My benchmark (RgbenchMM) for testing floating-point performance on Android is now published on Play store here
It is a reasonably optimized matrix multiplication kernel that is fully multithreaded and written using the NDK in C++. Here is the ARM-v7A assembly code produced by GCC of the innermost loop:
[code]
adds r2, r2, #1
adds r1, r1, #8
adds r0, r0, #8
cmp r2, r4
fldd d7, [r1, #0]
fldd d6, [r0, #0]
fldd d5, [r3, #-24]
fldd d4, [r3, #-16]
fldd d3, [r3, #-8]
fldd d2, [r3, #0]
fmacd d1, d7, d5
add r3, r3, r5
fmacd d0, d7, d4
fmacd d8, d7, d3
fmacd d9, d7, d2
fmacd d11, d5, d6
fmacd d12, d4, d6
fmacd d13, d3, d6
fmacd d10, d2, d6
bne .L4
[/code]
As you can see it does about 6 loads and 8 multiply-accumalates (or 16 flops) inside the loop. The load instructions (FLDD) are also VFP instructions as are the FMACD instructions. Thus, the benchmark is testing the VFP performance almost exclusively. One other detail about the code is that the threads are setup so that ideally they are reading the same columns of one of the input matrices. This will be beneficial on architectures with at least 1 level of shared cache and thus you may see more than 2x speedup on a dual-core processor.
With this background in mind, let us examine some of the data reported by testers.
Snapdragon S3 dual-core Scorpion @ 1.5GHz = 1175 MFlops
Exynos 4 dual-core @ 1.2 GHz = 920 MFlops
Tegra 3 T30L quad-core @ 1.2 GHz = 1488 MFlops
OMAP 4460 dual-core @ 1.2 GHz = 900 MFlops
These results are thanks to ChronoReverse, willyjwebb, derFunkenstein, DancinJack on Tech Report forums.
A back-of-the-envelope calculation shows that the innermost loop is executed on each core in about 40-42 cycles on OMAP, Exynos, Snapdragon S3 but about 50 cycles on the Tegra 3. The Tegra 3 result is somewhat surprising to me given that it is using the same Cortex A9 core as Exynos or OMAP. One possible culprit is that the L2 cache is not keeping up to feed 4 cores. However, more information is necessary to draw definitive conclusions. Particularly, if you have tested it on another Cortex A9 quad-core device like an Exynos 4 Quad, that will be helpful.
Would be very interesting to see how the newer generation of processors (like Cortex A15 and Qualcomm Krait) will perform.
One thing is clear. There is much to be learned from these ARM processors. The poor state of benchmarks on Android today (except mine ofcourse :P) and the lack of documentation from the vendors means that there is a LOT of misperceptions out there.