Prelim analysis of RgbenchMM

My benchmark (RgbenchMM) for testing floating-point performance on Android is now published on Play store here

It is a reasonably optimized matrix multiplication kernel that is fully multithreaded and written using the NDK in C++. Here is the ARM-v7A assembly code produced by GCC of the innermost loop:

	adds	r2, r2, #1
	adds	r1, r1, #8
	adds	r0, r0, #8
	cmp	r2, r4
	fldd	d7, [r1, #0]
	fldd	d6, [r0, #0]
	fldd	d5, [r3, #-24]
	fldd	d4, [r3, #-16]
	fldd	d3, [r3, #-8]
	fldd	d2, [r3, #0]
	fmacd	d1, d7, d5
	add	r3, r3, r5
	fmacd	d0, d7, d4
	fmacd	d8, d7, d3
	fmacd	d9, d7, d2
	fmacd	d11, d5, d6
	fmacd	d12, d4, d6
	fmacd	d13, d3, d6
	fmacd	d10, d2, d6
	bne	.L4

As you can see it does about 6 loads and 8 multiply-accumalates (or 16 flops) inside the loop. The load instructions (FLDD) are also VFP instructions as are the FMACD instructions. Thus, the benchmark is testing the VFP performance almost exclusively. One other detail about the code is that the threads are setup so that ideally they are reading the same columns of one of the input matrices. This will be beneficial on architectures with at least 1 level of shared cache and thus you may see more than 2x speedup on a dual-core processor.

With this background in mind, let us examine some of the data reported by testers.

Snapdragon S3 dual-core Scorpion @ 1.5GHz = 1175 MFlops

Exynos 4 dual-core @ 1.2 GHz = 920 MFlops

Tegra 3 T30L quad-core @ 1.2 GHz = 1488 MFlops

OMAP 4460 dual-core @ 1.2 GHz = 900 MFlops

These results are thanks to ChronoReverse, willyjwebb, derFunkenstein, DancinJack on Tech Report forums.

A back-of-the-envelope calculation shows that the innermost loop is executed on each core in about 40-42 cycles on OMAP, Exynos, Snapdragon S3 but about 50 cycles on the Tegra 3. The Tegra 3 result is somewhat surprising to me given that it is using the same Cortex A9 core as Exynos or OMAP. One possible culprit is that the L2 cache is not keeping up to feed 4 cores. However, more information is necessary to draw definitive conclusions. Particularly, if you have tested it on another Cortex A9 quad-core device like an Exynos 4 Quad, that will be helpful.

Would be very interesting to see how the newer generation of processors (like Cortex A15 and Qualcomm Krait) will perform.

One thing is clear. There is much to be learned from these ARM processors. The poor state of benchmarks on Android today (except mine ofcourse :P) and the lack of documentation from the vendors means that there is a LOT of misperceptions out there.

About these ads

Posted on September 25, 2012, in Uncategorized. Bookmark the permalink. 20 Comments.

  1. I would wager that Tegra didn’t pay as much attention to scalar performance, thinking that devs would take advantage of their GPU cores. Heterogenous computing is the future, using only the scalar cores is a mistake.

  2. @monkey, the same thing came to mind when I saw tegra noted as well. ARM is part of the HSA foundation so maybe we’ll see something there in the long term. However, the HSA foundation is headed up by AMD, so Nvidia may still remain primarily concerned with CUDA

  3. Alexander Lehmann

    with my nexus4 I get about 2950 MFlops…
    nice benchmark, better than all this meaningless crap

  4. Galaxy Note 2, 4 threads.
    2455 MFlops

  5. I have a Nexus 10 (Exynos 5250 Dual, dual-core Cortex A15 @ 1.7GHz) , and am getting different results with different runs. Note: I added the GFLOPS conversion to make it more search-engine friendly.

    A 4-thread test gets around 1950 MFLOPS (1.95 GFLOPS) most of time, though the number often goes to 2200 MFLOPS (2.2 GFLOPS), and it has gone as high as 2700 MFLOPS (2.7 GFLOPS)!

    A 2-thread test gets around 2250 MFLOPS (2.25 GFLOPS) consistently. I haven’t run it enough to notice any spikes.

    A 1-thread test gets around 970 MFLOPS (0.97 GFLOPS) consistently. I also haven’t run this enough to notice spikes.

    I suspect that the on-run workload of my system is causing the disparity in results. Needless to say, that these are impressive results given that this is a dual-core CPU. A quad-core A15 (like those in this years CPUs) at high clocks will likely result in 4 to 5 GFLOPS. Of course, the dreaded (and likely) cache-miss will inhibit real-world performance.

    I would *love* to get a feel for optimized GPU-compute performance. ;)

    • Thanks for those numbers! Possibly something related to the load, and perhaps something funny going on with the dynamic frequency scaling of the processor as well. I wonder if the 1.7GHz quoted by Samsung is a sustained speed.
      I do think I need to rework the benchmark a little bit. The results reported are not optimal. It is possible to go faster on cores like the Cortex A15 as the code is not properly utilizing all the VFP registers yet. The code is compiled for a baseline VFP, such as that present in the Tegra 2, but the VFP in newer cores has more registers.

      • You’re most welcome! Thank you for the benchmark!

        That’s a great insight. I can test the frequency scaling. I have an app called CPU spy, that is a timer for the CPU frequency states. As such, if I reset the timers, and run your app, and then quickly refresh, I should get an idea of how long your benchmark is running at each frequency. It’s not scientific, but it will provide good insight weather or not it’s running at peak speeds for the duration.

        I was frankly blown away at the 2+ GFLOPS, but further optimization to use the additional registers of VFP would be incredible (I’m going to look into the VFPv3 right after I finish posting this). I’m planning on doing some het-compute targeting the A15 specifically, and all of this gives me excellent insight into the performance of this CPU architecture. I can’t wait to get a feel for the GPU! :)

      • Ok, I just ran a test with CPU spy after a deep sleep.

        1) With 2-threads, I got a whopping 2.624 GFLOPS. The CPU speed as as follows:

        1700 MHz — 62% of the time (52 seconds)
        1600 MHz — 14% of the time (12 seconds)
        1500 MHz — 10% of the time (8 seconds)
        1000 MHz — 2% of the time (2 seconds)

        Since it took me around 4 seconds to reset the timers and switch the app (or around 8 seconds total), and start running the test, it would seem that the frequency downscaling is very aggressive and it scales up gradually to peak only after around 14 seconds of full-utilization.

        2) The second test was interesting as the frequencies were wildly different. I got 2.1 GFLOPs with the CPU speeds as follows:

        1700 MHz — 41% of the time (37 seconds)
        1600 MHz — 15% of the time (14 seconds)
        1500 MHz — 21% of the time (20 seconds)
        1400 MHz — 13% of the time (12 seconds)
        1000 MHz — 2% of the time (2 seconds)

        This follows from the test I did earlier when I got a high peak GFLOP throughput. It had been the first thing that I did with the tablet after ~8 hours of inactivity, and I got a speed burst (2.7 GFLOPs). The tests thereafter were the normal 2.2 GFLOPs. It would seem that the CPU is being throttled. I wouldn’t be surprised if this were motivated by a heat measurement, as I can clearly feel warmth on the device after running the test a few times, and the speeds drop dramatically after long periods of inactivity.

        3) The third test gave me similar amounts of time at 1700MHz and 1500MHz:

        1700 MHz — 29% of the time (27 seconds)
        1600 MHz — 17% of the time (16 seconds)
        1500 MHz — 22% of the time (21 seconds)
        1400 MHz — 14% of the time (14 seconds)
        1400 MHz — 8% of the time (8 seconds)
        1000 MHz — 2% of the time (2 seconds)

        Next experiment: put the device in the freezer and see what effect this has on the timings! (don’t mind me, I’m just excited)…

  6. Ok, I just put my device in a ziplock bag, stuck it in the freezer first for 3 mins, and then for 5mins and ran the test and timings on both occasions. On each test I got around 2.45 GFLOPs consistently with the CPU running at 1700MHz 89% of the time, both times.

    The Exynos 5250 is definitely throttling the frequency based on on the temperature, though there seems to be other factors inhibiting performance. I’m not sure what accounts for the peak 2.6 and 2.7 GFLOPs that I got after a period of deep sleep, but those numbers were significantly higher than the sustained rate, and even the ‘cold’ rate, and seem to be paradoxically independent of the CPU frequency.

    All in all, great performance all around! I’m quite happy with 2.2 GFLOPs sustained on the CPU, and I’m sure with additional optimizations this number can be pushed higher. I’m also really happy with the sustained ~5.5GB/s memory bandwidth. The Exynos 5250 Dual is truly is a wonderful performer, and I’m looking forward to seeing how the Exynos 5 Octa performs with its 4 A15 cores!

    • Thanks for the detailed info! Yes, the Cortex A15 is looking like a good core.

    • Are you able to set your CPU governor? It would be interesting to set it to Performance which would force the CPU to run at max frequency continuously. I wonder if the throttling will bypass it though.

  7. ChronoReverse

    Here’s something interesting I’ve found when running RgbenchMM on the Snapdragon 600 (quad) @ 1.9GHz

    With a single thread, the result is about 900.
    With two threads, it’s about 2000.
    But with four threads, I only get 3200.

    Either something is strange with the 4 core test or there’s some sort of throttling with the Snapdragon 600. My phone didn’t get particularly warm though.

    Interestingly enough, the Nexus 4, with a 1.5GHz Snapdragon S4Pro (quad) ALSO gets 3200 for 4 threads.

    Food for thought?

    • Hi Chrono. Difficult to say. Either it is thermal throttling, or some issue with the L2 cache not being able to feed the 4 cores sufficiently. I think the thermal throttling is more likely, but difficult to say. I do know that people got slightly higher scores on the HTC One.

  8. I get 4.016GFlops on my Galaxy S4 with Exynos 5410 octa! Nothing valuable to add except I’m pretty to see such performance from a year old CPU.

    • That is a really impressive score! I’m now curious to see how the slightly faster (and allegedly less buggy) Exynos 5420 performs. I will also be very excited to see how well the upcoming 20nm Cortex A57′s perform as they are supposed to be clocked higher and more efficient.

      • For 64-bit ARMv8 chips, we will need a better/newer test. ARMv8 adds new NEON instructions for fp64 and any matrix multiply test should try and take that into account.

      • Yes indeed! As I understand, the registers are larger as well and would need to be packed appropriately.

        Do you have any insight into how well the A15 (or A57) would handle trig, dot, or cross products? This is something that I’ve been very curious about.

        Finally, I know that you’re a fan of OpenCL, have you considered writing a post on Open GL ES 3.1 and its compute shaders? This seems relevant for Android as the Android team seems bullish on Renderscript, and this provides an alternate avenue where the GPU can be specifically targeted and offers the use of compressed resources (eg. compressed textures, framebuffers).

      • Yeah, GL compute shaders do look nice and will write about them sometime.

  9. I get 4.016GFlops on my Galaxy S4 with Exynos 5410 octa! Nothing valuable to add except I’m pretty excited to see such performance from a year old CPU.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: