Qualcomm has updated their Adreno Profiler and now it can work on Linux. To clarify, the setup is that you are running Linux on development PC, and running Android on the device being profiled. However, when I downloaded the tar file containing the Adreno profiler for Linux, it only contained “exe” files so initially I was confused. However, looking at Qualcomm developer forum threads, someone posted that it appears to be working with Mono.
I have tried the following steps, and it seems to work, at least for profiling OpenCL. I have used this for profiling OpenCL apps running on IFC6410 development board (Snapdragon 600) running Android 4.2.2. I have not tried OpenGL apps yet. Please feel free to report your experiences with OpenGL profiling. Steps were as follows
1. Install mono and associated Mono libraries, especially “core 4.0” and winforms libraries for Mono.
2. Connect your device over USB and make sure it is listed when you do “adb devices”.
3. Set some environment parameters using “adb shell setprop” commands for Adreno profiling. See documentation accompanying Adreno Profiler for exact settings.
4. Start Adreno Profiler using “mono AdrenoProfiler.exe”.
5. Start app on device using adb shell etc. It will block/hang, looking to connect to Adreno profiler.
6. In Adreno Profiler, hit “connect” and connect to your device. App will still remain frozen. Now start “Scrubber CL” and hit the “record” (red) button. App will resume.
7. Once app finishes, examine the data.
Overall, Adreno Profiler offers some really helpful data metrics for OpenCL such as timeline of various threads, GPU queues etc which is similar to say VTune. However, I have not yet found the way to actually view detailed hardware counter data such as cache misses that the documentation says I should be able to view.
Also, while Adreno Profiler offers some static kernel analysis metrics about number of ALU instructions, MOVs, NOPs etc in compiled code, I would really like to just view assembly generated because the instruction metrics shown do not account for what is inside loops vs outside making them less useful in kernels involving loops.