Renderscript from the perspective of an OpenCL/CUDA/C++ AMP programmer
Now that Renderscript Compute supposedly works on GPUs, here are some points to ponder about this strange “compute” API
1. In OpenCL or CUDA, you specify a thread grid to launch a kernel. In Renderscript, there is no concept of a thread grid. Instead, you specify input and/or output arrays and each “thread” processes one output item. It reminds me of the limitations of the very old GPGPU technologies like the original Brook language, and is essentially similar to a pixel shader model (each shader thread writes one item). You can’t even query the thread ID (similar to say get_global_id() in OpenCL) in Renderscript.
Even scatter is really complicated and inefficient. You cannot really scatter writes to the output array. However, you can do scatter to separately bound arrays and so you have to adopt the following hack:
a) Do not pass in the actual input and output array directly. Bind the input and output array as dynamic pointers separately
b) Pass an array containing the output indices as input.
c) For each index in the passed array, do the computation and write to the index.
This is just INEFFICIENT. There is no need for such inefficiency on modern hardware. (See also this stackoverflow thread: http://stackoverflow.com/questions/10576583/passing-array-to-rsforeach-in-renderscript-compute )
2. In Renderscript, the API chooses which device to run your code on. That’s right, you have no idea if your code is running on the CPU or GPU or DSP etc. The work is supposedly automatically distributed between processors by the Renderscript runtime according the driver implemented by the SoC, and currently no guidelines are given about how to ensure code runs on GPU beyond “simple code should run on GPU”.
3. Renderscript’s philosophy is to not expose the actual hardware information and properties to the programmer. OpenCL lets you query a lot of information about the hardware properties, like the amount of local memory available. I guess given that the programmer can’t even decide where to run the code, this is not surprising.
4. CUDA introduced on-chip shared memory, and that concept has been adopted by almost every GPGPU API today including OpenCL, C++ AMP etc. However, Renderscript does not have any concept of on-chip shared memory. Thus, performance will be lower to well-optimized OpenCL kernels on many families of GPUs.
5. Renderscript is not available directly from the Android NDK. This is a significant limitation because high-performance applications (such as performance sensitive games) will often be written using the NDK.
Overall I do not think that the current iteration of Renderscript is meant for writing high performance code. Well optimized OpenCL/CUDA/C++ AMP kernels will always significantly outperform Renderscript code simply because Renderscript tries to present a simple abstraction and gives no control over performance. Performance will be entirely dependent upon the Renderscript compiler and driver, and will only come close to an API like OpenCL, CUDA or C++ AMP in very simple cases where the compiler may have the right heuristics built in.
At the same time, Renderscript has very weird programming model limitations, such as the scatter limitation outlined above. I think Renderscript was designed with only one application in mind: Simple image processing filters. And as @jimrayvaughn pointed out on twitter, many of those can be done efficiently using GLSL using well-understood techniques.
I hope that the SoC vendors and mobile handset vendors are reading this blog, and I hope that GPGPU on Android does not remain limited to Renderscript. Mobile vendors are wasting the power and potential of modern GPUs by not exposing the full power of the hardware to the developers. If you want to unlock the performance of your GPU, Renderscript is not the solution you are looking for.
Disclaimer: I am not a Renderscript expert. Finding documentation on Renderscript has been very tough, and my comments here are based upon what I could glean from the docs. If you find errors in this article, please point them out and I will update the article.
edited: Added NDK issue.
edited: Originally stated gather requires similar hack. However, gather works just fine. Only scatter is problematic.