Found an interesting upcoming single-board computer (SBC) with quad-core Krait @ 1.7Ghz and Adreno 320. At $149, it looks interesting. Still only in -pre-order though. Might get one if the software support is right. Have been burnt before with devboards with interesting hardware but crap drivers so will wait and watch for now. Particularly interested to see if it supports OpenCL, and whether OS options include fully-supported Linux and not just Android.
Update: The company confirmed that the board supports OpenCL on GPU! Currently they have an Android image based on Jelly Bean 4.1.2. Linux seems to be on the roadmap but no ETA I believe.
Many scientific computations are done in double precision floating-point (i.e. fp64). Support for fp64 varies between GPU architectures as well as GPGPU APIs. Here I just recap the capabilities of various APIs, assuming the hardware support is present:
1. CUDA: Full support for fp64 including exponentials, trigonometry etc.
2. OpenCL: Full support for fp64, similar to CUDA
3. OpenGL: An extension called gpu_shader_fp64 is available but it only supports basics like addition, multiplication and divison. Does not support exponentials, trigonometry etc.
4. DirectCompute: On Windows 7, only supports fp64 add, multiply and a few comparison operators but not divison or exponentials etc. On Windows 8, some GPUs support double precision division, reciprocal and FMA. However, afaik still no support for exponentials and trigonometry etc.?
So, if you want full fp64 support, I guess OpenCL and CUDA are the way to go currently.
Intel has released an updated OpenCL driver and SDK for Ivy Bridge and Haswell, along with GPU profiling tools for OpenCL. I wrote about them briefly over at Anandtech.
I have had the Blackberry Z10 for more than two months now as my full-time device. The impression is mostly positive.
- The email and messaging experience is great. The always-available unified Blackberry Hub really does work well. Integration with Exchange ActiveSync accounts work great with full push support, as is expected from a smartphone these days. IM, SMS, BBM, Twitter etc are all integrated into the hub too and the whole messaging experience is very well thought out. At the first blush, it may not seem like a big deal but there are a lot of little things throughout the UI that make the experience very good. Still waiting for the Skype app though.
- Virtual keyboard is up there with the best.
- Overall gesture-based UI works fine and is quite consistent, at least for apps that are not Android ports.
- Multitasking system is actually pretty nice and much easier to understand conceptually and operate than the multitasking view on, say, Android ICS.
- Game availability is also decent. Some fun titles I have seen so far include Jetpack Joyride, Bejeweled 2 and Chimpact.
- HDMI functionality is also pretty decent. I connected the Z10 to a HDMI monitor using the micro-HDMI port and it mirrored the display on the screen. However, Z10′s aspect ratio is not perfectly 16:9 and thus there are small black bars on the side of the mirrored output. However, that is not a huge issue and playing back media, or opening apps or web browsing, all work fine. Mirroring the display over HDMI appears to be no impact on performance at all. Games also work flawlessly over HDMI.
- Under-the-hood, it is powered by POSIX compliant QNX and Blackberry provides very good tools for a variety of languages including C/C++ and HTML5+JS. Android apps can also be ported over in many cases as mentioned previously but the “native” option is to write in C/C++ using Cascades UI toolkit. I will cover the dev environment separately sometime. Suffice to say, it is very good.
- Maps is one of the weaker areas for Blackberry 10. The default Blackberry Maps app offers turn-by-turn navigation support for driving. But Blackberry Maps offers no support for walking directions or public transport and requires a constant data connection with no offline support. Third-party apps, such as Wisepilot, are also not all that great. In short, the offering do not match up to competitors like Nokia Maps or Google Maps.
- Camera is also only average. Autofocus often takes its sweet time, and image quality is passable. Not bad in a pinch, but nowhere close to my old Nokia N8.
Now that Renderscript Compute supposedly works on GPUs, here are some points to ponder about this strange “compute” API
1. In OpenCL or CUDA, you specify a thread grid to launch a kernel. In Renderscript, there is no concept of a thread grid. Instead, you specify input and/or output arrays and each “thread” processes one output item. It reminds me of the limitations of the very old GPGPU technologies like the original Brook language, and is essentially similar to a pixel shader model (each shader thread writes one item). You can’t even query the thread ID (similar to say get_global_id() in OpenCL) in Renderscript.
Even scatter is really complicated and inefficient. You cannot really scatter writes to the output array. However, you can do scatter to separately bound arrays and so you have to adopt the following hack:
a) Do not pass in the actual input and output array directly. Bind the input and output array as dynamic pointers separately
b) Pass an array containing the output indices as input.
c) For each index in the passed array, do the computation and write to the index.
This is just INEFFICIENT. There is no need for such inefficiency on modern hardware. (See also this stackoverflow thread: http://stackoverflow.com/questions/10576583/passing-array-to-rsforeach-in-renderscript-compute )
2. In Renderscript, the API chooses which device to run your code on. That’s right, you have no idea if your code is running on the CPU or GPU or DSP etc. The work is supposedly automatically distributed between processors by the Renderscript runtime according the driver implemented by the SoC, and currently no guidelines are given about how to ensure code runs on GPU beyond “simple code should run on GPU”.
3. Renderscript’s philosophy is to not expose the actual hardware information and properties to the programmer. OpenCL lets you query a lot of information about the hardware properties, like the amount of local memory available. I guess given that the programmer can’t even decide where to run the code, this is not surprising.
4. CUDA introduced on-chip shared memory, and that concept has been adopted by almost every GPGPU API today including OpenCL, C++ AMP etc. However, Renderscript does not have any concept of on-chip shared memory. Thus, performance will be lower to well-optimized OpenCL kernels on many families of GPUs.
5. Renderscript is not available directly from the Android NDK. This is a significant limitation because high-performance applications (such as performance sensitive games) will often be written using the NDK.
Overall I do not think that the current iteration of Renderscript is meant for writing high performance code. Well optimized OpenCL/CUDA/C++ AMP kernels will always significantly outperform Renderscript code simply because Renderscript tries to present a simple abstraction and gives no control over performance. Performance will be entirely dependent upon the Renderscript compiler and driver, and will only come close to an API like OpenCL, CUDA or C++ AMP in very simple cases where the compiler may have the right heuristics built in.
At the same time, Renderscript has very weird programming model limitations, such as the scatter limitation outlined above. I think Renderscript was designed with only one application in mind: Simple image processing filters. And as @jimrayvaughn pointed out on twitter, many of those can be done efficiently using GLSL using well-understood techniques.
I hope that the SoC vendors and mobile handset vendors are reading this blog, and I hope that GPGPU on Android does not remain limited to Renderscript. Mobile vendors are wasting the power and potential of modern GPUs by not exposing the full power of the hardware to the developers. If you want to unlock the performance of your GPU, Renderscript is not the solution you are looking for.
Disclaimer: I am not a Renderscript expert. Finding documentation on Renderscript has been very tough, and my comments here are based upon what I could glean from the docs. If you find errors in this article, please point them out and I will update the article.
edited: Added NDK issue.
edited: Originally stated gather requires similar hack. However, gather works just fine. Only scatter is problematic.
I am quite excited about Blackberry 10. RIM’s development tools are looking pretty good. As a C++ programmer, I am pleased with my initial experience with the toolchain. Blackberry 10 comes with a full Qt 4.8 implementation in the firmware. As a Qt programmer, you have 3 options to write apps for BB10:
- Use Cascades. Cascades is a proprietary (but quite cool) UI framework built by RIM. It uses QML as it’s markup but does not use the QML painting engine at all. Instead, it uses its own painting engine and it is incompatible with QWidget and Qt Quick frameworks. You can either use Cascades to show the UI, or you can use Qt’s painting engine (QtGui or Qt Quick), but not both. However, you can still use QtCore and QtNetwork etc in your app. I experimented a little bit with Cascades and it is pretty nice. For Cascades, I recommend that you simply use QNX Momentics IDE provided by RIM as part of the standard NDK download. However, I have decided not to pursue this route as I wanted my code to be platform independent. Currently I am a single developer working on Qt projects for fun, and I am preferring to maintain a single code base across platforms as much as possible.
- Use QWidget. QtGui module (the basis of QWidget) is fully supported but if you use QWidget, you cannot use Cascades.
- Use Qt Quick (perhaps with some QWidgets thrown in). This is also supported but will again exclude the use of Cascades.
If you are interested in options 2 or 3 (QWidget or Qt Quick), then I recommend that you use Qt Creator. I tested with Qt Creator 2.6.1 on Ubuntu 12.04 64-bit and things are working fine. I have not managed to get QWidget and Qt Quick working under QNX Momentics, there are always some compilation or build errors for non-Cascades project.
There are some good instructions to configure Qt Creator in NDK documentation as well as on the Qt Project Wiki. I do not have a BB10 device but I was able to compile a QWidget based app in Qt Creator and run in the BB10 simulator.
One piece of advise: ignore the “simulator-debug” configuration section mentioned on the Qt Project Wiki. It appears to be required only for Cascades projects. Trying to make that section work in a QWidget based app wasted a lot of my time. In the end, I omitted it and things started working. I simply defined the BB10 kit in Qt Creator as described, modified the bar-description.xml example given in the NDK official documentation about porting Qt apps, then added a “blackberry-x86-qcc” spec to QMake as recommended by the Qt Project wiki, and everything worked brilliantly.
I have not yet tried compiling a Qt Quick based application for BB10, but I think process of compiling a Qt Quick app should be similar to the QWidget app above.
Nexus 10 comes with the Mali T604 GPU and it is the first device to support Renderscript on the GPU. Previously, Renderscript only ran on CPUs. The documentation still remains sparse to non-existant though. For example, it is not clear which code can run on the GPU successfully and there are no performance guidelines either. There are no good code samples, no books and poor tutorials.
Nexus 10 does not appear to support OpenCL, which is VERY disappointing.
New release of RaijinCL (with preview support for reduction operators) is now up on the RaijinCL site. Get it here. Some more routines, such as matrix-vector multiplication, are landing very soon.
Texas Instruments recently announced their Keystone II chips. Essentially, these combine a multi-core Cortex A15 with DSPs on a single chip. The number of cores and DSP configuration varies depending on the SKU. Here I focus on the top-end SKU 66AK2H12.
The chip has the following integrated:
- 4 Cortex A15 cores @ 1.4 GHz giving 44.8 GFlops SP (NEON), 22.4 GFlops SP (IEEE-754), 11.2 GFlops DP
- 8 C66-family DSPs @ 1.2 GHz giving 153.6 GFlops SP, 57.6 GFlops DP?
- DDR3 memory controller 2×64-bit upto 1600MHz giving 25.6 GB/s bandwidth
- ARM cores L1 data cache 4*32 kB, L1 instruction cache 4*32kB, L2 cache 4MB shared across cores
- DSP cores L1 data cache 8*32kB, instruction cache 8*32kB, L2 cache 1MB/DSP = 8MB total
- 6 MB of cache (separate from L2 caches) shared by DSPs and ARM cores
- Upto 14W power consumption
- OpenMP programming tools. alpha version of OpenCL driver also available
You should not think of this chip as a GPU-like accelerator. This is intended to be a standalone solution, with the 4 general-purpose ARM cores capable of running any regular ARM applications including a full Linux OS. Certain parts of your application can be offloaded to the DSP or they can be used in concert with the ARM cores. The DSPs themselves have a fairly flexible instruction set and my understanding is that you can do function calls, recursion etc without issue (correct me if I am wrong, will confirm from documentation). The DSPs and the ARM cores are both reading/writing from the same memory elimintating the data-copy bottleneck that exists on many PCIe accelerator type solutions.
The base specifications are looking really good. The perf/W is looking to be competitive with GPU based solutions. The low power consumption means that it can used in many applications where the big power hungry solutions (such as Teslas or Xeon Phis) are not applicable. The shared memory model is also very enticing for everyone, including say supercomputing uses.
TI have a good solution on their hands and should push more aggressively into the HPC space. They should put money into getting libraries like an optimized BLAS optimized for the system along with say OpenCV. TI should invest money into developing good compiler, debuggers and profilers. They should particularly continue to invest in standards-based solutions like OpenMP and OpenCL. As a newcomer and a smaller player, they cannot afford to introduce yet another proprietary solution.
They also need to gain some mindshare as well as marketshare. To gain mindshare, they should ensure to make ALL of this available in a nicely packaged fashion with a good descriptive documentation and webpages. They should also make low-cost boards available to really gain some marketshare. People underestimate how convenient Nvidia makes getting and using their tools for CUDA. I can just buy a cheap Nvidia card for a desktop (or buy a decent laptop), just download the CUDA SDK for free without any agreements and off I go. Everything is packaged nicely, easy to find and comes with good documentation. Capturing mind-share IS important and TI should learn those lessons from Nvidia.
I do wish TI all the best in the HPC field. They have built some solid and interesting technology, and economics also potentially works out as their DSP technology investments can be leveraged in multiple product lines much like how Nvidia is able to use the same designs for both HPC and consumer products. If they invest in building a good software ecosystem around their products, they can certainly compete in this space.
If anyone from TI is reading this, I would love to port all of my software (such as my Python compilers and numerical libraries, see here and here) to your hardware so please let me know who can I contact