DirectX 12 will be unveiled soon. Given that DirectCompute forms the core of MS GPGPU computing stack (such as powering their C++ AMP implementation), I really hope that DirectCompute 12 delivers on the compute side. Unlike OpenCL, DirectCompute has had the massive advantage of good integration into a graphics API and that is why we have seen DirectCompute being adopted into 3D apps such as games where OpenCL really wasn’t. However, DirectCompute has fallen behind the times with little support for modern GPU features. This, in turn, hurts GPGPU solutions such as MS implementation of C++ AMP. Five things that I am hoping to see:
1. Support for shared memory architecture eliminating CPU-GPU data transfers where possible and also allowing platform-level atomics. This will be beneficial for everything from mobile (where SoCs rule the roost and discrete GPUs basically don’t exist) to servers.
2. Exposing multiple command queues per GPU, thus allowing concurrent execution of kernels
3. Launching GPU kernels from within GPU kernels.
4. Stable low-level bytecode that is more suitable as a target for high level compilers. D3D 11 has a bytecode, but ISVs are not given the proper specification and discouraged from using it. This needs to be opened up to enable third-party compilers for high-level languages
5. Compatibility with Windows 8
Broadcom has decided to open-source their graphics driver for one of their VideoCore IV powered Android chipsets. This is an awesome and welcome step. They also released an architecture manual giving details for many things. I will try and summarize some of the information known about VideoCore IV so far.
VideoCore IV refers to a family of closely-related GPUs. Implementations have shown up in various chipsets. For example, BCM2835 used in Raspberry Pi, BCM2763 used in several Nokia Symbian Belle handsets (eg: Nokia Pureview 808, 701,700 etc), BCM21553 in Android handsets such as Samsung Galaxy Y and and BCM28155 in Android handsets such as Samsung Galaxy SII Plus.
Overview: Various chipsets have their own peculiarities. In the Raspberry Pi and Nokia flavors, the VideoCore IV consists of two distinct processors. The first processor is the actual programmable graphics core, which I will refer to as PGC. The second processor is a coprocessor. This embedded processor, not to be confused with the main CPU, runs its own operating system and handles almost all the actual work of the OpenGL driver. For example, shader compilation is done on this embedded processor and not on the main CPU in the Raspberry Pi and Nokia flavors. The OpenGL driver on these devices just is a shim that passes calls to the embedded coprocessor via RPC-like mechanism. My speculation (low-confidence) is that the BCM21553, for which Broadcom released the source code, does not have the embedded coprocessor and the driver runs on the main CPU. The Nokia variants have an additional detail that these feature an 128MB LPDDR2 on-package memory dedicated for GPU, separate from the 512MB RAM in these devices, to provide a high-bandwidth (at the time) graphics RAM for the GPU. Raspberry Pi does not have this buffer and the GPU reads/writes from the main memory.
GPU core: VideoCore’s PGC is a tile-based renderer (TBR). Apart from fixed function parts, the programmable portion of the chip is organized into “slices”, which are similar to say “compute units” in GCN. Each slice consists of upto 4 SIMD units called QPUs, one special function unit (SFU),one or two texture and memory units (TMUs) as well as some caches. The architectural diagram shows upto 4 slices, but I guess the actual number may vary between chipsets (not confirmed).
QPU (SIMD ALUs): QPU consists of two SIMD ALUs. The ALUs are not symmetric. Each of these ALUs is physically 4-wide (i.e. 128-bit), but one of them is an “add” unit and the other is a “mul” unit, and handle add and multiply floating-point operations respectively along with some other ops such as integer and logical ops. The QPU is a dual-issue processor, capable of feeding one add and one mul instruction per cycle to each of the units. Logically, each ALU in the QPU is actually a 16-way machine that executes a 16-way instruction in 4 cycles. Thus, overall, each QPU can perform 8 flops/cycle. Thus, each slice can do upto 32 flops/cycle. Each QPU has access to a 4kB of registers, as well as a few accumulators. Registers are organized as two register files of 2kB each. Each register file is organized as 32 vector registers, where each vector register is 64 bytes (16 x 4bytes) which makes sense given the 16-way logical view of the QPU. Each QPU can run two threads.
Memory (TMUs and VPM): TMUs have their own L1 cache, and there is also a separate L2 cache that is shared across slices. Cache sizes are unknown. QPUs read/write vertex data through a separate path called the Vertex Pipe Manager (VPM). VPM is a system-wide shared unit and appears to have a buffer of either 8kB or 16kB. VPM performs DMA from main memory to read/write vertex data into the buffer. VPM is optimized essentially for reading/writing vectors of data from/to main memory and from/to the QPUs vector register files. Vertex fetch is general enough to implement memory gather operations, but it is not clear if scatter is also supported.
RPi and Conclusions: Consider the Raspberry Pi. We already know that the published frequency is 250MHz and that the QPUs can do 24 gflops and the TMUs can do 1.5 GTexel/s. Thus, per-clock, the GPU performs 96 flops/cycle and 6 texels/cycle. Likely, this is achieved through 3 slices each with 4 QPUs and 2 TMUs. Overall, VideoCore IV is an interesting architecture. Performance-wise, the implementation in the Raspberry Pi does not compare to modern mobile GPUs such as Adreno 330 or Mali T600 series but then again the Raspberry Pi is using an old SoC that was meant to be cost-conscious even at that time. For a low-cost GPU, VideoCore IV looks to be quite competent. It will be interesting to see what Broadcom is cooking up for VideoCore V.
Wrote two articles recently for Anandtech. “A deep dive on HSA” was a theoretical look at HSA and related technologies such as HSAIL, hUMA and hQ as well as the programming tools infrastructure. Next I wrote a bit about the floating point performance (both on CPU and GPU) of recent Intel and AMD chips including Kaveri.
Do check them out and let me know what you think
I have been using Qt Creator on Linux for a while now for my C++ based projects. Qt Creator has a nice editor with fairly speedy autocomplete and good refactoring support. I typically use CMake and Qt Creator has decent support for CMake under both Windows and Linux and also has decent integration with tools like mercurial. On WIndows, however, I was using Visual Studio (either 2010 or 2013 depending on the project dependencies) for C++. While VS has a nice debugger, its editing and refactoring functionality for C++ is quite behind Qt Creator in my experience. I finally got around to using Qt Creator on WIndows, even for non-Qt projects and I find that it offers the same great experience on Windows. The install process was pretty painless and Qt Creator detected all my tools (various VS versions, CMake binary etc.).
Kudos to Digia and Qt Project for the great tools. I am constantly amazed at the quality of work they put out. I only use VS now for occasionally debugging. Thoughts welcome.
Write combining is a technique where writes may get buffered into a temporary buffer, and then written to memory in a single largish transaction. This can apparently give a nice boost to write bandwidth. Write-combined memory is not cached so reads from write-combined memory are still very slow. I came upon the concept of write-combining while looking at data transfer from CPU to GPU on AMD’s APUs. It turns out that if you use the appropriate OpenCL flags (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY) while creating a GPU buffer on an AMD APU, then AMD’s driver exposes these buffers as write-combined memory on the CPU. AMD claims that you can write to these buffers at pretty high speeds and thus this can act as a fast-path for CPU to GPU data copies. In addition to regular and write-combined memory, there is also a third type: uncached memory without write-combining.
I wanted to understand the characteristics of write-combined memory as well as uncached memory compared with “regular” memory allocated using, say, the default “new” operator. On Windows, we can allocate write-combined memory or uncached memory using VirtualAlloc function by passing the flags PAGE_WRITECOMBINE and PAGE_NOCACHE respectively. So I wrote a simple test. The code is open-source and can be found here.
For each memory type (regular, write-combined and uncached), we run the following test. The test allocates a buffer and then we copy the data from a regular CPU array to the buffer and measure the time. We do the copy (to the same buffer) multiple times and measure the time of each copy and report the timing and bandwidth of first run as well as the average of subsequent runs. The first run timings give us an idea of the overhead of first use, which can be substantial. For bandwidth, if I am copying N bytes of data, then I report bandwidth computed as N/(time taken). Some people prefer to report bandwidth as 2*N/(time taken) because they count both the read and the write so that’s something to keep in mind.
I ran the test on a laptop running AMD A10-5750M (Richland), 8GB 1600MHz DDR3, WIndows 8.1 x64, VS 2013 x64.
The average bandwidth result for “double” datatype arrays (size ~32MB) was average bandwidth of 3.8GB/s for regular memory, 5.7 GB/s for write-combined and 0.33GB/s for uncached memory. The bandwidth reported here is for average runs not including the first run. The first use penalty was found to be substantial. The first run of regular memory took about 22ms while write-combined took 81ms for first run and uncached memory took 164ms. Clearly if you are only transferring it once, then write-combined memory is not the best solution. In this case, you need around 20 runs for the write-combined memory to break even in terms of total data copy times. But if you are going to be reusing the buffer many times, then write-combined memory is a definite win.
Sony published a blog post about OpenCL being available on Xperia devices such as Xperia Z, ZL, ZR, Z Ultra, Z1 and Tablet Z. These are Snapdragon S4 Pro and Snapdragon 800 devices and according to Sony come with OpenCL drivers for the Adreno GPUs. They also indicate that they intend to continue to support OpenCL. Very good news.
Android 4.4 has been announced and brings an interesting new feature: Renderscript Compute is now accessible from the NDK from C++ code. There is very little documentation atm, or at least I haven’t found it but you should download NDK r9b at the time of writing to see a HelloComputeNDK sample in the samples folder. It looks like a fully native solution, i.e. you do not need to write any Java code to access it afaik. The Android 4.4 release notes say that you should be able to access all renderscript functionality including intrinsics and user-defined kernels from the NDK. This is quite a nice development and kudos to the Renderscript team, but they do desperately need to address other concerns such as documentation. I still do think that they need to provide access to a lower-level API, such as OpenCL, to complement the high-level Renderscript. But I guess such is life.
As an aside, release notes also say that Android 4.4 enables GPU acceleration of Renderscript (I am assuming only for Filterscript) for all recent Nexus devices (Nexus 4, Nexus 5, Nexus 10, Nexus 7 2013) with the exception of Tegra 3 powered Nexus 7 2012, which has an ancient GPU architecture.
Wrote an article about Altera’s SDK for OpenCL for their Stratix FPGAs. You can find it here. If you find any correctness issues, or if you have any comments, let me know.
Finally I have managed to port a small desktop app to Android and Blackberry 10. The app was based on QWidgets and was based on Qt 4.8. The app was not for open distribution unfortunately so I can’t share the app. However, I can share some experiences:
Blackberry 10: I just changed to the Blackberry-provided qmake, ran qmake and make, and the app compiled without any changes for Blackberry 10. However, packaging the app requires a few additional steps. You have to obtain signing keys (which I already had), provide a bar-descriptor.xml and then run the blackberry-nativepackager command. The process is generally well-documented and went fine. I was able to test on a Blackberry Z10. I initially ran into an issue that I had forgotten to specify the proper permissions in the bar-descriptor for accessing files, but once I corrected it things ran “ok”. Most of the widgets actually rendered fine and layout scheme worked as expected. However, things were not perfect either. For example, the file picker dialog was not very nice but at least the app was servicable. The lack of integration with BB10 UI conventions is an issue though so will have to see if there is a way to integrate stuff like swipe-down menus from a pure QWidget (or perhaps Qt Quick 1/2) based app without using Cascades.
Android: For Android, I used Android 5.1 and actually used Qt Creator 2.8.1 to set everything up. Changing to Qt 5.1 required very minimal changes such as a couple of ifdefs for changed header file names. Qt Creator asked for Android SDK and NDK paths, and after that setup most of the things automatically including inbuilt simulator support. If you connect a device over adb, those are recognized as well for deployment and testing. Qt Creator was automatically generating unsigned APKs but doesn’t seem to generate signed apk’s, or at least I couldn’t figure out how.
Overall, things were going smoothly on Android .. until I actually ran the app. Turns out QWIdgets look really ugly on Android and some of the layouting doesn’t really work the way you would want. Some widgets, like QComboBox used for implementing drop-down menus, did not render correctly while some others like file dialogs were ugly and almost unusable. After a few careful changes, I was able to make it work but still integration was poor.
I would say, at least on Android, you should try to avoid QWidgets and start to use QML. QML will allow you far more control over the look, feel and layout of your app. I do hope that something like Qt Quick Controls come to Android because raw QML is a bit too much DIY. The other issue is that, I am still not clear on how Qt based apps can integrate with Android UI guidelines such as integration with the back button. I am also not clear how you can integrate, say, invoking the email application or the web browser. Need to figure these out next.
UPDATE: I think the 6900 series support for double-precision seems fine, but Trinity/Richland remains unexplained.
AMD has two VLIW4 based parts: Radeon 6900 series and Trinity/Richland APU’s GPU products (Radeon 7660D, 8650G etc.). Some of the launch media coverage stated that these GPUs have fp64 capability. I recently got a Richland based system to work on and realized the following:
a) AMD does not support cl_khr_fp64 (i.e. the standard OpenCL extension for fp64) on the 8650G GPU and only supports cl_amd_fp64. But AMD’s documentation is not very clear about the difference.
b) Earlier driver versions for Trinity (which afaik has the same silicon as Richland) definitely had cl_khr_fp64 support, but it was later removed and demoted to only cl_amd_fp64.
c) Richland’s GPU (8650G) does not seem to support double precision under Direct3D either.
Forum postings indicate latest drivers for 6900 series GPUs also do not support cl_khr_fp64, and only support cl_amd_fp64. I am not sure about the fp64 support status under DirectCompute.
My speculation is that AMD discovered some issue with IEEE compliance on fp64 units in the VLIW4 GPUs and hence AMD is unable to support APIs where full IEEE compliance is required. If anyone has any insight into the issue, then let me know.