Write combining is a technique where writes may get buffered into a temporary buffer, and then written to memory in a single largish transaction. This can apparently give a nice boost to write bandwidth. Write-combined memory is not cached so reads from write-combined memory are still very slow. I came upon the concept of write-combining while looking at data transfer from CPU to GPU on AMD’s APUs. It turns out that if you use the appropriate OpenCL flags (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY) while creating a GPU buffer on an AMD APU, then AMD’s driver exposes these buffers as write-combined memory on the CPU. AMD claims that you can write to these buffers at pretty high speeds and thus this can act as a fast-path for CPU to GPU data copies. In addition to regular and write-combined memory, there is also a third type: uncached memory without write-combining.
I wanted to understand the characteristics of write-combined memory as well as uncached memory compared with “regular” memory allocated using, say, the default “new” operator. On Windows, we can allocate write-combined memory or uncached memory using VirtualAlloc function by passing the flags PAGE_WRITECOMBINE and PAGE_NOCACHE respectively. So I wrote a simple test. The code is open-source and can be found here.
For each memory type (regular, write-combined and uncached), we run the following test. The test allocates a buffer and then we copy the data from a regular CPU array to the buffer and measure the time. We do the copy (to the same buffer) multiple times and measure the time of each copy and report the timing and bandwidth of first run as well as the average of subsequent runs. The first run timings give us an idea of the overhead of first use, which can be substantial. For bandwidth, if I am copying N bytes of data, then I report bandwidth computed as N/(time taken). Some people prefer to report bandwidth as 2*N/(time taken) because they count both the read and the write so that’s something to keep in mind.
I ran the test on a laptop running AMD A10-5750M (Richland), 8GB 1600MHz DDR3, WIndows 8.1 x64, VS 2013 x64.
The average bandwidth result for “double” datatype arrays (size ~32MB) was average bandwidth of 3.8GB/s for regular memory, 5.7 GB/s for write-combined and 0.33GB/s for uncached memory. The bandwidth reported here is for average runs not including the first run. The first use penalty was found to be substantial. The first run of regular memory took about 22ms while write-combined took 81ms for first run and uncached memory took 164ms. Clearly if you are only transferring it once, then write-combined memory is not the best solution. In this case, you need around 20 runs for the write-combined memory to break even in terms of total data copy times. But if you are going to be reusing the buffer many times, then write-combined memory is a definite win.
Sony published a blog post about OpenCL being available on Xperia devices such as Xperia Z, ZL, ZR, Z Ultra, Z1 and Tablet Z. These are Snapdragon S4 Pro and Snapdragon 800 devices and according to Sony come with OpenCL drivers for the Adreno GPUs. They also indicate that they intend to continue to support OpenCL. Very good news.
Android 4.4 has been announced and brings an interesting new feature: Renderscript Compute is now accessible from the NDK from C++ code. There is very little documentation atm, or at least I haven’t found it but you should download NDK r9b at the time of writing to see a HelloComputeNDK sample in the samples folder. It looks like a fully native solution, i.e. you do not need to write any Java code to access it afaik. The Android 4.4 release notes say that you should be able to access all renderscript functionality including intrinsics and user-defined kernels from the NDK. This is quite a nice development and kudos to the Renderscript team, but they do desperately need to address other concerns such as documentation. I still do think that they need to provide access to a lower-level API, such as OpenCL, to complement the high-level Renderscript. But I guess such is life.
As an aside, release notes also say that Android 4.4 enables GPU acceleration of Renderscript (I am assuming only for Filterscript) for all recent Nexus devices (Nexus 4, Nexus 5, Nexus 10, Nexus 7 2013) with the exception of Tegra 3 powered Nexus 7 2012, which has an ancient GPU architecture.
Wrote an article about Altera’s SDK for OpenCL for their Stratix FPGAs. You can find it here. If you find any correctness issues, or if you have any comments, let me know.
Finally I have managed to port a small desktop app to Android and Blackberry 10. The app was based on QWidgets and was based on Qt 4.8. The app was not for open distribution unfortunately so I can’t share the app. However, I can share some experiences:
Blackberry 10: I just changed to the Blackberry-provided qmake, ran qmake and make, and the app compiled without any changes for Blackberry 10. However, packaging the app requires a few additional steps. You have to obtain signing keys (which I already had), provide a bar-descriptor.xml and then run the blackberry-nativepackager command. The process is generally well-documented and went fine. I was able to test on a Blackberry Z10. I initially ran into an issue that I had forgotten to specify the proper permissions in the bar-descriptor for accessing files, but once I corrected it things ran “ok”. Most of the widgets actually rendered fine and layout scheme worked as expected. However, things were not perfect either. For example, the file picker dialog was not very nice but at least the app was servicable. The lack of integration with BB10 UI conventions is an issue though so will have to see if there is a way to integrate stuff like swipe-down menus from a pure QWidget (or perhaps Qt Quick 1/2) based app without using Cascades.
Android: For Android, I used Android 5.1 and actually used Qt Creator 2.8.1 to set everything up. Changing to Qt 5.1 required very minimal changes such as a couple of ifdefs for changed header file names. Qt Creator asked for Android SDK and NDK paths, and after that setup most of the things automatically including inbuilt simulator support. If you connect a device over adb, those are recognized as well for deployment and testing. Qt Creator was automatically generating unsigned APKs but doesn’t seem to generate signed apk’s, or at least I couldn’t figure out how.
Overall, things were going smoothly on Android .. until I actually ran the app. Turns out QWIdgets look really ugly on Android and some of the layouting doesn’t really work the way you would want. Some widgets, like QComboBox used for implementing drop-down menus, did not render correctly while some others like file dialogs were ugly and almost unusable. After a few careful changes, I was able to make it work but still integration was poor.
I would say, at least on Android, you should try to avoid QWidgets and start to use QML. QML will allow you far more control over the look, feel and layout of your app. I do hope that something like Qt Quick Controls come to Android because raw QML is a bit too much DIY. The other issue is that, I am still not clear on how Qt based apps can integrate with Android UI guidelines such as integration with the back button. I am also not clear how you can integrate, say, invoking the email application or the web browser. Need to figure these out next.
Wrote about Android 4.3 update for Nexus devices removing OpenCL, and about various Renderscript/OpenCL issues on Anandtech:
If you see any inaccuracy, let me know.
(This was originally posted on a forum in reply to Tim Murray of Google. Posting it here for readers who come to my blog for Renderscript information).
(TLDR: Renderscript is a fine idea, and I think has good intentions, but let people experiment with alternate solutions too.)
Tim, I see and partially agree with your vision. I understand that mobile architectures are complicated. There are considerations such as dynamic power distribution, shared memory bandwidth etc that one has not seen (before say Ivy Bridge) on desktop. I agree that trying to write 6 different codepaths for 6 different architectures is hard. But I do not agree that Renderscript solves the issues either. At present, all it does (compard to say CUDA) is prevent the programmer from specifying certain parameters.
Think about this way. You can divide the programming tools into two categories:
a) Close-to-metal. OpenCL is fairly close to metal exposing individual devcies, exposing memory hierarchy and thread dispatch mechanics.
Even OpenCL however offers some amount of possibility of optimization by the driver. For example, you can leave out thread group size and let the driver choose a suitable one. Vectorization may also be performed by the driver (for example, Intel’s driver does this on Ivy Bridge). It is also possible to write OpenCL drivers that automatically use local memory (on-chip shared memory in CUDA parlance) if the programmer does not. But let us ignore even OpenCL’s compiler optimization possibilities and let us think of it is as close to metal.
b) Middleware solutions that offer higher level programming languages. This typically includes a supposedly smart compiler and some kind of a scheduler+runtime. Renderscript falls in this category. My current research area happens to be this exact field, and I am a big proponent of the need of more productive languages in parallel computing so I sympathize with your goals.
But I think Renderscript is essentially just one particular middleware solution. Renderscript compiler and drivers will have one particular set of heuristics. You said, how can the developer code for an architecture he has not seen before? I think as a compiler writer, I often face the opposite issue: How can you ensure that your compiler+driver has the right heuristics for algorithms and use-cases you have never seen?
You said yourself that mobile architectures are more complicated than typical workstations, and people still haven’t solved building good middlewares for workstations. A compiler that attempts to automatically schedule computations on the right hardware needs to have at least some performance model for the application and an idea of how that will map to the archiecture. This is very hard (and completely unsolved) on “simple” desktop architectures. How do you think this will be done automatically by a driver on mobile, where things are more complicated by your own admission? We have been trying to solve some of the same problems as you in our lab and I think years of work needs to be done.
There is also the issue of over-optimizing for a particular architecture that you said can happen with OpenCL. This can certainly happen, but I think this can very well happen with Renderscript too. Just to give an example: Lets say the Renderscript driver on my machine always happens to choose CPU. It is very well possible to write an algorithm that performs wonderfully on my particular device’s CPU and ship that, while the algorithm itself may peform very badly on other devices where driver happens to choose the GPU. CPU algorithms are not always suitable for GPUs for example, so having the same source code for both is more disastrous than trying to run code optimized for one CPU on another. No matter how smart the compiler is, it cannot replace the algorithm with another one.
I think a better approach is to let people build different middlewares for different types of applications. Let a thousand middlewares bloom, and more particularly, let a thousand domain-specific tools bloom. This cannot be done on top of Renderscript. Middlewares building on top of another level of middleware with undocumented (and potentially ever-changing) set of optimizations is a bad bad idea but can potentially be done on top of lower level interfaces (like OpenCL). For example, let game engine programmers decide where/how they want to run the physics code. Let people build domain specific tools like Halide and let them choose how/when to optimize for which architecture. Let people build their own dynamic schedulers (like StarPU) and experiment with what scheduling algorithm suits them best. I understand Googlers are smart, but so are people at Unity or MIT or many other places.
I am not saying Renderscript is bad. I think Renderscript is a good idea that tries to tackle a real issue, and I think you should continue developing it further. But it is not, and can never be, a good solution for everyone and by forcing people to only use this tool will limit the exploration of alternate technical solutions. This is why choice is important. Limiting the choice to only middleware solution, which happes to choose one particular set of parameters in this vast unexplored design space of middleware solutions, will be bad for everyone in the long run.
UPDATE: I think the 6900 series support for double-precision seems fine, but Trinity/Richland remains unexplained.
AMD has two VLIW4 based parts: Radeon 6900 series and Trinity/Richland APU’s GPU products (Radeon 7660D, 8650G etc.). Some of the launch media coverage stated that these GPUs have fp64 capability. I recently got a Richland based system to work on and realized the following:
a) AMD does not support cl_khr_fp64 (i.e. the standard OpenCL extension for fp64) on the 8650G GPU and only supports cl_amd_fp64. But AMD’s documentation is not very clear about the difference.
b) Earlier driver versions for Trinity (which afaik has the same silicon as Richland) definitely had cl_khr_fp64 support, but it was later removed and demoted to only cl_amd_fp64.
c) Richland’s GPU (8650G) does not seem to support double precision under Direct3D either.
Forum postings indicate latest drivers for 6900 series GPUs also do not support cl_khr_fp64, and only support cl_amd_fp64. I am not sure about the fp64 support status under DirectCompute.
My speculation is that AMD discovered some issue with IEEE compliance on fp64 units in the VLIW4 GPUs and hence AMD is unable to support APIs where full IEEE compliance is required. If anyone has any insight into the issue, then let me know.
Currently, most of my GPGPU experience is with OpenCL and CUDA. I have recently been looking at DirectCompute as another IHV-neutral API besides OpenCL. I have tried porting some of my OpenCL code to DirectCompute to gain experience. Here are some notes, in no particular order, from the perspective of writing compute code which has no graphics component:
1. Basic programming paradigm is similar to OpenCL 1.2 and basic CUDA. You have threads organized into thread groups, you have access to local memory/on-chip shared memory and synchronization etc is fairly similar as well.
2. However, it is far behind the functionality in CUDA 5.x and OpenCL 2.0. For example, there is no support for dynamic parallelism. It is likely that Microsoft is considering adding these features, but with no public roadmap it is difficult to say anything. DirectCompute has not really evolved much since it started shipping in Windows 7 in late 2009 (i.e. almost 4 years ago).
3. No support for multiple command queues per context. CUDA has streams and OpenCL has the ability to create multiple command queues per context, but I think there is only one implicit command queue per device context in DirectCompute. I think this will be a problem under many compute scenarios.
4. Shared memory support is very limited. D3D 11.2 introduces some features that take one step towards shared memory, but it is not fully there yet. On OpenCL, we already have decent shared memory support under OpenCL 1.2 on Intel platforms. OpenCL 2.0 is going to bring proper shared memory support on many platforms.
5. Double-precision support in HLSL is limited. There are no trigonometric functions or exponential functions. On Windows 7, you don’t even get double-precision FMA or divide in the shader bytecode. You can potentially the missing functions yourself but a serious compute API should include them. Using Microsoft’s C++ AMP instead of using DirectCompute takes care of some of this on Windows 8.
6. Vendor tools are geared for games and graphics applications. Profilers from various vendors all provide “per frame” analysis, which is useful for graphics applications but useless for pure compute scenarios. OpenCL and CUDA tools are geared for compute and are getting pretty good. I think this will again be different for C++ AMP.
7. Driver quality for DirectCompute is far more consistent across vendors compared to OpenCL. With OpenCL, it is not uncommon to run into frustrating bugs in various drivers. Also, sometimes driver writers interpret the OpenCL spec quite “creatively” which is very frustrating and often requires multiple codepaths even in host API code. DirectCompute drivers are far more robust, less buggy and the program behavior is usually what you expect across all vendors.
8. Hardware-vendor independant shader bytecode is great to have in DirectCompute. OpenCL SPIR will tackle this but it is not yet implemented.
9. Thread-group size is compile time constant in DirectCompute. In OpenCL and CUDA, you can delay specifying the group size until dispatch and can dispatch it with a different group size in every invocation. Even OpenGL compute shaders are getting this ability with a new extension (GL_arb_compute_variable_group_size).
10. Documentation is not that great. I guess I am used to downloading OpenCL specs directly and reading them while MSDN is a bit harder to navigate. For example, Direct3D 11.2 docs are essentially diffs over D3D 11.1 which makes it hard to get the complete up-to-date picture in one place. Vendor documentation is also woefully inadequate on many DirectCompute related things. For example, just trying to find out which GPUs from any vendor supports all double-precision instructions and which doesn’t is hard. Vendors also don’t seem to bother providing detailed optimization guides for DirectCompute.
My experience is limited however, and it is likely I have gotten some things wrong. If you have any corrections to offer, please let me know
Overall I feel that if your app is not already using Direct3D, you probably should not use DirectCompute. You are probably better off choosing OpenCL for many compute scenarios. OpenCL has some technical advantages over DirectCompute as outlined above, is a more future-proof and platform-independent path and has much better documentation and tooling support today than DirectCompute for pure compute scenarios. Alternately, if you want to stick to Microsoft stack, then you are probably better off choosing C++ AMP over DirectCompute.
If we look at only programmability and floating-point performance, the progress we have made on GPUs is remarkable. Consider the following:
- Xbox 360 (2005 console): 240 GFlops and DirectX 10 level (mostly)
- GTX 280 (mid-2008 flagship): 622 GFlops, DirectX 10 and CUDA 1.0
- AMD Richland 8650G (integrated 2013): 550+ GFlops DirectX 11 and OpenCL 1.2
- Intel Iris Pro 5200 (integrated 2013): 650+ GFlops, DirectX11 and OpenCL 1.2
Integrated graphics today, with a TDP of perhaps 20W for the graphics component, has more floating-point performance than a flagship GPU from just 5 years earlier. Bandwidth constraints still remain, though potential solutions are emerging either using on-package eDRAM on Intel, or using say GDDR5 like the PS4. But it is impressive to see that integrated GPUs have advanced so much.