Some things I want to experiment with as a fun exercise:
F# : F# is an awesome language that I started experimenting with a few months back. While I know the basic syntax of many parts of the language now, I am nowhere near proficient. Even in my brief time with the language, I am enjoying it a lot. It is more productive than even Python and light years ahead of C++. Looking forward to playing with some of the interesting features such as async workflows and type providers. Two books that I have on my radar are “F# deep dives” and “Purely functional data-structures” by Okasaki.
C# and .net ecosystem: Playing with F# actually has made me interested in playing with more .net technologies. From the productivity standpoint, C# looks a lot nicer than Java or C++ and does have some interesting technologies such as async construct and LINQ. On the performance side, CLR looks like a good JIT and a lot of innovation and pragmatic decisions seem to have been taken in the .net ecosystem. For example, inclusion of value types, SIMD types introduced recently with RyuJIT and libraries such as TPL should make it possible to write reasonably high-performance code despite CLR being a managed runtime. Recent open-sourcing of the .net core is also an interesting move.
ZeroMQ: I don’t have much experience with message-queue based systems and ZeroMQ looks like a good place to start. Have heard a lot of good things about it.
C++11: I have read up on many of the features in C++11, and have a basic understanding, but have not used them in non-trivial ways so I am not confident about them yet. Overall I am not at all liking where C++ is going. However, as a professional programmer who works a lot with C++ I feel I should keep myself updated because I expect to see more C++11 going forward.
OpenCL 2.0: I have read the specs and am familiar with many of the features theoretically but want to spend some time with features such as device-side enqueue and SVM to see the types of algorithms that are now possible on modern hardware.
Direct3d 11 and 12: Well quite self-evident Going with the .net theme might try out SharpDX perhaps instead of going the native route.
I was reading Anandtech review of the power consumption of Nvidia shield tablet. While at first glance the GPU performance looked very impressive, the battery life data provided by the authors Joshua Ho and Andrei Frumusanu gives very good insights. Consider the battery life of Tab S 8.4 (using Exynos chipset with Mali graphics) and Shield Tablet running GFXbench 3.0. We can get the average power consumption as (Battery energy in WHr)/(Battery life in hours). They tested the shield tablet in two modes: Default (i.e. high performance) and capped performance. They reported observing GPU frequency of ~750MHz and ~450MHz in the two modes respectively. The battery life for the capped mode is inferred from the graph at about 14000 minutes (3.88 hours). For a very rough comparison, we will also compare with phablets such as Galaxy Note 3.
This gives us the following data:
1. Nvidia shield tablet (default, ~750MHz): 8.8W
2. Nvidia shield tablet (capped, ~450MHz): 5.09W
3. Tab S 8.4 (default): 5.5W
4. Galaxy Note 3: 3.1W
It is immediately obvious that in the default mode, shield tablet is consuming way too much power compared to Tab S. Given given the massive power consumption difference by reducing the GPU frequency, and the fact that the shield tablet gives good results for non-GPU bound tests , it is clear that most of the 9W of power is being consumed by the Tegra K1’s GPU.
The power consumption data is for the device, and hence includes the power consumption of components such as the screen and those can be very different across display types and sizes. We will only make very rough calculations here. To make very rough guesses, let us assume that the components other than the SoC and DRAM are consuming ~1.5W in the tablets and ~0.8W in the phone.
We get the following (VERY ROUGH) data for SoC + DRAM power consumption:
1. Tegra K1 (default, 750MHz): 7.3W
2. Tegra K1 (capped, 450MHz): 3.6W
3. Exynos 5420 (tablet): 4W
4. Snapdragon 800 (phone): 2.3W
Overall, I think it is quite reasonable to state that if Tegra K1’s stated GPU frequency targets of ~900MHz are not realizable in devices such as phones. I get the feeling that the Shield Tablet has been built more as a showcase device where the maximum GPU frequency has been set a bit too high in order to win benchmarks. I think if Tegra K1 ever ships in phones, it is likely that the GPU frequency will not exceed ~450MHz, and the GPU will not perform any better than it’s current mobile competitors. Perhaps Tegra K1 (particularly its GPU) is better suited in larger devices such as large tablets and ultraportable laptops where it can stretch it’s legs more.
Recently, there was some discussion about a set of microbenchmarks reported in a study called Clash of the Lambdas which compared a simple stream/sequence benchmark using Java 8 Streams, Scala, C# LINQ and F#. I am learning F# and as a learning exercise I decided to re-implement one of the benchmarks (Sum of Squares Even) myself in F# without referring to the code provided by the authors.
The source of my implementation can be found on Bitbucket and binaries are also provided. My interest was to test/compare various F# implementations and not cross-language comparison. I implemented it in four different ways:
- Imperative sequential for-loop
- Imperative parallel version using Parallel.For from Task Parallel Library
- Functional sequential version using F# sequences
- Functional parallel version using F# PSeq from FSharp.ParallelSeq
- UPDATE: I added a functional version using the Nessos Streams package as suggested by Nick Palladinos on twitter
I compiled using VS 2013 Express and F#3.1 with “Release” settings, Any CPU (32-bit not preferred) and ran it on my machine on 3 different CLR implementations: MS CLR from .net SDK 4.5.2 running on Windows 8.1, MS CLR RyuJIT CTP4 and finally on OpenSUSE 13.1 using Mono 3.4 (sgen GC, no LLVM).
The results are as follows:
|MS RyuJIT CTP4||18||7||168||76||44|
Some observations for this microbenchmark:
- Imperative version is far faster than the functional version, but the functional version was shorter and clearer to me. I wonder if there is some opportunity for compiler optimizations in the F# compiler for the functional version, such as inlining sequence operations or fusing a pipeline of operations where possible.
- MS RyuJIT CTP4, which is the beta version of the next-gen MS CLR JIT, is performing similar to the current MS CLR. This is good to see
- Mono is much slower than the MS CLR. Also, it absolutely hates F# parallel sequences for some reason. I guess I will have to try and install Mono with LLVM enabled and then check the performance again.
- Streams package from Nessos looks to be faster than F# sequences in this microbenchmark. It is currently sequential only but performs much faster than even PSeq.
These observations only apply to this microbenchmark, and probably should not be considered as general results. Overall, it was a fun learning experience, especially as a newcomer to both F# and the .net ecosystem. F# looks like a really elegant and powerful language and is a joy to write. There is still a LOT more to learn about both. For example, I am not quite clear what the best way to distribute .net projects as open-source. Should I distribute VS solution files? I am more used to distributing build files for CMake, Make, scons, ant etc. and looking more into FAKE. NuGet is also nice-ish though appears to be useful but not very powerful (eg: can’t remove packages) and merits further investigation.
Getting F# running on Linux took a lot more effort than I anticipated. I am documenting the process here in the hope it may benefit someone (maybe myself) in the future. For reference, I am using OpenSuse 13.1.
- F# is not compatible with all versions of Mono. For example, my distro repos have Mono 3.0.6 which appears to have some issues with F#. Instead, I found some people make new Mono packages available for various distros using Opensuse Build Service (OBS). For example, check out tpokorra repos for various distros such as OpenSUSE, CentOS, Debian etc. I installed “mono-opt” and related packages. It installed mono 3.4 into /opt/mono directory.
- If you install mono into /opt/mono, then ensure that you do append “/opt/mono/lib” into the LD_LIBRARY_PATH environment variable and /opt/mono/bin to the PATH variable. I did this in my .bashrc.
- By default, /opt/mono/bin/mono turned out to be a symlink to /opt/mono/bin/mono-sgen. Now it appears that Mono has two versions: one using sgen GC and one using Boehm GC. I have had trouble with compilng F# using mono-sgen so I removed /opt/mono/bin/mono and then created it as a symlink to /opt/mono/bin/mono-boehm.
- Now open up a new shell. In this shell, set up a few environment variables temporarily required for building F#. First, “export PKG_CONFIG_PATH=/opt/mono/lib/pkgconfig”. Next, we need to setup some GC parameters for Mono. It turns out compiling F# requires a lot of memory and Mono craps out with default GC parameters. I have a lot of memory in my laptop, so I set the Mono GC to use upto 2GB as follows: “export MONO_GC_PARAMS=max-heap-params=2G”. These two settings likely won’t be required after you have compiled and installed F#.
- Now you can follow the instructions given on the F# webpage.
Specifically I did
- git clone https://github.com/fsharp/fsharp
- cd fsharp
- ./autogen –prefix /opt/mono #Keep things consistent with rest of mono install
- make #Takes a lot of time
- make install
As I get somewhat older, I have come to the realization that attempting to keep track of all the buzzwords and current fashions in the software industry is counterproductive to actual work. At some point, I have to draw a line. Trends, programming styles, languages and APIs all come and go.
It is important to keep oneself updated, but like anything else, it should be done in moderation. It is also perhaps more important to get better at fundamental computer science ideas than necessarily knowing 10 APIs to do the same thing.
Anyway, I intend to a bit more selective about which technologies I learn, focusing on a few at any given point of time. The idea is not to stand still, but rather to have a focus period bigger than a goldfish. For rest of 2014, I have made a much bigger learning list about fundamental CS ideas and a much shorter list of “technologies”. I will keep posting about some of the books I am reading.
I have been reading some Metal API documents. Some brief notes about Metal compute from the perspective of pure compute and not graphics:
Kernels: If you know previous GPU compute APIs such as OpenCL or CUDA etc. you will be at home. You have work-items organized in work-groups. A work-group has access to upto 16kB of local memory. Items within a work-group can synchronize but different work-groups cannot synchronize. You do have atomic instructions to global and local l memory. You don’t have function pointers and while the documentation doesn’t mention it, likely no recursion either. There is no dynamic parallelism either. You also cannot do dynamic memory allocation inside kernels. This is all very similar to OpenCL 1.x.
Memory model: You create buffers and kernels read/write buffers. Interestingly, you can create buffers from pre-allocated memory (i.e. from a CPU pointer) with zero copy provided the pointer is aligned to page boundary. This makes sense because obviously on the A7, both CPU and GPU have access to same physical pool of memory.
CPU and GPU cannot simultaneously write to buffer I think. CPU only guaranteed to see updates to buffer when the GPU command completes execution and GPU only guaranteed to see CPU updates if they occur before the GPU command is “committed”. So we are far from HSA-type functionality.
Currently I am unclear about how pointers work in the API. For example, can you store a pointer value in a kernel, and then reload it in a different kernel? You can do this in CUDA and OpenCL 2.0 “coarse grained” SVM for example, but not really in OpenCL 1.x. I am thinking/speculating they don’t support such general pointer usage.
Command queues: This is the point where I am not at all clear about things but I will describe how I think things work. You can have multiple command queues similar to multiple streams in CUDA or multiple command queues in OpenCL. Command queues contain a sequence of “command buffers” where each command buffer can actually contain multiple commands. To reduce driver overhead, you can “encode” or record commands in two different command buffers in parallel.
Command queues can be thought of as in-order but superscalar. Command buffers are ordered in the order they were encoded. However, API keeps track of resource dependencies between command buffers and if two command buffers in sequence can be issued in parallel, they may be issued in parallel. I am speculating that the “superscalar” part applies to purely compute driven scenarios, and will likely apply more to mixed scenarios where a graphics task and a compute task may be issued in parallel.
GPU-only: Currently only works on GPUs, and not say the CPU or the DSP.
Images/textures: Haven’t read this yet. TODO.
Overall, Metal is similar in functionality to OpenCL 1.x. and it is more about having niceties such as C++11 support in the kernel language (the static subset) so you can use templates, overloading, some static usage of classes etc. Graphics programmers will also appreciate the tight integration with the graphics pipeline. To conclude, if you have used OpenCL or CUDA, then your skills will transfer over easily to Metal. From a theory perspective it is not a revolutionary API, and does not bring any new execution or memory model niceties. It is essentially Apple’s view on the same concepts and focused on tackling of practical issues.
There has been a lot of discussion about driver overhead in graphics and compute APIs recently. A lot of it has been centred around desktop-type scenarios with discrete GPUs. But just wanted to point out that driver overhead matters more on SoCs which integrate both CPU and GPU on the same chip.
The simple reason is SoCs have a fixed total power budget and modern SoCs dynamically distribute power budget between CPU and GPU. If there is a lot of driver overhead, which means CPU is doing a lot of work, then CPU eats a bigger part of the fixed power budget and thus the SoC may be forced to reduce the GPU frequency. In addition to power, caches and memory bandwidth may also be shared.
I have done some benchmarking and tuning of OpenCL code for Intel’s Core chipsets and often getting the best performance out of the GPU required being more efficient on the CPU. I am pretty sure similar strategy is applicable on smartphone SoCs with the added constraint that smartphone CPUs are usually wimpy due to power constraints.
UPDATE: Issues related to texture arrays appears to be an application error. Michael Marks provides a fork that corrects some issues.
UPDATE 2: I reran some of the Linux benchmarks, earlier Linux results appear to have a bug. Performance on Linux and Windows now similar.
The strengths and weaknesses of OpenGL compared to other APIs (such as D3D11, D3D12 and Mantle) and the recent talk Approaching Zero Driver Overhead (AZDO) have become topics of hot discussion. AZDO talk included a nice tool called “apitest” that allows us to compares a number of solutions in OpenGL and D3D. Hard data is always better than hand-wavy arguments. In the AZDO talk, data from “apitest” was shown for Nvidia hardware but no numbers were given for either Intel or AMD hardware. Michael Marks ran the tool on Linux and had some interesting results to report that imply that AMD’s driver have higher overhead than Nvidia’s driver.
However, I wanted to answer slightly different questions. For example, if we just restrict to AMD hardware, how does the performance compare to D3D? What is the performance and compatibility difference between Windows and Linux? And what is the performance of various approaches across hardware generations? With these questions in mind, I built and ran the apitool on some AMD hardware on both Linux and Windows.
Hardware: AMD A10-5750M APU with 8650G graphics (VLIW4) + 8750M (GCN) switchable graphics. Catalyst 14.4 installed on both Linux and Windows. Catalyst allows explicit selection of graphics processor. Laptop has a 1366×768 screen.
Build: On Windows, built for Win32 (i.e. 32-bit) using VS 2012 Express and DX SDK (June 2010). Release setting was used. On Linux, built for 64-bit using G++ 4.8 on OpenSUSE 13.1. Required one patch in SDL cmake file.
Run: Tool was run using “apitest.exe -a oglcore -b -t 15″ which is the same setting as Michael Marks. On Linux, it was run under KDE and desktop effects were kept disabled in case that makes a difference.
I encountered some issues. I am not sure if the error is in the application, the user (i.e me) or the driver.
- Solutions using shader draw parameters (often abbreviated as SDP in the talk) appear to lead to driver hangs on GCN and are unsupported on VLIW4. Therefore I have not reported any SDP results here. Michael Marks also saw the same driver hangs on GCN on Linux, did some investigation and has posted some discussion here.
Solutions involving ARB_shader_image_load_store (which is core in OpenGL 4.2 and not some arcane extension) appear to be broken on Windows but are working on Linux despite installing the same Catalyst version. On Windows, the driver appears to be reporting some compilation error for some shaders saying that “readonly” is not supported unless you enable the extension..UPDATE: Was application bug.
- GCN based 8750M should support bindless textures. However, some of the bindless based solutions failed to work. For example GLBindlessMultiDraw failed. Sparse bindless also failed to work.
I did not test 8750M on Linux, partially because I am lazy and partially because I did not want to disturb my Linux setup which I use for my university work. Anyway, here is the data for 3 problems covered by apitest.
|Solution||8650G Windows (FPS)||8650G Linux (FPS)||8750M Windows (FPS)|
|Solution||8650G Windows (FPS)||8650G Linux (FPS)||8750M Windows (FPS)|
|Solution||8650G Windows (FPS)||8650G Linux (FPS)||8750M Windows (FPS)|
- The theoretical principles discussed in the AZDO talk appear to be sound. The “modern GL” techniques discussed do appear to substantially reduce driver overhead compared to older GL techniques. The reduction was seen on AMD hardware on both Windows and Linux and worked on two different architectures (VLIW4 based APU, GCN based discrete). In particular, persistent buffer mapping (sometimes called PBM) and multi-draw-indirect (MDI) based techniques seem useful.
- On Windows, the best OpenGL solutions do appear to significantly outperform D3D. I am not an expert on D3D so I am not sure if better D3D11 solutions exist.
- If a test ran successfully on both Windows and Linux, then the performance was qualitatively similar in most cases.
- However, while theoretically things look good, in practice some issues were encountered. Some of the solutions failed to execute despite theoretically being supported by the hardware. In particular, shader draw parameters as well as some variations of bindless textures appear to be problematic. I am not sure if it was the fault of the application, the user (me) or the driver.
Well, I really got this one wrong. Previously I had (mistakenly) claimed that OpenGL compute shader, OpenGL ES compute shader etc. don’t really have specified minimums for some things but I got that completely wrong. I guess I need to be more careful while reading these specs as the required minimums are not necessarily located at the same place where they are being explained. Some of the OpenCL claims still stand though OpenCL’s relaxed specs are a bit more understandable given that it has to run on more hardware than others.
Here are the minimums:
- OpenGL 4.3: 1024 work-items in a group, and 32kB of local memory
- OpenGL ES 3.1: 128 work-items in a group, and 16kB of local memory (updated from 32kB, Khronos “fixed” the spec)
- OpenCL 1.2: 1 work-item in a group, 32kB of local memory
- OpenCL 1.2 embedded: 1 work-item in a group, 1kB of local memory
- DirectCompute 11 (for reference): 1024 work-items in a group, 32kB of local memory
Thanks to Graham Sellers and Daniel Koch on twitter for pointing out the error. I guess I got schooled today.
UPDATE: This post is just plain wrong. See correction HERE. Thanks to various people on twitter, especially Graham Sellers and Daniel Koch for pointing this out.
Just venting some frustration here. One of the annoying things in Khronos standards is the lack of required minimum capabilities, which makes writing portable code that much harder. The minimum guarantees are very lax. Just as an example, take work-group sizes in both OpenCL and OpenGL compute shaders. In both of these, you have to query to find out the maximum work group size supported which may turn out to be just 1.
Similarly, in OpenGL (and ES) compute shaders, there is no minimum guaranteed amount of local memory per workgroup. You have to query to ask how much shared memory per workgroup is supported, and the implementation can just say zero because there is no minimum mandated in the specification.
edit: Contrast this with DirectCompute where you have mandated specifications for both the amount of local memory and the work-group sizes which makes life so much simpler.