Posted
by
samzenpuson Friday January 20, 2012 @06:28PM
from the read-all-about-it dept.

asgard4 writes "In recent years GPUs have become powerful computing devices whose power is not only used to generate pretty graphics on screen but also to perform heavy computation jobs that were exclusively reserved for high performance super computers in the past. Considering the vast diversity and rapid development cycle of GPUs from different vendors, it is not surprising that the ecosystem of programming environments has flourished fairly quickly as well, with multiple vendors, such as NVIDIA, AMD, and Microsoft, all coming up with their own solutions on how to program GPUs for more general purpose computing (also abbreviated GPGPU) applications. With OpenCL (short for Open Computing Language) the Khronos Group provides an industry standard for programming heavily parallel, heterogeneous systems with a language to write so-called kernels in a C-like language. The OpenCL Programming Guide gives you all the necessary knowledge to get started developing high-performing, parallel applications for such systems with OpenCL 1.1." Keep reading for the rest of asgard4's review.

The authors of the book certainly know what they are talking about. Most of them have been involved in the standardization effort that went into OpenCL. Munshi, for example, is the editor of the OpenCL specification. So all the information in the book is first-hand knowledge from experts in OpenCL. The reader is expected to be familiar with the C programming language and basic programming concepts. Some experience in parallelizing problems is a benefit but not a requirement.

The book consist of two major parts. The first part is a detailed description of the OpenCL C language and the API used by the host to control the execution of programs written in that language. The second part is comprised of various case studies that show OpenCL in action.The authors get straight to the point in the introduction, discussing the conceptual foundations of OpenCL in detail. They explain what kernels are (basically functions that are scheduled for execution on a compute device), how the kernel execution model works, how the host manages the command queues that schedule memory transfers or kernel execution on compute devices, and the memory model.

While this first chapter is all prose, the second chapter dives right in with some code and a first HelloWorld example. The following chapters introduce more and more of the OpenCL language and API step-by-step. All API functions are described in somewhat of a reference style with a lot of detail, including possible error codes. However, the text is not a reference. There is always a good explanation with examples or short code listings, the only notable exception being chapter three, which presents the OpenCL C language. A few more examples would have made the text less dry in this chapter.

An important chapter is chapter nine on events and synchronization between multiple compute devices and the host. This chapter is important because — as any experienced parallel programmer knows — getting synchronization right is often tricky but obviously essential for correct execution of a parallel program.

An interesting feature in OpenCL is the built-in interoperability with OpenGL and, surprisingly, Direct3D. Various functions in the OpenCL API allow creating buffers from OpenGL/Direct3D objects, such as textures or vertex buffers, that can be used by an OpenCL kernel. This opens up interesting possibilities for doing a lot more work on the GPU in graphics applications, such as running a fluid simulation on the GPU in OpenCL, which directly writes its results into vertex buffers or textures to be used directly for rendering without the host CPU having to intervene.

Before delving into the case studies the book briefly discusses the embedded profile that is available for OpenCL and the standardized C++ API that the Khronos Group provides in addition to the regular OpenCL API (which is defined exclusively as C functions). The C++ API makes using some of the OpenCL objects a little bit easier and somewhat nicer.

The second part of the book contains various interesting case studies that show off what OpenCL can be used for, such as computing a sobel filter or a histogram for an image, computing FFTs, doing cloth simulation, or multiplying dense and sparse matrices. The choice and variety of case studies is definitely interesting and most will be immediately applicable to the reader when going forward developing applications using OpenCL. All the code for the examples and the case studies in the book are available for download on the book's website.

Overall, the OpenCL Programming Guide succeeds in being a great introduction to OpenCL 1.1. The book covers all of the specification and more, has an easy to read writing style and yet provides all the necessary details to be an all-encompassing guide to OpenCL. The good selection of case studies makes the book even more appealing and demonstrates what can be done with real-life OpenCL code (and also how it needs to be optimized to get the best performance out of current OpenCL platforms, such as GPUs).

Martin Ecker has been involved in real-time graphics programming for more than 15 years and works as a professional game developer for Sony Computer Entertainment America in sunny San Diego, California.

A GPU is just another computer. Coding for it in OpenCL isn't much different than writing C code that is just a wrapper around some assembly. There is no reason a MUCH more human friendly interface couldn't be made with the compiler taking care of using the appropriate memory and instructions to optimize for GPU usage.

Hell considering how many people are doing this its amazing there isn't even anything approaching a real comprehensive OpenCL tutorial on the web. Just because you CAN learn to use something a

CPU's are a assembly line, if you have a quadcore system, you have 4 assembly lines, and they may be very long. Those 4 assembly lines don't get to talk to each other except on either end. They can all be doing the same activity, or a different activity, and operate asynchronously. When they finish what they are doing, they wipe out the assembly line.GPU's are syncronous and parallel. Every assembly line in a GPU can only do the same instruction code until cleared. So if there are 2048 assembly lines, each of those do the same instructions, with different pieces of data.

So in principle, if you can't parallelize it (eg zlib), it is better run on the CPU. If it can be parallelized (image, video and sound compression, FFT, specific math functions) you can run it on the GPU.

What we haven't done yet is discovered any lossless parallelizeable compression schemes. The problem is that the more fragments you break it up into, the less compression you can do because compression is purely serial. Lossy compression however is not serial, you can go "here's a 64x64 block of data, compress it", and it will do that on the entire image at once, because those 64x64 blocks don't rely on the compression of any of the other blocks in the image. The compression code may be a simple XOR or a Motion vector with the previous image. It can't rely on the neighboring 64x64 blocks.

This is why you see "accelerated" video tear. Because it doesn't wait for all the fragments in the frame to complete before flipping the video buffer. Adobe Flash is especially guilty of this, where you'll see on dual core and quadcore CPU systems screen tearing because Flash assumes it has 100% use of the CPU, even though that same CPU is doing other stuff. If flash used the GPU, it would suffer the same problem since the GPU still is used in Windows Vista and 7 in the accelerated composited desktop.

Anyway. CPU programming and GPU programming are completely different animals.

One thing that GPU's have high potential for, is independent computations. For example, back in 1992, if you were playing a game, the game could only compute the NPC's that are just off the screen. Today, you could use the GPU to compute all the NPC's positions simultaneously. This is currently done with physics computations. Not simply doing "AI" on the GPU, but actually creating neural networks for many NPC's to react to the Playing Character, not just simple "is PC visible, shoot it."

What we haven't done yet is discovered any lossless parallelizeable compression schemes.

Uh, there are a lot of ways to parallelize loss-less compression schemes. I've been involved in projects doing this a couple times over the last decade. One example out of a half dozen I can think off of the top of my head is the history buffer search in LZ77 can be parallelized. How you go about that will make a huge difference in how fast it is.

The history probe may be parallel but the overall compression isn't because those searches still have to be serially executed - until the probe completes you don't know how much input was consumed or what to start the next probe with. The parallelism doesn't scale beyond speeding up the serial steps.

The same applies to decoding variable length codes. I have SIMD accelerated yanking the next huffman code from a bitstream but I have to know where the 1st bit is to perform the detection. The overall loop is st

A GPU is a computing device, but it's not another CPU. So while it may be fairly flexible it's still designed with one thing in mind. Debugging isn't nearly as nice as it is on the CPU, you can't do things such as print to the console from within a kernel (on a GPU) without an extension, and if you initiate a very time consuming process on the GPU your monitor will probably be locked until it finishes. Not to mention memory management is difficult on the GPU since you have to think about things such as coal

Coding for it in OpenCL isn't much different than writing C code that is just a wrapper around some assembly. There is no reason a MUCH more human friendly interface couldn't be made with the compiler taking care of using the appropriate memory and instructions to optimize for GPU usage.

As someone who has actually done some OpenCL programming, I can tell you why your wrong. Learning openCL syntax isn't hard, if you know C# you can probably write some useful openCL code in just an hour or two. It is after all, a C-like language just like C# is a C like language.

That said, don't expect your openCL code to run faster than similar C code compiled with SSE. Thats because making OpenCL run fast is an exercise is looking at memory access patterns, understanding how to share data between hundreds of threads efficiently, etc. My first openCL program was actually slower (by 1/2) than a similar program using all 8 cores of my CPU. I got it on par with the CPU using a top of the line AMD GPU within a day or so, and then spent another two weeks trying different things until finally finding the magic bullet which removed a memory collision I was having and by itself increased the performance of my routine by ~32x. Running the same code on an nvidia GPU put me back in the ballpark of my CPUs again, requiring more time to make it fast on those GPUs. Time I wasn't willing to spend.

The bottom line is that OpenCL could be any language, but, what is necessarily is the ability to make changes which affect how data is laid out in memory, and how that data is being read/written. Furthermore, you need the ability to specify where the memory is used, because GPU's have unforgiving memory hierarchy. So if your not comfortable with the nitty gritty details of how computers (or in this case GPUs) actually work (not some CompSci abstraction) your not going to write good OpenCL code. You also need a gut feeling for how fast something could be, based on the specification of a particular device. Otherwise you won't know when to give up.

I prefer to save C for the linux kernel. C# does just fine for regular programming, and doesn't make you hate yourself when you forget to properly terminate a string.

And I know classes have, for some odd reason, fallen out of style for programmers, but I like them. I've tried functional programming, and I just don't like it. I prefer my code to be more...organized / sane.

yeah except that the c# binaries run like dogs on the user's computer compared with a C equivalent. I avoid.net and java software whenever possible for this reason. if I wanted my workflow to behave like it's on a pentium 75, I'd just use a pentium 75.

I'd settle for some choice GPU optimized libraries. Most seem to be very specific and limited in scope. For instance I can find a million and one GPU accelerated libs to find a sub string but so far a basic PCRE lib for any language is completely elusive.

I did some opencl in python with PyOpenCL recently (http://mathema.tician.de/software/pyopencl). I found it very very easy to get going with. You simply prepare all your data in high level, friendly python and then you fire it off to the graphics card and wait for the result. Sure the OpenCL part is written in a language most resembling C but there is no need not to use a better tool for your non-computational parts.

Although generality is good, we might ask what the "ideal" abstraction for a particular application is. In my opinion, it is a programming language that is designed precisely for that application: one in which a person can quickly and effectively develop a complete software system. It is not general at all; it should capture precisely the semantics of the application domain - no more and no less. In my opinion, a domain-specific language is the "ultimate abstraction".

One approach is not to directly program the GPU, but to use library provided (domain-specific) high-level parallel primitives (map,fold,reduce,..) to describe the computation. The library in question then compiles the final low-level code. These libraries are often implemented as domain-specific embedded languages. Topic is a subject for active research, but some more or less mature implementations already exist, some of which are:

I read this book back in August. I've been using OpenGL for almost 10 years now but knew little to nothing about OpenCL.

This book was really good. There were some typos that I found while reading it (other people had already found and reported them). If you get this book make sure you visit the author's addendum & corrections page.

I agree with the review, 9/10. If there were NO typos at all, it would be 10/10 for me.

WTF is the Khronos Group? Good question. It sure sounds like one of those faux "we really do talk to each other while going our own separate ways" PR initiatives of the African UNIX warlord alliance of so many bland bodies from ages ago whose names we can no longer recall.

Is that its not really useful for learning OpenCL. Sure it will teach you the syntax and how to write an OpenCL program. That isn't the problem. The problem is that if your writing something in OpenCL you probably want it to be fast. Learning the language is doable by someone with C experience in just a couple hours with just the SDKs shipped by AMD/Nividia/Intel. Learning how to optimize a routine for a particular GPU/etc is the hard part, and is application specific. It also requires knowledge of how compute device actually work at an extremely low level. I don't believe this book teaches that. Save your money, download the spec and a SDK for your device. Start reading the architecture docs..

You mean learning how to actually program, where it's the algorithms that make the difference, is difficult? I thought I could buy this book (I have) and then figure out how to break the NSA's latest encryption standards on my iPhone:(

...Could I utilise this programming method to say, encode video streams to a common format, in an efficient (ie fully utilising available GPU/CPU cores) manner? Because right now I have a compute cluster comprising a pair of dual core laptops, one of which has an AMD Radeon HD GPU on-die, the other an Intel chipset GPU (but that's not really important), two P4 desktop machines with NVidia GF7 GPUs and a Sempron box with AMD Radeon HD pci-express. Altogether, that's 7 processor cores and 4 (possibly usable b