>> Tuesday, August 30, 2011

I've been discussing the possibility of writing an article for Intel's Visual Computing site. The proposed article involves accelerating ray-triangle intersection on GPUs with OpenCL. This computation is necessary to convert 2-D mouse clicks into 3-D object selection.

The Intel reviewer seems happy with my outline, but wants me to emphasize balancing the computational load between CPUs and GPUs. This makes sense for Intel, which doesn't make OpenCL-compliant GPUs. But there's a problem. OpenCL devices can only share buffer data, such as vertex colors and coordinates, if they belong to the same context. For an overview of devices and contexts, I recommend this article.

The OpenCL spec doesn't mention this explicitly, but OpenCL devices can only be placed in the same context if they belong to the same platform. Platforms correspond to device vendors, so if you have an Intel CPU and an AMD GPU, the devices belong to different platforms. Therefore, they can't be placed in the same context, which means they can't share buffer data. Intel doesn't make GPUs and Nvidia doesn't make CPUs, so the only way I know to place a CPU and GPU in a context is if both devices were made by AMD. When I demonstrate DynLab at SC11, that's the setup I'm going to use.

For more information, I recommend this thread on the Khronos forum. If I was on the OpenCL Working Group, my first priority for OpenCL 2.0 would involve making sure developers can create contexts with devices in different platforms.

>> Thursday, August 18, 2011

While searching for information on OpenGL's uniform buffer objects, I stumbled across an online book called Learning Modern 3D Graphics Programming. This is the best book on OpenGL 3.x/4.x I've encountered. It's well-written and the material proceeds gracefully from the simple to the complex. There are plenty of code samples, and unlike the examples in the OpenGL SuperBible, the code relies on the OpenGL API instead of deriving its own set of helper functions.

The book is so polished that I wish there was a PayPal link or other way to reimburse the author. I'd also like to know if he intends to finish the final chapters. But there's no e-mail link, and except for this brief biography, I can't find any information on Jason McKesson.

>> Tuesday, August 16, 2011

OpenGL provides two main methods of combining vertices into primitives (lines, triangles, etc.):

glDrawArrays/glMultiDrawArrays - For each primitive, every vertex is sent to the GPU. If Vertex A is connected to four primitives, Vertex A will be sent to the GPU four times.

glDrawElements/glMultiDrawElements - Vertices are sent in one large block and an index list is used to determine which vertices belong to which primitive. If Vertex A is connected to four primitives, Vertex A will only be sent to the GPU once.

The first method is simpler, but in general, the second method provides better performance. This is because CPU-GPU data transfer is a significant time sink and the second method usually results in less data to transfer. This is particularly true when primitives are combined to form complex 3-D shapes like spheres and tetrahedra.

But I want to process vertices on the GPU using OpenCL work-items. In this case, it may be easier to have each work-item access a separate group of vertices than to have multiple work-items access the same input data through an index list. As long as each work-item processes different vertices, I don't need to synchronize them with barriers. So I'm currently using glMultiDrawArrays instead of glMultiDrawElements.

However, some operations are beyond OpenCL, and dynamic memory allocation is an important one. For example, if a model contains one thousand 3-D objects that need be processed mathematically, OpenCL is great. But if the user deletes Objects 5 and 612, OpenCL can't deallocate the memory. Instead, the CPU needs to free the memory and re-transfer the remaining vertices to the GPU.

So the question is whether the speed-up provided by OpenCL kernels is sufficient to offset the delay imposed by using glMultiDrawArrays instead of glMultiDrawElements. In the end, the only way to know is to profile both methods and go with the one that provides better performance.
Read more...

>> Tuesday, August 9, 2011

SIGGRAPH 2011 is taking place in Vancouver, and the Khronos Group has made two interesting announcements:

OpenGL 4.2 has been released along with GLSL 4.20.6. The new standard supports atomic operations and read/write operations on textures. It looks like the difference in capability between OpenCL kernels and OpenGL shaders is narrowing.

The Khronos Group is seeking corporate participation with regard to the WebCL and StreamInput standards. The more I read about StreamInput, the more interesting it seems. I'd love to sit in on the standard deliberation process, but I don't think I'd have anything productive to contribute.

>> Wednesday, August 3, 2011

UPDATE: Certain versions of Mac OS do support OpenCL 1.1, but there's no official announcement. To determine if your system supports 1.1, I recommend that you compile and run the code listed here.

The book is nearly finished, but one reviewer had a significant concern: a handful of my code examples didn't compile on his Mac OS system. But my code wasn't the problem; the problem is that Mac OS doesn't support OpenCL 1.1.

What's odd about this is that Apple was the driving force behind OpenCL 1.0. As I understand it, they told their device vendors to come up with a non-proprietary toolset for accessing high-performance devices like CPUs and GPUs. In response, Nvidia, AMD, and IBM put their heads together and wrote the first draft of the OpenCL spec. But now, over a year after the release of OpenCL 1.1, Mac OS doesn't support the new standard. And I haven't found any plan to support OpenCL 1.1 in future releases.

The same goes for OpenGL, which is a far greater concern than OpenCL. OpenGL 4.1 was released in July 2010, but according to this table, Apple's latest operating system supports nothing higher than version 3.2. In other words, Apple's hardware is capable of high-performance rendering but the OS won't allow modern rendering applications to execute.

Very frustrating. One of the selling points of OpenGL/OpenCL over Microsoft technologies is that they're cross-platform. But if Apple isn't willing to support modern versions of these tools, cross-platform means Windows and Linux only.

So here's my plan regarding the book's example code. I'll release a separate archive for Mac OS users, and the code won't be any different than it is for GNU users. But I'll remove every project that requires OpenCL 1.1 and OpenGL. This may upset Mac users, but there's nothing I can do.

>> Monday, August 1, 2011

I excerpted two sections of the book and combined them into an article called A Gentle Introduction to OpenCL. OpenCL is a tough subject, so I've relied on analogies to explain how host applications and kernels work. Neither analogy is perfect, but I hope they'll help newcomers to the topic.

Regarding kernel execution, there's one point I wanted to mention that didn't get into the article. As OpenCL developers, we can control the total number of work-items generated for a kernel. We can also control the total number of work-groups. But we can't control the maximum number of work-items in a work-group. This depends on the device's resources and the resources required by the kernel. The clGetKernelWorkGroupInfo function makes it possible to determine this in code.