The code in the previous article showed each line of the fractal immediately after it was calculated. For this article, this is changed. The Mandelbrot image will be rendered completely off-screen, and only shown when the entire image has been calculated. This is to reduce the overhead of displaying the fractal line-by-line, especially with the C++ AMP version, which will be so fast that this overhead could become pretty substantial.

I will also switch to single precision floating point numbers, because the GPU and GPU driver combination on my Windows 7 machine does not support double precision floating point numbers. C++ AMP itself supports both single- and double precision floating point arithmetic. However, whether or not double precision arithmetic works depends on your specific GPU hardware and your specific drivers from your GPU vendor. A side effect of using single precision arithmetic in the Mandelbrot renderer is that the image will get blocky at big zoom levels. At the end of this article, a piece of code is shows how to check if an accelerator in your system supports double precision arithmetic or not. You could use that to decide at runtime whether to use a single precision or double precision implementation. This is left as an exercise for the reader.

Single-Threaded Implementation

All implementations, the single-threaded, the PPL, and the C++ AMP Mandelbrot version, use the following setup:

Here is the single-threaded implementation from my previous article, but updated to use single precision floating point arithmetic, and to render one Mandelbrot image to the buffer before displaying it:

Multi-Threaded Implementation (PPL)

Parallelizing this implementation using the Microsoft Parallel Patterns Library (PPL) is shown in my previous article. Again, this implementation has been updated to use single precision arithmetic and to render one whole frame to the buffer before displaying it. The updated code is as follows:

Mandelbrot Using C++ AMP

C++ AMP Implementation

C++ AMP is a new technology being developed by Microsoft for C++ to allow you to easily write code that takes advantage of the computing power of accelerators, such GPUs. For now, C++ AMP requires DirectX 11 GPUs. C++ AMP abstracts the accelerators for you. When using C++ AMP, you don't need to worry about what kind of accelerator is in the system. For example, there is no need to write one implementation for an NVidia GPU and another for an AMD GPU. You just write 1 implementation, and C++ AMP handles the rest. If you have multiple GPUs in your system, C++ AMP can use them all at the same time, and they don't even need to be from the same vendor. For example, if you have a system with both an AMD and an NVidia GPU in it, C++ AMP can use them both together.

Microsoft has the intention to make C++ AMP an open specification, so that other compiler vendors could also add support for it.

For this simple Mandelbrot renderer, once you have a PPL implementation, it's almost trivial to transform it to C++ AMP. Without further ado, here is the C++ AMP version, with a detailed explanation after the code:

Everything of C++ AMP is in the concurrency namespace. Certain Direct3D interoperability functionality, such as fast_log(), is in the concurrency::direct3d namespace.

The first thing the C++ AMP implementation does is defining a 2D array_view over the buffer. The array_view class creates a multi-dimensional view over a user supplied buffer. The data of this buffer is copied to and from GPU memory on demand by C++ AMP. The basic use of array_view is:

array_view<type, dim>

where type is the type of data in the buffer and dim is the dimensionality of your data. The dimensionality can be anything you want; it can be 1D, 2D, 3D, 4D... When you define your array_view like this, it will be a read and write view over your buffer, which means that initial data is copied from system memory to GPU memory, and the end result is copied from GPU memory back to system memory. If your GPU kernel is only writing to the buffer, the array_view can be defined as a write-only view over your buffer as follows:

array_view<writeonly<type>, dim>

This will optimize memory transfers. No initial data is copied from system memory to the GPU memory; data is only copied after finishing the calculations, from GPU memory to system memory.
The opposite is also possible. You can create a read-only array_view as follows:

array_view<const type, dim>

This will only copy initial data from system memory to GPU memory, and will not copy it back from GPU memory to system memory.

The above Mandelbrot implementation only writes to the buffer, so the array_view is defined as write-only.

Another option is to use the array class instead of array_view. The array class allocates a buffer, while array_view creates a view over a user allocated buffer.

The first parameter to the parallel_for_each() is a compute domain, called a grid. If you have an array_view, you can simply use its grid property as shown in the above implementation. For this Mandelbrot renderer, we have a two dimensional array_view, so the grid will also be 2D. On each position of the grid, a computation will be performed.

The second parameter to parallel_for_each() is a lambda expression that specifies restrict(direct3d) to say that the code in the lambda expression should be executed on your GPU, instead of your CPU. The restrict(direct3d) attribute is a compile time check; the compiler will check the code to see whether the code will be able to run on GPUs. For example, if you call a system call in your restrict(direct3d) GPU kernel, the code will not work on the GPU, and compilation will fail. In other words, if you use restrict(direct3d) and it compiles without errors, you are not violating any GPU restrictions and the code is valid to be executed on GPUs. Any function that you call from inside a GPU kernel should also have the restrict(direct3d) specifier. There are a number of restrictions on the code in a restrict(direct3d) function. MSDN lists them as follows:

The function can call only functions that have the restrict(direct3d) clause.

The function must be inlinable.

The function can declare only int, unsigned int, float, and double variables, and classes and structures that contain only these types.

References and single-indirection pointers are supported only as local variables and function arguments.

The following are not allowed:

recursion.

variables declared with the volatile keyword.

virtual functions.

pointers to functions.

pointers to member functions.

pointers in structures.

pointers to pointers.

goto statements.

labeled statements.

try, catch, or throw statements.

global variables.

static variables. Use tile_static Keyword instead.

dynamic_cast casts.

the typeid operator.

asm declarations.

varargs.

The lambda expression accepts only one parameter. This parameter is an index into the compute domain (= grid). For the 2D Mandelbrot case, this index is a 2D index denoted as:

index<2> idx

You can access each dimensionality of this index using array notation; for this 2D index you can use idx[0] and idx[1]. This gives you the position in the grid. An added bonus of using an index is that the code looks cleaner. In the C++ AMP version of the Mandelbrot renderer there is no ugly buffer index calculation. Compare the C++ AMP version:

a[idx] = result;

with the PPL version and single-threaded version:

pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result;

The lambda expression should capture all variables it needs by value, except concurrency::array objects, which have to be captured by reference. MSDN also states that the following restrictions apply on the object types that can be captured:

No pointers or references (exception: array)

No char, short, or long long types

No bool types

Structs or classes that contain supported types are allowed

No objects that contain virtual functions or virtual bases

Structs or classes must be PODs (more formally, they must be copyable by blitting)

Failure to create the shader, which is the code that runs on the accelerator

Failure to create buffers

Invalid grid passed

Mismatched accelerators

The body of the lambda expression supplied to the parallel_for_each() (= kernel) runs asynchronously with the CPU code following the parallel_for_each() call, which means that your CPU code after the parallel_for_each() call continues to execute in parallel with the code inside the body of the parallel_for_each() lambda, until a synchronization point is reached. If your CPU code at a certain point observes the result of the parallel_for_each() kernel, by inspecting the array_view in this example, at that point C++ AMP synchronizes and blocks until the GPU kernel is finished. A synchronize is also forced when the array_view object goes out of scope. However, when the array_view goes out of scope, the synchronize() method is called from within the array_view destructor and thus is not allowed to throw any exceptions. In that case, GPU and CPU exceptions that happen during the synchronization, including during transferring results from GPU memory back to system memory, will be lost. For this reason, it is recommended to manually call synchronize() at a suitable point in time, and properly catch any exceptions.

C++ AMP also includes a function called copy_async() that copies data from a source to a destination. The function returns an std::future, which you can check whenever you want to see whether the copy has finished or not.

You can use the accelerator and accelerator_view classes to retrieve information from installed accelerators in a system; and use the get_accelerators() function to get a vector of accelerators. This introductory article on C++ AMP does not go deeper into the functionality of these classes. Only a small example is given.

You can get a hold of the default accelerator in a system by using the default constructor of the accelerator class. Once you have this, you can query it for information using the supports() method and the accelerator_restriction enumeration. For example, the following code checks whether the default accelerator supports double precision calculations or not:

Mandelbrot Using C++ AMP

Results

On my test machine (Intel Core i7-2600k, 16GB RAM, AMD Radeon HD 6800) I get the following averages in milliseconds for rendering one frame at the default zoom level at a resolution of 1916x951:

Single threaded version: 1167 ms

PPL version: 194 ms

C++ AMP version: 31 ms

The Intel Core i7-2600k has 4 physical cores. With hyperthreading enabled, this results in 8 cores seen by the operating system. When using the single threaded version, there is only 1 of these cores being utilized. Using the PPL version, all 8 cores are fully loaded. The C++ AMP version uses almost no CPU power at all; it only uses the GPU cores. On my Radeon HD 6800, it executes around 220 threads.

When you are running Windows 8 Preview with Visual C++ 11 Preview, you can use GPU debugging and put breakpoints in your GPU code; see Debugging a C++ AMP Application. At the moment, GPU debugging does not work on Windows 7.

Sample Application

This article comes with a sample application. You need the Visual Studio 11 Developer Preview to compile it. Unfortunately, I am not able to include a pre-compiled binary because that binary would depend on the Visual C++ 11 Preview redistributable, which is not yet available.

The sample application has a toolbar with the following 'programmers-art' buttons:

R: Render one frame using the selected method

-: Zoom out on the Mandelbrot fractal

+: Zoom in on the Mandelbrot fractal

S: Select the single-threaded implementation

PPL: Select the PPL implementation

AMP: Select the C++ AMP implementation

B: Benchmark: Render a number of frames and display the average in the window title bar.

After rendering a frame, the title bar shows how long it took to render and at what resolution. You can use the left mouse button to move around in the Mandelbrot fractal, and the mouse wheel to zoom in and out. Note however that this interactive moving and zooming is only useful with the C++ AMP implementation, because the single-threaded and PPL versions are too slow for interactivity.

Resources

About the Author

Marc Gregoire

Marc graduated from the Catholic University Leuven, Belgium, with a degree in "Burgerlijk ingenieur in de computer wetenschappen" (equivalent to Master of Science in Engineering in Computer Science) in 2003. In 2004 he got the cum laude degree of Master In Artificial Intelligence at the same university. In 2005 he started working for a big software consultancy company. His main expertise is C/C++ and specifically Microsoft VC++ and the MFC framework. Next to C/C++, he also likes C# and uses PHP for creating webpages. Besides his main interest for Windows development, he also has experience in developing C++ programs running 24x7 on Linux platforms and in developing critical 2G,3G software running on Solaris for big telecom operators.

Top White Papers and Webcasts

Live Event Date: March 19, 2015 @ 1:00 p.m. ET / 10:00 a.m. PT
The 2015 Enterprise Mobile Application Survey asked 250 mobility professionals what their biggest mobile challenges are, how many employees they are equipping with mobile apps, and their methods for driving value with mobility.
Join Dan Woods, Editor and CTO of CITO Research, and Alan Murray, SVP of Products at Apperian, as they break down the results of this survey and discuss how enterprises are using mobile application management and private …

Today's agile organizations pose operations teams with a tremendous challenge: to deploy new releases to production immediately after development and testing is completed. To ensure that applications are deployed successfully, an automatic and transparent process is required. We refer to this process as Zero Touch Deployment™. This white paper reviews two approaches to Zero Touch Deployment--a script-based solution and a release automation platform. The article discusses how each can solve the key …