Open Source

Easy OpenCL with Python

Use OpenCL with very little code -- and test it from the Python console.

As an example, I'll use the 12 steps to build and deploy a kernel with PyOpenCL. This way, if you have experience with C++ OpenCL host applications, you will be able to use PyOpenCL to prepare your host applications to build and deploy your OpenCL kernels.

The following lines show the code for an OpenCL kernel that computes the product of a matrix and a vector:

Both the matrix and vector kernel arguments are of type float4 and are stored in the device's global address space, also known as "global memory." The kernel code retrieves the global ID number (gid) and uses it to calculate the product of the float4 matrix row whose index is equal to the global ID number and the float4 vector. The float result is stored in the global ID number index of result. Figure 1 shows an example of a 4-by-4 matrix multiplied by a 4-element vector.

Figure 1: A 4x4 matrix multiplied by a 4-element vector with the result.

The matrix-vector multiplication shown in Figure 1 requires the following operations:

Each row in the matrix is a float4 vector, so the kernel performs just one operation to compute the product of one row and the float4 vector. For example, the first matrix row is (1.0, 2.0, 4.0, 8.0) and the only element of vector is (1.0, 2.0, 4.0, 8.0). The dot operation for the first matrix row will have two arguments with 4 float values packed in each argument: (1.0, 2.0, 4.0, 8.0) for matrix[gid], and (1.0, 2.0, 4.0, 8.0) for vector[0]. The code takes advantage of the vector processing capabilities of OpenCL and demonstrates the support for vector types that PyOpenCL provides to Python.

The following lines show Python code that uses PyOpenCL and Numpy to perform the steps required for an OpenCL host program. The code includes comments that indicate which blocks of code are performing each of the 12 steps of the typical OpenCL C++ host program. You can also run different parts of the code in the Python console.

The first lines create two variables that initialize both the matrix and the vector. Notice that vector is an array of cl.array.vec.float4 with a single element and matrix is an array of cl.array.vec.float4 with four elements. I used numpy.zeros to create the array with the cl.array.vec.float4 type and then additional code to initialize the values shown in Figure 1. This way, you can easily understand the way you can use cl.array.vec types:

The code retrieves the first available platform, then the first device for this platform. There is no code to check either the available extensions or the device type. However, I placed comments in the code as a reminder that these tasks are necessary in a more complex host program.

Then, the code creates an OpenCL context for the selected device and calls cl.Program to create a program for the context with the kernel source code as one of the arguments. The call to the build() method for the created cl.Program instance builds the kernel.

The code calls cl.CommandQueue with the context as an argument to create a command queue (queue) for the target device. Then, it allocates device memory and moves input data from the host to the device memory. The following lines use the most basic features provided by PyOpenCL to do this:

matrix_buf. A read-only buffer that copies the data from the matrix variable. The kernel will read from this buffer in the global memory space

vector_buf. A read-only buffer that copies the data from the vector variable. The kernel will read from this buffer in the global memory space

destination_buf: A write-only buffer that will hold the result of the matrix-by-vector multiplication. The kernel will write to this buffer in the global memory space

The following line associates the arguments to the kernel and deploys it for device execution by calling the method that PyOpenCL generates in program with the built kernel name: matrix_dot_vector. The previously created queue is the first argument:

When the kernel finishes, it is time to move the kernel's output data (result) stored in destination_buf to the host program memory. The following line calls cl.enqueue_copy to do this, and the result will be available in the matrix_dot_vector variable.

cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)

In this example, the code doesn't take advantage of the different events that are fired when the kernel finishes its execution. That notwithstanding, because PyOpenCL performs all the necessary cleanup operations, you don't need to worry about reference counts or releasing the underlying OpenCL structures and resources.

Conclusion

This example shows basic features that PyOpenCL provides to Python developers who want to create OpenCL host applications. In the next article in this series, I'll dive deep into more advanced features that reduce the code required to build and deploy OpenCL kernels for many common parallel algorithms.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!