Downloads

Abstract

Recently, Intel announced the release of the Intel® SDK for OpenCL Applications 2013 with certified OpenCL* 1.2 support on 3rd and 4th generation Intel® Core™ processors with Intel® Iris Graphics and Intel® HD Graphics family. Visual computing applications like content creation, home movies, music, and personal images will benefit from the value of OpenCL general purpose programmability combined with access to the combined power of the CPU and the hardware acceleration capabilities of Intel® Iris Graphics and Intel® HD Graphics to increase performance and improve battery life.

In fact, with the launch of the 3rd generation Intel® Core™ processors, Intel has made it possible for software developers to heterogeneously program both CPU and Intel HD Graphics through the OpenCL* framework.

This article documents lessons learned while accelerating video processing with OpenCL* on the 3rd generation Intel® Core™ processors. Common video effects in a commercial application were optimized with OpenCL, and then evaluated on Intel® HD Graphics 4000. Using OpenCL, key video effects were sped up by as much as 2.3x, and with further tuning for Intel® HD Graphics 4000, additional performance gains of up to 4.3x were achieved.

Many of the lessons learned and documented in this article are also applicable to the newer versions of Intel® Graphics Processors.

Introduction

A popular software title for professional video editing was updated in 2011 to accelerate video processing effects with OpenCL. Intel used this as an opportunity to test compatibility and runtime performance of OpenCL on the Intel® 3rd Generation Core Processor with Intel HD Graphics 4000, and to determine if there were opportunities to enhance performance further. During development of the initial application release, over 60 video effects were accelerated with OpenCL, for which over 120 OpenCL kernels were implemented. With so many effects accelerated with OpenCL, it was essential to functional test and to assess performance of every OpenCL kernel.

This article outlines lessons learned and some optimization techniques used while testing and assessing the performance of the OpenCL kernels. In addition, performance and/or bottleneck issues found with some kernels and their solutions are outlined.

This article assumes the reader is familiar with the OpenCL programming model [1]. This article refers to the Intel® HD Graphics 4000 OpenCL as HDG or HDG OpenCL throughout. The HDG OpenCL capability was first introduced with the Intel 3rd Generation Core Family of Processors codenamed Ivy Bridge.

Analysis of the Title

A workload was released as part of the app’s release kit to highlight video processing acceleration with OpenCL. The release kit consisted of seven workloads designed to test different video effects which are accelerated with OpenCL. These seven workloads of the release kit were used in performance and functional analysis throughout the testing of the HDG implementation of OpenCL. A number of issues were encountered on the OpenCL compiler and runtime which were resolved. HDG runtime challenges were observed and it took time to understand before steps could be taken to address and optimize the performance bottlenecks observed. Figure 1 compares the initial performance to the optimized performance observed with HDG today.

The application release kit workloads were developed to determine increase in playback performance and decrease in render time. The kit is divided into 7 workloads, each showcasing different video effects. These included the following video effects, all implemented with OpenCL kernels.

Workload 3: includes a key framed picture-in-picture sequence with clock wipe transition and an animated title. All of this is happening over a transition from the slow motion background image (using the Cross Effect transition) to the next clip.

Workload 4: includes an iris transition, a 2-track composite shot with 2 levels of secondary color correction and key framed cookie cutter effect, and a crossfade to a slow motion clip.

Workload 5: includes a PNG file which uses a Chroma key filter to composite it over a generated media lower third and a slow motion clip with a key framed Bump Map filter.

Workload 6: includes a two-track composite created by a Mask Generator filter on the top track’s event. It also features a key framed black & white effect, slow motion, and fades.

Workload 7: includes a PNG file with transparency over a clip with a key framed sepia effect, a key framed lens flare effect, and fades.

Figure 1. Performance Improvements

Figure 1 shows how early OpenCL performance compared to the performance observed with the majority of the OpenCL kernels optimized for HDG.

General Optimizations

This section outlines lesson learned and general optimizations for creating measureable performance improvements in OpenCL kernels. These optimizations were integrated into the release of workloads shown in Figure 1.

Use native built-in functions cautiously. Most native functions yield better performance but not all. For example: Use y= y * y * y when x = 3.0f instead of pow(y, x) or even native_powr(y, 3.0f). The multiplication generated code is more than 2ms faster on HDG than either built-in function.

Use bit-wise operations for Boolean comparisons whenever possible; e. g., use if (y & x) instead of if (y && x). This optimization improves performance especially if the kernel is big, usually 4K instructions or bigger.

Eliminate arithmetic operations of invariant variables in kernel code. Move computations to host code whenever possible. E.g., y = a * b + c if c is the only variant, compute a * b in host code and pass value of a*b in an argument parameter to the kernel.

Performance Case Studies

This section documents three case studies of OpenCL kernels optimized for HDG: 1) use of shared local memory, 2) use of defined filters in the HDG OpenCL engine, as well as 3) transfer and handling of texture data on HDG. The OpenCL kernels used in these studies came from a professional video editing application. The optimizations outlined in the article were scheduled to release in subsequent application updates.

Performance Case Study 1 – Lookup Tables

This case study documents why and how lookup tables (LUT) present performance bottlenecks in OpenCL kernels. The study examines the performance of the “color curves” video effect program where three lookup tables are used. Analysis of HDG OpenCL capabilities determined that HDG Shared Local Memory (SLM) would speed up kernels that use LUTs. This case study proves that use of HDG SLM can in fact speed up the performance of OpenCL kernels that use LUTs. The case study also shows how simple it is to further accelerate video processing using SLM in OpenCL kernels where appropriate.

Lookup Tables in OpenCL Kernels

Processing lookup table data in OpenCL kernels in most cases creates a performance bottleneck. This is due to the large number of data transfers for lookup table data that occur between main memory and the memory available for the OpenCL device. The more the kernel is “compute” bound (limited by the number of computations and not the number of pixels rendered), the more severe the performance impact is. This is especially important since kernels should always be designed to be “compute” bound to maximize performance as hardware becomes more capable. If an OpenCL kernel is not compute bound, the kernel program should be redesigned or the algorithm should not be programmed using OpenCL at all. In general, lookup tables should be avoided in OpenCL kernels if at all possible because LUTs preclude a kernel from being compute bound. Avoiding lookup tables and whether the kernel is compute bound or not are topics for another paper. For now, we’ll examine how lookup tables can manifest as performance bottlenecks and what can be done to prevent bottlenecks.

Lookup tables will almost always create a performance bottleneck when the lookup table data is large, generally more than 256 bytes. As lookup table data is being accessed by hundreds of OpenCL HW threads, lots of data transfers occur between main system memory and HDG memory. Inherently, these data transfers between the two system memories incur latency and/or collisions. The data transfer latency and access hits slow the HDG OpenCL compute engine, which in turn prevents the kernel from running optimally.

OpenCL Code on Kernels with LUTs

Consider the kernel code from the “color curves” video effect in Figure 2. Notice that in addition to other parameters, the kernel also has three parameters (global pointers) for lookup tables; these are lutR, lutG, and lutB. The OpenCL keywords “__global” implies memory is used from the global memory pool which in the HDG architecture usually means cache memory. Data held in this memory has to transfer to the kernel along a slow path. The highlighted code shows LUT table data being used, where the code is indexing through the tables to retrieve LUT data. The indexes were computed based on the incoming image pixel data – not shown here. Notice there are six values to retrieve by each kernel thread from system memory. The data transfer latency impact is compounded as there are a large number of OpenCL hardware threads running for the kernel which are trying to hit the same memory address space.

So what can be done to avoid bottlenecks when using lookup tables? In most cases the answer is as easy as using the HDG local memory. Local memory is also referred to as SLM (Shared Local Memory) because variables which use the “__local” prefix are allocated in local memory and local memory is shared by all work-items in a work-group. For more details on OpenCL semantics refer to the OpenCL Specification.

Performance Optimization

If at all possible, lookup table data should be copied to shared local memory. Using SLM prevents excessive shuttling of lookup table data between kernel threads thus greatly minimizing data transfer penalties. With the memory latency removed or minimized, the kernel compute throughput will no longer be bogged down and should show substantially better performance. The code in Figure 3 illustrates how to use SLM for lookup tables to prevent performance bottlenecks in this OpenCL kernel.

get local size [0|1] – each function returns the value of the local work size specified on the kernel execution. When local work size is not specified, the OpenCL kernel engine selects most appropriate based on global worksize. These values remain constant across all kernel HW threads. In this case the local work size was not specified and HDG OpenCL engine selected [64, 8], so get_local_size(0) always returned 64 and get_local_size(1) returned 8.

get local id [0|1] – each function returns a value within the range of the local work size. get_local_id(0) returns a value between 0 – 63, while get_local_id(1) returns a value between 0 - 7

The barrier(CLK_LOCAL_FENCE) call blocks all kernel HW threads until all threads copy their corresponding chunks of the global lookup table data to local memory. The kernel code still indexes through the lookup tables, but those tables now reside in local memory. As data resides in local memory, the data transfer latency is avoided which expedites the compute part of the kernel, and thus achieves much better performance. Removing the performance bottleneck on kernels that use lookup tables is often just as simple as using SLM.

Table 1 shows the metrics of performance measured on the color curves effect which was optimized to use SLM. A stand-alone application was written and a single image (1440x1080) was used to assess the performance of the OpenCL kernel for the color curves effect. The stand-alone application host code looped the execution of the OpenCL kernel 100 times. The kernel with no SLM took about 76.5 milliseconds to process the effect on the image. While the kernel optimized with SLM took only 18 milliseconds. The performance is 4.3x faster with SLM when compared to the performance of the kernel not using SLM.

Table 1. Performance of Color Curves with and without SLM

Color Curves run type

Elapsed Time (ms)

OpenCL kernel with no SLM

76.5

OpenCL kernel with SLM

18.0

Performance was measured on Intel® 3rd generation processor with HDG OpenCL and compared to the same processor running the kernel on four logical threads. For additional system details refer to the system Information found on Appendix A.

As observed in this study, using SLM clearly benefits the runtime performance of OpenCL kernels. However, using SLM comes with some restrictions. It is important to understand the OpenCL device hardware limitations and capabilities when programming OpenCL kernels to use SLM. For HDG OpenCL and as rule of thumb, do not use SLM if the LUT size is larger than 4K bytes.

Figure 4 is a screenshot of the color curve (infrared) effect, showing the output after the different colors have been computed.

Figure 4. Sample Video Output of the Color Curves Effect

Performance Case Study 2 – Software Bilinear Interpolation

The OpenCL 1.1 specification requirements support the bilinear interpolation (BLI) algorithm through implementation of the required CLK_FILTER_LINEAR filter. Based on the OpenCL specification, the CLK_FILTER_LINEAR filter can be used to achieve bilinear interpolation. The read image function using a sampler with the CLK_FILTER_LINEAR filter returns the exact equivalent of a bilinear interpolated pixel.

Early on in the effort to accelerate video effects with HDG OpenCL, the HDG BLI capability was found to produce incorrect results. Due to these incorrect results, the video editing app programmers ended up writing their own bilinear interpolation functionality in OpenCL kernel code. The incorrect results issue was promptly fixed in the HDG graphics driver and the performance of both BLI implementations was compared. This case study shows that HDG hardware implementation for BLI achieves better performance than that of the functionality written with OpenCL.

Kernel Pseudo Code for Bilinear Interpolation

The code for the page peel video effect is shown in Figure 5 and it illustrates both the software implementation for BLI as well as the use of the CLK_FILTER_LINEAR to achieve bilinear interpolation with HDG hardware.

As the code shows, the software implementation of BLI relies on four calls the “image read” built-in function, it also uses arithmetic operations. In contrast, code in second row of Figure 5 shows the use of the CLK_FILTER_LINEAR filter in a sampler. The sampler is passed as an argument to the “read image” function; this serves as an indication for HDG to use hardware BLI. Note that the ScaledSampler sampler is defined so that the CLK_FILTER_LINEAR filter is used, and note that with this new sampler only one call to “read image” is needed. Highlighted code in Figure 5 shows differences between the two implementations.

Where xand y values are derived from the running thread ids which you get when calling get global id 0 and 1, respectively.

Performance Optimization

With HDG BLI capability fixed, performance of the software implementation for BLI was compared to the performance of HDG hardware implementation of BLI. As expected, HDG hardware BLI greatly outperformed the software version.

When optimizing an OpenCL kernel for performance on HDG, if software BLI is being used, the kernel code is likely not to perform optimally. The good news is that this performance issue can easily be avoided by simply using HDG own BLI instead of an OpenCL programmed implementation for BLI. BLI is commonly used in several video effects, some of which were used to gauge the performance deltas and output result consistency. The HDG BLI consistently outperformed the software BLI implementation. Depending on the video effect, HDG BLI performed between 1.25x and 1.3x compared to the software function.

Figure 6 is a screenshot of the page peel effect being used in a video transition scenario.

Figure 6.Sample Video Output with Page Transition Effect

For those curious about the performance of the C/C++ and OpenCL implementations for the page peel effect, Table 2 shows the performance achieved in frames per second. The HDG OpenCL implementation showed a 2.5x better performance over the CPU C/C++ implementation, and achieved an additional 1.25x with use of HDG HW bilinear interpolation.

For additional test system details, refer to the system Information found on Appendix A.

Performance Case Study 3 – Lens Flare

Even as OpenCL best practices and optimization guidelines suggest to program kernels with as few instructions as possible, there are exceptions to this advice. This case study explores performance shortcomings of the lens flare video effect which required six kernels, and compares the performance against a monolithic one-kernel solution.

Texture Traffic Overhead Using Multiple Kernels

Multiple OpenCL kernels are usually viewed as an optimal design solution for video effects where multiple independent video elements are added to the video output. In practice, a single kernel would minimize texture traffic overhead and it might be a better solution in terms of performance. This case study highlights the lens flare video effect which uses six kernels. Each kernel was designed to draw a lens effect element: poly, ring, circle, diffused ray, thin ring, and sunburst onto the video output.

Depending on the lens flare effect setting, a particular kernel would be executed multiple times to draw multiple instances of the same element on the same video frame. This required taking multiple passes over the same image and thus creating texture overhead. The video frame being processed in multiple passes incurs data traffic overhead. The traffic overhead was determined to slow down the processing of the video effect. A one-kernel approach was proposed to eliminate texture traffic and improve performance. This case study outlines the performance results with the one-kernel solution.

Performance Optimization

To consolidate six kernels into one, the unique code was taken from each of six kernels and turned into six functions which are called from the main kernel. A specific function would be called within a loop to draw multiple elements as needed. Surprisingly, not all of the settings of the lens flare effect showed performance improvement with this approach. In fact, two of the settings showed minor performance degradation. Table 3 includes the performance metrics observed with the six kernels and the one-kernel implementations. The one-kernel solution sped up three of the five settings while decreasing performance of the other two settings.

Lens Flare run type

6-kernels FPS

1-kernel FPS

Scale

Setting 1 – 8 elements

16.9

14.2

0.84

Setting 2 – 13 elements

12.6

15.7

1.25

Setting 3 – 9 elements

17.8

16.7

0.94

Setting 4 – 23 elements

8.9

14.4

1.62

Setting 5 – 16 elements

11.8

15.7

1.33

Table 3.One-Kernel vs. Multiple Kernel Performance Metrics

Table 3 shows that as the number of elements to draw increase, the one-kernel implementation achieves better performance. It also shows that if the number of elements to draw is less than 10, then the six kernel implementation yields better performance. At the time of this article’s publication, further optimization of the functions engaged for setting 1 and setting 3 were still being pursued. It is still possible that the one kernel per element solution might perform better even in lens flare effects with less than 10 elements.

Figure 7 is a screenshot of the video with setting 5 of lens flare effect.

Figure 7.Sample Video Output with Lens Flare Effect

Pseudo Code for Lens Flare

The pseudo code below includes both host and OpenCL code for the six kernels as well as for the one-kernel implementations of the lens flare effect. Figure 8 shows the host code and Figure 9 shows the OpenCL kernel code. Some code is omitted to simplify and to help illustrate key code changes.

Conclusion

In summary, video and image processing can be accelerated with OpenCL. Further optimization can be achieved on HDG with additional work. For most applications, excellent performance improvements will be observed with Intel® 3rd Generation Core Processors with HD Graphics running OpenCL, when compared to equivalent C/C++ code. Should the performance for a given kernel not improve as expected, consider the optimization techniques outlined in this paper. Refer to the Intel OpenCL optimization guide for detailed optimization techniques and methods to best performance tune OpenCL kernels for HDG.