This article is an overview of the OpenCL support provided in System Analyzer and Platform Analyzer on the Windows* OS. Notice that more fine-grain methods of profiling exist, specifically with the Intel® VTune™ Amplifier XE. Still, both System and Platform Analyzers are great starting points for understanding the general platform (CPU and GPU) utilization via user-level OpenCL APIs.

What’s New

System Analyzer and Platform Analyzer profile GPU cores when running your application, which together with capturing general CPU activity, enables you to correlate activities on both devices. And you can also identify which device (and API) mostly bounds your application.

Improvements to OpenCL API support are numerous and include:

Support for profiling OpenCL applications. Previously, only CPU metrics were available for OpenCL applications, such as console-based applications.

Tracing the application startup. This feature is particularly useful when you need to measure the initialization costs or analyze very short applications.

Improved support for metrics. Now you can explore the performance of your application per selected GPU metrics over time in a much more accurate way (with less than 1ms granularity).

More importantly, the metrics are no longer tied to the frames. The fine-grained metrics are now available for all analysis types:

In the System Analyzer’s real-time view, in both the Heads-up Display (HUD) and the standalone version:

Figure 1.Fine-grained metrics in the System Analyzer.

In the Platform Analyzer, where you can zoom in to the specific time region or frame and inspect it in detail offline. Refer to the “API Level Analysis with Platform Analyzer” section below.

Explore what contribution Microsoft DirectX* calls make to the same GPU software queue and the resulting GPU load (via DMA packets execution), and how these calls interfere with DMA packets that originate from the OpenCL API.

Configuring an Analysis Profile

Right-click the Monitor icon in the system tray and select Profiles… in the context menu to open the Profiles dialog:

Figure 3. Tracing tab in the Profiles dialog box.

New options are now available on the Tracing tab of the Profiles dialog box:

Capture Application Startup checkbox enables you to start tracing immediately upon application startup, which is useful for short applications that can be hard to trace via the regular Ctrl+Shift+T shortcut.

Note: You can specify the trace duration with the spin box. To specify various types of events used to trigger the tracing, switch to the Trigger tab.

Now your application runs in the instrumented mode and you can capture the trace with Ctrl+Shift+T. Refer to the Online help on how to specify metrics of interest, how to conditionally capture the trace, etc. Note: If your application is very short in execution, consider using the new Capture Application Startup configuration option (see the previous section).

API Level Analysis with Platform Analyzer

Once you have generated a trace of your application, the next step is inspecting the timeline by opening the trace in the Platform Analyzer. It offers very handy ways to pinpoint the hotspots in execution and correlate them to the API calls.

In general, the particular analysis depends on the identified areas for improvement. For example, you may see that DirectX calls dominate the GPU execution path. In this case, use Frame Analyzer because it relies on the frame capture file (generated similarly to the trace capture, but with the Ctrl+Shift+C shortcut). The frame capture helps you understand exactly what is happening within your application on a frame-by-frame basis.

For the rest of the document we focus on features related to the OpenCL API. Specifically, we use the OpenCL and Intel Media SDK Interoperability code sample that exploits Intel Media SDK for initial video decoding, processes the decoded video frame with the OpenCL API, and finally displays the resulting image on the screen with a DirectX API. So all three APIs are utilized in one sample!

Below is an example trace, viewed in the Platform Analyzer:

Figure 5. Example trace viewed in the Platform Analyzer. Note how the execution path (marked in red) of the OpenCL device queue (in blue) correlates to the DMA packets queue (in black). Hovering over any packet in the DMA queue highlights its path through the queue to the actual execution by the GPU.

As you can see in Figure 5, the application has two kernels, Mouse and Process, in the OpenCL queue. Both follow the execution path, where the commands from the queue are executed back-to-back. Since Mouse has only a work-item (hover over an object to get the pop-up hint with kernel execution parameters), it is executed so fast that you would need to zoom to spot it on the execution path.

In turn, the OpenCL queue execution path expedites kernels to the driver, where DMA packets of different types get multiplexed in the single DMA queue. This Render and GPGPU queue serves both graphics-originated (tagged “GHAL”) and compute-originated (tagged “OpenCL”) packets. Note that video transcoding tasks pass through a dedicated Video Codec queue, which enables the Intel Media SDK commands to run on the GPU in parallel in the majority of cases.

Unlike the OpenCL device queue where different colors are assigned to different kernels (matching the colors in the OpenCL execution path), the DMA queues have just two colors: light green for the packets still stacked in the DMA queue, and yellow for the packets currently being processed by the GPU. The DMA packets with DirectX “Present” calls are marked with cross-hatching (), and the color scheme is the same: green for queued DMA packets, yellow for packets being executed.

For more details on the Platform Analyzer and its GUI, refer to the Online help.

Metrics for OpenCL Kernel Analysis

As we discussed in the previous section, the application optimization is typically started with a user-space analysis, for example, API-level tracing with assistance from the Platform Analyzer to sanitize a general application flow (previous section) and check that overall GPU utilization is ok.

After this phase of API-level analysis, you can focus on the most expensive OpenCL kernels with the help of specific metrics. The metrics appear on the same timeline:

Figure 6. Example of fine-grained metrics (charts at the bottom of the screenshot) in the Platform Analyzer. Notice that metrics appear on the same timeline. Also notice the resolution of the metrics (time scale on top). Hover over any specific point on the chart to get a popup hint with the exact value.

The Platform Analyzer supports the same types of OpenCL kernel metrics as the previous release:

CPU-specific metrics, such as core utilization

Intel HD Graphics execution units (EUs) metrics, of which GPU EUs active/idle/stalled are the most important

Memory metrics, such as GPU memory reads/writes

Power metrics for CPUs, Intel HD Graphics device, and the whole package

Unless your algorithm is memory-bound, the execution units (EUs) are likely to gate the performance of your application. The EU metrics can provide information on these bottlenecks. The goal is to maximize the utilization of EUs with useful computations. Refer to the OpenCL Applications Optimization Guide for tips and tricks.

The following information briefly describes the EU-related metrics:

GPU EUs Active represents the percentage of time when the GPU execution units (EUs) were actively executing. GPU EUs Idle is the percentage of time when the GPU execution units (EUs) are neither actively executing instructions nor stalled (below).

GPU EUs Stalled metric represents the percentage of time when the GPU execution units (EUs) were stalled. An EU becomes stalled when all of its threads are waiting for results from fixed function units, for example, requesting data from the Data Port or Sampler.

If GPU EUs Stalled is quite high, this might indicate inefficient memory bandwidth usage (for example, suboptimal data access granularity or cache thrashing, so that the GPU waits for data to arrive). See the Intel SDK for OpenCL Optimization Guide for theoretical memory performance and hints on saturating the bandwidth.

Finally, if the number of workgroups in the flight is insufficient, EU utilization might be really low (GPU EUs Idle will be high). Too low a value for the local size provided to the clEnqueueNDRange call can also result in units being idle. Again, refer to the Intel SDK for OpenCL Optimization Guide for details.

Also try the Intel SDK for OpenCL and Intel Media SDK Interoperability code sample, which enables you to pause/resume Intel Media SDK decoding and OpenCL code processing with a simple GUI. You can experiment with it to understand what effect the different sample pipeline stages have on the metrics.

Using Instrumentation and Tracing Technology (Intel® ITT) APIs for Custom Instrumentation of the CPU Code

Since the primary target for Platform Analyzer is GPU efficiency, it does not offer many insights in to CPU code, beyond tracing the recognized OpenCL or DirectX API calls. For example, it does not provide hotspots for your general C/C++ code (unlike Intel VTune Amplifier XE that provides much deeper analysis and source-level hotspots view). Still, you can check the overall CPU core utilization with the Platform Analyzer.

You can also annotate any CPU code with ITT API to explore how execution flow of the particular code region appears on the timeline with respect to the rest of activities. This sort of user instrumentation works well for both the Platform Analyzer and VTune Amplifier XE.

Summary

This paper covered the following key points:

Platform Analyzer provides a powerful way to analyze OpenCL applications with support for the OpenCL device queues.

You can inspect the overall GPU utilization and see the breakdown for each API, including the OpenCL API.

You can explore a set of accurate metrics that cover the host CPU and the Intel HD Graphics OpenCL device.