Executive Summary

This case study details the optimization of Total Immersion's D'Fusion* Augmented Reality pipeline, using the Intel® Integrated Performance Primitives (Intel® IPP) Asynchronous to execute key parts of the pipeline on the GPU. The paper explains the Total Immersion pipeline, the goals and strategy for the optimization, the results achieved, and the lessons learned.

Intel IPP Asynchronous

The Intel IPP Asynchronous (Intel IPP-A) library—available for Windows* 7, Windows 8, Linux*, and Android*—is a companion to the traditional CPU-based Intel IPP library. This library extends the successful Intel IPP acceleration library model to the GPU, providing a set of GPU-accelerated primitive functions that can be used to build visual computing algorithms. Intel IPP-A is a simple host-callable C API consisting of a set of functions that operate on matrix data, the basic data type used to represent image and video data. The functions provided by Intel IPP-A are low-, medium-, and high-level building blocks for video analysis algorithms. The library includes low-level functions such as basic math and Boolean logic operations; mid-level functions like filtering operations, morphological operations, edge detection algorithms; and high level functions including HAAR classification, optical flow, and Harris and Fast9 feature detection.

When a client application calls a function in the Intel IPP-A API, the library loads and executes the corresponding GPU kernel. The application does not explicitly manage GPU kernels; at application run time the library loads the correct highly optimized kernels for the specific processor. The Intel IPP-A library supports third generation Intel® Core™ processors (code named Ivy Bridge) and higher, and Intel® Atom™ processors, like the Bay Trail SoC, that include Intel® Processor Graphics. Allowing the library implementation to manage kernel selection, loading, dispatch, and synchronization simplifies the task of using the GPU for visual computing functionality. The Intel IPP-A library also includes a CPU-optimized implementation for fallback on legacy systems or application-level CPU/GPU balancing.

Like the traditional CPU-based Intel IPP library, when code is implemented using the Intel IPP-A API, the code does not need to be updated to take advantage of the additional resources provided by future Intel processors. For example, when a processor providing additional GPU execution units (EUs) is released, the existing Intel IPP-A kernels can automatically scale performance, taking advantage of the additional EUs. Or, if a future Intel processor provides new hardware acceleration blocks for video analysis operations, a new Intel IPP-A library implementation will use the accelerators while keeping the Intel IPP-A interface constant. Developers can simply recompile and relink with the new library implementation. Intel IPP-A provides a convenient abstraction layer for GPU-based visual computing that provides automatic performance scaling across processor generations.

It is easy to integrate Intel IPP-A code with the existing CPU-based code, so developers can take an incremental approach to optimization. They can identify key pixel processing hotspots and target those for offload to the GPU. But they must take care when offloading to the GPU so as not to introduce data transfer overhead. Instead, developers should create an algorithm pipeline that allows significant work to be performed on the GPU before the results are required by the CPU code, minimizing inter-processor data transfer.

Benefits of GPU Offload

Offloading time consuming pixel processing operations to the GPU can result in significant power and performance benefits. In particular, the GPU:

Has a lower operating frequency – the GPU runs at a lower clock frequency than the CPU, consuming less power for the same computation.

Has more hardware threads – the GPU has significantly more hardware threads, providing better performance for operations where performance scales with an increasing number of threads, such as the visual processing operations in Intel IPP-A.

Has the potential to run more complex algorithms – due to the better power and performance provided by the GPU, developers can use more computationally intensive algorithms to achieve improved results and/or process more pixels than they could otherwise using the CPU only.

Can free the CPU for other tasks – by moving processing to the GPU, developers can reduce CPU utilization, freeing up the CPU processing resources for other tasks.

The benefits offered by Intel IPP-A programming on the GPU can be applied in a variety of market segments to help ISVs reach specific goals. For example, in Digital Security and Surveillance (DSS), the primary metric is the number of channels of input video that a platform can process (the "channel density"), while in Augmented Reality, decreasing the time to acquire targets to track and increasing the number of objects that can be simultaneously tracked are key.

Augmented Reality

Augmented Reality (AR) enhances a user's perception with computer-generated input such as sound, video, or graphics data. AR merges the real world with computer-generated elements, either meta information or virtual objects, resulting in a composite that presents more information and capabilities than an un-augmented experience. AR applications usually overlay information about the environment and objects on a real-time video stream, making the virtual objects interactive. AR technology can be applied to many market segments including retail, medicine, entertainment, and education. For example:

Mobile augmented reality systems combine a mobile platform's camera, GPS, and compass sensors with its Internet connectivity to pinpoint the user's location, detect device orientation, and provide information about the scene, overlaying content on the screen.

Virtual dressing rooms allow customers to virtually try on clothes, shoes, jewelry, or watches, either in-store or at home, automatically sizing the item to the user in a 3D view on the device.

Construction managers can view and monitor work in progress, in real time, through Augmented Reality markers placed throughout a site.

Total Immersion

Total Immersion is an augmented reality company, founded in 1998, based in Suresnes, France. Through its patented D'Fusion software solution, Total Immersion combines the virtual world and the real world by integrating real-time interactive 3D graphics into a live video stream. The company maintains offices in Europe, North America, and Asia and supports the world's largest augmented reality partner network, with over 130 solution providers.

Today, mobile technology is everywhere. Total Immersion (TI) is developing compelling AR experiences for tablets and phones. Intel, recognizing Total Immersion as a leader in Augmented Reality, initiated a collaboration with TI to optimize the D'Fusion software for Intel processors, including GPU offloading. They aimed to improve the AR experience when running on Intel products that power mobile platforms, such as the Intel Atom SoC Z3680.

Optimization Goals and Strategy

Augmented Reality applications rely on computer vision algorithms to detect, recognize, and track objects in input video streams. While a large part of the AR processing doesn't deal directly with pixels, the pixel processing required is a computationally intensive, data parallel task appropriate for GPU offload. Intel and Total Immersion planned to offload the pixel processing to the GPU, using Intel IPP-A, so that the pipeline handled the pixel processing—from capture to rendering—and only the metadata about the pixel information would be returned to the CPU as input for higher-level AR operations. By offloading all of the pixel processing to the GPU, the application achieved better performance with less power consumption, making D'Fusion-based applications run efficiently on mobile platforms while conserving battery life.

The D'Fusion AR Pipeline

The core of the D'Fusion software is a processing pipeline that consists of the following stages:

Figure 1 – The Design of the PixelFlow Framework

Capture – The first step in the pipeline is capturing input video from the camera. The video can be captured in a variety of formats, such as RGB24, NV12, or YUY2, depending on the specific camera. Frames are captured at the full frame rate, typically 30 FPS, and passed to the next stage in the pipeline. Each captured frame has an associated time stamp that specifies the precise time of capture.

Preparation – Computer vision algorithms usually operate on grayscale images, and the TI AR pipeline is no exception. The first step after Capture is to convert the color format of the captured image to grayscale. Next, because computer vision algorithms often do not require the full frame size to operate effectively, input frames can be downscaled to a lower resolution. The reduced number of pixels to process saves computational resources. Then, depending on the orientation of the image, mirroring may also be required. Finally, in addition to the grayscale image required by the computer vision processing, a color image must also be sent down the pipeline so that the scene can eventually be rendered along with the AR-generated information. It is also necessary to obtain a second color format conversion from the camera input format, like NV12, to a format appropriate for display, such as ARGB. All of the operations in the Preparation stage are pixel-intensive operations appropriate to target for offload to the GPU.

Detection – Once a frame is prepared, the pipeline applies a feature detection algorithm, either Harris or Fast9, to the reduced-size grayscale input image. The algorithm returns a list of feature points detected in the image. The feature detection algorithm can be controlled by various parameters, including the threshold level. These parameters continuously adjust the feature point detection to return an optimal number of feature points and to adapt to changing ambient conditions, such as the brightness of the input scene. Non-maximal suppression is applied to the feature point calculation to get a better distribution of feature points, avoiding local "clustering." Both feature detection and non-maximal suppression are targeted for offload to the GPU.

Recognition – Once the features are generated by the Detection stage of the pipeline, the FERNS algorithm is used to match the features against a database of known objects. Instead of operating on the feature points directly, the FERNS algorithm uses a patch, a square region of pixels centered on the feature point. The patches are taken from a filtered version of the frame that has been convolved with a smoothing filter. Each of the patches is associated with a timestamp of the frame from which they were derived. Since the processing of each patch by the FERNS algorithm is an independent operation, it is easily parallelizable and a candidate for GPU offload. The frame smoothing can also happen on the GPU.

Tracking - Many image processing algorithms operate on multi-resolution images called image pyramids, where each level of the pyramid is a further downscaled version of the original input frame. The Tracking stage of the pipeline provides the image pyramid to the Lucas-Kanade optical flow algorithm to track the objects in the scene. Both the image pyramid generation and the optical flow are good candidates to run on the GPU.

Rendering – Rendering is the final stage of the pipeline. In this stage, the AR results are combined with the color video and rendered on the output, in this case using OpenGL*. The application renders the color video as an OpenGL texture and uses OpenGL functions to draw the graphics output, based on the video analysis, on top of the video frame.

Optimization Strategy

Initial profiling of the TI application confirmed that the pixel processing operations mentioned in the prior section were the primary bottlenecks in the AR pipeline. However, other bottlenecks existed, including a CPU-based copy of the color image data to an OpenGL texture.

To simplify collaboration, Intel delivered the optimizations to Total Immersion as a library to be incorporated into the TI software. The library, dubbed PixelFlow, encapsulates the pixel processing required by the TI AR pipeline and is implemented using Intel IPP-A library. Intel and Total Immersion decided that PixelFlow would target the Preparation, Detection, and Rendering bottlenecks first, while also providing information required for the Recognition and Tracking stages. Moving the first stages of the pipeline to the GPU would be a milestone towards the eventual goal of handling all pixel processing operations on the GPU.

To implement the Preparation and Detection stages, the operations performed by PixelFlow on the GPU included color format conversion, resizing, mirroring, Fast9 and Harris feature point detection, and non-maximal suppression. To support the Recognition and Tracking stages, the library provides a smoothed frame to be used by the FERNS algorithm and an image pyramid of the input to be used by the optical flow algorithm. Finally, PixelFlow also provides a GPU texture of the color input frame suitable for use in OpenGL.

Implementation

The PixelFlow framework was conceived as a flexible framework for analysis of multiple video input streams derived from a single video capture source. The PixelFlow pipeline runs on the GPU, operating asynchronously with the CPU. Each video capture source serves frames to one or more logical video streams, where the color format and resolution of each stream is independently configurable. Each stream runs on a separate thread and can use Intel IPP-A to analyze the video frames, producing meta information. The following diagram shows the general design of the framework.

Figure 2 – The Design of the PixelFlow Framework

The TI Augmented Reality pipeline is comprised of two video streams: the Analytics Stream and the Graphics Stream. The Analytics Stream processes a grayscale input frame, performing feature detection with non-maximal suppression, image pyramid generation, and smoothing of the input frame. The Graphics Stream converts the color camera input to ARGB for display. In both cases, the resulting data is placed in a queue for access by the CPU-based code. The following diagram shows the basic organization of the pipeline and the functions targeted for offload to the GPU.

Figure 3 – The PixelFlow implementation for the TI AR pipeline

The information on each queue has a timestamp of the original frame capture, allowing the CPU software to correlate each frame with the corresponding data produced by the analytics stream.

Implementation Challenges

Several challenges were encountered during the implementation of the PixelFlow framework:

Separate kernels for frame preparation – The initial PixelFlow implementation used separate Intel IPP-A functions for resizing, color format conversion, and mirroring. Because the functions didn't support multi-channel images to prepare the ARGB output for the Analytics Stream, the implementation used one Intel IPP-A function to split the input image into separate channels, then called other functions to resize and mirror each of the channels individually before combining them back into an interleaved format. To minimize the kernel overhead and simplify programming, the Intel IPP-A team developed a single hppiAdvancedResize function to combine the resize, color format conversion, and mirroring into a single GPU kernel, allowing the frame to be prepared for the Analytics Stream or the Graphics Stream with a single function call.

Direct-to-GPU-memory video input – The intention of the PixelFlow pipeline was to have the entire pipeline, from video capture to graphics rendering, on the GPU. However, the graphics drivers for the targeted platforms did not yet support direct-to-GPU-memory video capture. Instead, each frame was captured to system memory and then copied to GPU memory. To minimize the impact of the copy, the PixelFlow implementation took advantage of the Fast Copy feature supported by the Intel IPP-A library. Using a 4K-aligned system memory buffer, the GPU kernel is able to use shared physical memory to access the data, thus avoiding a copy.

NMS, weights, and orientation for Fast9 – The results produced by the Intel IPP-A Fast9 algorithm did not initially match the CPU-based function that it replaced. An investigation revealed that the TI code was also applying non-maximal suppression to the results of the Fast9 calculation. In addition, the TI code also calculated a weight and orientation value for each detected feature point. The team updated the Intel IPP-A Fast9 function to add NMS as an option and to return the weight and orientation values.

OpenGL surface sharing and DX9 surface import/export – OpenGL is used for rendering in this pipeline. The video frame is rendered as an OpenGL texture and other virtual elements are added by calling OpenGL drawing primitives. In the Frame Preparation stage of the pipeline, Intel IPP-A's AdvancedResize function converts the video frame from the input format (NV12, YUY2, etc.) to ARGB. A CPU-based copy of this image into an OpenGL texture was one of the top bottlenecks. The Intel IPP-A team added an import/export capability so that a DX9 surface handle could be extracted from an existing Intel IPP-A matrix, or an Intel IPP-A matrix could be created from an existing DX9 surface. This enabled the use of the OpenGL surface sharing capability in the Intel OpenGL driver. With is functionality, a DX9 surface could be shared with OpenGL as a texture, avoiding the CPU-based copy and keeping the data on the GPU.

Additional Non-PixelFlow Optimizations

After implementing the optimizations described in the previous section, a trace performed in the VTune™ analyzer showed that when tracking nine targets, with input video and analytics resolution at 1024x768, several hotspots remained in the computer vision module:

Remaining Hotspots – Ivy Bridge

Function

% of CV

Description

dcvGroupFernsRecognizer::RecognizeAll

18.95

Using x87 floating point. Should try using SIMD floating point instructions such as Intel® SSE3 or Intel® AVX.

dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim

16.76

General code generation issues. Expect these would be improved by using the Intel® compiler.

dcvPolynomSolver::solve_deg3

10.20

General code generation issues. Expect these would be improved by using the Intel compiler.

After building the computer vision module with the Intel® compiler with Intel® AVX instructions enabled, the hotspots were eliminated.

Remaining Hotspots – Ivy Bridge

Function

% of CV

Description

dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim

33.56

Image pyramid generation.

dcvCorrelationsDetectorLite::ComputerIntegralImage

16.83

Integral image computation.

dcvKtlOptim::__CalcOpticalFlowPyrLK_Optim_ResizeNN_levels

13.0

LK optical flow.

The second trace uncovered an instance in the code that still used the old CPU-based image pyramid calculation. The instance was updated to use the image pyramid calculated by PixelFlow. The remaining hotspots were additional operations that were not yet included in PixelFlow, integral image, and LK optical flow. The team will target these functions first when extending the PixelFlow functionality.

Results – Performance and Power

The resulting AR pipeline offloads its initial stages to the GPU and provides data for subsequent stages of AR processing. To analyze the PixelFlow implementation of the AR pipeline, the team used a test application from Total Immersion, the "AR Player." This configurable test application allows the user to set operating parameters like the number of targets to track, the video capture resolution and format, the analytics processing resolution, and so on. In addition to the power and performance statistics, the team was interested in the feasibility and impact of increasing the analytics resolution. For the pre-optimized CPU-based flow, the TI AR software used a 320x240 analytics resolution. The additional performance provided by the GPU offload allowed us to experiment with higher resolutions and the resulting impact on responsiveness and quality. The team tested PixelFlow implementation on Ivy Bridge and Bay Trail platforms.

Results: Ivy Bridge

We tested the software on the following Ivy Bridge platform:

Ivy Bridge Platform Details

Item

Description

Computer

HP EliteBook* 8470p

Processor

Intel® Core™ I7 processor 3720QM

Clock Speed

2.6 GHz (3.6 GHz Max Turbo Frequency)

# Cores, Threads

4, 8

L1, L2, L3 Cache

256 KB, 1 MB, 6 MB

RAM

8 GB

Graphics

Intel® HD Graphics 4000

# of Execution Units

16

Graphics Driver

Igdumdim64, 9.18.10.3257, Win7 64-bit

OS

Windows* 7 Pro (Build 7601), 64-bit, SP1

The first test scenario tracked nine targets simultaneously, with both a video capture resolution and an analytics resolution of 640x480.

Test Scenario #1

Metric

Value

Number of targets

9

Capture resolution

640x480

Analytics resolution

640x480

Performance Results – Ivy Bridge, Test Scenario #1

Processor Number

Software (ms)

PixelFlow (ms)

Difference (ms)

Difference (%)

Rendering FPS

60

60

Analytics FPS

30

30

Tracking FPS

30

30

Frame Preprocessing

0.399

0.088

-0.311

-77.83

Tracking

1.412

1.355

-0.057

-4.03

Construct Pyramid

0.548

0.025

-0.523

-95.44

Recognition

3.322

1.477

-1.846

-55.55

Compute Interest Points

1.358

0.035

-1.323

-97.43

Smooth Image

0.693

0.001

-0.692

-99.89

The second test scenario also tracks nine targets, but increases the video capture resolution to 1024x768 with an analytics resolution of 640x480.

Test Scenario #2

Metric

Value

Number of targets

9

Capture resolution

1024x768

Analytics resolution

640x480

Performance Results – Ivy Bridge, Test Scenario #2

Processor Number

Software (ms)

PixelFlow (ms)

Difference (ms)

Difference (%)

Rendering FPS

60

60

Analytics FPS

30

30

Tracking FPS

30

30

Frame Preprocessing

0.391

0.094

-0.297

-75.99

Tracking

1.355

0.900

-0.455

-33.58

Construct Pyramid

0.532

0.024

-0.508

-95.58

Recognition

2.844

0.917

-1.927

-67.77

Compute Interest Points

1.225

0.027

-1.199

-97.83

Smooth Image

0.708

0.001

-0.7070

-99.93

Results: Bay Trail

Similar tests were run on the following Bay Trail platform:

Bay Trail Platform Details

Item

Description

Computer

Intel® Atom™ (Bay Trail) Tablet PR1.1B

Processor

Intel® Atom™ processor Z3770

Clock Speed

1.46 GHz

# Cores, Threads

4, 4

L1, L2, L3 Cache

128 KB, 2048 KB

RAM

2 GB

Graphics

Intel® HD Graphics

# of Execution Units

4

Graphics Driver

Igdumdim32.dll, 10.18.10.3341, Win8 32-bit

OS

Windows* 8 (Build 9431), 32-bit

The test scenario is slightly different than the first test scenario run on the Ivy Bridge platform due to the different resolutions supported by the camera on the Bay Trail system.

Test Scenario #1

Metric

Value

Number of targets

9

Capture resolution

640x360

Analytics resolution

640x360

Performance Results – Bay Trail, Test Scenario #1

Processor Number

Software (ms)

PixelFlow (ms)

Difference (ms)

Difference (%)

Rendering FPS

55

35

Analytics FPS

30

30

Tracking FPS

15

15

Frame Preprocessing

5.215

0.385

-4.830

-92.62

Tracking

15.484

10.411

-5.074

-32.77

Construct Pyramid

6.081

0.122

-5.985

-97.99

Recognition

28.389

15.590

-12.799

-45.09

Compute Interest Points

9.235

0.365

-8.870

-96.04

Smooth Image

7.236

0.011

0.7255

-99.85

The second scenario for Bay Trail tests the video capture resolution at 1280x720, while the analytics resolution remains at 640x460.

Test Scenario #2

Metric

Value

Number of targets

9

Capture resolution

1280x720

Analytics resolution

640x360

Performance Results – Bay Trail, Test Scenario #2

Processor Number

Software (ms)

PixelFlow (ms)

Difference (ms)

Difference (%)

Rendering FPS

12

30

Analytics FPS

30

25

Tracking FPS

8

12

Frame Preprocessing

4.865

0.408

-4.458

-91.62

Tracking

16.158

9.718

-6.440

-39.86

Construct Pyramid

5.995

0.122

-5.872

-97.96

Recognition

32.398

14.532

-17.865

-55.14

Compute Interest Points

8.864

0.376

-8.488

-95.76

Smooth Image

7.337

0.013

-7.324

-99.82

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Power Analysis

After implementing GPU offload using the PixelFlow pipeline, investigations into the power savings achieved by the GPU offload yielded unexpected results; instead of achieving a significant power savings from offloading the processing to the GPU from the CPU, the power consumption of the PixelFlow implementation was on par with the CPU-only implementation. The following GPUView trace shows why this occurred.

Figure 4 –GPUView trace of the processing for a single frame

The application dispatched the work to the GPU in separate chunks: CPU setup, GPU operation, wait for completion, CPU setup, GPU operation, wait for completion, etc. This approach impacted power consumption, causing the processor package to be continually active and not allowing the processor to enter deeper sleep states.

Instead, the pipeline should consolidate GPU operations and maximize CPU/GPU concurrency. The following diagram illustrates the ideal situation to achieve maximum power savings: GPU operations consolidated into a single block, executing concurrently with CPU threads and leaving a period of inactivity that allows the processor package to achieve deeper sleep states.

Figure 5 – Ideal pattern to maximize power savings

Conclusion

Moving the key pixel processing bottlenecks of the Total Immersion AR pipeline to the GPU resulted in performance gains on Intel processors, allowing the application to use a larger input frame size for video analysis, find targets faster, track more targets, and track them more smoothly. We expect similar gains can be achieved for similar video analysis pipelines.

While achieving performance benefits using Intel IPP-A is fairly straightforward, achieving power benefits requires a careful design of the processing pipeline. The best is one that consolidates the GPU operations and maximizes CPU/GPU concurrency to allow the processor to reach deeper sleep states. Diagnostic and profiling tools that are GPU-capable, like GPUView and Intel VTune analyzer, are essential as they can help to identify power-related problems with the pipeline. Consider using these tools during development to verify the power efficiency of a pipeline and avoid having to re-architect a pipeline to address power-related issues.

The PixelFlow pipeline offloaded several of the pixel processing bottlenecks in the TI pipeline. Work remains to move additional operations to the GPU such as integral image, optical flow, FERNS, etc. Once these operations are included in PixelFlow, all of the pixel processing will occur on the GPU with these operations returning metadata to the CPU as input for higher-level operations. The success of the current PixelFlow implementation, which uses IPP-A-based GPU offload, indicates that further gains are possible with additional offloading of pixel processing operations.

Finally, power and performance optimization can go beyond just the vision processing algorithms, but can extend to other areas such as video input, codecs, and graphics output. Intel IPP-A allows for DX9-based surface sharing with related Intel technologies such as the Intel® Media SDK for codecs and the OpenGL graphics driver. Understanding the optimization opportunities with these related technologies is also important. This allows developers to create entire GPU-based processing pipelines.

Author Biographies

Michael Jeronimo is a software architect and applications engineer in Intel's Software and Solutions Division (SSG), focused on helping customers to accelerate computer vision workloads using the GPU.