Jim Dempsey

Dr. Dobb's Bloggers

Two Variations of Parallel Pipelines

February 15, 2010

To programmers unfamiliar with parallel pipelines the parallelization strategy tends to be: Use a profiler to identify hot spots and address your parallelization efforts to those areas first. Repeat profiling and parallelizing hot spots until diminishing returns indicate your parallelization task is complete. The next step would be to look at improving the I/O by adding multi-buffering for input and output as well as adding additional threads and program complexity to perform the I/O. Another description of this process is: as your returns diminish, increase the complexity of your program.
The programmers familiar with parallel pipelines will recognize the above strategy has the priorities reversed, often produces non-optimal code, and consumes unnecessary programming efforts. To illustrate this point consult the following chart:

Today we examine two variations of parallel pipelines: The 3-Stage Parallel Pipeline running a video frame body tracking program, and a dual 2-stage parallel pipeline used in situations where the input and output run in isolation from the computational phase of the application.There are many classes of programming problems that are ideally suitable for using a parallel pipeline. As an illustration, let's look at the Body Tracking benchmark program from Princeton University's PARSEC benchmark suite.

The Body Tracking program uses four video cameras viewing an area from four different perspectives. The four cameras take synchronized video frames of this area and run for a duration of time. For this benchmark, 261 frames from each of the four cameras were used (1044 total frames). The programming problem is to interpret synchronized frames to identify if a person is within the field of view, and to identify the body position of a person walking within this field of view. An example of the output for one synchronized frame from the four video cameras follows:

To programmers unfamiliar with parallel pipelines the parallelization strategy tends to be: Use a profiler to identify hot spots and address your parallelization efforts to those areas first. Repeat profiling and parallelizing of the hot spots until a diminishing returns indicate your parallelization phase is complete. The next step would be to look at improving the I/O by adding multi-buffering for input and output as well as adding additional threads and program complexity to perform the I/O. Another description of this process is: as your returns diminish, increase the complexity of your program.

The programmers familiar with parallel pipelines will recognize the above strategy has the priorities reversed, often produces non-optimal code, and consumes unnecessary programming efforts. To illustrate this point consult the following chart:

The above chart shows the performance in terms of frames per second for the output including a composite of the four images with body position identified. The performance results are charted for runs on four different platforms.

The serial performance for each system is listed as a base line and four different parallel programming techniques were compared. OpenMP, Threading Building Blocks, Win32 threading model, and QuickThread.

The QuickThread technique is clearly dominant method at 30% faster than the nearest alternative when run on the dual Xeon 5570 system.

What makes QuickThread so dominant?

How much additional coding effort was involved?

The answers are:

The QuickThread adaptation uses the QuickThread 3-stage parallel pipeline, and the adaptation used the serial code with very minor changes.

The programming effort was reduced to relocating some static objects and arrays into dynamically allocated objects and arrays. Ostensibly, the serial code was untouched.

For the Body Tracking program the QuickThread 3-stage pipeline in looks like this:

Stage-1 is the Read Sequencer.

The Read Sequencer task is run on an I/O class thread. This task runs upon availability of buffer from the pipeline buffer pool. The functionality of the Read Sequencer task is to read through an input store and consolidate the data into the buffer provided by the buffer pool. The input store could be a single file, a series of sequenced files, a database or other data store. You program the Read Sequencer task to perform as little work as necessary in constructing the data within (or referenced by) the pipeline buffer. These pipeline buffers will contain your input data plus control information for the pipeline. One of the control information data is the input sequence number (zero based and incrementing). Once the read sequencer has assembled the data within the pipeline buffer its work is done and the task exits. Upon exit the parallel pipeline driver will enqueue the pipeline buffer to the next stage (pipe) in the pipeline. Should a buffer be available in the buffer pool, then the Read Sequencer task will begin again. This process is automatic and continues until a termination condition (end of file, error, etc..). For the Body Tracking Benchmark four frames are read from four files (one for each camera) pipeline buffer essentially contains the pointers to the four frame buffers.

Stage-2 is the Do Work stage.

In this example, the 2nd pipe is of the compute class of threads with all compute class threads being participants. The QuickThread task scheduler will select a thread from the compute class thread pool based on availability of thread. The QuickThread parallel pipeline buffers are NUMA node sensitive with a preference for thread selection to be a thread on the same NUMA node as the buffer. The processing order of the buffers tends to be, but is not assured to be, in the order of delivery. The completion of the work required for the buffer will most likely not be in the order that the processing began. For the Body Tracking Benchmark four frames are essentially processed by the original serial code with minor changes as to placement of data. Each of the pipeline buffer data is independent of the other pipeline buffer data and thus can run in parallel using serial code.

Stage 3 is the Write Sequencer.

The Write Sequencer will receive the buffers as they are completed, which is not necessarily in the sequence order. Unless instructed otherwise, the Write Sequencer will automatically re-sequence the buffers for delivery to your I/O task. Your I/O task would typically write to the output file/files or insert records into a database. When your I/O task completes, the pipeline buffer is returned to the buffer pool. And when necessary, will enqueue the buffer to the Read Sequencer. For the Body Tracking Benchmark four frames of output are written to four separate files plus a fifth file is written holding the composite of the four results images (first image in this article).

Three Comments:

The first comment is: The reason for the performance advantage of the QuickThread implementation is twofold:

The I/O processing is separated from the compute processing but integrated into the solution by way of the QuickThread parallel pipeline.

The processing performed per pipeline buffer is devoid of parallel programming constructs and accompanying overheads.

The second comment is: The performance of an application such as this could be improved by storing each series of frames, of each camera, in one file for each camera. As opposed to storing one file per frame per camera. Perhaps separating the files at convenient storage intervals of 10 minutes or 15 minute periods. This would greatly reduce the file system overhead and improve the overall performance of the pipeline.

The third comment is: Although the diagram of the parallel pipeline looks complicated, and internally is complicated, all this complexity is hidden inside the parallel pipeline code. The user programming effort is almost trivial. You write three tasks:

The first task is the input stage (pipe). For the Body Tracking benchmark the coding effort consisted of moving (with cut and paste) the read frame n code and placing it into a separate function.

The second task is the Do Work stage (pipe) which essentially is to encapsulate that section of your serial code which calls the process frame n code into a function body. Another cut and paste operation.

The third task is the output state (pipe) is to encapsulate that section of your serial code into a function body. Another cut and paste operation.

The only real coding change was a minor change to relocate some static buffers into allocatable buffers.

The remaining programming task is to insert code to assemble and run the parallel pipeline. For a 3-Stage Parallel Pipeline the additional code looks like this:

Where MyPipelineIOContext is a struct or class you create to contain context information for performing I/O (e.g. file path and/or name, and/or context information to be used when constructing your pipeline buffer objects).

And where MyPipelineBuffer is a struct or class you create to contain context information for performing compute phase of the pipeline buffer. In this case four frame buffers, one for each camera.

The 3-stage pipeline works quite well for most applications such as the frame-by-frame processing of the PARSEC Body Tracking benchmark.

There are some classes of applications where a single linear pipeline is unsuitable. Applications where there is no symmetry between the input and output. For these types of applications you might find it advantageous to sandwich your work phase between two 2-stage pipelines.

(Your parallel processing of internal objects here)

Functionally this is still a three-step process: read-in, process, write-out.

To accomplish this in QuickThread using a 2-stage pipeline for input, your parallel code for process, and a second 2-stage pipeline for output you would use something like the following code:

The prep work for input might consist of converting ASCII text input into binary format and storing it into an array. The prep work for output might be converting binary format from arrays to ASCII text. Although you are free to use "prep" pipes in any manner as you see fit.

An example of using two 2-stage pipelines is the QuickThread variation of the Black-Scholes benchmark (in the PARSEC benchmark suite from Princeton University).

On the QuickThread column (right most column) the 2-stage input parallel pipeline (blue) and 2-stage output parallel pipeline (yellow) illustrates the exceptional performance to be gained in the input and output phases of your application. While the interior (ROI) computation time of each threading model is approximately the same, QuickThread's parallel pipeline provided an overall 4x improvement in running this synthetic benchmark application.

The typical European Options application would run only one iteration of each option. When charting this, the computation phase is insignificant as compared to the I/O part of the application. The chart of the 1 iteration run on the 10,000,000 options becomes:

The parallelization improvements in using OpenMP, TBB and threaded models is hardly worth the coding effort. However, the dual 2-stage parallel pipeline technique using QuickThread brings a 9x improvement in application run time. You can clearly see the advantage of using parallel pipeline for applications of this nature.

Typically, a 3-stage pipeline assumes (requires) buffers to pass through the pipeline. i.e. The number of buffers read is the number of buffers written (although data varies).

When this is not the case, and when using a 3-stage pipeline, you can accommodate differing number of writes from reads by use of flags:

bool nothingToOuptput; // true == skip output, false == do output

Then your output stage would observe the flag and bypass the I/O when set.

When the input is substantially different from the output (or when there is no I/O for input or no I/O for output) you will find it easier to code using one or two 2-stage pipelines.

In the code snip of the two 2-stage pipeline note, the DoWorkOutsideOfPipeline(); can be contained within the compute pipe of either of the two 2-stage pipelines. However, this may be counter-productive when compared with using a simplified unpack and or pack routine as the compute step of the pipeline.

Parallel pipelines work exceptionally well when the DoWork pipe is essentially serial code. That is to say the DoWork pipe uses serial code to process data within the context of the buffer. The nature of the parallel pipeline is to run multiple of these buffers in parallel. Also note that with isolation of the data, many of the software locks can be eliminated in the intra-buffer work.

As long as the I/O pipes can out pace the compute pipe(s) your pipeline buffers run in parallel using serial code. And in the process: you have all but eliminated the task management overhead of the application while simplifying the coding effort.

Jim Dempsey
www.quickthreadprogramming.comTo programmers unfamiliar with parallel pipelines the parallelization strategy tends to be: Use a profiler to identify hot spots and address your parallelization efforts to those areas first. Repeat profiling and parallelizing hot spots until diminishing returns indicate your parallelization task is complete. The next step would be to look at improving the I/O by adding multi-buffering for input and output as well as adding additional threads and program complexity to perform the I/O. Another description of this process is: as your returns diminish, increase the complexity of your program.

The programmers familiar with parallel pipelines will recognize the above strategy has the priorities reversed, often produces non-optimal code, and consumes unnecessary programming efforts. To illustrate this point consult the following chart:

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!