Parallel

Task-Based Programming in Windows

By John Revill, February 19, 2013

Convert programs into a series of independent executable parallel tasks.

Having completed initialization, let's move on to doing some work. State::SetMethod(new Method(&MyAsyncClass::Step)) sets the current method to step; and returning IO::Queue::ERROR_YIELD instructs the worker thread to return the instance to the input queue.

Bear in mind that in the intervening period, any number of other items might have been posted, so our instance ends up at the back of the queue. When it is retrieved, possibly on a different thread, Step() will be executed (Listing Five):

Step() posts a copy of itself back to the output queue and in place of doing any real work decrements the counter. (In reality, this code would be an exercise in futility as the overhead of returning the instance to the input queue would far outweigh the utility of just decrementing a counter. For the moment we're just following the flow. A more realistic example is provided later.) We make a copy the current instance rather than post current instance. The state of our instance may have changed before it is fetched from the output queue, and if we want to ensure its integrity we would be back to needing to synchronize the access. Making a copy relieves us of this burden. If we have exhausted the work available (so, counter == 0), we arrange a transition to Finalize (Listing Six). Regardless of whether we have another step to take or need to finalize, we yield back to the input queue.

Listing Six: The Finalize function

In this case, Finalize() has little to do. Any required cleanup could be done here. We set the OnComplete handler and pass our instance to the output queue before returning ERROR_SUCCESS to indicate to the worker thread that we don't want this item posted again.

As our thread pool has been laboring to wind our counter back, the main thread has been fetching and dispatching our output queue items. And we've been receiving status reports:

It helps to have a convention to keep the input and output sides of the fence separate in our minds. I've used a prefix of "On" for that purpose. Our progress reporter simply writes some state to the console and deletes itself. Remember this is a point-in-time copy of our working object. For a busy system, allocating and then deleting copies might become expensive and you will definitely run into some synchronization issues in the default heap allocator. In this case, a pool allocator with a private heap may be very useful.

The final output item is the return of our original instance to the output queue:

This has been a fairly detailed look at a trivial example with some excursions into some the internals of the plumbing classes. However, I hope it is clear that for the most part we don't have to worry much about the details, and we can focus on the application code. This is as it should be. Now for a more interesting example.

Example: Particle Swarm Optimization (PSO)

I've chosen a PSO as my example application because it's interesting in its own right, relatively simple to code, and a good fit for the discussion at hand. If you haven't been exposed to them before, the idea is to randomly distribute a number of particles (or agents, if you prefer) across a search space and have the particles collaboratively search for an optimal solution.

The search space I'll use here is the Sombrero function. As a search space this function holds few mysteries but it does effectively demonstrate the issue of local optima.

Figure 2: The Sombrero function.

If the preceding paragraphs seem irretrievably vague, imagine a nest of ants wandering across the graph, shrouded in fog, searching for the highest point, or more typically for ants, the highest concentration of sugar. Each can sense its own altitude and can share that information. We can see that an individual ant could become stuck on the lower circular ridge. However, with a sufficiently large nest randomly but evenly scattered across the graph, the central spire would be quickly discovered. The analogy isn't entirely facetious; PSOs were inspired by animal behavior.

PSOs can be used to hunt for solutions in domains that are combinatorially much more complex than the example. The example application implements two versions of a swarm: a synchronous and an asynchronous version invoked as shown in Listing Seven(a) and (b) respectively:

Listing Seven(a): The synchronous version of the principal PSO function.

On the surface, differences between the routines are minor. DoAsyncSwarm creates a thread pool, and an AsycSwarm instance; then, after setting up and starting the swarm, it runs the pool which returns when all particles have run to completion.

The synchronous form is useful for coming to grips with the working of the swarm, if it's not familiar, without getting tangled in the asynchronous implementation. It's also a useful reality check to determine whether our performance has really improved.

The naive approach to this problem  that is, the one I tried first  is to simply return particles to the input queue between each step. Similar to the aforementioned trivial example, the overhead of doing this meant that the asynchronous version was consistently slower.

My next approach was to run each particle to completion but this overlooked a subtlety of the way PSOs work. Each particle can have global influence by updating the best fitness and position values for the entire swarm. Running a couple of particles to completion before any others defeated the purpose of the swarm in the first place. In the end, I settled on creating an asynchronous helper class that could be assigned a number of particles from the parent swarm. I went with a simple round robin particle distribution across a collection of helpers, each of which is then queued. A randomized distribution may be a better approach in a production system, but I haven't tested that idea as yet.

I mention the evolution of the example to make the point that even with plumbing taken care of, we don't escape the need to think carefully about how we structure our application code.

A test on an Intel Core Duo 1.58 GHz machine with a run of 200 particles each stepping through 1000 iterations sees improvements ranging from 25%–75% using the asynchronous version. Occasionally, the asynchronous version will take the same amount of time or even longer. Bear in mind that the system is also servicing every other thread on the system as well and from time to time the thread pool in the app will just miss out.

Running in the IDE, you will regularly see DoAsyncSwarm coming in behind DoSyncSwarm. This is the Heisenberg effect at work  caused by the overhead of the IDE keeping track of the internal state of your program including threads. And the IDE itself has its own collection of threads to deal with. It's not too surprising that the timings are different. When you make a change, test your code in release mode to get a clear idea of the difference it will make.

Conclusion

I have used this architecture successfully in several of systems including: a mine scheduling system, a video camera sharing system in a facial image capture and recognition system, a package management system, and now a particle swarm optimization. While lunch may no longer be free, it can still be had at a quite reasonable price.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!