Wednesday, June 30, 2010

Professor David Patterson wrote an interesting article in IEEE Spectrum, 'The trouble with multicore'. The tag line is "Chipmakers are busy designing microprocessors that most programmers can't handle". The thrust of the article is that multicore processors are a hardware development that software is poorly equipped to utilise.

There are two main arguments made in the article. The first is that programming languages are very poor at describing parallelism. There has been a long list of languages that were either designed to tackle parallelism or have had parallelism imposed upon them. To be fair parallel programming is littered with the ill-conceived corpses of languages that were meant to solve the problem. So his view is correct, but perhaps this is not relevant.

The second point he makes is that not all tasks break down to independent work. His example is that of ten reporters writing the same story, and not being able to write the story ten times faster because each section of text has to build on the previous sections. Again, this is true. There are some tasks that have implicit (or explicit) dependencies, but perhaps this is not relevant.

The example in his paper that best illustrates how multicore is the solution, and not the problem, is that of cloud computing. As he says "Expert programmers can take advantage of the task-level parallelism inherent in cloud computing.". Are you an expert programmer when you type a search term into Google? A lot of computation goes into finding the results for you, but they appear nearly instantly. It could be argued that Google put a considerable amount of effort into designing a system that produced results so quickly. Of course they did. However, they did it once, and its used for millions of search queries every day.

Observation 1: Many problems just need parallelising once. Or conversely, not every developer needs to worry about the parallelism – in the same way as not every developer on a project needs to worry about the GUI.

But this only addresses part of the argument. It is all very well using an anecdotal example to demonstrate that it is possible to utilise multiple cores, but that does not disprove Professor Patterson's argument.

Lets return to the example of the reporters. The way the reporters are working is perhaps not the best use of their resources. Much of the work of reporting is fact checking, talking to people, and gathering data. The writing part of this is only the final step in a long pipeline. Perhaps a better way of utilising the ten reporters would be during the data gathering stages, multiple people could be interviewed simultaneously, multiple sources consulted at the same time. On the other hand, a newspaper would rarely allocate more than a single reporter to a single story. More progress would be made if each reporter was working on a different story. So perhaps the critical observation is that dependencies within a task are an indication that parallelism needs to be discovered outside that task.

Observation 2: It is rare that there are no other ways of productively utilising compute resources. Meaning that given a number of cores, it is almost always possible to find work to keep them busy. For example, rendering a movie could have cores working on separate frames, or separate segments of the same frame. Sequencing genes could have multiple genes being examined simultaneously. Simulation models of different scenarios could be completed in parallel.

But, it can be argued that there are times when you need to do a single task, and you care how long that task takes to complete. So, lets consider exactly what problems we encounter during our day where we would benefit from a faster processor.

"I waited for my PC to boot.". Well booting a PC is pretty much a serial process, however, the boot time is largely dominated by disk access time rather than processor speed.

"I waited for my e-mail to download". Any downloading activity, be it e-mail or webpages is going to be dominated by network latency or bandwidth issues. There is undoubtedly some processor activity in the mix, but it is unlikely that a fast processor would make a noticeable difference to performance.

"I was watching a video when my virus scanner kicked in and caused the movie to stutter." Assuming it wasn't a disc activity, this is a great example of where having multiple cores will help rather than hinder. Two cores would allow the video to continue playing while the virus scanner did its work. This was, of course, the frequently given example of why multicore processors were a good thing – as if virus scanner were a desirable use of processor time!

"I was compiling an application and it took all afternoon." Some stages of compilation, like linking or crossfile optimisation, are inherently serial. But, unless the entire source code was placed into a single file, most projects have multiple source files, so these could be compiled in parallel. Again, the performance can be dominated by disk or network performance, so it is not entirely a processor performance issue.

These are a few situations where you might possibly feel frustration at the length of time a task takes. You may have plenty more. The point is that it is rare that there is no parallelism available, and no opportunity to make parallel progress on some other task.

Observation 3: There are very few day to day tasks that are actually limited by processor performance. Most tasks have substantial bottlenecks in other parts of the system (disk, network, speed of devices). If anything having multiple cores enables a system to remain useful while other compute tasks are completed.

All this discussion has not truly refuted Professor Patterson's observation that there exist problems which are inherently serial, or fiendishly difficult to parallelise. But that's ok. Most commonly encountered computational activities are either easy to parallelise, or there are ways of extracting parallelism at other levels.

But what of software? There is great allure to using threads on a multicore processor to deliver many times the performance of a single core processor. And this is the crux of the matter. Advances in computer languages haven't 'solved' this problem for us. It can still be hard, for some problems, to write parallel programs that are both functionally correct and scale well.

However, we don't all need to solve the hard problems. There are plenty of opportunities for exploiting parallelism in a large number of common problems, and in other situations there are opportunities for task level parallelism. This combination should cover 90+% of the problem space.

Perhaps there are 10% of problems that don't map well to multicore processors, but why focus on those when the other 90% do?

Wednesday, June 9, 2010

I was pleasantly surprised to find support for runtime analysis embedded in the Solaris Studio IDE. This analysis uses the Performance Analyzer to gather data as the code is running and then presents this data both as timeline views over the runtime of the application, and also source code annotations. Here's the view as the data is gathered.

The tool gathers profile data which is shown as an aggregation of time spent in each routine, and also annotated against each line of source.

The other thing the tool is able to track is memory leaks, again reporting the amount leaked, and attributing the leaks to the lines of source where the data was allocated.

Tuesday, June 8, 2010

In some instances you can get better performance, or reproducibility, by restricting the processors that a thread runs on. Linux has pthread_set_affinity_np (the 'np' tag means non-portable). On Solaris you have a number of nearly equivalent options:

Processor sets where you create a set of processors and allow only particular processes to run on this set.

Processor_bind, where you bind a particular process or thread to a particular virtual CPU. The thread cannot migrate off this CPU, but other threads can run on it. This means that you need to coordinate between different processes to ensure that they are not all allocated to the same CPU.

Locality groups. On a NUMA system, a locality group is a set of CPUs which share the same memory local memory. Processes that remain executing on processors within their locality group will continue to get low memory access times, if they get scheduled to processors outside the group their memory access times may increase.

The Performance Analyzer has also had a number of new features and improvements. The most obvious one of these is the new call tree tab. This allows you to drill down into the call tree for an application and see exactly how the time is divided between the various call stacks.

Monday, June 7, 2010

The latest Solaris Studio Express release contains the tool discover, which tests for memory access errors. These are errors like reading past the end of an array or freeing a pointer twice. The best part of the tool is that it does not require a special build of the application. The sequence is:

$ discover a.out
$ a.out

The discover command adds instrumentation to the executable, and you then run the resulting binary in the same way that you would normally run your program. The output from discover is an html file containing details of any memory access errors that the tool discovered.