Life as a Physicist

CHEP Trends: Multi-Threading May 24, 2012

I find the topic of multi-threading fascinating. Moore’s law means that we now are heading to a multi-core world rather than just faster processors. But we’ve written all of our code as single threaded. So what do we do?

Before CHEP I was convinced that we needed an aggressive program to learn multithreaded programming techniques and to figure out how to re-implement many of our physics algorithms in that style. Now I’m not so sure – I don’t think we need to be nearly as aggressive.

Up to now we’ve solved things by just running multiple jobs – about one per core. That has worked out very well up to now, and scaling is very close to linear. Great! We’re done! Lets go home!

There are a number of efforts gong on right now to convert algorithms to be multi-threaded –rather than just running jobs in parallel. For example, re-implementing a track finding algorithm to run several threads of execution. This is hard work and takes a long time and “costs” a lot in terms of people’s time. Does it go faster? In the end, no. Or at least, not much faster than the parallel job! Certainly not enough to justify the effort, IMHO.

This was one take away from the conference this time that I’d not really appreciated previously. This is actually a huge relief: trying to make a reconstruction completely multi-threaded so that it efficiently uses all the cores in the machine is almost impossible.

But, wait. Hold your horses! Sadly, it doesn’t sound like it is quite that simple, at least in the long run. The problem is first the bandwidth between the CPU and the memory and second the cost of the memory. The second one is easy to talk about: each running instance of reconstruction needs something like 2 GB of memory. If you have 32 cores in one box, then that box needs 64 GB of main memory – or more including room for the OS.

The CPU I/O bandwidth is a bit tricky. The CPU has to access the event data to process it. Internally it does this by first asking its cache for the data and if the data hasn’t been cached, then it goes out to main memory to get it. The cache lookup is a very fast operation – perhaps one clock cycle or so. Accessing main memory is very slow, however, often taking many 10’s or more of cycles. In short, the CPU stalls while waiting. And if there isn’t other work to do, then the CPU really does sit idle, wasting time.

Normally, to get around this, you just make sure that the CPU is trying to do a number of different things at once. When the CPU can’t make progress on one instruction, it can do its best to make progress on another. But here is the problem: if it is trying to do too many different things, then it will be grabbing a lot of data from main memory. And the cache is of only finite size – so eventually it will fill up, and every memory request will displace something already in the cache. In short, the cache becomes useless and the CPU will grind to a halt.

The way around this is to try to make as many cores as possible work on the same data. So, for example, if you can make your tracking multithreaded, then the multiple threads will be working on the same set of tracking hits. Thus you have data for one event in memory being worked on by, say, 4 threads. In the other case, you have 4 separate jobs, all doing tracking on 4 different sets of tracking hits – which puts a much heavier load on the cache.

In retrospect the model in my head was all one or the other. You either ran a job for every core and did it single threaded, or you made one job use all the resources on your machine. Obviously, what we will move towards is a hybrid model. We will multi-thread those algorithms we can easily, and otherwise run a large number of jobs at once.

The key will be testing – to make sure something like this actually works faster. And you can imagine altering the scheduler in the OS to help you even (yikes!). Up to now we’ve not hit the memory-bandwidth limit. I think I saw a talk several years ago that said for a CMS reconstruction executable that occurred somewhere around 16 or so cores per CPU. So we still have a ways to go.

So, relaxed here in HEP. How about the real world? Their I see alarm bells going off – everyone is pushing multi-threading hard. Are we really different? And I think the answer is yes: there is one fundamental difference between them and us. We have a simple way to take advantage of multiple cores: run multiple jobs. In the real world many problems can’t do that – so the are not getting the benefit of the increasing number of cores unless they specifically do something about it. Now.

To, to conclude, some work moving forward on multithreaded re-implementation of algorithms is a good idea. As far as solving the above problem it is less useful to make the jet finding and track finding run at the same time, and more important to make the jet finding algorithm itself and the track finding algorithm itself multithreaded.

Like this:

Related

Do you know if we have any serious OpenMP-type development going on? That strikes me as a fairly low-impact way to try out some multithreading.

In my heart of hearts I agree with the previous comment; people in HEP have a distinct tendency to code in a very threading-unfriendly way, and a lot of that’s due to blithe disinterest in side effects.

An artificial intelligence like Watson from IBM, or a Google search engine cannot be single-threaded. These problems require as many cores and computers as possible. Of course, distributed computing is hard, next to the CPU cache problem you have to deal with schedulers, memory capacity, network bandwidth, network topology, etc.

For classic, desktop/server style computing you usually need several threads to ensure the responsiveness. E.g. you have a web server, and you provide a movie to a user, then other users should not wait for the end of this downloading.

And of course, there are scientific computations. Actually, there are at least two kind of computations: Memory-bandwidth limited and computation power limited tasks. A simply matrix multiplication is usually bandwidth limited. Classic architectures are just not the best for these problems. So Intel/AMD invented the MMX/SSE intruction set which is a kind of vector instruction set, but it could be better. Cray introduced (really expensive) vector computers decades ago. And recently, nVidia and ATI (AMD) offer GPUs for these computations.
All of these have their strength and weekness. I use GPUs for medical image reconstruction, and the speedup (compared to a classic Intel server CPU) is 10-100x. But GPUs are weak in branching instructions (like the ‘if’ statement). Intel is strong in out-of-order instructions, so they announced the Larrabee architecture, many-core architecture, etc. (But it’s still premature.)

If I have to use functional languages or graph theory applications, etc, then I will use Intel family. But for vector operations GPUs are just faster., and they have much bigger memory-bandwidth.

If I have to parallelize hundreds of thousands tasks without any connection between them and without time constraints, then I will use an architecture which is similar to the SETI at Home project.

If I have a very challanging problem, like identifying never seen phenomenoms, then I will use neural networks (actually, human beings, check the Galaxy Zoo project.)

If I need parallelism for need fault-tolerance, then I will use some really reliable hardware with ECC memory (Error checking codes, etc.) and a mature functional programming language, like Erlang. (Check it how long uptime can be achieved with an Erlang based router/switch. It is also provides code and hardware change without any interruption.)

Back to the problem. In physics most of the problems are matrix-vector operations. Using GPUs for this is one of the most cost effective and fastest way. However, starting a job (initial memory transfer, etc.) and finishing it (downloading the results) have some cost, so you cannot use it for fast realtime applications. (E.g. coincidence measurements, etc.) For special real time applications use FPGA. And for rest of the problems use classic CPU.

Microoptimization is the root of all evil. (Or “ROOT is the root of all evil.” It depends on you.)

If you cannot avoid classic CPU architecture, then don’t parallelize everything. Parallelize the most important parts. Use a profiler (like cachegrind, etc.), find the hotspots, and parallelize these parts. Then measure it again and again.

I recently saw a project where people tried to parallelize a problem. But the input was read from disk. And it’s really hard to find an optimal way to parallelize IO operations (because of disk seeking, etc.) So finally the results became slower (and much more complex) than the original was.

So in general: do not create threads for fun. Sometimes they are useful, sometimes they are not. Choose a goal, make a plan, choose the best tool (maybe choosing a less complex algorithm matters more than a parallelized version of a complex algorithm), try it, measure it, compare the measurements to your expectations, and drop your code if it does not fulfill your needs. (If it’s not fast when you finish it, then drop it, because it’s much more complex than the original was. Maintaining a complex code consumes a lot of time on long term. (Physicist have a habbit to keep old codes, and spend lot of time to maintain them. (In large projects it makes sense, but I think it happens more often than it should.))

E.g. In my case, I use CUDA for GPU computing, because I need huge amount of calculations. But I use a python wrapper (pycuda), and python tools (numpy, scipy, etc.) because I do not want to parallelize the config file reading, the disk IO, and several other task.

I am not sure what is the best solution for you. But there are several tools, and definitely there is no ultimate tool.

A lot of the data processing computation problems in experimental high energy physics can be described as embarrassingly parallel type. You don’t need communication between different threads. It is trivial to parallel this kind of computation, just submit parallel jobs (single thread implemented) to a computing farm, collect output from each job and combine them in a meaningful way at the end.

In my opinion, this old-fashioned way is suitable for HEP. A good parallel programming implementation requires a lot of expertises and deep understanding of common pitfalls and bottlenecks that are unique in the parallel world. Without those, you may frequently find your prematurely parallelized codes actually run slower than the their single threaded counterparts.

david – awesome post, thanks! I good lay of the land. Feng and others have it right. We’ve never paid attention to the use of global variables, and thus our code is littered with them. This code will be with us for 20+ years, and is millions of lines. So we will only do major rewrites if we are forced to. So, embarrassingly parallel it is… Until, as I’ve said before, I think the only thing that will cause to change is if we saturate the memory architecture of a machine we are running on and so we have to use multiple cores at once on the same data (say track finding) in order to keep the CPU fully occupied. Even that then starts to suffer from the GPU issues mentioned above (setup/tear down time).