Life as a Physicist

I find the topic of multi-threading fascinating. Moore’s law means that we now are heading to a multi-core world rather than just faster processors. But we’ve written all of our code as single threaded. So what do we do?

Before CHEP I was convinced that we needed an aggressive program to learn multithreaded programming techniques and to figure out how to re-implement many of our physics algorithms in that style. Now I’m not so sure – I don’t think we need to be nearly as aggressive.

Up to now we’ve solved things by just running multiple jobs – about one per core. That has worked out very well up to now, and scaling is very close to linear. Great! We’re done! Lets go home!

There are a number of efforts gong on right now to convert algorithms to be multi-threaded –rather than just running jobs in parallel. For example, re-implementing a track finding algorithm to run several threads of execution. This is hard work and takes a long time and “costs” a lot in terms of people’s time. Does it go faster? In the end, no. Or at least, not much faster than the parallel job! Certainly not enough to justify the effort, IMHO.

This was one take away from the conference this time that I’d not really appreciated previously. This is actually a huge relief: trying to make a reconstruction completely multi-threaded so that it efficiently uses all the cores in the machine is almost impossible.

But, wait. Hold your horses! Sadly, it doesn’t sound like it is quite that simple, at least in the long run. The problem is first the bandwidth between the CPU and the memory and second the cost of the memory. The second one is easy to talk about: each running instance of reconstruction needs something like 2 GB of memory. If you have 32 cores in one box, then that box needs 64 GB of main memory – or more including room for the OS.

The CPU I/O bandwidth is a bit tricky. The CPU has to access the event data to process it. Internally it does this by first asking its cache for the data and if the data hasn’t been cached, then it goes out to main memory to get it. The cache lookup is a very fast operation – perhaps one clock cycle or so. Accessing main memory is very slow, however, often taking many 10’s or more of cycles. In short, the CPU stalls while waiting. And if there isn’t other work to do, then the CPU really does sit idle, wasting time.

Normally, to get around this, you just make sure that the CPU is trying to do a number of different things at once. When the CPU can’t make progress on one instruction, it can do its best to make progress on another. But here is the problem: if it is trying to do too many different things, then it will be grabbing a lot of data from main memory. And the cache is of only finite size – so eventually it will fill up, and every memory request will displace something already in the cache. In short, the cache becomes useless and the CPU will grind to a halt.

The way around this is to try to make as many cores as possible work on the same data. So, for example, if you can make your tracking multithreaded, then the multiple threads will be working on the same set of tracking hits. Thus you have data for one event in memory being worked on by, say, 4 threads. In the other case, you have 4 separate jobs, all doing tracking on 4 different sets of tracking hits – which puts a much heavier load on the cache.

In retrospect the model in my head was all one or the other. You either ran a job for every core and did it single threaded, or you made one job use all the resources on your machine. Obviously, what we will move towards is a hybrid model. We will multi-thread those algorithms we can easily, and otherwise run a large number of jobs at once.

The key will be testing – to make sure something like this actually works faster. And you can imagine altering the scheduler in the OS to help you even (yikes!). Up to now we’ve not hit the memory-bandwidth limit. I think I saw a talk several years ago that said for a CMS reconstruction executable that occurred somewhere around 16 or so cores per CPU. So we still have a ways to go.

So, relaxed here in HEP. How about the real world? Their I see alarm bells going off – everyone is pushing multi-threading hard. Are we really different? And I think the answer is yes: there is one fundamental difference between them and us. We have a simple way to take advantage of multiple cores: run multiple jobs. In the real world many problems can’t do that – so the are not getting the benefit of the increasing number of cores unless they specifically do something about it. Now.

To, to conclude, some work moving forward on multithreaded re-implementation of algorithms is a good idea. As far as solving the above problem it is less useful to make the jet finding and track finding run at the same time, and more important to make the jet finding algorithm itself and the track finding algorithm itself multithreaded.

I’m at CHEP 2009 – a conference which combines physics with computing. In the old days it was one of my favorites, but over the past 10 years it has been all about the Grid. Tomorrow is the conference summary – and I sometimes play a game where I try to guess what the person who has to summarize the whole conference is going to say – the trends, if you know what I mean.

Amazon, all the time (cloud computing). Everyone is trying out its EC2 service. Some physics analysis has even been done using it. People have tried putting large storage managers up on it. Cost comparisons seem to indicate it is almost the same cost as using the GRID – and so might make a lot of sense when used as overflow. As far as cost goes, it is not at all favorable for storing large amounts of data (actually, transferring it in and out of the service).

Virtualization. I’ve been saying this for years myself, and it looks like everyone else is saying it now too. This is great. It is driven partly by the cloud – which uses virtualization – and that was something I definitely did not foresee. Cloud computing and virtualization go hand-in-hand. CERNVM is my favorite project along these lines, but lots of people are playing with this. I’ve even seen calls to make the GRID more like the loud.

Multi-core. This is more wishful thinking on my part, but it should be more of a trend than it is. The basic problem is that as we head towards 100 cores on a single chip there is just no way to get 100 events’ worth of data onto the chip – the bandwidth just isn’t there. Thus we will have to change how we processes the data, spending multiple CPU’s on the same data – something no one in HEP does up to now. One talk mentioned that problems will occur at about 24 CPUs on a single core.

CMS has put together a bunch of virtual control rooms. Now everyone in CMS wants one (80% in a few years). These are supposed to be used for both outreach and also remote monitoring. This seems successful enough that I bet ATLAS will soon have its own program. I’m not convinced how useful it is. 🙂

It is all about the data! Everyone now says running jobs on the GRID isn’t hard, it is feeding them the data. Cynically, I might say that was only because we now know how to run those jobs – several years ago that was the problem. This is a tough nut to crack. To my mind, for the first time, I see all the bits in place to solve this problem; but nothing really works reliably yet.

And now a non-trend. I keep expecting HEP to gather around a single database. That hasn’t happened now, and so I don’t think it ever will! That is both good and bad. We have a mix of open source and also vendor supplied solutions in the mix.

Ok – there are probably other small ones, but these are the big ones I think Dario will mention in his final talk.