Life as a Physicist

The recent LHCC open meeting is a great place to look to see the current state of the Large Hadron Collider’s physics program. While watching the talks I had one of those moments. You know – where suddenly you realize something that you’d seen here and there isn’t just something you’d seen here and there, but that it is a trend. It was the LHCbtalk that drove it home for me.

There are many reasons this is desirable, which I’ll get to in a second. but the fact that everyone is starting to do it is because it is possible. Moore’s law is at the root of this, along with the fact that we take software more seriously than we used to.

First, some context. Software in the trigger lives in a rather harsh environment. Take the LHC. Every 25 ns a new collision occurs. The trigger must decide if that collision is interesting enough to keep, or not. Interesting, of course, means cool physics like a collision that might contain a Higgs or perhaps some new exotic particle. We can only afford to save about 1000 events per second. Afford, by the way, is the right word here: each collision we wish to save must be written to disk and tape, and must be processed multiple times, spending CPU cycles. It turns out the cost of CPU cycles is the driver here.

Even with modern processors 25 ns isn’t a lot of time. As a result we tend to divide our trigger into levels. Traditionally the first level is hardware – fast and simple – and can make a decision in the first 25 ns. A second level is often a combination of specialized hardware and standard PC’s. It can take a little longer to make the decision. And the third level is usually a farm of commodity PC’s (think GRID or cloud computing). Each level gets to take a longer amount of time and make more careful calculations to make its decision. Already Moore’s law has basically eliminated Level 2. At the Tevatron DZERO had a hardward/PC Level 2; ATL:AS had a PC-only Level 2 the 2011-2012 run of ATLAS, and now even that is gone in the run that just started.

Traditionally the software that ran in the 3rd level trigger (often called a High Level Trigger, or HLT for short) were carefully optimized and custom designed algorithms. Often only a select part of the collaboration wrote these, and there were lots of coding rules involved to make sure extra CPU cycles (time) weren’t wasted. CPU is of utmost importance here, and every additional physics feature must be balanced against the CPU cost. It will find charged particle tracks, but perhaps only ones that can be quickly found (e.g. obvious ones). The ones that take a little more work – they get skipped in the trigger because it will take too much time!

Offline, on the other hand, was a different story. Offline refers to reconstruction code – this is code that runs after the data is recorded to tape. It can take its time – it can carefully reconstruct the data, looking for charged particle tracks anywhere in the detector, applying the latest calibrations, etc. This code is written with physics performance in mind, and traditionally, CPU and memory performance have been secondary (if that). Generally the best algorithms run here – if a charged particle track can be found by an algorithm, this is where that algorithm will reside. Who cares if it takes 5 seconds?

Traditionally, these two code bases have been exactly that: two code bases. But this does cause some physics problems. For example, you can have a situation where your offline code will find an object that your trigger code does not, or vice versa. And thus when it comes time to understand how much physics you’ve actually written to tape – a crucial step in measuring a particle like the Higgs, or searching for something new – the additional complication can be… painful (I speak from experience!).

Over time we’ve gotten much better at writing software. We now track performance in a way we never have before: physics, CPU, and memory are all measured on releases built every night. With modern tools we’ve discovered that… holy cow!… applying well known software practices means we can have our physics performance and CPU and memory performance too! And in the few places that just isn’t possible, there are usually easy knobs we can turn to reduce the CPU requirements. And even if we have to make a small CPU sacrifice, Moore’s law helps out and takes up the slack.

In preparation for Run 2 at the LHC ATLAS went through a major software re-design. One big effort was to more as many of the offline algorithms into the trigger as possible. This was a big job – the internal data structures had to be unified, offline algorithms’ CPU performance was examined in a way it had never been before. In the end ATLAS will have less software to maintain, and it will have (I hope) more understandable reconstruction performance when it comes to doing physics.

LHCb is doing the same thing. I’ve seen discussions about new experiments running offline and writing only that out. Air shower arrays searching for large cosmic-ray showers often do quite a bit of final processing in real-time. All of this made me think these were not isolated occurrences. I don’t think anyone has labeled this a trend yet, but I’m ready to.

By the way, this does not mean offline code and algorithms will disappear. There will always be versions of the algorithm that will use huge amounts of CPU power to get the last 10% of performance. The offline code is not run for several days after the data is taken in order to make sure the latest and greatest calibration data has been distributed. This calibration data is much more fine grained (and recent) than what is available to the trigger. Though as Moore’s law and our ability to better engineer the software improves, perhaps even this will disappear over time.

Ok, here is a dumb lesson I’ve learned the hard way. And thanks to many others who helped resolve it. Lets say you design a distributed system – like your online and trigger and data collection system at D0. This is a medium sized system — perhaps 500 boxes and several 1000 CPU’s at this point. It is key to note that this is a heterogeneous system — many of those boxes are doing different things and have to be custom configured.

Now, since it is heterogeneous, but a distributed system, and all the boxes have to communicate with each other, they have to have a way of finding each other. You definitely can’t use raw DNS and the machine name. Computers change. Sometimes you want to do a hot-swap to an experimental system. Your DNS is managed by a central facility so the turn-around can be a day – and when the accelerator is delivery beam you need less than an hour.

So you have to decide on some sort of name service. Some service that can take a name and reply with a machine. If it is done right, this will disappear into the infrastructure and you’ll not even be aware it is there after a few years.

Ops!

Lets see, we’ve been running since 2001. In about 2003 we started using what the “sanctioned” name server for our Level 3 Trigger and DAQ part of the system. Of course, you have to make sure you know where that name server is for all this to work. We had an alias in DNS for that purpose.

And it turns out that our stuff is one of the few things left using that nameserver. Everyone else loads a python file on the command line. I’d originally designed our system so that you could change the location of a system on the fly without having to reboot one of the components – so the python approach was never considered. And the online system recently cleared out a bunch of machines.

The name server was moved. And that alias? Well, everyone forgot and so it wasn’t established. And then slowly, over time, parts of Level 3 started to fail. Thank goodness it was the monitoring code that failed first. But there were several hours of panic. All of us had forgotten how the system works it has been so long.

Maintaining the same system running for years is so weird. Almost all the code I write I think “Ok — get it running, debugged, and check it in and move on.” Keeping some of it running for years, however, there are other considerations. I bet there are whole books on this. Too bad we HEP people never take the time to read that sort of thing before we do our software development…