The biggest thing I got back was that as the corrections become well known, they get automated – so there is no need for this two step process I outlined before – running on MC and data, deriving a correction, and then running a third time to do the actual work, taking the correction into account. Rather, the ROOT files are centrally produced and the correction is applied there by the group. So the individual doesn’t have to worry. Sweet! That definitely improves life! However, the problem remains (i.e. when you are trying to derive a new correction).

I made three attempts before finally finding an analysis framework that worked (well, four if you count the traditional approach of C++, python, bash, and duct tape!). As you can tell – what I wanted was something that would correctly glue several phases of the analysis together. The example from last time:

Correct the jet pT spectra in Monte Carlo (MC) to data

Run on the full dataset and get the jetPt spectra.

Do the same for MC

Divide the two to get the ratio/correction.

Run over the data and reweight my plot of jet variables by the above correction.

There are basically 4 steps in this: run on the data, run on the MC, divide the results, run on the data. Ding! This looks like workflow! My firs two attempts were based around this idea.

Workflow has a long tradition in particle physics. Many of our computing tasks require multiple steps and careful accounting every step of the way. We have lots of workflow systems that allow you to assemble a task from smaller tasks and keep careful track of everything that you do along the way. Indeed, all of our data processing and MC generation has been controlled by home-rolled workflow systems at ATLAS and DZERO. I would assume at every other experiment as well – it is the only way.

This approach appealed to me: I can build all the steps out of small tasks. One task that runs on data and one that runs on MC. And then add the “plot the jet pT” sub-task to each of those two, take the outputs, and then have a small generic tasks that would calculate a rate, and then another task that would weight the events and finally make the plots. Easy peasy!

So, first I tried Trident, something that came out of Microsoft Research. An open source system, it was designed to work with a number of scientists with large datasets that required frequent processing (NOAA related, I think). It had an attractive UW, and arbitrary data could be passed between the tasks, and the code interface for writing the tasks was pretty simple.

I managed to get some small things working with it – but there were two big things that caused it to fail. First, the way you pass around data was painful. I wanted to pass around a list of files to run on – and then from that I needed to pass around histograms. I wanted fine grained tasks that would manipulate histograms (dividing the plots) and the same time other tasks would be manipulating whole files (making the plots). Ugh! It was a lot of work just to do something simple! The second thing that killed it was that this particular tool – at the time – didn’t have sub-jobs. You couldn’t build a workflow, and then use it in other workflows. It was my fault that I missed that fact when I was choosing the tool.

So, I moved onto a second attempt. Since my biggest problem had been hooking everything up I decided to write my own. Instead of a GUI interface, I had an XML interface. And I did what is known as “coding-by-convention.” The idea is that I’d set a number of defaults into the design so that it “just worked” as long as the individual components obeyed the conventions. Since this was my own private framework there was no worry that this wouldn’t happen. The framework knew how to automatically combine similar histograms, for example, or if it was presented with multiple input datasets it knew how to combine those as well – something that would have required a another step in the Trident solution.

This solution went much better – I was able to do more than just do my demo – I tried moving beyond the reweighting example above and tried to do something more complex. And here is where, I think, I hit on the real reason that workflow doesn’t work for analysis (or at least for me): you are having to switch between various environments too often. The framework was written in XML. If I wanted a new task, then I had to write C++, or C# (depending). Then there was the code that ran the framework – I’d have to upgrade that periodically.

Really, all I wanted to do was make a stupid plot on two datasets, divide it, and then make a third plot using the first as a weight. Why did I need different languages and files to do that – why couldn’t I write that in a few lines??

Those of you who are active in this biz, of course, know the answer: two different environments. One set of code deals with looping over, possibly, terrabytes of data. That is the loop that makes the plot. Then you need some procedural code to do the histogram division. When that is done, you need another loop of code to do the final plots and reweighting. Take a step back. That is a lot of support code that I have to write! Loading up the MC and data files, running the loop over them, saving the resulting histogram. The number of lines I actually need to create the plot and put the data into the plot? Probably about 2 line or 3. The number of lines I need to actually run that job start to finished and make that plot? Closer to 150 or so, and in several files, some compiled and some interpreted. Too much ceremony for that one or two lines of code: 150 lines of boilerplate for 3 or so lines of the physics interesting code.

So, I needed something better. More on that next week.

BTW, the best visual analysis workflow I’ve seen (but not used) is something called VISPA. Had I known about it when I started the above project I would have gone to it first – it is cross platform, has batch manager, etc., integrated in, etc. (a fairly extensive list). Looking in retrospect it looks like it could support most of what I need to do. I say this only having done a quick scan of its documentation pages. I suspect I would have run into the same problem: having to move between different environments to code up something “simple”.

Last October (2010) my term came to and end running the ATLASflavor-tagging group. It was time to get back to being a plot-making member of ATLAS. I don’t know how most people feel when they run a large group like this, but I start to feel separated from actually doing physics. You know a lot more about the physics, and your input affects a lot of people, but you are actually doing very little yourself.

But I had a problem. By the time I stepped down in order to even show a plot in ATLAS you had to apply multiple corrections: the z distribution of the vertex was incorrect, the transverse momentum spectrum of the jets in the Monte Carlo didn’t match, etc. Each of these corrections had to first be derived, and then applied before someone would believe your plot.

To make your one really great plot then, lets look at what you have to do:

Run over the data to get the distributions of each thing you will be reweighting (jet pT, vertex z position, etc.).

Run over the Monte Carlo samples to get the same thing

Calculate the reweighting factors

Apply the reweighting factors

Make the plot you’d like to make.

If you are lucky then the various items you need to reweight are not correlated – so you can just run the one job on the Data and the one job on the Monte Carlo in steps one and two. Otherwise you’ll have to run multiple times. These jobs are either batch jobs that run on the GRID, or a local ROOT job you run on PROOF or something similar. The results of these jobs are typically small ROOT files.

In step three you have to author a small script that will extract the results from the two jobs in steps 1 and 2, and create the reweighting function. This is often no more difficult that dividing one histogram by another. One can do this at the start of the plotting job (the job you create for steps 4 and 5) or do ti at the command line and save the result in another ROOT file that serves as one of the inputs to the next step.

Steps 4 and 5 can normally be combined into one job. Take the results of step 3 and apply it as a weight to each event, and then plot whatever your variable of interest is, as a function of that weight. Save the result to another ROOT file and you are done!!

Whew!

I don’t know about you, but this looked scary to me. I had several big issues with this. First, the LHC has been running gang-busters. This means having to constantly re-run all these steps. I’d better not be doing it by hand, especially as things get more complex, because I’m going to forget a step, or accidentally reuse an old result. Next, I was going back to be teaching a pretty difficult course – which means I was going to be distracted. So whatever I did was going to have to be able to survive me not looking at it for a week and then coming back to it… and me still being able to understand what I did! Mostly, the way I normally approach something like the above was going to lead to a mess of scripts and programs, etc., all floating around.

It took me three tries to come up with something that seems to work. It has some difficulties, and isn’t perfect in a number of respects, but it feels a lot better than what I’ve had to do in the past. Next post I’ll talk about my two failed attempts (it will be a week, but I promise it will be there!). After that I’ll discuss my 2011 Christmas project which lead to what I’m using this year.

I’m curious – what do others do to solve this? Mess of scripts and programs? Some sort of work flow? Makefiles?? What?? What I’ve outlined above doesn’t seem scalable!

Friday night. A truly spectacular day in Seattle. I had to take half of it off and was stuck out doors hanging out with Julia. Paula is on a plane to Finland. I’ve got a beer by my slide. A youtube video of a fire in a fireplace. Hey. I’m up for anything.

ROOT is weird. It has made it very easy to do very simple things. For example, want to draw a previously made histogram? Just double click and you’re done. Want to see what the data in one of your TTree’s looks like? Just double click on the leaf and it pops up! But, the second you want to do something harder… well, it is much harder. I’d say it was as hard to do something advanced as it was to do something intermediate in ROOT.

Plotting is an example.

Stacking the Plots

I have four plots, and I want to plot them on top of each other so I can compare them. If I do exactly what I learned how to do when I learned to plot one thing, I end up with the following:

Note all the lines on black, thin, and on top of each other. No legend. And that “stats” box in the upper right contains data relevant only to the first plot. The title strip is also only for the first plot. Grey background. Lousy font. It should probably have error bars but that is for a later time.

THStack actually took care of a lot of stuff behind our backs.It matched up the axes, it made sure the max and min of the plot were correct, removed the stats box, and killed off the title. So this is a big win for us! Thanks to the ROOT team. But we are not done. I don’t know about you, but I can’t tell what is what on there!

Color

There are two options for telling the plots apart: color the lines or make them different patterns (dots, dashes, etc.). I am, fortunately, not color blind, and tend to choose color as my primary differentiator. ROOT defines a number of nice colors for you in the EColor enumeration… but you can’t really use it out of the box. Charitably, I would say the colors were designed to look good on the printed page – some of them are a disaster on a CRT, LCD, or beamer.

First, under no circumstances, under no situation, never. EVER. use the color kYellow. It is almost like using White on a White background. Just never do it. If you want a yellowish color, use kOrange as the color. At least, it looks yellow to me.

Second, try to avoid the default kGreen color. It is a flourecent green. On a white or grey background it tends to bleed into the surrounding colors or backgrounds. Instead, use a dark green color.

Do not use both kPink and kRed on the same plot – they are too close together. kCyan suffers the same problem as kGreen, so don’t use it. kSpring (yes, that is the name) is another color that is too bright a green to be useful – stay away if you can.

After playing around a bit I settled on these colors for my automatic color assignment: kBlack, kBlue, TColor::GetColroDark(kGreen), kRed, kViolet, kOrange, kMagenta. The TColor class has some nice palettes (right there in the docs, even). But it one thing it doesn’t have that it really should is what the constituents of EColor look like. These are the things that you are most likely to use.

Colors are tricky things. The thickness of the line can make a big difference, for example. The default 1 pixel line width isn’t enough in my opinion to really show off these colors (more on fixing that below).

After applying the colors I end up with a plot that looks like the following:

A Legend and Title

So the plot is starting to look ok… at least, I can tell the difference between the various things. But darned if I can tell what each one is! We need a legend. Now, ROOT comes with the TLegend object. So, we could do all the work of cycling through the histograms and putting up the proper titles, etc. However, it turns out there is a very nice short-cut provided by the ROOT folks: TPad::BuildLegend. So, just using the code:

And we now have something that is at least scientifically serviceable:

One thing to note here – there are no x-axis labels. If you add an x-axis label to your plot the THStack doesn’t copy it over. I’d call that a bug, I suppose.

Background And Lines And Fonts

We are getting close to what I think the plot should look like out of the box. The final bit is basically pretty-printing. Note the very ugly white-on-grey around the lines in the Legend box. Or the font (it is pixelated, even when the plot is blown up). Or (to me, at least) the lines are too thin, etc. This plot wouldn’t even make it past first-base if you tried to submit it to a journal.

ROOT has a fairly nice system for dealing with this. All plots and other graphing functions tend to take their queues from a TStyle object. This defines the background, etc. The default set in ROOT is what you get above. HOWEVER… it looks like that is about to change with the new version of ROOT.

Now, a TStyle is funny. A style is applied when you draw the histograms… but it is also applied when it is created. So to really get it right you have to have the proper style applied both when you create and when you draw the histogram. In short: I have an awful time with TStyle! I’m left with the choice of either setting everything in code when I do the drawing, or applying a TStyle everywhere. I’ve gone with the latter. Here is my rootlogon.C file, which contains the TStyle definition. But even this isn’t perfect. After a bunch of work I basically gave up, I’m afraid, and I ended up with this (note the #@*@ title box still has that funny background):

Conclusion

So, if you’ve made it this far I’m impressed. As you can tell, getting ROOT to draw nice plots is not trivial. This should work out of the box (using the “SAME” option that I used in the first line we should get behavior that looks a lot like this last plot).

Finally, a word on object ownership. ROOT is written in C++, which means it is very easy to delete an object that is being referenced by some other bit of the system. As a result, code has to carefully keep track of who owns what and when. For example, if I don’t write out the Canvas that I’ve generated right away, sometimes my canvases somehow come out blank. This is because something has deleted the objects from under me (it was my program obviously, but I have no idea what did it). Reference counting would have been the right away to go, but ROOT was started too long ago. Perhaps it is time for someone to start again?

The code I used to make the above appears below. My actual code does more (for example, it will take the legend and automatically turn it into “lightJets”, “charmJets”, etc., instead of the full blown titles you see there. It is, obvously, not in C++, but the algorithm should be clear!

If you are even semi-conscious of the computing world you know this name: Bjarne Stroustrup. He is the father of C++. He started designing the language sometime in the very late 1970’s and continues to this day trying to keep it from getting too “weird” (his words).

He visited CERN this last week, invited by the ROOT team (I took few pictures). I couldn’t see his big plenary talk due to a meeting conflict, but my friend Axel, on the ROOT team, was nice enough to invite me along to a smaller discussion. Presentations made at this discussion should be posted soon here. The big lecture is posted here, along with video (sadly, in flash and wmv format – not quite mp4 as I’ve been discussing!!)! I see that Axel also has a blog and he is posting a summary there too – in more detail than I am.

The C++ standard – which defines the language – is currently overseen by a ISO Standards Committee. Collectively they decide on the features and changes to the language. The members are made up of compiler vendors, library vendors, library authors, large banking organizations, Intel, Microsoft, etc. – people who have a little $$ and make heavy use of C++. Even high energy physics is represented – Walter Brown from Fermilab. Apparently the committee membership is basically open – it costs about $10K/year to send someone to all the meetings. That is it. Not very expensive. The committee is currently finishing off a new version of the C++ language, commonly referred to as C++0x.

The visit was fascinating. I’ve always known there was plenty of politics when a group of people get together and try to decide things. Heck, I’m in High Energy Physics! But I guess I’d never given much thought to a programming language! Part of the reason it was as fascinating as it was was because several additions to the language that folks in HEP were interested in were taken out at the last minute – for a variety of reasons – so we were all curious as to what happened.

I learned a whole bunch of things during this discussion (sorry for going technical on everyone here!):

Bjarne yelled at us multiple times: people like HEP are not well represented on the committee. So join the thing and get views like ours better represented (though he worried if all 150 labs joined at once that might cause a problem).

In many ways HEP is now pushing several multi-core computing boundaries. Both in numbers of cores we wish to run on and how we use memory. Memory is, in particular, becoming an acute problem. Some support in the standard would be very helpful. Minimal support is going in to the new standard, but Bjarne said, amazingly enough, there are very few people on the committee who are willing to work on these aspects. Many have the attitude that one core is really all that is needed!!! Crazy!

In particle physics we leak memory like a sieve. Many times our jobs crash because of it. Most of the leaks are pretty simple and a decent garbage collector could efficiently pick up everything and allow our programs to run longer. Apparently this almost made it into the standard until a coalition of the authors of the boost library killed it: if you need a garbage collector then you have a bug; just fix it. Which is all good and glorious in an ideal world, but give me a break! In a 50 million line code base!? One thing Bjarne pointed out was it takes 40 people to get something done on the committee, but it takes only 10 to stop it. Sort of like health insurance. 🙂

Built in support for memory pools would probably be quite helpful here too. The idea is that when you read in a particle physics event you allocated all the data for that event in a special memory pool. The data from an event is pretty self-contained – you don’t need it once you have done processing that event and move onto the next one. If it is all in its own memory pool, then you can just wipe it out all at once – who cares about actually carefully deleting each object. As part of the discussion of why something like this wasn’t in there (scoped allocators sounds like it might be partway there) he mentioned that HP was “on our side”, Intel was “not”, and Microsoft was one of the most aggressive when it came to adding new features to the language.

I started a discussion of how the STL is used in HEP – pointing out that we make very heavy use of vector and map, and then very little else. Bjarne expressed the general frustration that no one was really writing their own containers. In the ensuing discussion he dissed something that I often make use of – the for_each loop algorithm. His biggest complaint was who much stuff it added – you had to create a whole new class – which involves lots of extra lines of code – and that the code is no longer near where it is being used (non-locality can make source code hard to read). He is right both are problems, but to him they are big enough to nix its used except in rare circumstances. Perhaps I’ll have to re-look at the way I use them.

He is not a fan of OpenMP. I don’t like it either, but sometimes people trot it out as the only game in town. Surely we know enough to do better now. Tasked based parallelism? By slots?

Bjarne is very uncomfortable with Lambda’s functions – a short hand way to write one-off functions. To me this is the single best thing being added to the language – it will not be possible to totally avoid having to write another mem_fun or bind2nd template. That is huge, because those things never worked anyway – you could spend hours trying to make the code build, and they added so much cruft to your code you could never understand what you were trying to do in the first place! He is nervous that people will start adding large amounts of code directly into lambda functions – as he said “if it is more than one line, it is important enough to be given a name!!” We’ll have to see how use develops.

He was pretty dismissive of proprietary languages. Java and C# both were put in this category (both have international standards behind them, just like C++, however) – citing vendor lock-in. But the most venom I detected was when he was discussing the LLVM open source project. This is a C++ interpreter and JIT. This project was loosely run but has now been taken over by Apple – presumably to be, among other things, packaged with their machines. His comment was basically “I used to think that was very good, but now that it has been taken over by Apple I’d have to take a close look at it and see what direction they were taking it.”

Run Time Type Information. C++ came into its own around 1983 or so. No modern language is without the ability to inspect itself. Given an object, you can usually determine what methods are on the object, what the arguments of those methods are, etc. – and most importantly, build a call to that method without having ever seen the code in source form. C++ does not have it. We all thought there was a big reason this wasn’t the case. The real reason: no one has pushed hard enough or is interested enough on the committee. For folks doing dynamic coding or writing interpreters this is crucial. We have to do that in our code and adding the information in after-the-fact is cumbersome and causes code bloat. Apparently we just need to pack the C++ committee!

Usually as someone rises in importance in their field they get more and more diplomatic – it is almost a necessity. If that is the case, Bjarne must have been pretty rough when he was younger! It was great to see someone who was attempting to steer-by-committee something he invented vent his frustrations, show his passion, name names, and at one point threaten to give out phone numbers (well, not really, but he almost gave out phone numbers). He can no longer steer the language exactly as he wants it, but he is clearly still very much guiding it.

You can find slides that were used to guide the informal discussion here. I think archived video from the plenary presentation will appear linked to here eventually if you are curious.

While the plane was bouncing all over the map, it occurred to me that ROOT is a lot like Microsoft Office (indeed, any product like Office).

Looking at the list of libraries/packages in ROOT you might be tempted to call it bloatware. You’d be right, of course, but, just like MS Word and other similar programs, the libraries you think are important are different from the ones that someone else thinks are important. And almost all the libraries are available for separate use as well! But you’ll notice that few people are using it that way.

The reason is obvious – simplicity.

The brilliance of large software packages like MS Office and ROOT is not that they innovate (though both certianly do), it is that they have taken tried and true ways of doing things and made them work together by packaging them up and building bridges between them. Take TMVA, for example, the package that allows you to easily implement various multivariate analysis techniques (like a boosted decision tree). You can get that software separately – there are lots of packages out there. But TMVA is specifically designed to work with a TTree and other things in ROOT.

Once you get yourself into ROOT you have access to all these tools – and the potential barrier you have to get over to use them is minimal. You want to switch to something else? Of course you can do it – but it will be a lot of work! Just like MS Office. 😉

I think people complain about ROOT and how hard it is to use, but use it anyway for the same reasons they complain about MS Office and its daughter programs. Fortunately, for MS Office, there is real $$ involved so other companies are finally starting to compete – which I hope will make MS Office better. I don’t see anything like that coming along to challenge ROOT until after I retire (or there is a complete revolution in how we do data analysis in HEP).

There are only three weeks left to term! I can’t wait! And now for something a little bit different…

This weekend I took a few hours off and finished up some remaining work on a tool Axel and I created a year ago, ROOT Builder. This is a very small windows utility that is useful if you want to build ROOT from source on your computer.

When you hit the “Generate Project” button it will create a VS 2008, 2005, or 2003 project that you can then edit and access in the IDE. Unfortunately, this is still using the ROOT make file and not a native VS IDE project (which means the build is a bit slow). However, if you decide to build debug mode then this will allow you to hit “debug” and step into ROOT right away. Very handy for those odd I/O bugs that you can’t solve with Philippe’s help! 😉

So, on Windows, there has been the occasion that I’ve needed to build ROOT. In particular, on windows, this is because the released version of ROOT is not built against python. I also find myself doing lots of building building when buts are found in root (yeah, I know, ROOT doesn’t have bugs).

The standard way to do the build is download and install cygwin. It is a rather heavy install, however, as any of you who have installed it know. Axel, a member of the ROOT team, put together a minimal set of tools that will build ROOT using the MSVC compilers. I built a small UI on top of that along with an installer to make it easier to get onto your machine.

I found this pretty useful for getting my work done, so I thought I’d release it. Actually, it has been finished for months, I just never seemed to get around to finishing off the last bit. So, here it is. The UI is horrible, but it works…

I complain about ROOT a lot. However, one thing I have to say is that when you report a bug it is only a matter of days before it gets fixed in their source code repository. Philippe Canal just did a whole bunch of work after I found a few bugs in the MakeProxy code that means it will now be possible to use weird variables like vector<vector<double> > without having to think (these are produced by the EventView tool in ATLAS). Nice!

BTW, MakeProxy is a step back in time. Remember PAW (ok, I know, some of you are still using it). One of the nice things about the FORTRAN macros back then was the auto-loading of ntuple columns. Only what you referenced in your FORTRAN code was loaded. That isn’t the case in ROOT — it has no good way to scan compiled C++ code to determine what is used. Philippe wrote an automatic proxy-generator that replaces each ntuple branch with a bit of code that first makes sure the data is read in before accessing it (I’m pretty sure Philippe was the progenitor of the idea). In the end this has the effect of doing the same thing — only what is needed is read. I’ve seen x10 speed-ups caused by this. The only it can’t do is look inside a custom compiled object — so this doesn’t buy you as much if you are running in that case.