Monday, 25 October 2010

The die has been RECAST

RECAST is an idea toward a more efficient use of experimental data collected by particle physics experiments. A paper outlining the proposal appeared on ArXiv 2 weeks ago. In order to explain what RECAST is and why it is good I need to make a small detour.

In the best of all worlds, all experimental data acquired by humanity would be stored in a convenient format and could be freely accessed by everyone. Believe it or not, the field of astrophysics is not so far from this utopia. The policy of the biggest sponsors in that field - NASA and ESA - is to require that more-or-less data (sometimes a pre-processed form) are posted some time, typically 1-2 years, after the experiment starts. This policy is followed by such cutting-edge experiments as WMAP, FERMI, or, in the near future, Planck. And it is not a futile gesture: quite a few people from outside of these collaborations have made a good use these publicly available data, and more than once maverick researchers have made important contributions to physics.

Although the above open-access approach appears successful, it is not being extended to other areas of fundamental research. There is a general consensus that in particle physics an open-access approach could not work because:

bla bla bla,

tra ta ta tra ta ta,

chirp chirp,

no way.

Consequently, data acquired by particle physics collaborations are classified and never become available on the outside of the collaboration. However, our past experience suggests that some policy shift might be in order. Take for example the case of the LEP experiment. Back in the 90s the bulk of experimental analyses was narrowly focused on a limited set of models, and it is often difficult or impossible to deduce how these analyses constrain more general models. One disturbing consequence is that up to this day we don't know for sure whether the Higgs boson was beyond LEP's reach or whether it was missed because it has unexpected properties. After LEP's shutdown, new theoretical developments suggested new possible Higgs signatures that were never analyzed by the LEP collaborations. But now, after 10 years, accessing the old LEP data requires extensive archeological excavations that few are willing to undertake, and in consequence scores of valuable information are rotting in the CERN basements. The situation does not appear to be much better at the Tevatron where the full potential of the collected data has not been explored, and it may never be, either because of theoretical prejudices, or simply because of lack of manpower within the collaborations. Now, what will happen at the LHC? It may well be that new physics will come straight in our faces, and there will never be any doubt what the underlying model is and what are the signals we should analyze. But it may not... Therefore, it would be wise to organize the data such that they could be easily accessed and tested against multiple theoretical interpretations. Since an open access is not realistic at the moment, we would welcome another idea.

EnterRECAST, a semi-automated framework enabling to recycle existing analyses so as to test for alternative signals. The idea goes as follows. Imagine that a collaboration performs a search for a fancy new physics model. In practice, what is searched for is a set of final states particles, say, a pair of muons, jets with unbalanced transverse energy, etc. The same final state may arise in a large class of models, many of which the experimenters would not think of, or which might not even exist at the time the analysis is done. The idea of RECAST is to provide an interface via which theorists or other experimentalists could submit a new signal (simply at the partonic level, in some common Les Houches format). RECAST would run the new signal through the analysis chain, including hadronization, detector simulations and exactly the same kinematical cuts as in the original analysis. Typically, most experimental effort goes into simulating the standard model background, which has already been done by the original analysis. Thus, simulating the new signal and producing limits on the production cross section of the new model would be a matter of seconds. At the same time, the impact of the original analyses could be tremendously expanded.

There is some hope that RECAST may click with experimentalists. First of all, it does not put a lot of additional burden on collaborations. For a given analysis, it only requires a one-time effort of interfacing it into RECAST (and one could imagine that at some point this step could be automatized too). The returns for this additional work would be a higher exposure of the analysis, which means more citations, which means more fame, more job offers, more money, more women... At the same time, RECAST ensures that no infidel hands ever touch the raw data. Finally, RECAST is not designed as a discovery tool, so the collaborations would keep the monopoly on that most profitable part of the business. All in all, lots of profits for a small price. Will it be enough to overcome the inertia? For the moment the only analysis available in the RECAST format is the search for Higgs decaying into 4 tau leptons performed recently by the ALEPH collaboration. For the program to kick off more analyses have to be incorporated. That depends on you....

Come visit the RECAST web page and tell the authors what you think about their proposal. See also another report, more in a this-will-never-happen vein.

15 comments:

Anonymous
said...

You show a poor understanding of how data is released by some of the other experiments outside of HEP which you laud as "open access". There is no open access. Data may be made available only in very highly processed formats, subject to the review of the collaboration. The raw data would be neither useful to the outside community, nor would it advance the cause of science. It is far too prone to being badly analyzed.

You have also failed to understand (though I think the authors of RECAST have not) that this proposal does in fact require the collaborations to invest significant manpower. Perhaps they will ultimately decide it is worth it. We can hope.

I understand your style is to be 'edgy' and 'tell-it-like-it-is'. But you have crossed a line where I no longer think you are overstating for effect, but instead think that you are actually just not very well informed.

Moreover, you seem to misinterpret the paper by Kyle and Itay (I know Itay from Harvard).

It's not a call for open access, it's a somewhat more detailed - but not too detailed - recipe to make it easier to recycle some results so that some parts of the previous analyses don't have to be done again. In this sense, it is an attempt to make people spend *less* time, not more time, with reproducing the results of others.

If you make a stophicated transition from the raw data to the final comprehensible results that actually imply something we care about, you have to do lots of work. And this work doesn't decrease much if someone gives you the codes etc. he used.

In order for the new "outsider" to be sure he's doing the right thing with the data, he will have to learn the details of the code, anyway. It takes almost the same time as if he is doing it himself.

It's actually very questionable whether a universal "open access" would increase the percentage of correct results or decrease the time needed for a typical analysis.

Of course, another thing is if there are worries that someone is deliberately hiding some data because he's been doing something intentionally bad with them. In such cases, the original data and all the methods are desirable to be accessible - but in such cases, it's hard to force the authors to make them accessible, too.

I hope that experimental particle physics is not yet in the state where the risk of deliberate misconduct is too high - primarily because "no one cares" about one result or another. ;-)

Anon, I think you misinterpret what I said. Concerning the open access in astrophysics, for example, Fermi gives you the energy and direction of each reconstructed photon + some quality information. This is what I meant by "more-or-less raw data", and this is enough for outsiders to perform an independent analysis. "More-or-less raw data" does not mean "a full set of read-out voltages" - that would indeed be useless for outsiders. If HEP experiments could publish basic information about all reconstructed objects, that would be more than perfect. Secondly, whether RECAST will involve significance manpower depends a bit on the collaborations. If they put RECAST analysis on the same footing with the regular ones, with the same level of scrutiny, then indeed, by definition, RECAST will require similar manpower as the original analysis. However I do hope that some fast track procedures could be developed for RECAST. That is to say, a RECAST analysis would not have to carry the full authority of the collaboration on its back.

Lubos, I am aware that RECAST is not a call for open access, on the contrary. But for me it could play a similar role - lead to a more efficient use of experimental data. See the keywords "detour" and "another idea" that begin and end the parable. Whether open access would speed up progress in HEP is of course a matter of opinion, since there is no "experimental data" so far to support the pros and cons. But since it works fine in astro, then why wouldn't it work in HEP?

Dear Jester, do you actually follow the relevant technicalities about data in HEP experiments?

Take the LHC as a not so minor example. ;-) It has made 2 trillion collisions at each detector. Each collision comes with 1 kilobyte data.

So that's 2 petabytes - 2 million gigabytes - of data. You don't want to provide each fan or auditor of particle physics with a USB flash drive with all the raw data. In some sense, all the work of experimenters and phenomenologists is about selecting subsets of the data and projecting them onto some relevant axes.

The LHC and other experiments may have too many people in it. On the other hand, you couldn't do it with vastly smaller numbers of experts. CERN may hold something like 1/2 of the world's people who can interpret the experiments in some sensible way.

It's pretty self-evident that even if the rest of the world "teamed up", it won't be quite able to do all the necessary work needed to derive something - because it would never team up so well. Some pre-selection and pre-manipulation of the data is always necessary and outside individuals won't be able to check the manipulation with the data "at all levels".

Even if you look at less extreme cases than the LHC, HEP has much more extensive data, and the theoretically interesting features of the data are hidden in much more complex patterns than in astrophysics.

In astrophysics, you always measure things such that the amount of information can be rewritten as a histogram - or graph - in at most 3-4 coordinates (frequency, 2 coordinates on the sky, time). This space of possible data is replaced by the whole 100-particle phase space in particle collisions. ;-) The histograms are made out of many components corresponding to different combinations of particle species, and each component is a histogram in up to 300-dimensional space, to label the momenta of 100 particles. ;-)

You know, an LHC collision may produced up to hundreds of particles (like 100 charged particles in the recent claim about the quark-gluon plasma). Some "projection" always has to be done with and there are many possible projections. It's questionable whether the data at any level - of a manipulation done by a paper - are too useful for others than for those who are writing the paper.

Is that a joke, anonymous [above me]? I am a 100% theorist, so the maximum thing that my comments could have revealed was my extraordinary sensitivity for experiments and their subtleties. ;-)

It's the whole point of the explanation why the open access probably won't be too useful in a multi-layered field such as particle physics. Once the data get to the level that is interesting for the theorists, they have already been massaged and manipulated by many, many layers of algorithms and compression.

And for each theoretical purpose - e.g. for the verification of each individual theory - the optimal procedure to manipulate with the raw data is a bit different.

The experimenters or others who write papers always face a trade-off. On one hand, they try to make a paper relevant to a maximum number of phenomenologists and theorists; on the other hand, they want to say something specific enough so that the theorists and phenomenologists actually care and understand.

This is a subtle trade-off. So there are various ways how to talk about the data that are more experimental, phenomenological, or theoretical in Nature. But you can't have the advantages of all the layers at the same moment.

If you publish some analysis that is very close to the raw data, it will only be useful for those phenomenologists who are very close to the experiment etc. If you manipulate the data more intensely so that they can actually be usefully read by the people who care about the effective Lagrangians, i.e. if you're an experimenter who already has a theory in mind and tries to reverse-engineer its features, your manipulation will inevitably contain some model-specific procedures, and another paper meant to present the experimental results in a way relevant for another model will have to do much of the work differently, i.e. from scratch.

In all cases, it's hard work and the verification of the work takes a substantial amount effort as doing the work in the first place.

I'm avoiding the issue of trying to characterize HEP or non-HEP data access in 10,000 words or less, but I think that most people do appreciate what Itay and I are proposing in the RECAST paper.

It will require an effort by the experimentalists to incorporate their analyses into something like RECAST. It may actually happen if there is community support for the approach. If you support the idea, then I would like to encourage you to leave a comment on the RECAST web page.

It's certainly going to require work, I don't want to give the impression that it is effortless. I do think that recasting is less effort than designing a new analysis from scratch, and I do think that with time the experiments could streamline the process significantly so that the marginal effort is not that large. But for that to happen, there will need to be a community effort. Most of that work will have to come from within the experiments themselves, but having a clear endorsement (or at least encouragement) from the theoretical community will help.

even if the overall community's costs (and time required) were reduced by the extra formatting you propose, you simply haven't solved the financial problem because people's and teams' budgets are separate, and for a good reason.

What you effectively say is that a team, X1, should do some extra work that costs C1 because it will save time for teams X2,X3 who won't have to pay C2 and C3.

But why would team X1 be doing that? It effectively means that they would donate C1 - or C2+C3, depending on your viewpoint - to the other teams. Why would they suddenly do it? Do you understand that this is equivalent to reducing all their grants by a significant percentage?

Scientists are responsible for - and credited with - finding results that can be seen to be valid and new, in any format. They're not paid for doing work for others.

And if you wanted to "close" the physics community that it would only be accessible to the people who respect some Itay-Kyle format, then I apologize, but I am strongly against it, and surely I won't be the only one.

Science has to be open to anyone regardless of his formats. Otherwise it's not science. Science is not format worshiping. It's both an ethical as well as pragmatic principle. If you made the scientific work in places A,B impossible without this extra hassle, scientists would surely start to move elsewhere.

I'm a little lost. As you wrote "RECAST is "not a call for open access", and it has nothing to do with data formats. The point is that if you archive analyses, you don't even need access to the data for a RECAST request to be processed.

I don't want to get into the debate about the economics of it all. I agree that the success of the system depends on the incentives working out correctly, but I think they do... or at least can. I don't think your simple zero-sum picture of the economics is a good model for what is going on for a number of reasons. First, groups have grants that cover some amount of time, and they try to do as much good physics in that time as possible. They continue getting funded if they make the case that they were doing enough interesting stuff.

An experimentalist will get involved in RECAST if they think it is worth while for them to do so. That is if they think that their existing analysis is sensitive to new models, and if they think the benefit of saying something about those models is worth the effort it requires to 'recast' their old analysis to answer that question. That is why identifying concrete examples is important and why it is important to try and streamline the system.

For an experimentalist, I think the decision is between starting a new analysis for a new signal (a big investment), or reusing an old analysis for a new signal (smaller investment). The recasted analysis might have less power, but if it's powerful enough then it will be a good choice.

Remember, this is not hypothetical, we considered several case studies where this was done... like the many papers that recast existing LEP Higgs searches into different final states.

If the archival system becomes automated, then it is really very little work for an experimentalist to get the new result. Of course, now the effort has been transferred to the computing professionals that setup such a system. But that is exactly what their job is: to provide computing tools that support and facilitate doing science efficiently. In the end, it's going to be up to the experimental collaborations, the labs, etc. to collectively decide if the benefit of such a system is worth the investment. I think it is.

I wouldn't say that one group is doing another group's work, because previously those questions just weren't being answered. That's why things like the LEP H->4τ analysis was left as an open question for so many years.

In short, I appreciate that the success or failure of such a system is going to be based on incentives and some fuzzy notion of economics. That's why it is important for those that think it would be useful to have such a system to speak up. Even better if there is a concrete example (eg. Jester's h->lepton jets).

Remember, the proposal itself is actually very conservative... it is not a break from the way we experimentalists have been doing analyses in the past, it is not a call for open data access, it's just a way of extending the impact of those searches to new models in a stream-lined way.

thanks for your explanations. Note that they were mostly about the economy of the proposal: they had to be because the whole proposal of yours is about the economy of different management approaches. It would be strange if you "didn't want to talk about the economy" because that's what you're doing all the time.

Do you actually have a feeling that the community needs to reduce the work in this way? My impression is that there are, on the contrary, many people without work and it's OK if they often have to do some analyses from scratch.

Also, your recycling proposal would make mistakes spread much further from the original places. You really want to reduce the number of repetitions and checks - you want people to increasingly rely on what was written previously.

I think that just for the safety that the results are not spoiled by the past errors, it's a very problematic approach. You paint the fact that some work is only done once per history in physics as a clear advantage. I don't see it as a clear advantage. It doesn't hurt when similar things are being computed again and again, from slightly different angles, by somewhat different people who are influenced by slightly different assumptions and philosophies. This is one mechanism that makes the scientific process more robust. It also makes the scientists more experienced. If a calculation by a signal is only done by one team and the results are recycled for 20 years, then you may even think that a whole inexperienced generation is being created.

Moreover, as I mentioned, the errors are being spread "exponentially". If you depend on N previous calculational steps and there is a "p" (near one) probability that each step is fine, the probability that all of them are fine is "p^N". It decreases with N. If "p^N" is too far from one, the whole research may become too shaky.

About Résonaances

Résonaances is a particle physics blog from Paris. It's about the latest news and gossips in particle physics and astrophysics. The posts are often spiced with sarcasm, irony, and a sick sense of humor. The goal is to make you laugh; if it makes you think too, that's entirely on your own responsibility...