Tuesday, May 20, 2014

ATLAS: find Higgs, win $7k

Particle physics meets the machine learning sport

Amateur and experienced programmers, you have a chance to win $7,000 (gold), $4,000 (silver), or $2,000 (bronze) if you succeed in a contest organized by the LHC's ATLAS Collaboration (via Tommaso Dorigo),

So far there are 180+ contestants (well, teams – a team may contain at most 4 people). Anyone who registers and sends her results by September 15th, 2014 may win, however.

What is the sport about?

You have to download 55 megabytes of data in four files and write a program – assuming that you won't be able to classify all the data on the top of your head, or on the back of the envelope – that is able to classify some event (a proton-proton collision) as "s[ignal]" (super-exciting) or "b[ackground]" (boring).

You and your computer may find the right way to label the event as "s" or "b" by looking at 250,000 events which already have the right "s" and "b" labels attached to them. Yes, the contestants who don't like computers will have to cut a forest to obtain a sufficient amount of paper for that. Send your results.

You see the ID 100,000 (the numbers from 100k below 350k are training events which are identified by "s" or "b" at the end; those from 350k below 900k are contest events in the "test" file) followed by 30 reasonable-accuracy real numbers and, in the case of the training events, also 1 insanely precise number (a "weight": close to zero for "s", greater than about one for "b") as well as with an "s/b" classification (a "label"). Out of the 250,000 training events, 85,667 are "s" (signal), slightly more than 1/3. (This information, as well as some other comments below, reveals results of my preliminary "research" of the files.)

Your final submission (up to 5 submissions per day are OK) is a CSV file of this form:

The rank order is a number between 1 and 550,000 you calculate – 1 is the most background-like event according to you, 550,000 is the most signal-like event. I think that only the s/b answers are actually used to pick the winners.

Once you train yourself or your programs to decide whether an event is an "s" or a "b", you should apply it to 550,000 contest events that are not known to be "s" or "b" to any contestant. The closer you get to the right classification of events as "s" or "b", the greater chance to win the money you will have.

Each event is a collection of numbers – more precisely 30 parameters describing kinematical distributions – that captures some properties of the resulting products of a proton-proton collision. There are many patterns in the numbers and some combination of the patterns is useful for deciding whether the event is an "s" or a "b". You may view the events as points in a 30-dimensional space \(\RR^{30}\) and your task is essentially to develop a program drawing a map of this 30-dimensional world i.e. dividing this space to "b" (Bundesland; sorry, I didn't find a better word) and "s" (seas). ;-) One would clearly need unrealistically huge resources to remember the "bitmap" of the 30-dimensional space. Instead, you have to design (and apply) a system to construct a compressed JPEG-like 30-dimensional image of this space.

If you want to know, the "s" events are those in which you create a Higgs boson \(h\) that decays to a pair of (3,500 times) heavy cousins of the electron known as the taus (a particle-antiparticle pair),\[

h\to \tau^+ \tau^-,

\] and the "b" events are those that just imperfectly imitate the "s" events. But in principle, it is being said that even programmers who don't understand any particle physics should have a chance to win. It is actually an interesting question whether programmers may train their computer and/or their brain to "see" whether something is a decaying Higgs boson.

If they learn to do it, they have mastered a "practical skill" in particle physics – something that animals would be forced to learn to do in practice if their survival depended on the discrimination of decaying Higgs bosons. However, I think that the "theoretical, deep wisdom" about particle physics is much more than any practical skill of this kind, and that's also why particle physics cannot be automatized.

The contestants who send their answer may see how they're doing in a "preliminary leaderboard" of results that only incorporates 18% of the events so it is just an approximation of the final results but it is likely to tell you a lot about your doing well or badly, anyway.

Incidentally, if I were picky, I would point out that in principle, one cannot sharply divide the events to "s" or "b" not only because the "b" events may parrot "s" events very closely but because in each single real-world event, the intermediate histories with a Higgs boson and without a Higgs boson interfere with each other before you get the probability amplitude for a final state.

So the existence of the Higgs – a new particle in the intermediate histories – only modifies the probability for a particular final state but you can never quite 100% certainly (even in principle, if you know absolutely everything about the final state that can be known) attribute the Higgs-like character of an event to the Higgs. It is a similar disclaimer as the usual comments that you can't attribute a particular hurricane to man-made climate change. The only difference between the Higgs boson and the CO2-hurricane link is that the existence of the Higgs boson actually does observably modify the composition (or the character of some) proton-proton collisions.

Non-physicists are unlikely to understand the meaning of the 30 parameters listed for each event. They are:

Lots of information about energies and momenta of all the leptons and all the jets (partons). PRI_jet_num – probably the number of jets – seems to be the only integer among the 30 numbers. If that number is zero, the remaining ones are written as -999.0 because the information about jets is "N/A". The numbers of training events with 0,1,2,3 jets is about 100k, 78k, 50k, 22k, i.e. 250k in total, and no higher numbers are found.

But even if you know the meaning of the 30 numbers, e.g. because you are a particle physicist, there is no straightforward way to reverse-engineer them and to decide which collections of 30 real numbers are "b" and which of them are "s". Even to a trained particle physicist who knows how the Higgs may decay and who understands these labels above, none of the collections of numbers seems to say "b" or "s". Instead, all of them say "BS". ;-) So not being a physicist might not be a severe disadvantage, after all.

Only ATLAS is going to pay the money to you; CMS pays the corresponding money to Tommaso Dorigo for him to have some fun with Cicciolina.

Update: Excellent, I submitted my own random permutation with 1/3 of "s" labels, their computer was satisfied with the format, and I verified that with such unrefined submissions, one remains at the bottom of the leaderboard haha. I also have some real code discriminating the events but haven't submitted it yet. Then I posted two test submissions with a simple manual score read from the histogram. However, both of them were plagued by a severe bug in my Mathematica code: I thought that Ordering[...] produces the inverse permutation than what it does. However, Ordering[{2,3,1}] isn't {2,3,1} but its inverse. It was fixed by replacing it with Ordering[Ordering[...]] and my rank in the leaderboard jumped by six spots or so. However, the best score improved from 0.55 to 1.3 so it is clearly sensitive to such choices. ;-) I may apply the real, potentially competitive algorithm later.

26 comments:

Right. They're simulated but if they're correctly simulated, in agreement with Nature, they still agree with the claim that there can't be an unambiguous assignment of the signal-vs-background labels on the event-on-event basis. In practice, their "background" in the contest file are hopefully events that significantly differ from the signal ones, in some way, so one may gain a high enough certainty that the contribution of the signal to the probability amplitude wasn't important for the event to occur. But this can never be a certainty.

A neural net was trained to ID photos of friendly versus enemy tanks. It went spectacularly well until applied real world, Friendlies were beautifully rendered, vampires were captured spy images. The neural net trained to ID contrast and focus.

Great, Lubos. So I think I more or less understand the main points of your article by now. As a good student I should be able to take that knowledge to take apart the "cogwheel model" (chapter 2.2. in 't Hooft's paper) which looks somewhat tempting to me. The main point I see that I am only allowed to take values for t that are multiples of delta t, which is not a continuous time evolution at all and thus a major flaw which is put under the rug. Other then that I didn't manage to identify some clear problems though perhaps because on this level it is not even clear what is the significance of the complex linear combinations of the ontological basis so maybe it is too early to attack those.

Hi Lubos,since we talk about the foundations of quantum mechanics I have another slightly off topic question:Have you ever heard about PT-Symmetric Quantum Theory and if so, what do you think about it?http://arxiv.org/pdf/quant-ph/0501052v1.pdf

Dear Lubos, I never imagined that they'd study Hebrew, but I've known of Cantor and aleph since I read a popular treatment of various math topics for young people when I was in middle school or high school, and I've seen the aleph in non-advanced articles many times since, not always in mathematical contexts (and no, I'm not Jewish). I would even say, in news magazines. The mere fact that it looks like 'N' makes it memorable; one wonders, why doesn't it look like alpha?

Something I've noticed a lot in life is that many people don't know things which would seem to me to be general knowledge, or at least general knowledge for people with inclinations in a certain direction, in this case, math & science. I've always found this puzzling.

It's not as if I were a math wizard. You yourself once alluded to this about me. I got very good scores in the math SAT: 720 math aptitude, 769 advanced math achievement. But I'm lazy, and I fluffed my way through courses. I squeezed through until, finally, I flunked calculus on manifolds. And but: A high-school classmate of mine in a small public high school - 100 in the senior class - could run rings around me in math, while another two could run the same rings :-) I could run. Two of those three became physicists but then gave it up.

You have to scan everything yourself - and I had to - and there are many other similar patterns that you (or your program) will have to learn, but your or your computer's ability to learn such thing is probably necessary to succeed.

I think that the people who had accounts on the Kaggle server are generally de facto professionals in this mass evaluation of the data.

Sure, just register an account and then click on Leaderboard - on their website or this blog post. You willl see Luboš Motl with the 1-line discrimination result near the bottom LOL. There is also Tommaso Dorigo in the middle of the table.

Dear Tom, the weight tells you how much the correct or wrong identification of the event contributes to your background and signal score.

Note that the "s" events have weights close to zero and "b" events have a big weight.

The final score is only calculated from the events that you label as "s". Among those, one calculates the real positives (indeed "s") and false positives ("b" in their correct database)., probably not by counting "1" but by adding the weights. By a signal-vs-background-dependent formula, it's translated to the score.

Well, right, but they check the answer against the simulated(but kept secret) weight score, which is a (quasi)continuous measure of how much contribution the higgs boson has(if I understand correctly) so it is(or at least can be in principle) quite adequate.

Anyways, people might get an impression after reading your article that these are actual measurements in which case the LHC folks either do or do not have a superior algorithm for deciding the question both cases rendering the competition worthless. I meant to comment on that.

Building a visualizer for the events is likely required. If you look at 1000 or more training events you can build an idea of b vs s.

There are only 30 parameters, so with 3d, colour and weight, is easy to play with lots of different views, some based on a traditional photo of the event, others with some abstract parameters as the x,y,z.

Dear Dilaton, I spent many hours by installing Ubuntu Linux in a virtual box and many packages - nothing worked immediately - but now I have rank 44, score 3.6, ahead of Dorigo at rank 109 and score 3.1 or so. ;-) With some string-theory-knowledge-helped improvement, I may win this contest now.

If I've told you once, I must have told you a thousand times: Do not let this Motl fellow enter any more contests where he has a chance of winning. Because he will. It is bad for morale, bad for fairness and equality, bad for global harmony and peace.

I enclose a Kindle copy of the Vonnegut short story Harrison Bergeron. Be sure to get back to me with a 10-page report (minimum length) by this time tomorrow, detailing how you will apply the lessons learned from the short story to hobbling Motl so that he ceases to be an irritant.

Do not fail me this time, [redacted], or you will learn the meaning of "hardship assignment", the hard way. Our office in Mogadishu is severely shorthanded. I think you catch my drift?