Friday, May 23, 2014

Higgs contest, top ten

As you may have noticed (by looking at my reduced activity on this blog), I have spent hours with the ATLAS Higgs contest in recent two days. It's time to boast a little bit. Here is the screenshot of the current leaderboard:

Click to zoom in.

You see that there's someone new in the top-ten list, someone who jumped up by 179 stairs in recent two days. And his name is the same as mine! ;-)

Finally, I could apply some idiosyncratic improvements to the data manipulation. To compare, Tommaso Dorigo of CMS is at the 116th place. Maybe he should try to learn some string theory if he's not too good at evaluating the data from particle physics experiments. Or better not... ;-)

This has been my most intense "contest-motivated" wave of programming in the last 25 years – I won some Czechoslovakia-wide programming contests in 1988 and 1989 but an informal programming language was OK.

Needless to say, when one doesn't really evolve in that kind of business, he at least partially loses the contact with the things that are important for the current state of the industry. So my "native languages" still include things like BASIC, Pascal, machine code 8080 and 6502/6510, MS-DOS batches, some Unix shell batches, some Javascript, and that's it. I have only programmed in C for 30 minutes in my life – it was exactly 20 years ago and I managed to convert my Pascal program for Penrose tiling to C with the help of someone else – and I've largely forgotten the message. That's what programmers still use for most of the "general" programs.

Fortunately, I didn't need C, Java etc. to do things like that. However, I have learned some Python so that I know how to deal with variables, local variables, imported functions, mathematical functions, two dimensional arrays, for-loops, if-commands, and a few other capabilities.

If I needed to calculate or visualize something in the recent decade, I would always choose Mathematica which Stephen Wolfram kindly donated me and it works very well. And easily. Still, some of the special software, like the software for machine learning, is only working well under Linux. It's my impression that almost all the people doing these things – machine learning, huge data manipulation, pattern recognition, use whatever name you like for almost the same activity – are working in Linux.

Finally, I decided to resuscitate a virtual box with the Linux OS. It was very painful. In particular, the installation of the packages was repeatedly freezing, crashing, not working for about 5 independent reasons, especially the installation of Scipy which forced me to install TeX even though the Czech repository didn't really have it or something like that. I fixed some of the bugs by switching to a U.S. repository. Of course that for some hour during this angry period, I was thinking about plans how to legally exterminate everyone who has ever been connected with this Linux crap.

More scientifically: I roughly know how these "machine learning" packages are working – I think – but I wanted to create mine. It would be a huge project. At the end, it would consume a gigantic amount of time and if I were doing in Mathematica, it couldn't possibly achieve the required speeds, I am afraid.

When you try to use the existing packages and improve them by some "ingenious" ideas, you learn about some of the differences between theory and experiment. A theorist may fall in love with an idea and becomes convinced that it brutally improves the Higgs contest score. At the end, you may see that the good enough existing algorithms don't really care about certain kinds of improvements and they make no improvements at all. I don't want to be too specific because the other contestants could be helped too much.

Even to understand how the score is calculated requires some deep enough knowledge of statistics. Normally, you could think that the contestants are rated according to the number of wrong answers ("signal" instead of "background" and vice versa) that they submit. However, the actual formula is more complicated and this more complicated formula has certain nice properties. It took me some time to realize that the weights of the "b" events are really about 1,000 times higher than the weights of the "s" events – both in the training file and probably also the secret weights for the test file. It follows that \(b\ll s\) and one may pretty much use the approximation for AMS for this regime, \(s/\sqrt{b}\).

I have also learned to be more certain about the meaning of various quantities such as the pseudorapidity and their properties and relationships. It still seems plausible to me that the contest is going to be won by someone who doesn't actually know any of these physics things.

If you believe that you won't be devastated by the hurdles, you are recommended to try to register and play with the algorithms yourself. At the Kaggle website, you just need to login under your Google or Facebook account (there are other options, too).

1 comment:

Haha. What machine learning algorithm are you using, if that is not a secret? I have some experience with these algorithms since I am using those to do things like in this contesthttp://www.kaggle.com/c/decoding-the-human-brain

I can tell you at least my own experience with these algorithms. First of all, there is not a single best algorithm which follows from the informal "no free lunch theorem". For each particular dataset, different algorithms might perform differently. Nevertheless, the random forests are widely considered to be the best current algorithmhttp://kawahara.ca/the-best-machine-learning-algorithm-for-classification-and-regression/The random forest is the algorithm which I have the most experience with. Second best would probably be the deep neural networkshttp://www.idsia.ch/~juergen/vision.html

You need no Linux to do machine learning. I am using MATLAB under Windows. Other good options are Python or R (a statistical package). Under MATLAB there is a good implementation of random forest which is compiled from C and is very fasthttps://code.google.com/p/randomforest-matlab/Most data scientists, engineers, AI experts are using MATLAB. Mathematica is predominantly used by physicists and mathematicians. I do not know why it i so. I have no experience with Mathematica, but MATLAB can do everything I ever wanted. It has a very dedicated community of users and you can find a lot of code online (unlike Mathematica)

If I competed in this contest, how would I do it? I would use maybe 5-fold cross-validation on the training data set. I would use random forest. I would use the Gini index (or variable importance) of the random forest to identify those features which are the most discriminative. After that I would tweak with the parameters of the random forest to increase the weight of the more discriminative features.

You programmed a machine learning algorithm from scratch? You are very good indeed.