How to enter a data contest - machine learning for newbies like me

I've not had much experience with machine learning, most of my work has been a struggle just to get data sets that are large enough to be interesting! That's a big reason why I turned to the Kaggle community when I needed a good prediction algorithm for my current project. I wasn't completely off the hook though, I still needed to create an example of our current approach, limited as it is, to serve as a benchmark for the teams. While I was at it, it seemed worthwhile to open up the code too, so I've created a new Github project:

It actually produces very poor results, but does demonstrate the basics of how to pull in the data and apply one of scikit-learn's great collection of algorithms. If you get the itch there's lots of room for improvement, and the contest has another two weeks to run!

Installing scikits-learn

Before you can run the python scripts, you'll need to install the scikits-learn machine-learning framework. Here's the instructions.

It's also worth checking out the tutorial and their other guides, they've written some great documentation.

Getting the code

To pull the latest copy of this code and enter the directory run these commands:

git clone git://github.com/petewarden/MLloWorld.git

cd MLloWorld/

Creating a model

Before you can predict unknown values, you need to train up the algorithm with example data. I've packaged a set of 40,000 items as a CSV file, with each column representing an attribute of the original photo albums. You'll need to run these through the training script to build a model that can be used for prediction. Here's the command:

python train.py training_data.csv storedmodel

That may take ten or twenty minutes to run, but at the end you should have a file called storedmodel in the current directory.

Predicting results

Now that you have a model built, you can take the test set of data and predict their values:

python predict.py test_data.csv storedmodel > results.csv

This will also take a few minutes, but at the end you'll have a CSV file containing a list of the album ids and a prediction for each one. It's in the right format to submit to Kaggle, and if you look for the 'Full scikit-learn example' in the benchmarks at the bottom of the leaderboard, you'll see how this simple approach scored:

As you can see, it's not that great! If you modify the code and think you've improved its predictions, you can create a team and submit your new results to find out how well you've done. There's already stiff competition from the current teams of course!

The trickiest part for me was getting the data into a format that scikit-learn's functions could understand. Because the CSV stores which words occurred for an album, the full row vector for each of them could be thousands of entries long, most of them zero. To speed up the training and save on memory, I used numpy's sparse matrix class to store the results, coo_matrix. You can see the sort of unpacking I do in the expand_to_vectors() function in mlloutils.py

[Update - Big thanks to Olivier Grisel who vastly improved the results by fixing some errors in the CSV reader and picking a more accurate and much faster classifier. I've integrated his changes, and now see a score of 0.44, which still puts it at the bottom of the leaderboard but is at least respectable!]