Anil: I am a Technical Leader at Cisco Systems, where I work on building multimedia server software. I was introduced to machine learning when I participated in the Netflix Prize competition. Other than Netflix Prize where I was able to eke out an improvement of 7% in recommendation accuracy, I have no significant data mining experience to speak of.

Chris: I have a MS in electrical engineering, but I have no formal background in machine learning. My first data-mining contest was the Netflix Prize, and I learned a tremendous amount by being part of the team that came in 2nd place. Since then, I’ve been hooked by these competitions, and have entered several Kaggle contests in my spare time. During the day, though, I work on Voice-over-IP projects at AT&T Labs, where I’m a systems engineer.

Will: I studied physics at Cornell and am in the final stages of a PhD in biomedical engineering at Rutgers. Like Chris and Anil, my first data mining contest was the Netflix prize, where I placed somewhere around 30,000th (the leaderboard doesn’t go past 1000, but what’s a few thousand places among friends). Several years and many competitions later, I am lucky to rub elbows with the clever minds and talented folks on Kaggle.

Will and Chris formed a team from the start, while Anil climbed the leaderboard separately. In the closing days of the competition, the two teams agreed to merge in order to better their respective chances at the top and only prize. Following a long weekend of furious model blending, they ended up in 4th. All three participants wish to thank Capital Markets Cooperative Research Centre and Kaggle.com for hosting this competition.

What made you decide to enter?

Anil: I had just completed the excellent online course on machine learning taught by Prof. Andrew Ng of Stanford and was looking for a challenge that goes beyond routine homework problems. This competition was a perfect fit. I have always found stock market data to be intriguing and this looked like a good opportunity to try my hand at analytics.

Will: This was my first contest with financial data and a nice opportunity to peek into the world of short-term market dynamics, spreads, order books, etc.

Chris: I have always been interested in “quant” finance topics. Also, I had some success in the INFORMS 2010 contest on Kaggle, which involved predicting short-term price movements of securities. I thought some of the lessons I learned in that contest might be helpful in this one.

What preprocessing and supervised learning methods did you use?

Will: I had initial success with a kNN model and spent the majority of the competition convinced I could improve this model. My initial feature set was picked by hand using feedback from the probe set (the last 50k trades of the train set). Most of the features were basic transformations of the spread just before the liquidity shock. Querying for neighbors within each security generally outperformed querying across all securities, but we did find that the combination of the two worked best. I spent many weeks attempting to implement a more rigorous way to pick the feature space. There are many published methods on how to learn a custom Mahalanobis distance metric using supervised labels (that is, to find S in the equation below such that trades with similar reactions would have similar distances in feature space).

(Note that a diagonal S is the same thing as weighting each feature separately in Euclidean space)

However, this contest was not a traditional supervised classification problem in the sense that we had a measure of dissimilarity (the RMSE between the bid/ask responses), as opposed to neat-and-tidy class labels. Despite numerous promising modifications and a last minute multidimensional scaling idea, I ran out of time to find a suitable Mahalanobis matrix that beat the RMSE of the initial hand-picked feature set.

Chris: I tried to keep it simple, and stuck with creating multiple variations on basic linear regression models. I created more than 30 features derived from the original data (consisting mostly of the min/max/median/std.deviations of prices & spreads & the trade/quote arrival rate). I fed those features plus the original data to a LASSO regression, which selected 18 variables. Separate regressions were used for bids vs asks, and buys vs sells. I also had models that predicted prices for each time period individually, as well as other models that predicted time-invariant, average prices. Furthermore, the characteristics of the the testing & training sets differed, so I tried a variety of ways to weight each row of data to correct for those differences. In the end, I just weighted each row by the ratio of how often each security appeared in the testing vs training sets.

Anil: The model that worked best for me was linear regression on various indirect predictors derived from the training data. I also tried Random Forest, k-NN, k-means and SVM regression techniques. As for preprocessing, I found it advantageous to set the base predictions to match the general trajectory of the prices and then model the residuals.

What was your most important insight into the data?

Will: This data was very fussy, and the use of un-normalized RMSE to score the competition made for a very skewed error distribution. Chris and Anil did some great due diligence into the quirks of this dataset, so I defer to them for details.

Chris: The market open (at 8AM) was very extremely unpredictable, and contributed a disproportionally large amount of error. For one model, I found 12% of squared error for the entire trading day occurred in the first minute of trading. To combat this, I trained some separate models for the market open, since it seemed so different (the naive benchmark model worked better than my regressions at the market open, for example).

Additionally, the farther away you got from the “liquidity shock” trade, the more unpredictable the prices were. Looking backward in time from that “liquidity shock” trade, my variable-selection algorithms dropped all historical bid/ask prices except those immediately before the trade, since those prices did not provide enough predictive value. As you moved forward in time from that trade, bid/ask prices got progressively harder to predict. Using time-averages & PCAs, though, you could see two common patterns in the noise: for buys, bid/ask prices jumped up sharply & then rose slowly; for sells, bid/ask prices jumped down sharply, and then fell slowly. Thus the “liquidity shock” trades seemed to have a permanent impact on prices, rather than a temporary, mean-reverting one.

Anil: Categorizing and plotting the data clearly showed that the bid and ask prices followed separate paths and their trajectories differed depending on who initiated the trade - buyer or seller. Performing regression separately for each category led to dramatic improvement in prediction accuracy.

Were you surprised by any of your insights?

Will: I’m unconvinced that I had noteworthy insights outside of the usual techniques to gain ground in a data mining competition. RMSE falls by 3 methods: creating many models and blending them, better data/features, or better methods. Chris and Anil each brought a nice blender and several models to the table. I did what I could to make a better kNN method, but perhaps my time would have been better spent coming up with features or looking for outliers.

Anil: The conventional wisdom seems to suggest that an ensemble of reasonably good models perform better than a finely tuned individual model. As a team, we had a great variety of models, but looking back, I think we would have fared better if we spent more efforts to tune the individual models. The models that used SVM and k-means with mediocre prediction accuracy ended up contributing almost nothing to the final blended result.

Chris: I knew the market open & outliers would be important, but I was really surprised by how much of an impact they had on one’s RMSE. I was also surprised to find no mean-reversion in the bid/ask prices, since that differed from the examples that the contest organizers gave.

Which tools did you use?

Anil: I am a minimalist and typically use little more than gVim, gcc and gnuplot. For this competition, I picked up some R and was impressed by its capabilities. One could just grab an off-the-shelf package, let it loose on the data and end up with decent results. I still think lower level languages have a place in data analytics because of the flexibility that they offer. Knowing what happens under the hood can give you an edge. Sometimes small tweaks to the underlying mechanism can give you a big boost when you desperately need it. My best model was written entirely in C++ without using any 3rd party libraries. This made it possible to mold the model well enough to fit the quirks within the data.

Chris: I used R, mostly working with the glmnet package. I also used Python (with the numpy package) for blending prediction sets together.

Will: Matlab

What have you taken away from this competition?

Will: Collaborating with new teammates was a nice experience. Teammates bring different backgrounds, fresh ideas, and code in different ways. Working alone it is easy to get stuck trying the same hammer on every nail, even if that nail happens to be a screw. That’s when a teammate can step in and tell you to stop smashing with the hammer and try a screwdriver. Witnessing another person dissect the same data problem is a great way to pick up new tools and skills.

Chris: One lesson I learned from this competition is that one should always identify outliers as specifically as possible & decide how to best deal with them. Also, it’s really helpful to have teammates to bounce ideas off of, especially when you’re stuck or losing motivation.

Anil: The discussions on the forum, especially post-contest have been illuminating. I don't think any contestant individually had quite a handle on the data. The details that contestants have shared about their findings and methods have shed light on various aspects of the data. One can actually see the pieces coming together in the giant jigsaw puzzle. I am also thankful that I got to collaborate with Chris and Will, who are first-rate data scientists and fantastic people to work with.