Thursday, January 28, 2016

Recently, I participated in a Kaggle contest sponsored by Winton Capital. The contest provided various market related data and asked participants to predict intraday and next two day return forecasts over unseen future data. Prizes included $50,000 in monetary rewards and a chance to join Winton Capital's team of research scientists. I think these kinds of contests are a great way for modern data companies to scour hidden talent and in return, they offer great real world problems for both aspiring, as well as seasoned scientists to explore and validate ideas and methodologies with like minded individuals. I decided to write a post on my approach, so that users can repeat the methods and have some ideas behind the thought process.

The bad news is that I didn't fare very well in the contest (edit* looks like I made top 25%), however, there were a lot of interesting issues that surfaced as the contest progressed. I joined the contest fairly late and read some posts about how the data was altered in midstream. To understand this better, it might help to describe the data in more detail. The original dataset was comprised of two subsets, a train dataset with 40000 obs.and 211 features, and a test dataset with 40000 obs. and 147 features. The features could be broken into an ID column, a mix of 25 unlabeled continuous and discrete features, and 183 ordered time series returns. The time series returns were further broken down into -D1,-D2,1minD ,+D1,+D2 : the 1 min data represented a range of 179 intraday 1 min returns. The D1,D2 returns were prior overnight, and post daily returns, respectively. The goal was to predict the intraday and post two day returns beyond the original time series data points. So we were given roughly 2/3 of the prior returns and needed to predict 1/3 of the out of sample returns. Contestants were asked to submit the predicted test data results and received feedback on a reserved portion of the private dataset that were displayed as public leaderboard results. Typically, contestants can use the public leaderboard as a gauge of how well their models are performing on hidden, out of sample data. The overall final performance on the private leaderboard is withheld until the contest ends.

Many posters quickly figured out that the task was rather daunting and suspected that a zero baseline (just predict all zeros) or median of time series observations were pretty difficult to beat. However, what had changed in the contest, was that there was a possibility that some posters could have gamed the data and reverse engineered it to obtain very good scores on the public leaderboard. In response, the contest administrators decided to add additional test data to the original private test dataset, it was upped from 40000 unseen observations to 120000 observations. This was done to try to deter any gaming of the system. It also made gauging cross validation much more difficult as we were only given leaderboard feedback on 20000 of the observations. So we had to try to figure out the most robust model, given only 25% of training data, and make it robust to 75% of new hidden data! This with performance feedback on only 16.7% of the hidden data results being displayed on the public leaderboard. Additionally, some posters suspected that the data structures of the new dataset had been distorted from its original characteristics in order to avoid any gaming. I'm hoping to get more feedback from the contest and providers, in order to understand the data better.

My own approach was to use a gradient boosted model from the R gbm package. GBMs have historically performed well on kaggle competitions, have been shown to be one of the best performing models on a variety of datasets (see Kuhn, Applied Predictive Modelling), and are known to be a good choice for a robust out of the box model. Some of the benefits of GBMs, are ability to deal with NAs and handling of unscaled data, so we don't have to worry much about pre-procesisng data. As I didn't have much time to explore other options, I (perhaps wrongly) spent most of my time on improving the gbm cross-validated performance. Like others, I found that my public leaderboard score was consistently at the high range of my cv performance results. Since there was so much discussion about data manipulation, I (again likely wrongly) stubbornly believed in the cv results, and assumed that they might have biased the leaderboard set towards the worse performing data. In retrospect, I should have maybe spent more time on extremely simple models (like means and medians) as opposed to fitting a relatively small set of training data with a complex model. I also found many odd data characteristics, like intraday data columns that were perfectly trending up in the training set and down in the test set, some data had very little noise, others had lots of noise. My intuition was that not all of the time series data were from true financial time series, and that the distributions were not stable from training to test sets. This made the problem more difficult, IMO, then dealing with real data and having knowledge of the underlying data. I attempted to filter for different training and test samples, by using t-tests for comparisons, and setting a threshold for removing unusually different time series data columns.

Some posters observed that Feature_7 was unstable between train and test samples. Unfortunately, that was also the same feature that was highlighted as the most important feature in both D1, and D2 predictions from my GBM model. I left it in as I didn't have enough time to change strategies at that point and figured that I still had reasonable CV estimates that would beat the zero and median estimates on average. I ended up just using median estimates of training features for the out of sample 1 min predictions, and used a GBM to predict the next two out of sample days. We were also given a vector of loss weights for the training data and were instructed to use weighted mean absolute error as our loss function. R's GBM models allow us to set 'Laplace' as an absolute loss function. I also wrote out the loss funciton to manually record it in the cross validated loops and get a better feel for the sensitvity of the validation folds. A loop was created to record performance on 5 validation folds across several tree lengths. The optimal value was then used to rebuild final training models on all the training data, that could then be used to predict values for the test fold.

Fig. 3. below shows cross validated results from a simple GBM model using 5 fold validation, with shrinkage = 0.1, n.minobsinnode=50, interaction.depth =1. The red line shows the median value for the zero baseline as a comparison to the WMAE.val folds across different tree iterations. Notice that selecting, say n.trees = 900, yields a range of results that were all superior to the baseline median. I woud expect somewhere around 1765 for this model; however, as explained earlier, the private test set was difficult to match performance as distribution of features were different. I used the run below to illustrate, but the actual model I built had a lower shrinkage of 0.01 and much better performance, but took a long time to run -- as I had to use about 8000 trees.

In conclusion, I was able to use R to build a GBM model with good cross validated performance against the baseline zero metric, but the model did not perform very well on the hidden private data. I believe two big factors were differences in train and test datasets, as well as a much larger hold out dataset that was different enough that the true results were much worse than 5 fold CV would predict. It's equally possible that the model was too complex for the very noisy dataset.

Some posters were wondering about why WMAE was used as opposed to RMSE. In financial time series, MAE is often used as a loss function as it is more robust to outliers. You can imagine why large outliers will not only skew the data and fit, but their contributions will effectively be squared using RMSE.

Lastly, although it was a great excercise in machine learning and real world modeling. In my experience, targeting continuous variables and using weighted mean absolute value loss functions may not be the best fitness to target. The great thing that it illustrates is the practical difficulty in predicting financial time series. Thanks to Winton and Kaggle for a great experience!

#####################################################
Code is below. Note* I used Pretty R to embed the code but there are color clashes with the background. You can copy and paste into a local browser to view more clearly. If ayone has suggestions on how to change the PrettyR font colors, please let me know.

Friday, April 3, 2015

I just wanted to briefly share some initial impressions of the 2nd edition of Stephen Marsland's very hands on text, "Machine Learning, An Algorithmic Perspective." Having been a big fan of the first, I requested a review copy from Dr. Marsland, and with his help, the publishers were kind enough to send me a review copy. I spent the the last few months going over most of the newer topics and testing many of the newer scripts in Python. With that, I'll dive into some of my impressions of the text.

I've stated before, that I thought the 1st edition was hands down, one of the best texts covering applied Machine Learning from a Python perspective. I still consider this to be the case. The text, already extremely broad in scope, has been expanded to cover some very relevant modern topics, including:

Particle Filtering (expanded coverage with working implementation in Python).

Deep Belief Networks

Gaussian Processes

Support Vector Machines. Now includes working implementation with cvxopt optimization wrapper.

Those topics alone should generate a significant amount of interest from readers. There are several things that separate this text's approach from many of the other texts covering Machine Learning. One, is that the text covers a very wide range of useful topics and algorithms. You rarely find a Machine Learning text with coverage in areas like evolutionary learning (genetic programming) or sampling methods (SIR, Metropolis-Hasting, etc). This is one reason I recommend the text highly to students of MOOC courses like Andrew Ng's excellent 'Machine Learning', or Hastie and Tibshirani's, 'An Introduction to Statistical learning'. Many of these students are looking to expand their set of skills in Machine Learning, with a desire to access working concrete code that they can build and run.

While the book does not overly focus on mathematical proofs and derivations, there is sufficient mathematical coverage that enables the student to follow along and understand the topics. Some knowledge of Linear Algebra and notation is always useful in any Machine Learning course. Also, the text is written in such a way, that if you simply want to cover a topic, such as particle filtering, you don't necessarily need to read all of the prior chapters to follow. This is useful for those readers looking to refresh their knowledge of more modern topics.

I did, occasionally, have to translate some of the Python code to work with Python 3.4. However, the editing was very minimal. For example, print statements in earlier versions did not require parentheses around the print arguments. So you can just change print 'this' to print('this'), for example.

I found the coverage of particle filters and sampling, highly relevant to financial time series-- as we have seen, such distributions often require models that depart from normality assumptions. I might possibly add a tutorial on this, sometime in the future.

In summary, I highly recommend this text to anyone that wants to learn Machine Learning, and finds the best way to augment learning is having access to working, concrete, code examples. In addition, I particularly recommend it to those students that have followed along from more of a Statistical Learning perspective (Ng, Hastie, Tibshirani) and are looking to broaden their knowledge of applications. The updated text is very timely, covering topics that are very popular right now and have little coverage in existing texts in this area.

*Anyone wishing to have a deeper look into topics covered and code, can find additional information and code on the author's website.

Monday, April 29, 2013

A colleague and I were recently discussing ways to get intuition about the IBS classification method for reversion systems. I thought I'd share a violin plot I generated that might help to get some visual intuition about it. We can download and process next day returns for an asset like SPY and group classes into LOW (IBS < 0.2), HIGH(IBS > 0.8), and MID(all others). One thing you can see in the plots is the pronounced right skew in the LOW class, and left skew in the HIGH class (they sort of resemble opposing stingrays -- stingray plots might be a more apt term for the reversion phenomena); while the MID class tends to be more symmetrical. The nice thing about the vioplot visualization is that it includes the density shape of the return distribution, which adds intuition over more common box and whisker plots.

About Me

I've been trading full time for over 10 years and wish to share some of the knowledge I've acquired with others.
I have a particular interest in machine learning and how the current research in this area holds much unexplored potential towards the area of systematic trading development.
Although I've studied many different texts on machine learning, I've often found sparse practical examples related to trading. My goal is to share some concrete examples for the layman to be able to build and replicate.
.
.
.
.
intelligenttradingtech@yahoo.com