Stephen Smith's Blog

Musings on Machine Learning…

Playing the Kaggle Two Sigma Challenge – Part 5

with 2 comments


This posting will conclude my coverage of Kaggle’s Two Sigma Financial Modeling Challenge. I introduced the challenge here and then blogged on what I did in December, January and February. I had planned to write this article after the challenge is fully finished, however Kaggle is still reviewing the final entries and I’m not too sure how long that is going to take. I’m not sure if this delay is being caused by Google purchasing Kaggle or just the nature of judging a competition where code is entered rather than just uploading data. I’ll just go by what is known at this point and then if there are any further surprises I’ll post an update.

Public vs Private Leaderboards

During the competition you are given a data set which you can download or run in the Kaggle Cloud. You could train on all this data offline or if you did a test run in the cloud it would train on half the data and then test on the other half. When you submit an entry it would train on all of this data set and then test against a hidden dataset that us competitors never see.

This hidden dataset was divided into two parts, ⅓ of the data was used to calculate our public score which was revealed to us and provided our score on the leaderboard during the competition. The other ⅔ was reserved for the final evaluation at the end of the competition which was called the private leaderboard score. The private leaderboard score was secret to us, so we could only make decisions based on the public leaderboard scores.

My Final Submissions

You last step into the competition is to choose two submissions as your official entries. These will then be judged based on their private leaderboard scores (which you don’t know). If you don’t choose two entries, it will select your two highest entries. I just chose my two highest entries (based on the public leaderboard).

The Results Revealed

So after the competition closed on March 1, the private leaderboard was made public. This was quite a shock since it didn’t resemble the public leaderboard at all. Kaggle hasn’t said much about the internal mechanisms of the competition so some of the following is speculation either on my part of from the public forms. It appears that the ⅓ – ⅔ split of the data wasn’t a random sample but was a time based split and it appears the market conditions were quite different in the second two thirds than the first third. This then led to quite different scores. For myself I dropped from 71st place to 1086th place (out of 2071).

What Went Wrong

At first I didn’t know what had gone wrong. I had lots of theories but had no way to test them. A few days later they revealed the private leaderboard scores for all our submissions so I could get a better idea of what worked and what didn’t. It appears that the thing that really killed me was using Orthogonal Matching Pursuit in my ensemble of algorithms. Any submission that included this had a much worse private leaderboard score than the public leaderboard score. On the reverse side any submission that used Ridge regression did better on the private leaderboard score than the public leaderboard.

Since I chose my two best entries, they were based on the same algorithm and got the same bad private leaderboard score. I should have chosen something quite different as my second entry and with luck would have gotten a much better overall score.

There is a tendency to blame overfitting on solutions that did well on the public leaderboard but badly on the private leaderboard. But with only two submissions a day and given the size of the data, I don’t think this was the case. I think it’s more a matter of not having a good representative test set. Especially given how big a shake up there was.

What Worked

Before I knew the private leaderboard scores for all my submissions I was worried the problem was caused by either offline training or my use of reinforcement learning, but these turned out to be ok. So here is list of what worked for me:

  • Training offline was fine. It provided good results and fit the current Kaggle competition.
  • RANSAC did provide better results.
  • My reinforcement learning results gave equivalent improvements on both the public and private leaderboards.
  • Lasso and ElasticNet worked about the same for both leaderboards.
  • ExtraTreesRegressor worked about the same for both leaderboards.
  • Using current and old time series data worked about the same for both leaderboards.

My best private leaderboard submission was an ExtraTreesRegressor where it had added columns for last timestamp data on a few select columns. I then had several ensembles score well as long as they didn’t include Orthogonal Matching Pursuit.

How the Winners Won

Several people that scored in the top ten revealed their winning strategies. A key idea from the people with stock market experience was to partition the data based on an estimate of market volatility and then use a different model for each volatility range. So for instance use one model when things are calm (small deltas) and and a different model when volatile. These seemed to be a common theme. One that did quite well divided the data into three volatile ranges and then used Ridge Regression on each. Another added the volatility as a calculated column and then used offline gradient boosting to generate the model. The very top people have so far kept their solutions secret, probably waiting for the next stock market competition to come along.

Advice for New Competitors

Here is some advice I have based on my current experience:

  • Leverage the expertise on the public forum. Great code and ideas get posted here. Perhaps even wait a bit before starting, so you can get a jump start.
  • Research a bit of domain knowledge in the topic. Usually a bit of domain knowledge goes a long way.
  • Deal with outliers, either by removing or clipping them or using an algorithm like RANSAC to reduce their effect.
  • Don’t spend a lot of time fine tuning tunable parameters. The results will be different on the private leaderboard anyway so you don’t really know if this is helping or not.
  • Know a range of algorithms. Usually an ensemble wins. Know gradient boosting, it seems to always be a contender.
  • Don’t get too wrapped up in the competition, there is a lot of luck involved. Remember this is only one specific dataset (usually) and there is a large amount of data that you will be judged against that you don’t get to test against.
  • Select your two entries to win based on completely different algorithms so you have a better chance with the private leaderboard.
  • Only enter a challenge if you are actually interested in the topic. Don’t just enter for the sake of entering.

Some Suggestions for Improvement

Based on my experience with this challenge, here are some of my frustrations that I think could be eliminated by some changes to the competition.

  • Don’t make all the data anonymous. Include a one line description of each column. To me the competition would have been way better (and fairer) if we knew better what we were dealing with.
  • Given the processing restrictions on the competition, provide a decent dataset which isn’t full of missing values. I think it would have been better if reasonable values were filled in or some other system was used. Given the size of the dataset, not much could be done about these.
  • A more rectangular structure to the data would have helped processing in the limited resources and enhanced accuracy. For instance stocks that enter the portfolio still have prices from before that point and these could be included. This would have helped treating the dataset as a time series easier.
  • Including columns for the market would have be great (like the Dow30 or S&P500). Most stock models heavily follow the market so this would have been a big help.
  • Be more upfront on the rules. For instance is offline training allowed? Be explicit.
  • Provide information on how the public/private leaderboard data is split. Personally I think it should have been a random sample rather than a time based split.
  • Give the VMs a bit more oomph. Perhaps now that Google owns Kaggle these will get more power in the Google cloud. But keep them free, unlike the YouTube challenge where you get a $300 credit which is used up very quickly.


This wraps up my coverage of the Kaggle Two Sigma Financial Challenge. It was very interesting and educational participating in the challenge. I was disappointed in the final result, but that is part of the learning curve. I will enter future challenges (assuming Google keeps doing these) and hopefully can apply what I’ve learned along the way.

Written by smist08

March 13, 2017 at 7:36 pm

Posted in Artificial Intelligence

Tagged with ,

2 Responses

Subscribe to comments with RSS.

  1. Hi Stephen, impressive blog.

    I wrote one of the scripts you referred to. I only used Ridge on high volatility and used SGDRegressor with a different loss function for low and medium. This post clarifies:

    Overall, I thought the Two Sigma competition was not very well thought out. One would think with all the PhD’s doing quantitative research at that company, they could of thought up a better and more fair competition. Instead they came up with random winners who accidentally over fit on the solution 🙂


    April 13, 2017 at 1:44 am

  2. […] Two Sigma for stock market prediction. I blogged about this in part 1, part 2, part 3, part 4 and part 5. The upshot of this was that although I put in a lot of work, I performed quite poorly in the final […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: