Stephen Smith's Blog

Musings on Machine Learning…

Playing the Kaggle Two Sigma Challenge 2018/2019

leave a comment »



A couple of years ago, I entered a Kaggle data science competition sponsored by Two Sigma for stock market prediction. I blogged about this in part 1, part 2, part 3, part 4 and part 5. The upshot of this was that although I put in a lot of work, I performed quite poorly in the final stages. I learned a lot about machine learning and data science along the way and was keen to have another go, when another Two Sigma sponsored competition rolled around starting last year.

In this competition we had three months to create our models in the fall of 2018, then they used our models to predict the stock market through the first half of 2019. With my learnings from the last competition, I was able to do much better this time around.

Don’t Overfit

My big lesson from the first competition was to not overfit the model to the training data. This is equivalent to having ten data points and fitting them perfectly with a 9th degree polynomial. There is no error in predicting the ten points, but the model is useless at predicting anything else, and in fact gives about the worst predictions possible.

A more subtle form of overfitting is trying hundreds of models and fiddling with their parameters until they work really well with the training data. This is a lot of work and won’t help you on any data outside of the training data. I did this for the first competition, it was a lot of work, and it performed badly.

Avoiding overfitting means doing less work, which is good. I spent very little time on this competition and got quite good results.

Kaggle Virtual Environment

One of the things I like about this competition is that you play it in Google/Kaggle’s virtual environment. You have a fixed set of computer resources and everyone plays in the same environment. This levels the playing field with people who have access to very high powered equipment at corporations or Universities. This year the environment included a GPU and we could run for six hours on a high powered server.

This does limit the models you can use, I wasn’t successful at using a neural networks, probably because historical stock market data is very flat and it is hard to get these models to converge. I ended up using an Extra Trees model in SciKit Learn.

Make the Program Robust

Usually in this sort of competition, when you run your model, behind the scenes Kaggle runs it on the secret test data as well, so if you run successfully on the provided test data, you know you also run on the secret data you are going to be scored against. In this case the secret test data didn’t exist yet. This led to the worry that sometime in the six months that they would be running our program, something unexpected would appear in the data and cause my program to crash, knocking me out of the competition.

I was careful to put in try/catch statements and added extra checks to try and keep my program running. The other thing with Python programs, is that sometimes they work, but throw a memory exception when they shutdown. I spent some time tracking down a number of these bugs and made sure my program could exit gracefully without any errors.

From the message board for the competition, it appears quite a few competitors were knocked out during the run on the new data.

Don’t Cheat

Some people spend all their time trying to cheat. Trying to hack the system to gain access to the secret test data. For the purposes of the leaderboard, there was secret test data to give people a score during the model building phase, but this data wouldn’t be used in the real competition. There was no real protection on this data, since using it to cheat would be useless. However quite a few people did cheat to move to the top of the leaderboard before the real competition started. These programs crashed when the real competition started.

There are rumours that people have succeeded in other Kaggle competitions by cheating, but in this one, since it was based on stock market data generated after the models were frozen, it wasn’t going to work.

My Model

The intent of the competition was to try to use news data to enhance your stock prediction algorithm. I don’t think this worked well, I used an Extra Trees variation on a Random Forest, and the news data never seemed to contribute much. Many other competitors didn’t even use it. The competition metric was a Sharpe Ratio. A couple of observations about the Sharpe Ratio, one is that it has the standard deviation of the estimates in the denominator. This means volatile stocks will hurt you even if you predict them accurately. Second, you enter a confidence value on whether you think the stock will do well and this confidence can be negative if you think it will go down. If you get the sign wrong it will doubly hurt you.

Given the Sharpe Ratio as the metric, I decided to sort the stocks by standard deviation of their returns and then rate any with a high standard deviation as zero, which would exclude them from the model. This meant I was building a portfolio of a subset of all the stocks present. I wanted stocks I could predict accurately and that had a lower volatility.

In reading the discussion board after the competition, it appears many of the top performers took this approach.


I enjoyed this competition. I liked working with the high powered servers in the competition’s virtual environment. The lesson learned to not overfit, greatly reduced the amount of work. Whenever I was tempted to tune my model, I just said no. I ended up receiving a silver medal, coming in 145 out of 2927 competitors. With each competition I learn a bit more and I look forward to the next one.

Written by smist08

August 16, 2019 at 5:43 pm

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: