Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘missing values

Playing the Kaggle Two Sigma Challenge – Part 3

with 4 comments

Introduction

Previously I introduced the Kaggle Two Sigma Financial Modeling Challenge which ran from December 1, 2016 to March 1, 2017. Then last time I covered what I did in December. This time I’ll cover what I worked on in January.

Update 2017/03/07: I added the source code for adding the previous value of a few columns to the data in RegressionLab4.py posted here.

Time Series Data

Usually when you predict the stock market you use a time series of previous values to predict the next value. With this challenge the data was presented in a different way, namely you were given a lot of data on a given stock at a point of time and then asked to predict the next value. Of course you don’t have to use the data exactly as given, you can manipulate it into a better format. So there is nothing stopping you reformatting the data so for a given timestamp you also have a number of historical data points, you just need to remember them and add them to the data table. Sounds easy.

Generally computers are good at this sort of thing, however for this challenge we had 20 minutes of running time and 8Gig of RAM to do it. For testing training runs there were about 800,000 rows of training data and then 500,000 of test rows. These all needed to be reformatted and historical values held in memory. Further you couldn’t just do this as an array because the stock symbols changed from timestamp to timestamp. Ie symbols were inserted and removed meaning that you had to stay indexed by the symbol to ensure you were shifting data properly. The pandas library has good support for this sort of thing, but even with pandas it tended to be expensive for processing time and memory usage.

My first attempt was why not just keep the 10 last values of everything and feed that into scikit learn algorithms to see what they liked. Basically this got nowhere since my runs were just aborted as they hit the memory limit. Next I tried adding a few time series looking columns to the data table and feeded that into ExtraTreesRegressor, this worked quite well but I couldn’t had much more data without running out of memory or slowing things down so I couldn’t use many trees in the algorithm.

From this experience, I tried just selecting a few rows and presented different and tried keeping different numbers of historical data. Experimenting I found I got the best results using 36 columns and keeping 2 timestamps of history. This wasn’t much work on my part bart took quite a few days since you only get two submissions per day to test against the larger submission set.

Strictly speaking this isn’t a time series since I don’t really have historical values of the variable I’m predicting, however it is theorized (but not confirmed) that some of the input variables include weighted averages of historical data, so it might not be that far off.

Ensemble Averaging

Ensemble averaging is the technique of taking a number of machine learning algorithms that solve a particular problem and taking the average of their results as the final result. If the algorithms are truly independent then there are theorems in probability that support this. But even if they aren’t fully independent, practical results show that this does provide surprisingly good results. Further typically most Kaggle competitions are won by some sort of weighted average of a good number of algorithms. Basically this approach averages out errors and biases that an individual algorithm might introduce.

Note that the Christmas surprise solution from the previous blog article was really an ensemble of three algorithms where the average was quite a bit better than any of the individual components.

I now suspected I had enough ways to slice the data and had tried quite a few algorithms that towards the end of January I could start combining them into ensembles to get better results. I started by combining a number of regression algorithms since these were fairly fast to train and execute (especially on a reduced set of columns). I found that the regressions that gave the best results were ones that eliminated a lot of variables and just had 8 or so non-zero coefficients. This surprised me a bit, since I would have expected better results out of Ridge regression, but didn’t seem to be able to get them.

This moved me up the leaderboard a bit, but generally through January I dropped down the leaderboard and found it a bit of a struggle to stay in the top 100.

Missing Values

I also spent a bit of time trying to get better values for the missing values. Usually with stock data you can use the pandas fillna using back or forward filling (or even interpolating). However these didn’t work so well because the data wasn’t strictly rectangular due to the stocks being added and removed. Most things I tried just used too much processing time to be practical. In fact just doing a fillna on the mean (or median) values on the training data was a pretty big time user. So I tried this offline running locally on my laptop to see if I could get anywhere. I figured if I got better results then I could try to optimize them and get it going in the Kaggle VM. But even with quite a bit of running it seemed that I didn’t get any better results this way, so I gave up on it. I suspect practically speaking the ML algorithms just ignored most of the columns with a lot of missing values anyway.

Summary

This was a quick overview of my progress in January. Next up the final month of February. One thing that was good about the two submission per day limit was that it limited the amount of work I did on the competition since it could be kind of addictive.

Advertisements

Written by smist08

March 6, 2017 at 5:43 pm