Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘ensemble averaging

Playing the Kaggle Two Sigma Challenge – Part 4

with 2 comments


The Kaggle Two Sigma Financial Modeling Challenge ran from December 1, 2016 through March 1, 2017. In previous blog posts I introduced the challenge, covered what I did in December then what I did in January. In this posting I’ll continue on with what I did in February. This consisted of refining my work from before, finding ways to refine the methods I was using and getting more done during the Kaggle VM runs.

The source code for these articles is located here. The file is the code I used to train offline. You can see how I comment/uncomment code to try different things. The file shows how to use these results for 3 regression models and 1 random forest model. The offline file uses the datafile train.h5 which is obtained from the Kaggle competition, I can’t redistribute this, but you can get it from Kaggle by acknowledging the terms of use.

Training Offline

Usually training was the slowest part of running these solution. It was quite hard to setup a solution with ensemble averaging when you only had time to train one algorithm. Within the Kaggle community there are a number of people that religiously rely on gradient boosting for their solutions and gradient boosting has provided the key components in previous winning solutions. Unfortunately in this competition it was very hard to get gradient boosting to converge within the runtime provided. Some of the participants took to training gradient boosting offline locally on their computers and then taking the trained model and inserting it into the source code to run in the Kaggle VM. This was quite painful since the trained model is a binary Python object. So they pickled it to a string and then output the string as an ascii representation of the hex digits that they could cut and paste into the Kaggle source code. The problem here was that the Kaggle source file is limited to 1meg in size, so it limited the size of the model they could use. However a number of people got this to work.

I thought about this and realized that for linear regression, this was much easier. In linear regression the model only requires the coefficient array which is the size of the number of variables and the intercept. So generating these and cut/pasting them into the Kaggle solution is quite easy. I was a bit worried that the final test data would have different training data, which would cause this method to fail, but in the end it turned out to be ok. A few people questioned whether this was against the rules of the competition, but no one could quote an exact rule to prevent it, just that you might need to provide the code that produced the numbers. Kaggle never gave a definitive answer to this question when asked.

Bigger Ensembles

With all this in mind, I trained my regression models offline. Some algorithms are quite slow so this opened up quite a few possibilities. I basically ran through all the regression algorithms in scikit-learn and then used a collection of them that gave the best scores individually. Scikit-learn has a lot of regression algorithms and many of them didn’t perform very well. The best results I got were for Lasso, ElasticNet (with L1 ratios bigger than 0.4) and Orthogonal Matching Pursuit. Generally I found the algorithms that eliminated a lot of variables (setting their coefficients to zero) worked the best. I was a bit surprised that Ridge regression worked quite badly for me (more on that next time). I also tried some adding some polynomial components using the scikit-learn PolynomialFeatures function, but I couldn’t find anything useful here.

I trained these models using cross-validation (ie the CV versions of the functions). Cross-validation divides the data up and does various training/testing on different folds to find the best results. To some degree this avoids overfitting and provides more robustness to bad data.

Further I ran these regressions on two views of the data, one on my last data/current data on a bunch of columns and the other on the whole dataset but just for the current time stamp. Once doing this for one regression, adding more regressions didn’t seem to slow down processing much and the overall time I was using wasn’t much. So I had enough processing time leftover to add an ExtraTreesRegressor which was trained during the runs.

It took quite a few submissions to figure out a good balance of solutions. Perhaps with more time a better optimum could have been obtained, but hard time limits are often good.


A number of people in the competition with more of a data background spent quite a bit of time cleaning the data which seemed quite noisy with quite a few bad outliers. I wasn’t really keen on this and really wanted my ML algorithms to do this for me. This is when I discovered the the scikit-learn functions for dealing with outliers and modeling errors. The one I found useful was RANSAC (RANdom SAmple Consensus). I thought this was quite a clever algorithm to use subsets of the data to figure out the outliers (by how far they were from various prediction) and to find a good subset of the data without outliers to train on. You pass a linear model into RANSAC to use for estimating and then you can get the coefficients out at the end to use. The downside is that running RANSAC is very slow and to get good results it would take me about 8 hours to train a single linear model.

The good news here is that using RANSAC rather than cross-validation, I improved my score quite a bit and as a result ended up in about 70th place before the competition ended. You can pass the cross-validation version of the function into RANSAC to perhaps get even better results, but I found this too slow (ie to was still running after a day or two).


This wraps up what I did in February and basically the RANSAC version of my best Ensemble is what I submitted as my final result for the competition. Next time I’ll discuss the final results of the competition and how I did on the final test dataset.

Written by smist08

March 7, 2017 at 9:25 pm

Playing the Kaggle Two Sigma Challenge – Part 3

with 5 comments


Previously I introduced the Kaggle Two Sigma Financial Modeling Challenge which ran from December 1, 2016 to March 1, 2017. Then last time I covered what I did in December. This time I’ll cover what I worked on in January.

Update 2017/03/07: I added the source code for adding the previous value of a few columns to the data in posted here.

Time Series Data

Usually when you predict the stock market you use a time series of previous values to predict the next value. With this challenge the data was presented in a different way, namely you were given a lot of data on a given stock at a point of time and then asked to predict the next value. Of course you don’t have to use the data exactly as given, you can manipulate it into a better format. So there is nothing stopping you reformatting the data so for a given timestamp you also have a number of historical data points, you just need to remember them and add them to the data table. Sounds easy.

Generally computers are good at this sort of thing, however for this challenge we had 20 minutes of running time and 8Gig of RAM to do it. For testing training runs there were about 800,000 rows of training data and then 500,000 of test rows. These all needed to be reformatted and historical values held in memory. Further you couldn’t just do this as an array because the stock symbols changed from timestamp to timestamp. Ie symbols were inserted and removed meaning that you had to stay indexed by the symbol to ensure you were shifting data properly. The pandas library has good support for this sort of thing, but even with pandas it tended to be expensive for processing time and memory usage.

My first attempt was why not just keep the 10 last values of everything and feed that into scikit learn algorithms to see what they liked. Basically this got nowhere since my runs were just aborted as they hit the memory limit. Next I tried adding a few time series looking columns to the data table and feeded that into ExtraTreesRegressor, this worked quite well but I couldn’t had much more data without running out of memory or slowing things down so I couldn’t use many trees in the algorithm.

From this experience, I tried just selecting a few rows and presented different and tried keeping different numbers of historical data. Experimenting I found I got the best results using 36 columns and keeping 2 timestamps of history. This wasn’t much work on my part bart took quite a few days since you only get two submissions per day to test against the larger submission set.

Strictly speaking this isn’t a time series since I don’t really have historical values of the variable I’m predicting, however it is theorized (but not confirmed) that some of the input variables include weighted averages of historical data, so it might not be that far off.

Ensemble Averaging

Ensemble averaging is the technique of taking a number of machine learning algorithms that solve a particular problem and taking the average of their results as the final result. If the algorithms are truly independent then there are theorems in probability that support this. But even if they aren’t fully independent, practical results show that this does provide surprisingly good results. Further typically most Kaggle competitions are won by some sort of weighted average of a good number of algorithms. Basically this approach averages out errors and biases that an individual algorithm might introduce.

Note that the Christmas surprise solution from the previous blog article was really an ensemble of three algorithms where the average was quite a bit better than any of the individual components.

I now suspected I had enough ways to slice the data and had tried quite a few algorithms that towards the end of January I could start combining them into ensembles to get better results. I started by combining a number of regression algorithms since these were fairly fast to train and execute (especially on a reduced set of columns). I found that the regressions that gave the best results were ones that eliminated a lot of variables and just had 8 or so non-zero coefficients. This surprised me a bit, since I would have expected better results out of Ridge regression, but didn’t seem to be able to get them.

This moved me up the leaderboard a bit, but generally through January I dropped down the leaderboard and found it a bit of a struggle to stay in the top 100.

Missing Values

I also spent a bit of time trying to get better values for the missing values. Usually with stock data you can use the pandas fillna using back or forward filling (or even interpolating). However these didn’t work so well because the data wasn’t strictly rectangular due to the stocks being added and removed. Most things I tried just used too much processing time to be practical. In fact just doing a fillna on the mean (or median) values on the training data was a pretty big time user. So I tried this offline running locally on my laptop to see if I could get anywhere. I figured if I got better results then I could try to optimize them and get it going in the Kaggle VM. But even with quite a bit of running it seemed that I didn’t get any better results this way, so I gave up on it. I suspect practically speaking the ML algorithms just ignored most of the columns with a lot of missing values anyway.


This was a quick overview of my progress in January. Next up the final month of February. One thing that was good about the two submission per day limit was that it limited the amount of work I did on the competition since it could be kind of addictive.

Written by smist08

March 6, 2017 at 5:43 pm