Posts Tagged ‘random forest’
The Kaggle Two Sigma Financial Modeling Challenge ran from December 1, 2016 through March 1, 2017. In previous blog posts I introduced the challenge, covered what I did in December then what I did in January. In this posting I’ll continue on with what I did in February. This consisted of refining my work from before, finding ways to refine the methods I was using and getting more done during the Kaggle VM runs.
Usually training was the slowest part of running these solution. It was quite hard to setup a solution with ensemble averaging when you only had time to train one algorithm. Within the Kaggle community there are a number of people that religiously rely on gradient boosting for their solutions and gradient boosting has provided the key components in previous winning solutions. Unfortunately in this competition it was very hard to get gradient boosting to converge within the runtime provided. Some of the participants took to training gradient boosting offline locally on their computers and then taking the trained model and inserting it into the source code to run in the Kaggle VM. This was quite painful since the trained model is a binary Python object. So they pickled it to a string and then output the string as an ascii representation of the hex digits that they could cut and paste into the Kaggle source code. The problem here was that the Kaggle source file is limited to 1meg in size, so it limited the size of the model they could use. However a number of people got this to work.
I thought about this and realized that for linear regression, this was much easier. In linear regression the model only requires the coefficient array which is the size of the number of variables and the intercept. So generating these and cut/pasting them into the Kaggle solution is quite easy. I was a bit worried that the final test data would have different training data, which would cause this method to fail, but in the end it turned out to be ok. A few people questioned whether this was against the rules of the competition, but no one could quote an exact rule to prevent it, just that you might need to provide the code that produced the numbers. Kaggle never gave a definitive answer to this question when asked.
With all this in mind, I trained my regression models offline. Some algorithms are quite slow so this opened up quite a few possibilities. I basically ran through all the regression algorithms in scikit-learn and then used a collection of them that gave the best scores individually. Scikit-learn has a lot of regression algorithms and many of them didn’t perform very well. The best results I got were for Lasso, ElasticNet (with L1 ratios bigger than 0.4) and Orthogonal Matching Pursuit. Generally I found the algorithms that eliminated a lot of variables (setting their coefficients to zero) worked the best. I was a bit surprised that Ridge regression worked quite badly for me (more on that next time). I also tried some adding some polynomial components using the scikit-learn PolynomialFeatures function, but I couldn’t find anything useful here.
I trained these models using cross-validation (ie the CV versions of the functions). Cross-validation divides the data up and does various training/testing on different folds to find the best results. To some degree this avoids overfitting and provides more robustness to bad data.
Further I ran these regressions on two views of the data, one on my last data/current data on a bunch of columns and the other on the whole dataset but just for the current time stamp. Once doing this for one regression, adding more regressions didn’t seem to slow down processing much and the overall time I was using wasn’t much. So I had enough processing time leftover to add an ExtraTreesRegressor which was trained during the runs.
It took quite a few submissions to figure out a good balance of solutions. Perhaps with more time a better optimum could have been obtained, but hard time limits are often good.
A number of people in the competition with more of a data background spent quite a bit of time cleaning the data which seemed quite noisy with quite a few bad outliers. I wasn’t really keen on this and really wanted my ML algorithms to do this for me. This is when I discovered the the scikit-learn functions for dealing with outliers and modeling errors. The one I found useful was RANSAC (RANdom SAmple Consensus). I thought this was quite a clever algorithm to use subsets of the data to figure out the outliers (by how far they were from various prediction) and to find a good subset of the data without outliers to train on. You pass a linear model into RANSAC to use for estimating and then you can get the coefficients out at the end to use. The downside is that running RANSAC is very slow and to get good results it would take me about 8 hours to train a single linear model.
The good news here is that using RANSAC rather than cross-validation, I improved my score quite a bit and as a result ended up in about 70th place before the competition ended. You can pass the cross-validation version of the function into RANSAC to perhaps get even better results, but I found this too slow (ie to was still running after a day or two).
This wraps up what I did in February and basically the RANSAC version of my best Ensemble is what I submitted as my final result for the competition. Next time I’ll discuss the final results of the competition and how I did on the final test dataset.
Last time I introduced the Kaggle Two Sigma Challenge and this time I’ll start describing what I did at the beginning of the competition. The competition started at the beginning of December, 2016 and completed on March 1, 2017. This blog covers what I did in December.
Update 2017/03/07: I uploaded the Python source code for the code discussed here to my Google Drive. You can access them here. The files are TensorFlow1.py for the first (wide) TensorFlow attempt, TFNarrow1.py for the second narrow TensorFlow attempt, RegressionLab1.py for my regression one with reinforcement learning and then TreeReg1.py for the Christmas surprise with reinforcement learning added.
Since I spent quite a bit of time playing and blogging about predicting the stock market with TensorFlow, this is where I started. The data was all numeric, so it was quite easy to get started, no one hot encoding and really the only pre-processing was to fill in missing values with the pandas fillna function (where I just used the mean since this was easiest). I’ll talk more about these missing values later, but to get started they were easy to fill in and ignore.
I started by just feeding all the data into TensorFlow trying some simple 2, 3 and 4 level neural networks. However my results were quite bad. Either the model couldn’t converge or even if it did, the results were much worse than just submitting zeros for everything.
With all the data the model was quite large, so I thought I should simplify it a bit. The Kaggle competition has a public forum which includes people publishing public Python notebooks and early in every competition there are some very generous people that published detailed statistical analysis and visualizations of all the data. Using this I could select a small subset of data columns which had higher correlations with the results and just use these instead. This then let me run the training longer, but still didn’t produce any useful results.
At this point I decided that given the computing resource limitations of the Kaggle playgrounds, I wouldn’t be able to do a serious neural network, or perhaps doing so just wouldn’t work. I did think of doing the training on my laptop, say running overnight and then copy/pasting the weight/bias arrays into my Python code in the playground to just run. But I never pursued this.
Penalized Linear Regression
My next thought was to use linear regression since this tends to be good for extrapolation problems since it doesn’t suffer from non-linearities going wild outside of the training data. Generally regular least squares regression can suffer from overfitting, especially when there are a large number of variables and they aren’t particularly linearly independent. Also least squares regression can be thrown off by bad errant data. The general consensus from the forums was that this training set had a lot of outliers for some reason. In machine learning there are a large family of Penalized Linear Regression algorithms that all try to address these problems via one means or another. Generally they do things like start with the most correlated column and then add the next most correlated column and only keep doing this as long as they have a positive effect on the results. They also penalize large weights borrowing the technique we described here. Then there are various methods to filter out outliers or to change their effect by using different metrics than sum of squares. Two popular methods are Lasso regression that uses the taxi-cab metric (sum of difference of absolute values rather than sum of square differences) and Ridge regression which uses sum of squares regression. Then both penalize large coefficients and bring in variables one at a time. Then there is a combined algorithm called Elastic Net Regression that uses a ratio of each and you choose the coefficient.
Playing around with this a bit, I found the scikit-learn algorithm ElasticNetCV worked quite well for me. ElasticNetCV breaks up the training data and then run iteratively testing the value of how many variables to include to find the best result. Choosing the l1 ratio of 0.45 actually put me in the top ten of the submissions. This was a very simple submission, but I was pretty happy to get such a good result.
One thing that seemed a bit strange to me about the way the Kaggle Gym worked was that you submitted your results for a given time step and then got a reward for that. However you didn’t get the correct results for the previous timestep. Normally for stock market prediction you predict the next day, then get the correct results at the end of the day and predict the next day. Here you only get a reward which is the R2 score for you submission. The idea is to have an algorithm like the following diagram. But incorporating the R2 score is quite tricky.
I spent a bit of time thinking about this and had the idea that you could sort of calculate the variance from the R2 score and then if you made an assumption about the underlying probability distribution you could then make an estimate of the mean. Then I could introduce a bias to the mean to compensate for cumulative errors as the time gets farther and farther from the training data.
Now there are quite a few problems with this, namely the variance doesn’t give you the sign of the error which is worrying. I tried a few different relationships of mean to variance and found one that improved my score quite bit. But again this was all rather ad-hoc.
Anyway, every ten timesteps I didn’t apply the bias so I could get a new bias and then used the bias on the other timesteps.
The competition moves fairly quickly so a week or two after my first good score, I was well down in the standings. Adding the my mean bias from the reward to my ElasticNetCV regression put me back into the top 10 again.
A Christmas Present
I went to bed on Christmas eve in 6th place on the competition leaderboard. I was pretty happy about that. When I checked in on Christmas Day I was down to 80th place on the leaderboard. As a Christmas present to all the competitors one of the then current top people above me made his solution public, which then meant lots of other folks forked his solution, submitted it and got his score.
This solution used a Random Forest algorithm ExtraTreesRegressor from scikit-learn combined with a simple mean based estimate and a simple regression on one variable. The random forest part was interesting because it let the algorithm know which were missing values so it could learn to act appropriately.
At first I was really upset about this, but when I had time I realized I could take that public solution, add my mean bias and improve upon it. I did this and got back into the top ten. So it wasn’t that bad.
Well this covered the first month of the competition, two more to go. I think getting into the top ten on the leaderboard a few times gave me the motivation to keep plugging away at the competition and finding some more innovative solutions. Next up January.