## Posts Tagged ‘**random forest**’

## Playing the Kaggle Two Sigma Challenge 2018/2019

# Introduction

A couple of years ago, I entered a Kaggle data science competition sponsored by Two Sigma for stock market prediction. I blogged about this in part 1, part 2, part 3, part 4 and part 5. The upshot of this was that although I put in a lot of work, I performed quite poorly in the final stages. I learned a lot about machine learning and data science along the way and was keen to have another go, when another Two Sigma sponsored competition rolled around starting last year.

In this competition we had three months to create our models in the fall of 2018, then they used our models to predict the stock market through the first half of 2019. With my learnings from the last competition, I was able to do much better this time around.

# Don’t Overfit

My big lesson from the first competition was to not overfit the model to the training data. This is equivalent to having ten data points and fitting them perfectly with a 9th degree polynomial. There is no error in predicting the ten points, but the model is useless at predicting anything else, and in fact gives about the worst predictions possible.

A more subtle form of overfitting is trying hundreds of models and fiddling with their parameters until they work really well with the training data. This is a lot of work and won’t help you on any data outside of the training data. I did this for the first competition, it was a lot of work, and it performed badly.

Avoiding overfitting means doing less work, which is good. I spent very little time on this competition and got quite good results.

# Kaggle Virtual Environment

One of the things I like about this competition is that you play it in Google/Kaggle’s virtual environment. You have a fixed set of computer resources and everyone plays in the same environment. This levels the playing field with people who have access to very high powered equipment at corporations or Universities. This year the environment included a GPU and we could run for six hours on a high powered server.

This does limit the models you can use, I wasn’t successful at using a neural networks, probably because historical stock market data is very flat and it is hard to get these models to converge. I ended up using an Extra Trees model in SciKit Learn.

# Make the Program Robust

Usually in this sort of competition, when you run your model, behind the scenes Kaggle runs it on the secret test data as well, so if you run successfully on the provided test data, you know you also run on the secret data you are going to be scored against. In this case the secret test data didn’t exist yet. This led to the worry that sometime in the six months that they would be running our program, something unexpected would appear in the data and cause my program to crash, knocking me out of the competition.

I was careful to put in try/catch statements and added extra checks to try and keep my program running. The other thing with Python programs, is that sometimes they work, but throw a memory exception when they shutdown. I spent some time tracking down a number of these bugs and made sure my program could exit gracefully without any errors.

From the message board for the competition, it appears quite a few competitors were knocked out during the run on the new data.

# Don’t Cheat

Some people spend all their time trying to cheat. Trying to hack the system to gain access to the secret test data. For the purposes of the leaderboard, there was secret test data to give people a score during the model building phase, but this data wouldn’t be used in the real competition. There was no real protection on this data, since using it to cheat would be useless. However quite a few people did cheat to move to the top of the leaderboard before the real competition started. These programs crashed when the real competition started.

There are rumours that people have succeeded in other Kaggle competitions by cheating, but in this one, since it was based on stock market data generated after the models were frozen, it wasn’t going to work.

# My Model

The intent of the competition was to try to use news data to enhance your stock prediction algorithm. I don’t think this worked well, I used an Extra Trees variation on a Random Forest, and the news data never seemed to contribute much. Many other competitors didn’t even use it. The competition metric was a Sharpe Ratio. A couple of observations about the Sharpe Ratio, one is that it has the standard deviation of the estimates in the denominator. This means volatile stocks will hurt you even if you predict them accurately. Second, you enter a confidence value on whether you think the stock will do well and this confidence can be negative if you think it will go down. If you get the sign wrong it will doubly hurt you.

Given the Sharpe Ratio as the metric, I decided to sort the stocks by standard deviation of their returns and then rate any with a high standard deviation as zero, which would exclude them from the model. This meant I was building a portfolio of a subset of all the stocks present. I wanted stocks I could predict accurately and that had a lower volatility.

In reading the discussion board after the competition, it appears many of the top performers took this approach.

# Summary

I enjoyed this competition. I liked working with the high powered servers in the competition’s virtual environment. The lesson learned to not overfit, greatly reduced the amount of work. Whenever I was tempted to tune my model, I just said no. I ended up receiving a silver medal, coming in 145 out of 2927 competitors. With each competition I learn a bit more and I look forward to the next one.

## A Crack in the TensorFlow Platform

# Introduction

Last time we looked at how some tunable parameters through off a TensorFlow solution of a linear regression problem. This time we are going to look at a few more topics around TensorFlow and linear regression. Then we’ll look at how Google is implementing Linear Regression and some problems with their approach.

# TensorFlow Graphs

Last time we looked at calculating the solution to a linear regression problem directly using TensorFlow. That bit of code was:

# Now lets calculated the least squares fit exactly using TensorFlow X = tf.constant(data[:,0], name="X") Y = tf.constant(data[:,1], name="Y") Xavg = tf.reduce_mean(X, name="Xavg") Yavg = tf.reduce_mean(Y, name="Yavg") num = (X - Xavg) * (Y - Yavg) denom = (X - Xavg) ** 2 rednum = tf.reduce_sum(num, name="numerator") reddenom = tf.reduce_sum(denom, name="denominator") m = rednum / reddenom b = Yavg - m * Xavg with tf.Session() as sess: writer = tf.summary.FileWriter('./graphs', sess.graph) mm, bb = sess.run([m, b])

TensorFlow does all its calculations based on a graph where the various operators and constants are nodes that then get connected together to show dependencies. We can use TensorBoard to show the graph for the snippet of code we just reviewed here:

Notice that TensorFlow overloads the standard Python numerical operators, so when we get a line of code like: “denom = (X – Xavg) ** 2”, since X and Xavg are Tensors then we actually generate TensorFlow nodes as if we had called things like tf.subtract and tf.pow. This is much easier code to write, the only downside being that there isn’t a name parameter to label the nodes to get a better graph out of TensorBoard.

With TensorFlow you perform calculations in two steps, first you build the graph (everything before the with statement) and then you execute a calculation by specifying what you want. To do this you create a session and call run. In run we specify the variables we want calculated. TensorFlow then goes through the graph calculating anything it needs to, to get the variables we asked for. This means it may not calculate everything in the graph.

So why does TensorFlow follow this model? It seems overly complicated to perform numerical calculations. The reason is that there are algorithms to separate graphs into separate independent components that can be calculated in parallel. Then TensorFlow can delegate separate parts of the graph to separate GPUs to perform the calculation and then combine the results. In this example this power isn’t needed, but once you are calculating a very complicated large Neural Network then this becomes a real selling point. However since TensorFlow is a general tool, you can use it to do any calculation you wish on a set of GPUs.

# TensorFlow’s New LinearRegressor Estimator

Google has been trying to turn TensorFlow into a platform for all sorts of Machine Learning algorithms, not just Neural Networks. They have added estimators for Random Forests and for Linear Regression. However they did this by using the optimizers they created for Neural Nets rather than using the standard algorithms used in other libraries, like those implemented in SciKit Learn. The reasoning behind this is that they have a lot of support for really really big models with lots of support for one-hot encoding, sparse matrices and so on. However the algorithms that solve the problem seem to be exceedingly slow and resource hungry. Anything implemented in TensorFlow will run on a GPU, and similarly any Machine Learning algorithm can be implemented in TensorFlow. The goal here is to have TensorFlow running the Google AI Cloud where all the virtual machines have Google designed GPU like AI accelerator hardware. But I think unless they implement the standard algorithms, so they can solve things like a simple least squares regression quickly hand accurately then its usefulness will be limited.

Here is how you solve our fire versus theft linear regression this way in TensorFlow:

features = [tf.contrib.layers.real_valued_column("x", dimension=1)] estimator = tf.contrib.learn.LinearRegressor(feature_columns=features, model_dir='./linear_estimator')

# Input builders input_fn = tf.contrib.learn.io.numpy_input_fn({"x":x}, y, num_epochs=10000) estimator.fit(input_fn=input_fn, steps=2000) mm = estimator.get_variable_value('linear/x/weight') bb = estimator.get_variable_value('linear/bias_weight') print(mm, bb)

This solves the problem and returns a slope of 1.50674927 and intercept of 13.47268105 (the correct numbers from last post are 1.31345600492 and 16.9951572327). By increasing the steps in the fit statement I can get closer to the correct answer, but it is very time consuming.

The documentation for these new estimators is very limited, so I’m not 100% sure it’s solving least squares, but I tried getting the L1 solution using SciKit Learn and it was very close to least squares, so whatever this new estimator is estimating (which might be least squares), it is very slow and quite inaccurate. It is also strange that we now have a couple of tunable parameters added to make a fairly simple calculation problematic. The graph for this solution isn’t too bad, but still since we know the exact solution it is a bit disappointing.

Incidentally I was planning to compare the new TensorFlow RandomForest estimator to the Scikit Learn implementation. Although the SciKit Learn one is quite fast, it uses a huge amount of memory so I kind of would like a better solution. But when I compared the two I found the TensorFlow one so bad (both slow and resource intensive) that I didn’t bother blogging it. I hope that by the time this solution becomes more mainstream in TensorFlow that it improves a lot.

# Summary

TensorFlow is a very powerful engine for performing calculations that can be automatically parallelized and distributed over multiple GPUs for amazing computational speeds. This really does make it possible to spend a few thousand dollars and build quite a powerful supercomputer.

The downside is that Google appears to have the hammer of their neural network optimizers that they really want to use. As a result they are treating everything else as a nail and hitting it with this hammer. The results are quite sub-optimal. I think they do need to spend the time to implement a few of the standard non-Neural Network algorithms properly in TensorFlow if they really want to unleash the power of this platform.

## Playing the Kaggle Two Sigma Challenge – Part 4

# Introduction

The Kaggle Two Sigma Financial Modeling Challenge ran from December 1, 2016 through March 1, 2017. In previous blog posts I introduced the challenge, covered what I did in December then what I did in January. In this posting I’ll continue on with what I did in February. This consisted of refining my work from before, finding ways to refine the methods I was using and getting more done during the Kaggle VM runs.

The source code for these articles is located here. The file use2.py is the code I used to train offline. You can see how I comment/uncomment code to try different things. The file multimodelmultitime.py shows how to use these results for 3 regression models and 1 random forest model. The offline file use2.py uses the datafile train.h5 which is obtained from the Kaggle competition, I can’t redistribute this, but you can get it from Kaggle by acknowledging the terms of use.

# Training Offline

Usually training was the slowest part of running these solution. It was quite hard to setup a solution with ensemble averaging when you only had time to train one algorithm. Within the Kaggle community there are a number of people that religiously rely on gradient boosting for their solutions and gradient boosting has provided the key components in previous winning solutions. Unfortunately in this competition it was very hard to get gradient boosting to converge within the runtime provided. Some of the participants took to training gradient boosting offline locally on their computers and then taking the trained model and inserting it into the source code to run in the Kaggle VM. This was quite painful since the trained model is a binary Python object. So they pickled it to a string and then output the string as an ascii representation of the hex digits that they could cut and paste into the Kaggle source code. The problem here was that the Kaggle source file is limited to 1meg in size, so it limited the size of the model they could use. However a number of people got this to work.

I thought about this and realized that for linear regression, this was much easier. In linear regression the model only requires the coefficient array which is the size of the number of variables and the intercept. So generating these and cut/pasting them into the Kaggle solution is quite easy. I was a bit worried that the final test data would have different training data, which would cause this method to fail, but in the end it turned out to be ok. A few people questioned whether this was against the rules of the competition, but no one could quote an exact rule to prevent it, just that you might need to provide the code that produced the numbers. Kaggle never gave a definitive answer to this question when asked.

# Bigger Ensembles

With all this in mind, I trained my regression models offline. Some algorithms are quite slow so this opened up quite a few possibilities. I basically ran through all the regression algorithms in scikit-learn and then used a collection of them that gave the best scores individually. Scikit-learn has a lot of regression algorithms and many of them didn’t perform very well. The best results I got were for Lasso, ElasticNet (with L1 ratios bigger than 0.4) and Orthogonal Matching Pursuit. Generally I found the algorithms that eliminated a lot of variables (setting their coefficients to zero) worked the best. I was a bit surprised that Ridge regression worked quite badly for me (more on that next time). I also tried some adding some polynomial components using the scikit-learn PolynomialFeatures function, but I couldn’t find anything useful here.

I trained these models using cross-validation (ie the CV versions of the functions). Cross-validation divides the data up and does various training/testing on different folds to find the best results. To some degree this avoids overfitting and provides more robustness to bad data.

Further I ran these regressions on two views of the data, one on my last data/current data on a bunch of columns and the other on the whole dataset but just for the current time stamp. Once doing this for one regression, adding more regressions didn’t seem to slow down processing much and the overall time I was using wasn’t much. So I had enough processing time leftover to add an ExtraTreesRegressor which was trained during the runs.

It took quite a few submissions to figure out a good balance of solutions. Perhaps with more time a better optimum could have been obtained, but hard time limits are often good.

# RANSAC

A number of people in the competition with more of a data background spent quite a bit of time cleaning the data which seemed quite noisy with quite a few bad outliers. I wasn’t really keen on this and really wanted my ML algorithms to do this for me. This is when I discovered the the scikit-learn functions for dealing with outliers and modeling errors. The one I found useful was RANSAC (RANdom SAmple Consensus). I thought this was quite a clever algorithm to use subsets of the data to figure out the outliers (by how far they were from various prediction) and to find a good subset of the data without outliers to train on. You pass a linear model into RANSAC to use for estimating and then you can get the coefficients out at the end to use. The downside is that running RANSAC is very slow and to get good results it would take me about 8 hours to train a single linear model.

The good news here is that using RANSAC rather than cross-validation, I improved my score quite a bit and as a result ended up in about 70th place before the competition ended. You can pass the cross-validation version of the function into RANSAC to perhaps get even better results, but I found this too slow (ie to was still running after a day or two).

# Summary

This wraps up what I did in February and basically the RANSAC version of my best Ensemble is what I submitted as my final result for the competition. Next time I’ll discuss the final results of the competition and how I did on the final test dataset.

## Playing the Kaggle Two Sigma Challenge – Part 2

# Introduction

Last time I introduced the Kaggle Two Sigma Challenge and this time I’ll start describing what I did at the beginning of the competition. The competition started at the beginning of December, 2016 and completed on March 1, 2017. This blog covers what I did in December.

**Update 2017/03/07:** I uploaded the Python source code for the code discussed here to my Google Drive. You can access them here. The files are TensorFlow1.py for the first (wide) TensorFlow attempt, TFNarrow1.py for the second narrow TensorFlow attempt, RegressionLab1.py for my regression one with reinforcement learning and then TreeReg1.py for the Christmas surprise with reinforcement learning added.

# TensorFlow

Since I spent quite a bit of time playing and blogging about predicting the stock market with TensorFlow, this is where I started. The data was all numeric, so it was quite easy to get started, no one hot encoding and really the only pre-processing was to fill in missing values with the pandas fillna function (where I just used the mean since this was easiest). I’ll talk more about these missing values later, but to get started they were easy to fill in and ignore.

I started by just feeding all the data into TensorFlow trying some simple 2, 3 and 4 level neural networks. However my results were quite bad. Either the model couldn’t converge or even if it did, the results were much worse than just submitting zeros for everything.

With all the data the model was quite large, so I thought I should simplify it a bit. The Kaggle competition has a public forum which includes people publishing public Python notebooks and early in every competition there are some very generous people that published detailed statistical analysis and visualizations of all the data. Using this I could select a small subset of data columns which had higher correlations with the results and just use these instead. This then let me run the training longer, but still didn’t produce any useful results.

At this point I decided that given the computing resource limitations of the Kaggle playgrounds, I wouldn’t be able to do a serious neural network, or perhaps doing so just wouldn’t work. I did think of doing the training on my laptop, say running overnight and then copy/pasting the weight/bias arrays into my Python code in the playground to just run. But I never pursued this.

# Penalized Linear Regression

My next thought was to use linear regression since this tends to be good for extrapolation problems since it doesn’t suffer from non-linearities going wild outside of the training data. Generally regular least squares regression can suffer from overfitting, especially when there are a large number of variables and they aren’t particularly linearly independent. Also least squares regression can be thrown off by bad errant data. The general consensus from the forums was that this training set had a lot of outliers for some reason. In machine learning there are a large family of Penalized Linear Regression algorithms that all try to address these problems via one means or another. Generally they do things like start with the most correlated column and then add the next most correlated column and only keep doing this as long as they have a positive effect on the results. They also penalize large weights borrowing the technique we described here. Then there are various methods to filter out outliers or to change their effect by using different metrics than sum of squares. Two popular methods are Lasso regression that uses the taxi-cab metric (sum of difference of absolute values rather than sum of square differences) and Ridge regression which uses sum of squares regression. Then both penalize large coefficients and bring in variables one at a time. Then there is a combined algorithm called Elastic Net Regression that uses a ratio of each and you choose the coefficient.

# First Victory

Playing around with this a bit, I found the scikit-learn algorithm ElasticNetCV worked quite well for me. ElasticNetCV breaks up the training data and then run iteratively testing the value of how many variables to include to find the best result. Choosing the l1 ratio of 0.45 actually put me in the top ten of the submissions. This was a very simple submission, but I was pretty happy to get such a good result.

# Reinforcement Learning

One thing that seemed a bit strange to me about the way the Kaggle Gym worked was that you submitted your results for a given time step and then got a reward for that. However you didn’t get the correct results for the previous timestep. Normally for stock market prediction you predict the next day, then get the correct results at the end of the day and predict the next day. Here you only get a reward which is the R2 score for you submission. The idea is to have an algorithm like the following diagram. But incorporating the R2 score is quite tricky.

I spent a bit of time thinking about this and had the idea that you could sort of calculate the variance from the R2 score and then if you made an assumption about the underlying probability distribution you could then make an estimate of the mean. Then I could introduce a bias to the mean to compensate for cumulative errors as the time gets farther and farther from the training data.

Now there are quite a few problems with this, namely the variance doesn’t give you the sign of the error which is worrying. I tried a few different relationships of mean to variance and found one that improved my score quite bit. But again this was all rather ad-hoc.

Anyway, every ten timesteps I didn’t apply the bias so I could get a new bias and then used the bias on the other timesteps.

# Second Victory

The competition moves fairly quickly so a week or two after my first good score, I was well down in the standings. Adding the my mean bias from the reward to my ElasticNetCV regression put me back into the top 10 again.

# A Christmas Present

I went to bed on Christmas eve in 6th place on the competition leaderboard. I was pretty happy about that. When I checked in on Christmas Day I was down to 80th place on the leaderboard. As a Christmas present to all the competitors one of the then current top people above me made his solution public, which then meant lots of other folks forked his solution, submitted it and got his score.

This solution used a Random Forest algorithm ExtraTreesRegressor from scikit-learn combined with a simple mean based estimate and a simple regression on one variable. The random forest part was interesting because it let the algorithm know which were missing values so it could learn to act appropriately.

At first I was really upset about this, but when I had time I realized I could take that public solution, add my mean bias and improve upon it. I did this and got back into the top ten. So it wasn’t that bad.

# Summary

Well this covered the first month of the competition, two more to go. I think getting into the top ten on the leaderboard a few times gave me the motivation to keep plugging away at the competition and finding some more innovative solutions. Next up January.