This posting will conclude my coverage of Kaggle’s Two Sigma Financial Modeling Challenge. I introduced the challenge here and then blogged on what I did in December, January and February. I had planned to write this article after the challenge is fully finished, however Kaggle is still reviewing the final entries and I’m not too sure how long that is going to take. I’m not sure if this delay is being caused by Google purchasing Kaggle or just the nature of judging a competition where code is entered rather than just uploading data. I’ll just go by what is known at this point and then if there are any further surprises I’ll post an update.
Public vs Private Leaderboards
During the competition you are given a data set which you can download or run in the Kaggle Cloud. You could train on all this data offline or if you did a test run in the cloud it would train on half the data and then test on the other half. When you submit an entry it would train on all of this data set and then test against a hidden dataset that us competitors never see.
This hidden dataset was divided into two parts, ⅓ of the data was used to calculate our public score which was revealed to us and provided our score on the leaderboard during the competition. The other ⅔ was reserved for the final evaluation at the end of the competition which was called the private leaderboard score. The private leaderboard score was secret to us, so we could only make decisions based on the public leaderboard scores.
My Final Submissions
You last step into the competition is to choose two submissions as your official entries. These will then be judged based on their private leaderboard scores (which you don’t know). If you don’t choose two entries, it will select your two highest entries. I just chose my two highest entries (based on the public leaderboard).
The Results Revealed
So after the competition closed on March 1, the private leaderboard was made public. This was quite a shock since it didn’t resemble the public leaderboard at all. Kaggle hasn’t said much about the internal mechanisms of the competition so some of the following is speculation either on my part of from the public forms. It appears that the ⅓ – ⅔ split of the data wasn’t a random sample but was a time based split and it appears the market conditions were quite different in the second two thirds than the first third. This then led to quite different scores. For myself I dropped from 71st place to 1086th place (out of 2071).
What Went Wrong
At first I didn’t know what had gone wrong. I had lots of theories but had no way to test them. A few days later they revealed the private leaderboard scores for all our submissions so I could get a better idea of what worked and what didn’t. It appears that the thing that really killed me was using Orthogonal Matching Pursuit in my ensemble of algorithms. Any submission that included this had a much worse private leaderboard score than the public leaderboard score. On the reverse side any submission that used Ridge regression did better on the private leaderboard score than the public leaderboard.
Since I chose my two best entries, they were based on the same algorithm and got the same bad private leaderboard score. I should have chosen something quite different as my second entry and with luck would have gotten a much better overall score.
There is a tendency to blame overfitting on solutions that did well on the public leaderboard but badly on the private leaderboard. But with only two submissions a day and given the size of the data, I don’t think this was the case. I think it’s more a matter of not having a good representative test set. Especially given how big a shake up there was.
Before I knew the private leaderboard scores for all my submissions I was worried the problem was caused by either offline training or my use of reinforcement learning, but these turned out to be ok. So here is list of what worked for me:
- Training offline was fine. It provided good results and fit the current Kaggle competition.
- RANSAC did provide better results.
- My reinforcement learning results gave equivalent improvements on both the public and private leaderboards.
- Lasso and ElasticNet worked about the same for both leaderboards.
- ExtraTreesRegressor worked about the same for both leaderboards.
- Using current and old time series data worked about the same for both leaderboards.
My best private leaderboard submission was an ExtraTreesRegressor where it had added columns for last timestamp data on a few select columns. I then had several ensembles score well as long as they didn’t include Orthogonal Matching Pursuit.
How the Winners Won
Several people that scored in the top ten revealed their winning strategies. A key idea from the people with stock market experience was to partition the data based on an estimate of market volatility and then use a different model for each volatility range. So for instance use one model when things are calm (small deltas) and and a different model when volatile. These seemed to be a common theme. One that did quite well divided the data into three volatile ranges and then used Ridge Regression on each. Another added the volatility as a calculated column and then used offline gradient boosting to generate the model. The very top people have so far kept their solutions secret, probably waiting for the next stock market competition to come along.
Advice for New Competitors
Here is some advice I have based on my current experience:
- Leverage the expertise on the public forum. Great code and ideas get posted here. Perhaps even wait a bit before starting, so you can get a jump start.
- Research a bit of domain knowledge in the topic. Usually a bit of domain knowledge goes a long way.
- Deal with outliers, either by removing or clipping them or using an algorithm like RANSAC to reduce their effect.
- Don’t spend a lot of time fine tuning tunable parameters. The results will be different on the private leaderboard anyway so you don’t really know if this is helping or not.
- Know a range of algorithms. Usually an ensemble wins. Know gradient boosting, it seems to always be a contender.
- Don’t get too wrapped up in the competition, there is a lot of luck involved. Remember this is only one specific dataset (usually) and there is a large amount of data that you will be judged against that you don’t get to test against.
- Select your two entries to win based on completely different algorithms so you have a better chance with the private leaderboard.
- Only enter a challenge if you are actually interested in the topic. Don’t just enter for the sake of entering.
Some Suggestions for Improvement
Based on my experience with this challenge, here are some of my frustrations that I think could be eliminated by some changes to the competition.
- Don’t make all the data anonymous. Include a one line description of each column. To me the competition would have been way better (and fairer) if we knew better what we were dealing with.
- Given the processing restrictions on the competition, provide a decent dataset which isn’t full of missing values. I think it would have been better if reasonable values were filled in or some other system was used. Given the size of the dataset, not much could be done about these.
- A more rectangular structure to the data would have helped processing in the limited resources and enhanced accuracy. For instance stocks that enter the portfolio still have prices from before that point and these could be included. This would have helped treating the dataset as a time series easier.
- Including columns for the market would have be great (like the Dow30 or S&P500). Most stock models heavily follow the market so this would have been a big help.
- Be more upfront on the rules. For instance is offline training allowed? Be explicit.
- Provide information on how the public/private leaderboard data is split. Personally I think it should have been a random sample rather than a time based split.
- Give the VMs a bit more oomph. Perhaps now that Google owns Kaggle these will get more power in the Google cloud. But keep them free, unlike the YouTube challenge where you get a $300 credit which is used up very quickly.
This wraps up my coverage of the Kaggle Two Sigma Financial Challenge. It was very interesting and educational participating in the challenge. I was disappointed in the final result, but that is part of the learning curve. I will enter future challenges (assuming Google keeps doing these) and hopefully can apply what I’ve learned along the way.
The Kaggle Two Sigma Financial Modeling Challenge ran from December 1, 2016 through March 1, 2017. In previous blog posts I introduced the challenge, covered what I did in December then what I did in January. In this posting I’ll continue on with what I did in February. This consisted of refining my work from before, finding ways to refine the methods I was using and getting more done during the Kaggle VM runs.
Usually training was the slowest part of running these solution. It was quite hard to setup a solution with ensemble averaging when you only had time to train one algorithm. Within the Kaggle community there are a number of people that religiously rely on gradient boosting for their solutions and gradient boosting has provided the key components in previous winning solutions. Unfortunately in this competition it was very hard to get gradient boosting to converge within the runtime provided. Some of the participants took to training gradient boosting offline locally on their computers and then taking the trained model and inserting it into the source code to run in the Kaggle VM. This was quite painful since the trained model is a binary Python object. So they pickled it to a string and then output the string as an ascii representation of the hex digits that they could cut and paste into the Kaggle source code. The problem here was that the Kaggle source file is limited to 1meg in size, so it limited the size of the model they could use. However a number of people got this to work.
I thought about this and realized that for linear regression, this was much easier. In linear regression the model only requires the coefficient array which is the size of the number of variables and the intercept. So generating these and cut/pasting them into the Kaggle solution is quite easy. I was a bit worried that the final test data would have different training data, which would cause this method to fail, but in the end it turned out to be ok. A few people questioned whether this was against the rules of the competition, but no one could quote an exact rule to prevent it, just that you might need to provide the code that produced the numbers. Kaggle never gave a definitive answer to this question when asked.
With all this in mind, I trained my regression models offline. Some algorithms are quite slow so this opened up quite a few possibilities. I basically ran through all the regression algorithms in scikit-learn and then used a collection of them that gave the best scores individually. Scikit-learn has a lot of regression algorithms and many of them didn’t perform very well. The best results I got were for Lasso, ElasticNet (with L1 ratios bigger than 0.4) and Orthogonal Matching Pursuit. Generally I found the algorithms that eliminated a lot of variables (setting their coefficients to zero) worked the best. I was a bit surprised that Ridge regression worked quite badly for me (more on that next time). I also tried some adding some polynomial components using the scikit-learn PolynomialFeatures function, but I couldn’t find anything useful here.
I trained these models using cross-validation (ie the CV versions of the functions). Cross-validation divides the data up and does various training/testing on different folds to find the best results. To some degree this avoids overfitting and provides more robustness to bad data.
Further I ran these regressions on two views of the data, one on my last data/current data on a bunch of columns and the other on the whole dataset but just for the current time stamp. Once doing this for one regression, adding more regressions didn’t seem to slow down processing much and the overall time I was using wasn’t much. So I had enough processing time leftover to add an ExtraTreesRegressor which was trained during the runs.
It took quite a few submissions to figure out a good balance of solutions. Perhaps with more time a better optimum could have been obtained, but hard time limits are often good.
A number of people in the competition with more of a data background spent quite a bit of time cleaning the data which seemed quite noisy with quite a few bad outliers. I wasn’t really keen on this and really wanted my ML algorithms to do this for me. This is when I discovered the the scikit-learn functions for dealing with outliers and modeling errors. The one I found useful was RANSAC (RANdom SAmple Consensus). I thought this was quite a clever algorithm to use subsets of the data to figure out the outliers (by how far they were from various prediction) and to find a good subset of the data without outliers to train on. You pass a linear model into RANSAC to use for estimating and then you can get the coefficients out at the end to use. The downside is that running RANSAC is very slow and to get good results it would take me about 8 hours to train a single linear model.
The good news here is that using RANSAC rather than cross-validation, I improved my score quite a bit and as a result ended up in about 70th place before the competition ended. You can pass the cross-validation version of the function into RANSAC to perhaps get even better results, but I found this too slow (ie to was still running after a day or two).
This wraps up what I did in February and basically the RANSAC version of my best Ensemble is what I submitted as my final result for the competition. Next time I’ll discuss the final results of the competition and how I did on the final test dataset.
Previously I introduced the Kaggle Two Sigma Financial Modeling Challenge which ran from December 1, 2016 to March 1, 2017. Then last time I covered what I did in December. This time I’ll cover what I worked on in January.
Update 2017/03/07: I added the source code for adding the previous value of a few columns to the data in RegressionLab4.py posted here.
Time Series Data
Usually when you predict the stock market you use a time series of previous values to predict the next value. With this challenge the data was presented in a different way, namely you were given a lot of data on a given stock at a point of time and then asked to predict the next value. Of course you don’t have to use the data exactly as given, you can manipulate it into a better format. So there is nothing stopping you reformatting the data so for a given timestamp you also have a number of historical data points, you just need to remember them and add them to the data table. Sounds easy.
Generally computers are good at this sort of thing, however for this challenge we had 20 minutes of running time and 8Gig of RAM to do it. For testing training runs there were about 800,000 rows of training data and then 500,000 of test rows. These all needed to be reformatted and historical values held in memory. Further you couldn’t just do this as an array because the stock symbols changed from timestamp to timestamp. Ie symbols were inserted and removed meaning that you had to stay indexed by the symbol to ensure you were shifting data properly. The pandas library has good support for this sort of thing, but even with pandas it tended to be expensive for processing time and memory usage.
My first attempt was why not just keep the 10 last values of everything and feed that into scikit learn algorithms to see what they liked. Basically this got nowhere since my runs were just aborted as they hit the memory limit. Next I tried adding a few time series looking columns to the data table and feeded that into ExtraTreesRegressor, this worked quite well but I couldn’t had much more data without running out of memory or slowing things down so I couldn’t use many trees in the algorithm.
From this experience, I tried just selecting a few rows and presented different and tried keeping different numbers of historical data. Experimenting I found I got the best results using 36 columns and keeping 2 timestamps of history. This wasn’t much work on my part bart took quite a few days since you only get two submissions per day to test against the larger submission set.
Strictly speaking this isn’t a time series since I don’t really have historical values of the variable I’m predicting, however it is theorized (but not confirmed) that some of the input variables include weighted averages of historical data, so it might not be that far off.
Ensemble averaging is the technique of taking a number of machine learning algorithms that solve a particular problem and taking the average of their results as the final result. If the algorithms are truly independent then there are theorems in probability that support this. But even if they aren’t fully independent, practical results show that this does provide surprisingly good results. Further typically most Kaggle competitions are won by some sort of weighted average of a good number of algorithms. Basically this approach averages out errors and biases that an individual algorithm might introduce.
Note that the Christmas surprise solution from the previous blog article was really an ensemble of three algorithms where the average was quite a bit better than any of the individual components.
I now suspected I had enough ways to slice the data and had tried quite a few algorithms that towards the end of January I could start combining them into ensembles to get better results. I started by combining a number of regression algorithms since these were fairly fast to train and execute (especially on a reduced set of columns). I found that the regressions that gave the best results were ones that eliminated a lot of variables and just had 8 or so non-zero coefficients. This surprised me a bit, since I would have expected better results out of Ridge regression, but didn’t seem to be able to get them.
This moved me up the leaderboard a bit, but generally through January I dropped down the leaderboard and found it a bit of a struggle to stay in the top 100.
I also spent a bit of time trying to get better values for the missing values. Usually with stock data you can use the pandas fillna using back or forward filling (or even interpolating). However these didn’t work so well because the data wasn’t strictly rectangular due to the stocks being added and removed. Most things I tried just used too much processing time to be practical. In fact just doing a fillna on the mean (or median) values on the training data was a pretty big time user. So I tried this offline running locally on my laptop to see if I could get anywhere. I figured if I got better results then I could try to optimize them and get it going in the Kaggle VM. But even with quite a bit of running it seemed that I didn’t get any better results this way, so I gave up on it. I suspect practically speaking the ML algorithms just ignored most of the columns with a lot of missing values anyway.
This was a quick overview of my progress in January. Next up the final month of February. One thing that was good about the two submission per day limit was that it limited the amount of work I did on the competition since it could be kind of addictive.
Last time I introduced the Kaggle Two Sigma Challenge and this time I’ll start describing what I did at the beginning of the competition. The competition started at the beginning of December, 2016 and completed on March 1, 2017. This blog covers what I did in December.
Update 2017/03/07: I uploaded the Python source code for the code discussed here to my Google Drive. You can access them here. The files are TensorFlow1.py for the first (wide) TensorFlow attempt, TFNarrow1.py for the second narrow TensorFlow attempt, RegressionLab1.py for my regression one with reinforcement learning and then TreeReg1.py for the Christmas surprise with reinforcement learning added.
Since I spent quite a bit of time playing and blogging about predicting the stock market with TensorFlow, this is where I started. The data was all numeric, so it was quite easy to get started, no one hot encoding and really the only pre-processing was to fill in missing values with the pandas fillna function (where I just used the mean since this was easiest). I’ll talk more about these missing values later, but to get started they were easy to fill in and ignore.
I started by just feeding all the data into TensorFlow trying some simple 2, 3 and 4 level neural networks. However my results were quite bad. Either the model couldn’t converge or even if it did, the results were much worse than just submitting zeros for everything.
With all the data the model was quite large, so I thought I should simplify it a bit. The Kaggle competition has a public forum which includes people publishing public Python notebooks and early in every competition there are some very generous people that published detailed statistical analysis and visualizations of all the data. Using this I could select a small subset of data columns which had higher correlations with the results and just use these instead. This then let me run the training longer, but still didn’t produce any useful results.
At this point I decided that given the computing resource limitations of the Kaggle playgrounds, I wouldn’t be able to do a serious neural network, or perhaps doing so just wouldn’t work. I did think of doing the training on my laptop, say running overnight and then copy/pasting the weight/bias arrays into my Python code in the playground to just run. But I never pursued this.
Penalized Linear Regression
My next thought was to use linear regression since this tends to be good for extrapolation problems since it doesn’t suffer from non-linearities going wild outside of the training data. Generally regular least squares regression can suffer from overfitting, especially when there are a large number of variables and they aren’t particularly linearly independent. Also least squares regression can be thrown off by bad errant data. The general consensus from the forums was that this training set had a lot of outliers for some reason. In machine learning there are a large family of Penalized Linear Regression algorithms that all try to address these problems via one means or another. Generally they do things like start with the most correlated column and then add the next most correlated column and only keep doing this as long as they have a positive effect on the results. They also penalize large weights borrowing the technique we described here. Then there are various methods to filter out outliers or to change their effect by using different metrics than sum of squares. Two popular methods are Lasso regression that uses the taxi-cab metric (sum of difference of absolute values rather than sum of square differences) and Ridge regression which uses sum of squares regression. Then both penalize large coefficients and bring in variables one at a time. Then there is a combined algorithm called Elastic Net Regression that uses a ratio of each and you choose the coefficient.
Playing around with this a bit, I found the scikit-learn algorithm ElasticNetCV worked quite well for me. ElasticNetCV breaks up the training data and then run iteratively testing the value of how many variables to include to find the best result. Choosing the l1 ratio of 0.45 actually put me in the top ten of the submissions. This was a very simple submission, but I was pretty happy to get such a good result.
One thing that seemed a bit strange to me about the way the Kaggle Gym worked was that you submitted your results for a given time step and then got a reward for that. However you didn’t get the correct results for the previous timestep. Normally for stock market prediction you predict the next day, then get the correct results at the end of the day and predict the next day. Here you only get a reward which is the R2 score for you submission. The idea is to have an algorithm like the following diagram. But incorporating the R2 score is quite tricky.
I spent a bit of time thinking about this and had the idea that you could sort of calculate the variance from the R2 score and then if you made an assumption about the underlying probability distribution you could then make an estimate of the mean. Then I could introduce a bias to the mean to compensate for cumulative errors as the time gets farther and farther from the training data.
Now there are quite a few problems with this, namely the variance doesn’t give you the sign of the error which is worrying. I tried a few different relationships of mean to variance and found one that improved my score quite bit. But again this was all rather ad-hoc.
Anyway, every ten timesteps I didn’t apply the bias so I could get a new bias and then used the bias on the other timesteps.
The competition moves fairly quickly so a week or two after my first good score, I was well down in the standings. Adding the my mean bias from the reward to my ElasticNetCV regression put me back into the top 10 again.
A Christmas Present
I went to bed on Christmas eve in 6th place on the competition leaderboard. I was pretty happy about that. When I checked in on Christmas Day I was down to 80th place on the leaderboard. As a Christmas present to all the competitors one of the then current top people above me made his solution public, which then meant lots of other folks forked his solution, submitted it and got his score.
This solution used a Random Forest algorithm ExtraTreesRegressor from scikit-learn combined with a simple mean based estimate and a simple regression on one variable. The random forest part was interesting because it let the algorithm know which were missing values so it could learn to act appropriately.
At first I was really upset about this, but when I had time I realized I could take that public solution, add my mean bias and improve upon it. I did this and got back into the top ten. So it wasn’t that bad.
Well this covered the first month of the competition, two more to go. I think getting into the top ten on the leaderboard a few times gave me the motivation to keep plugging away at the competition and finding some more innovative solutions. Next up January.
As I started learning about machine learning and playing with simple problems, I wasn’t really satisfied with the standard datasets everyone starts out with like MNINST. So I went to playing with stock market predictions which was fun, but there was really no metric on how well I was doing (especially since I wasn’t going to use real money). Then as I was reading various books on machine learning and AI, I often ran into references to Kaggle competitions.
Kaggle is a company that hosts machine learning competitions. It also facilitates hosting data sets, mentoring people and promoting machine learning education and research. The competitions are very popular in the machine learning community and often have quite large cash prizes, though a lot of people just do it to get Kaggle competition badges.
Kaggle appealed to me because there were quite a few interesting data sets and you could compare how your algorithms were doing against the other people playing there. The competitions usually run for 3 or 4 months and I wanted to start one at the beginning rather than jump into the middle of one or play with the dataset of an already completed competition so I waited for the next one to start.
The next competition to start was the AllState Claims Severity challenge where you predicted how much money someone was going to cost the insurance company. There was no prize money with this one and I wasn’t really keen on the problem. However the dataset was well suited to using TensorFlow and Neural Networks, so I started on that. I only played with it for a month before abandoning it for the next competition, but my last score ended up being 1126th out of 3055 teams.
Then came the Outbrain Click prediction competition. Outbrain is that annoying company that places ads at the bottom of news articles and they wanted help better predicting what you might be interested in. This competition had $25,000 in prizes, but besides not really wanting to help Outbrain, I quickly realized that the problem was way beyond what I could work on with my little MacBook Air. So I abandoned that competition very quickly.
Next came the Santander Bank Product Recommendation competition where you predicted other banking products to recommend to customers. There was $60,000 in prize money for this one. I played with this one a bit, but didn’t really get anywhere. I think the problem was largely dealing with bad data which although important, isn’t really what I’m interested in.
Then along came the Two Sigma Financial Modelling Challenge. This one had $100,000 in prize money and was actually a problem I’m interested in. I played this competition to completion and my plan is to detail my journey over a few blog posts. Only the top seven entries receive any prize money and since these competitions are entered by University research teams, corporate AI departments and many other people, it is very hard to actually win any money. The top 14 win a gold medal, the top 5% get a silver medal and the top 10% a bronze. 2071 teams entered this competition.
The Two Sigma Financial Modeling Challenge
One of the problems I had with the first set of challenges was that you ran your models on your own computer and then just submitted the results to the competition for evaluation. This put me at a disadvantage to people with access to much more powerful computers or access to large cloud computing infrastructure. I didn’t really want to spend money on the cloud or buy specialized hardware for a hobby. With the Two Sigma challenge this changed completely. With this challenge you run your code in a Kaggle hosted docker image and rather than submit your results, you submit your code and it has to run in the resources of the Kaggle hosted image. This then leveled the playing field for all the competitors. This restricted you to programming in Python (which I like, but many R programmers objected to) and using the standard Python machine learning libraries, but they seemed to include anything I’d ever heard of.
The provided dataset consisted of 60 or so fundamental metrics, 40 or so technical indicators and 5 derived indicators. Plus in the test set the value of what you are trying to predict. No further explanation of what anything was was given, it was up to you to make what you could of the data and predict the target variable. The data was grouped by id and timestamp so you appeared to be tracking a portfolio of about 1000 stocks or commodities through time. The stocks in the portfolio changed over time and when you were predicting values you weren’t explicitly given the previous values of what you were predicting. There was some feedback in that you predicted the results for the portfolio one timestamp at a time and received a score for each submitted group for a timestamp.
We were given about 40% of the data to train our models and do test runs, which we could either do locally or in the Kaggle cloud. Then when you submitted your code for evaluation it ran on the full dataset, training on that 40% and then predicting the rest that we never saw. You could only submit two entries a day for evaluation, so you had to make each one count. This was to stop various algorithms of simply being able to overfit the evaluation data to win the competition (ie limit cheating).
We could run against the test data all we liked in the Kaggle cloud. But there were restrictions on processing time and memory usage. When running we were allowed 20 minutes processing and 8Gig of RAM. When submitting we were allowed 1 hour or processing and 16Gig or RAM. This tended to work out due to the size difference in the data sets used. Given the size of the dataset, this meant your training and evaluation had to be very efficient to run under the given constraints. We’ll discuss the tradeoffs I had to make to run in this environment quite a bit over the coming blog articles, this especially meant you had to use pandas and numpy very efficiently and avoided Python loops at all costs.
Well I seem to have filled up this blog post without even starting the competition. In future posts I’ll cover the various algorithms that I tried. What worked and what didn’t. And the progression of results that eventually combined to give me my final score. I’ll also discuss what is revealed by the winners and how what they did differed from what I did, what I missed and what I got right. Anyway I think this going to take quite a few posts to cover everything.
I’ve been using Google’s TensorFlow machine learning platform for some time now starting with version 0.8, going onto 0.9 and now playing with 1.0 which was released last week. There are some really good videos from the release summit posted on YouTube here. This blog article looks at the evolution of TensorFlow and what 1.0 brings to the table.
Installing the new TensorFlow 1.0 on MacOS was fairly painless, I chose to install it natively rather than using a VM type solution since I don’t try to run multiple versions of Python, just stick to the latest. They recommend using Docker or other VM technology to avoid having to install at all, but I didn’t have any problems.
More Than Neural Networks
TensorFlow has always been built on a low level compute engine that executes graphs of operations on matrices and vectors (tensors). However the main tutorials and higher level functions were always oriented to performing Neural Network calculations. It contains very good algorithms for training Neural Networks and had all the supporting functions you needed to create very powerful Neural Network models. It contained a Linear Regression function, but this was mainly used as a simple tutorial rather than anything real.
With 1.0 TensorFlow is adding a large number of other popular machine learning algorithms out of the box so you can use Random Forests, Support Vector Machines, and many other standard libraries that you find in more complete libraries like scikit-learn. The list of standard algorithms isn’t as full as scikit-learn yet, and a very notable omission is the ensemble method of gradient boosting (which is promised sometime soon).
I’ve been entering some Kaggle competitions where penalized regression, random forests and gradient boosting are often the algorithms that produce the best results. However TensorFlow under Keras has been doing quite well. Often the winning solution is a combination of several of these, since an average of independent techniques will give better results.
The good thing about this is that TensorFlow provides very good GPU and other hardware accelerator support, so now all these algorithms can benefit from this. In addition Google is now offering (in beta) a machine learning cloud service which runs TensorFlow on optimized accelerated hardware. In the past if this only had TensorFlow the usage would have been limited since most full applications use a combination of algorithms in the final deployment.
As TensorFlow went through the 0.x versions, there were quite a few API changes that caused you to be frequently updating your programs. With version 1.0 the claim is that for the part of TensorFlow that is in the core library, API compatibility will now be maintained.
A lot of the changes for 1.0 were to make the naming conventions more standard, including following the lead of Python’s Numpy library (so the same function didn’t have a different name in NumPy vs TensorFlow). All this should make coding a bit more straightforward and reduce always having to look everything up continuously.
However beware that a lot of the new advertised features in TensorFlow 1.0 are not in the core library yet, and so their API may change until they are moved there.
The good thing is that Google provided a Python script to convert previous TensorFlow Python programs up to the new API level. This worked fine for my programs, so as to make the process rather painless.
Higher Level APIs
A criticism of TensorFlow was that although it was a great low level framework, it was difficult or tedious to do a number of standard operations, like for instance setting up a simple multi-level neural network. Due to this omission sevel developers created competing high level abstractions to run on various lower level libraries. Probably the most successful of these is Keras which runs on top of both TensorFlow and Theano.
With 1.0 TensorFlow is adding a higher level API which works with all the various algorithms it contains as well as adding a Keras compatible library as a nod to the heavy adoption that Keras has enjoyed.
The non-neural network algorithms follow the API conventions in scikit-learn, which are very efficient. The whole thing is also oriented so you can feed one component into another so you can easily build a compound model consisting of several algorithms and then easily train and deploy the whole thing.
Generally this is a good thing for people looking to just use TensorFlow since the amount of code you need to write becomes much smaller and it embodies all the TensorFlow best practices so it works properly with TensorBoard, deploys flexibly, etc.
The TensorFlow documentation has been greatly improved. The tutorials are way better and it’s much easier to get a basic understanding of TensorFlow from the introductory material. There are also many more videos available as well as training courses.
Although this is all a huge step forward, one annoying side effect is that all the external links, say from Stack Overflow articles (or even Google searches) are now broken.
Some of the other notable additions include a new experimental TensorFlow compiler XLA, APIs for Go and Java, addition of a command line debugger, improvements to TensorBoard for better visualizations and lots of additional hardware support.
Windows support was added in version 0.10 which is new since my original blogs. There is support to use Qualcomm DSP chips for co-processing which should greatly enhance the capabilities of Android phones containing this chip.
TensorFlow has come a long way over the last year from a rather specialized Neural Network tool, evolving into a complete machine learning platform. The open source community around TensorFlow is extremely vibrant and extends quite far beyond just Google employees. Looking at what is scheduled for the next couple of point releases looks very exciting and I’m finding this tool becoming more powerful in leaps and bounds.
With sophisticated Neural Networks, you are dealing with a quite complicated nonlinear function. When fitting a high degree polynomial to a few data points, the polynomial can go through all the points, but have such steep slopes that it is useless for predicting points between the training points, we get this same sort of behaviour in Neural Networks. In a way you are training the Neural Network to exactly memorize all the training data exactly rather than figuring out the trends and patterns that you can use to predict other values.
One solution is to perhaps gather more training data, however this may be impossible or quite expensive. It also might be that the training data is missing some representative samples. Here we’ll concentrate on what we can do with the algorithm rather than trying to improve the data.
Interpolation and Extrapolation
Here we refer to generalization as wanting to get answers to data that isn’t in the training data. We refer to overfitting as the case where the model works really well for the training data but doesn’t do nearly as well for anything else.
There are two distinct cases we want to worry about. One is interpolation, this is trying to estimate values where the inputs are surrounded by data in the training set. Extrapolation is the process of trying to predict what happens beyond the training data. Our stock market data is an example of extrapolation. Recognizing handwriting is an example of interpolation (assuming you have a good sample of training data)
Extrapolation tends to be a much harder problem than interpolation, but both a strongly affected by overfitting.
What we often do is divide our training data into three groups. The largest of these we call the training data and use for training. Another is the test data which we run after training to see how well the algorithm works on data that hasn’t been seen by training. To help with detecting overfitting we create a third group which we run after a certain number of steps during training. The following screenshot shows the results for the training and validation sets (this is for a Kaggle competition so the test set needs to be submitted to Kaggle to get the answer). Here smaller values are better. Notice that the training data gets better starting at 3209.5 and going down to 712.8 which indicates training is working. However the validation data starts at 3014.3 goes down to the 1160s and then starts increasing. This indicates we are overfitting the data.
The approach here is really simple, let’s just stop once the validation data starts increasing. So let’s just stop at this point and say we’re done. This is actually a pretty simple and effective way to prevent overfitting. As an added bonus this is a rare technique that leads to faster training.
Penalizing Large Weights
A sign of overfitting is that the slope of our function is high at the points in the training data. Since the slope is approximated by the appropriate weights in our matrix, we would want to keep the weights in our weight matrices low. The way we accomplish this is to add a penalty to the loss function based on the size of the weights.
loss = (tf.nn.l2_loss( tf.sub(logits, tf_train_labels)) + tf.nn.l2_loss(layer1_weights)*beta + tf.nn.l2_loss(layer2_weights)*beta + tf.nn.l2_loss(layer3_weights)*beta + tf.nn.l2_loss(layer4_weights)*beta)
Here we add the sum of the squares of the weights to our loss function. The factor beta is there to let us scale this value to be in the same magnitude as the main loss function. I’ve found that in some problems making the loss due to the weights about equal to the main loss works quite well. In another problem I found choosing beta so that the weights are 10% of the main loss worked quite well.
I have found that combining this with early stopping works quite well. The weight penalty lets us train longer before we start overfitting, which leads to a better overall result.
One property of the Neural Networks in our brain is that brain cells die, but our brain seems to mostly keep on working. In this sense the brain is far more resilient to damage than a computer. The idea behind dropout is to try to add rules to train the Neural Network to be resilient to Neurons being removed. This means the Neural Network can’t be completely reliant on any given Neuron since it could die (be removed from the model).
The way we accomplish this is we add a dropout activation function at some point:
if dropout: hidden = tf.nn.dropout(hidden, 0.5)
This activation function will remove 50% of the neurons at this layer and scale up its outputs by a matching amount. This is so the sum stays the same which means you can use the same weights whether dropout is present or not.
The reason for the if statement is that you only want to do dropout during training and not during validation, testing or production.
You would do this on each hidden layer. It’s rather surprising that the Neural Network still works as well as it does with this much dropout.
I find dropout doesn’t always help, but when it does you can combine it with penalizing the weights and then you can train longer before you need to stop during overfitting. This can sometimes help a network find finer details without overfitting.
When you do dropout, you do have to train for a longer time, so if this is too time prohibitive you might not want to use it.
I think it’s a good sign that Neural Networks can exhibit the same resilience to damage that the brain shows. Perhaps a bit of biological evidence that we are on the correct track.
These are a few techniques you can use to avoid overfitting your model. I generally use all three so I can train a bit longer without overfitting. If you can get more good training data that can also help quite a bit. Using a simpler model (with fewer hidden nodes) can also help with overfitting, but perhaps not provide as good a functional approximation as the more complicated model. As with all things in computer science you are always trading off complexity, overfitting and performance.