## The Road to TensorFlow – Part 11: Generalization and Overfitting

# Introduction

With sophisticated Neural Networks, you are dealing with a quite complicated nonlinear function. When fitting a high degree polynomial to a few data points, the polynomial can go through all the points, but have such steep slopes that it is useless for predicting points between the training points, we get this same sort of behaviour in Neural Networks. In a way you are training the Neural Network to exactly memorize all the training data exactly rather than figuring out the trends and patterns that you can use to predict other values.

We’ve touched upon this problem in other articles like here and here, but glossed over what we are doing about this problem. In this article we’ll explore what we can do about this in more detail.

One solution is to perhaps gather more training data, however this may be impossible or quite expensive. It also might be that the training data is missing some representative samples. Here we’ll concentrate on what we can do with the algorithm rather than trying to improve the data.

# Interpolation and Extrapolation

Here we refer to generalization as wanting to get answers to data that isn’t in the training data. We refer to overfitting as the case where the model works really well for the training data but doesn’t do nearly as well for anything else.

There are two distinct cases we want to worry about. One is interpolation, this is trying to estimate values where the inputs are surrounded by data in the training set. Extrapolation is the process of trying to predict what happens beyond the training data. Our stock market data is an example of extrapolation. Recognizing handwriting is an example of interpolation (assuming you have a good sample of training data)

Extrapolation tends to be a much harder problem than interpolation, but both a strongly affected by overfitting.

# Early Stopping

What we often do is divide our training data into three groups. The largest of these we call the training data and use for training. Another is the test data which we run after training to see how well the algorithm works on data that hasn’t been seen by training. To help with detecting overfitting we create a third group which we run after a certain number of steps during training. The following screenshot shows the results for the training and validation sets (this is for a Kaggle competition so the test set needs to be submitted to Kaggle to get the answer). Here smaller values are better. Notice that the training data gets better starting at 3209.5 and going down to 712.8 which indicates training is working. However the validation data starts at 3014.3 goes down to the 1160s and then starts increasing. This indicates we are overfitting the data.

The approach here is really simple, let’s just stop once the validation data starts increasing. So let’s just stop at this point and say we’re done. This is actually a pretty simple and effective way to prevent overfitting. As an added bonus this is a rare technique that leads to faster training.

# Penalizing Large Weights

A sign of overfitting is that the slope of our function is high at the points in the training data. Since the slope is approximated by the appropriate weights in our matrix, we would want to keep the weights in our weight matrices low. The way we accomplish this is to add a penalty to the loss function based on the size of the weights.

loss = (tf.nn.l2_loss( tf.sub(logits, tf_train_labels)) + tf.nn.l2_loss(layer1_weights)*beta + tf.nn.l2_loss(layer2_weights)*beta + tf.nn.l2_loss(layer3_weights)*beta + tf.nn.l2_loss(layer4_weights)*beta)

Here we add the sum of the squares of the weights to our loss function. The factor beta is there to let us scale this value to be in the same magnitude as the main loss function. I’ve found that in some problems making the loss due to the weights about equal to the main loss works quite well. In another problem I found choosing beta so that the weights are 10% of the main loss worked quite well.

I have found that combining this with early stopping works quite well. The weight penalty lets us train longer before we start overfitting, which leads to a better overall result.

# Dropout

One property of the Neural Networks in our brain is that brain cells die, but our brain seems to mostly keep on working. In this sense the brain is far more resilient to damage than a computer. The idea behind dropout is to try to add rules to train the Neural Network to be resilient to Neurons being removed. This means the Neural Network can’t be completely reliant on any given Neuron since it could die (be removed from the model).

The way we accomplish this is we add a dropout activation function at some point:

if dropout: hidden = tf.nn.dropout(hidden, 0.5)

This activation function will remove 50% of the neurons at this layer and scale up its outputs by a matching amount. This is so the sum stays the same which means you can use the same weights whether dropout is present or not.

The reason for the if statement is that you only want to do dropout during training and not during validation, testing or production.

You would do this on each hidden layer. It’s rather surprising that the Neural Network still works as well as it does with this much dropout.

I find dropout doesn’t always help, but when it does you can combine it with penalizing the weights and then you can train longer before you need to stop during overfitting. This can sometimes help a network find finer details without overfitting.

When you do dropout, you do have to train for a longer time, so if this is too time prohibitive you might not want to use it.

I think it’s a good sign that Neural Networks can exhibit the same resilience to damage that the brain shows. Perhaps a bit of biological evidence that we are on the correct track.

# Summary

These are a few techniques you can use to avoid overfitting your model. I generally use all three so I can train a bit longer without overfitting. If you can get more good training data that can also help quite a bit. Using a simpler model (with fewer hidden nodes) can also help with overfitting, but perhaps not provide as good a functional approximation as the more complicated model. As with all things in computer science you are always trading off complexity, overfitting and performance.

[…] effect on the results. They also penalize large weights borrowing the technique we described here. Then there are various methods to filter out outliers or to change their effect by using different […]

Playing the Kaggle Two Sigma Challenge – Part 2 | Stephen Smith's BlogMarch 3, 2017 at 11:51 pm

To my opinion, it is difficult to expect that the machine learning approach can improve the basic unpredictability of stock price movements. If you analyze a single stock returns over a longer period of time, you find an almost normal distribution which is a bit shifted to the right if the stock is financially and otherwise a healthy one. This implies that the probabilities of both price ups and downs are very near 50 % and this cannot be improved by any machine learning algorithm relying only on past price or returns data. If you are using daily data and forecasting for a day in advance there is also the problem of events after the closure of the market which can influence the stock price for the next day which would further decrease the probability of correct predictions. Your algorithm can be a bit improved by including stocks that are sensitive for various market events (for example, bank stocks for modelling the impact of interest rate changes or gold stock for modelling the impact of uncertainity etc.), but the above unpredictability would still persist. So it is difficult to expect probabilities much greater then a little over 50 %. which is basically more or less useless for short term predictive purposes. .

Vital SeverApril 2, 2017 at 9:57 am

As I mentioned in the comment above, it would be difficult to use your modelling approach for short term forecasting of stock market even with an improved stock structure. However, you can use your this or similar modelling approach in a contrarian way: for searching for eventual market inconsistencies which could eventually be exploited; such an inconsistency would be for example a stock for which the relative size of its prediction error would consistently stay very low.

Vital SeverApril 3, 2017 at 1:06 am

[…] and produces better results even in a reliable system. We talked about this “dropout” in this article about TensorFlow. So to some degree perhaps the “unreliability” actually led to better […]

The Brain’s Operating System | Stephen Smith's BlogJune 16, 2017 at 6:52 pm

[…] adjustments such as setting small weights to zero so they really don’t represent a connection or penalizing large weights since these lead to […]

Learning in Brains and Computers | Stephen Smith's BlogJune 23, 2017 at 6:49 pm