Posts Tagged ‘Gradient Descent’
We’ve been playing with TensorFlow for a while now and we have a working model for predicting the stock market. I’m not too sure if we’re beating the stocking picking cat yet, but at least we have a good model where we can experiment and learn about Neural Networks. In this article we’re going to look at the optimization methods available in TensorFlow. There are quite a few of these built into the standard toolkit and since TensorFlow is open source you could create your own optimizer. This article follows on from our previous article on optimization and training.
Weaknesses in Gradient Descent
Gradient Descent has worked for us pretty well so far. Basically it calculates the gradients of the loss function (the partial derivatives of loss by each weight) and moves the weights in the direction of lowering the loss function. However finding the minimums of a complicated nonlinear function is a non-trivial exercise and compound this with the fact that a lot of the data we are feeding in during training is very noisy. In our case the stock market historical data is probably quite contradictory and is probably presenting a good challenge to the training algorithm. Here are some weaknesses these other algorithms attempt to address:
- Learning rate. We have one fixed learning rate (how far we move in the direction of the sign of the gradient). We added an optimization to reduce this learning rate as we proceed, but we use the same learning rate for everything at each step. But some parts of our weight matrix may be changing quickly and other parts remaining close to constant. So perhaps use a different learning rate for each weight/bias and vary it by how fast it’s moving and whether it’s moving consistently in the same direction.
- Getting stuck in local minimums or wandering around plateaus. Are we getting stuck in a local minimum which is much worse than the global minimum we would like to find? How can we power past global minimums and continue to the real goal?
The optimizers included with TensorFlow are all variations on Gradient Descent. There are many other optimizers that people use like simulated annealing, conjugate gradient and ant colony optimization but these tend to either not work well with multi-layer Neural Networks or don’t parallelize well to run on GPUs or a distributed network or are far too computationally intensive for large matrices. We added to the code all the optimizers and you just uncomment the one that you want to use.
# optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step) # optimizer = tf.train.AdadeltaOptimizer(starter_learning_rate).minimize(loss) # optimizer = tf.train.AdagradOptimizer(starter_learning_rate).minimize(loss) # promising # optimizer = tf.train.AdamOptimizer(starter_learning_rate).minimize(loss) # promising # optimizer = tf.train.MomentumOptimizer(starter_learning_rate, 0.001).minimize(loss) # diverges # optimizer = tf.train.FtrlOptimizer(starter_learning_rate).minimize(loss) # promising optimizer = tf.train.RMSPropOptimizer(starter_learning_rate).minimize(loss) # promising
Perhaps it would be less hacky to make this a parameter to the program, but we’ll leave that till we need it.
Let’s quickly summarize what each optimizer tries to accomplish:
- MomentumOptimizer: If gradient descent is navigating down a valley with steep sides, it tends to madly oscillate from one valley wall to the other without making much progress down the valley. This is because the largest gradients point up and down the valley walls whereas the gradient along the floor of the valley is quite small. Momentum Optimization attempts to remedy this by keeping track of the prior gradients and if they keep changing direction then damp them, and if the gradients stay in the same direction then reward them. This way the valley wall gradients get reduced and the valley floor gradient enhanced. Unfortunately this particular optimizer diverges for the stock market data.
- AdagradOptimizer: Adagrad is optimized to finding needles in haystacks and for dealing with large sparse matrices. It keeps track of the previous changes and will amplify the changes for weights that change infrequently and suppress the changes for weights that change frequently. This algorithm seemed promising for the stock market data.
- AdadeltaOptimizer: Adadelta is an extension of Adagrad that only remembers a fixed size window of previous changes. This tends to make the algorithm less aggressive than pure Adagrad. Adadelta seemed to not work as well as Adagrad for the stock market data.
- AdamOptimizer: Adaptive Moment Estimation (Adam) keeps separate learning rates for each weight as well as an exponentially decaying average of previous gradients. This combines elements of Momentum and Adagrad together and is fairly memory efficient since it doesn’t keep a history of anything (just the rolling averages). It is reputed to work well for both sparse matrices and noisy data. Adam seems promising for the stock market data.
- FtrlOptimizer: Ftrl-Proximal was developed for ad-click prediction where they had billions of dimensions and hence huge matrices of weights that were very sparse. The main feature here is to keep near zero weights at zero, so calculations can be skipped and optimized. This algorithm was promising on our stock market data.
- RMSPropOptimizer: RMSprop is similar to Adam it just uses different moving averages but has the same goals.
Neural networks can be quite different and the best algorithm for the job may depend a lot on the data you are trying to train the network with. Each of these optimizers has several tunable parameters. Besides initial learning rate, I’ve left all the others at the default. We could write a meta-trainer that tries to find an optimal solution for which optimizer to use and with which values of its tunable parameters. You would want a quite powerful distributed set of computers to run this on.
Optimization is a tricky subject with Neural Networks, a lot depends on the quality and quantity of your data. It also depends on the size of your model and the contents of the weight matrices. A lot of these optimizers are tuned for rather specific problems like image recognition or ad click-through prediction; however, if you have a unique problem them largely you are left to trial and error (whether automated or manual) to determine the best solution.
Note that a lot of practitioners stick with basic gradient descent since they know it quite well, rather than relying on the newer algorithms. Often massaging your data or altering the random starting point can be a better area to focus on.
Output of Layer = ActivationFunction( A x (Input of Layer) + b )
We also specified that our input vector would be 900 elements large (the 30 Dow stocks times the last 30 price changes) and the output vector would be 30 elements (then next price change for each of the Dow 30 stocks). This means that if we have just one hidden layer of say 100 Neurons then we need a 900×100 matrix and a 100×30 matrix plus a 100 element bias vector and a 30 element bias vector. This means we need 900×100 + 100×30 + 100 + 30 = 93,130 values. Where do these all come from? In this article we’ll look at where we get these.
What we want to do is use some sort of known or historical data to train the Neural Network. When Neural Networks were first proposed, Computer Scientists tuned these by hand which resulted in taking a long time to get a very small Neural Network that didn’t work well. Later on many methods were developed to calculate these from databases of known cases, however until recently these databases were too small to be effective and led to extreme over-fitting. With the advent of big data, shared cloud resources and automated data collection, a large number of high quality extremely large databases are available to train Neural Networks for well know problems like hand writing recognition or shape identification. Notice that in the introduction to find 93,130 values requires far more than 93,130 bits of data, since this will lead to over-fitting (which we’ll talk a lot about in a future article).
If you remember back to basic statistics and linear regression, we found the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why its often called least squares regression. Basically we are formulating the problem as an optimization problem where we are trying to minimize an error function. The linear regression problem is then easily solvable by first year linear algebra. For the Neural Network case it’s a little bit more complicated, but the basic idea is the same.
To train our Neural Network we will use historical data where we provide 30 days of price changes for the Dow 30 stocks and then we know the next change so we can provide the error for our error function. To define an error function, we are going to start by just using the square of the difference, so basically just doing least square minimization just like least squares regression. In TensorFlow we can define our loss function as:
loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))
Now that we have the data and an error function how do we go about training our network. First we start by seeding the matrix weights with normally distributed random numbers. TensorFlow provides some help here
layer1_weights = tf.Variable(tf.truncated_normal(
[NHistData * num_stocks, num_hidden], stddev=0.05))
to define our matrix and initialize it with normalized random numbers.
There are a number of optimization algorithms that can be used to solve this problem. The one we are going to use is called Gradient Descent which is a form of Back Propagation. The key property of back propagation algorithms is that they can be applied to Neural Networks with multiple hidden layers. The basic idea is that you take the partial derivative of the loss function with respect to each weight. This gives you a gradient with respect to each weight and then based on whether the gradient is positive or negative you can increase or decrease the weight by a little bit. This little bit is the learning rate which is a parameter to the algorithm (or can be changed dynamically by another algorithm). You then run the training data through this algorithm and hopefully observe your error function decreasing as you go.
This is then the basis of training your network. Once you have the weights you can calculate all the values as you like.
A big danger here is that you are overfitting. You reduce the error function to next than nothing and the network works well for all your training data. You then try it on something else and it produces very bad results. This is similar to fitting a 10th degree polynomial through 11 data points. It fits all those points exactly, but has no predictive value outside of those exact points.
A common technique is to divide the training data into three buckets: actual training data, testing data and final validation data. You use the training data to train and as you train, you use the testing data to see how you are doing. Then when everything is finished you use the final validation data to do a final test (where the training process has never seen this data). This then gives you an idea of how well the network will fare out in the real world. We will make the size of these three buckets configurable.
Local Versus Global Minimums
During the training process a few different things could happen. The solution could diverge, the error could just keep getting larger and larger. The solution could get stuck in a valley and just orbit a minimum value without converging to it. The solution could converge, but to a local minimum rather than the global minimum. These are all things that need to be watched out for.
Since the initial values are random, re-running the training can lead to quite different solutions. For some problems you want to train repeatedly to get the best solution. Or perhaps compare different optimization algorithms to see which gives the best result. Another idea is to use a combination of algorithms, perhaps start with one that gets into the correct neighborhood, and then another that can zero in on it.
There are quite a few tricks to get out of local minimums and to escape valleys using various random numbers. One is to change the learning rate to occasionally take a bigger jump. Others are to try some random perturbations to see if you can start converging to another solution.
Batch Versus Single
A lot of time we process the training data in batches where we take the average of the partial derivatives to adjust the weights. This can greatly speed up training and avoids the problem of one bad data point sending us in the wrong direction. Again the batch size is a meta-parameter to the training algorithm that we can tune to get the best results.
This was a really quick introduction to training a Neural Network. There are many optimization algorithms that can be applied to solve this problem, but we are starting with gradient descent. A number of the algorithms chosen, are done so to facilitate using a GPU or distributed network to parallelize and hence speed up the training process.
Next time we’ll start looking at the TensorFlow code for a simple Neural Network model, then we will start enhancing it to get better results.