Posts Tagged ‘Learning Rate’
In the last part of this series we presented a complete Python program to demonstrate how to create a simple feed forward Neural Network to predict the price changes in the thirty stocks that comprise the Dow Jones Index. The program worked, but its predictions tend to be very close to zero. Perhaps this isn’t too hard to understand since if the data is rather random then this might actually be the minimum. Or perhaps there is something wrong with our model. In this article we’ll start to look at how to improve a Neural Network model.
Remember that humans can’t predict the stock market so this might not be possible. However Neural Networks are very good at finding and recognizing patterns, so if the technical analysts are correct and they really do exist then Neural Networks will be able to find them.
The new source code for this is located here.
You might notice that there are quite a few rather arbitrary looking constants in the sample program. How do we know the values chosen are optimal? How can we go about tuning these. Some of these parameters include:
- The learning rate passed to the GradientDescentOptimizer function (currently 0.01).
- The number of steps we train the model with the data.
- The batch size.
- The number of hidden nodes.
- The number of hidden layers.
- The mean and standard deviation of the random numbers we seed the weights with.
- We use the tanh activation function; would another activation function be better?
- Are we providing too much data to the model, or too little data?
- We are providing the last 30 price changes. Should we try different values?
- We are using the gradient descent optimizer; would another optimizer be better?
- We are using least squares for our loss function, would another loss function be better?
All these parameters affect the results. How do we know we’ve chosen good values? Some of these are constants, some require code changes but all are fairly configurable. Since TensorFlow is a toolkit, there are lots of options to choose from for all of these.
For this article we are going to look at a number of best practices for these parameters and look to improve these by following the best practices.
Another approach is to treat this as an optimization problem and run it through an optimizer like gradient descent to find the best values for these parameters. Another approach is to use evolution algorithms to evolve a better solution. Perhaps we’ll have a look at these in a future article.
The learning rate affects how far we move as we zero in on the minimum. If you use a bigger value it moves further. If you use too big a value it will overshoot the minimum and perhaps miss it. If you use too small a value it can get stuck in valleys are take an extremely long time to converge. I found that for this problem if I set the learning rate to 0.1 then the program doesn’t converge and the weights actually diverge to infinity. So we can’t make this much larger.
What we can do is make the learning rate variable so it starts out larger and then gets smaller as we zero in on a solution. So let’s look at how to do that in our Python code.
# Setup Learning Rate Decay global_step = tf.Variable(0, trainable=False) starter_learning_rate = 0.01 learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 5000, 0.96, staircase=True) # Optimizer. optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss, global_step=global_step)
Basically TensorFlow gives us a way to do this with the exponential_decay function and assistance from the optimizer.
The other thing we’ve done is increase the batch size to 10. Increasing the batch size past 10 leads to the training process diverging, so we are limited to this. We have also dramatically increased the number of steps in the training process. Previously we kept this short to speed up processing time. But now that we are improving the model we need to give it time to train. But at the same time we don’t want it to over-train and overfit the data. In the program there are two ways of running it, one is with no test or validation data, so you can train with all the data to get the best result on the next prediction. The other is with test and validation steps, where you can see where improvements in the test and validation stop improving and hence can stop training.
We are currently providing the previous 30 price changes for each of the Dow components. Perhaps this isn’t sufficient? Perhaps there is different data we could add? One common practice in stock technical analysis is to consider stock price moves when the trading volume is high as more important than when the trading volume is low. So perhaps if we add the trading volumes to the input data it will help give better results.
The way we do this is we modify the sliding windows we are providing to now give 60 bits of data for each of the Dow stocks. We now provide the price change followed by the matching volume. So our input vector is now alternating price changes and volumes and has grown to 1800 in length. Trading volumes of Dow stocks are quite large numbers, so we normalize these by dividing these by the first volume for a stock. This way they tend to be in the range 0 to 10. In this case the volume isn’t an output anywhere so we don’t need it in range of the activation functions, but we do want to keep the weights in the matrix from getting too large.
I won’t list all the code changes to do that here, but you can download the source file from my Google Drive.
We could also consider adding more data. Perhaps varying the amount of historical data. We could add calculated values that technical analysts find useful like weighted averages, Bollinger bands etc. We could get other metrics from Web services such as fundamental data on the stock, perhaps P/E ratio or market capitalization. We could construct metrics like number of news feed stories on a stock. All these are possibilities to include. Perhaps later on we can see methods to see if these help. After training the model you can see which inputs don’t lead anywhere, i.e. all their connections have zero weights. By analysing these we can see which values the training process has considered important and which ones can be removed to simplify our model.
Number of Hidden Nodes
In the initial implementation we just made the number of hidden nodes to be 16 in each layer. Partly we used a small number to make calculation quicker so we could iterate our coding quicker. But what values should we use?
Neural Networks are very good at pattern recognition. They accomplish this in a similar way to the brain. They take heavily pixelated data and then discover structure in that data as they go from layer to layer. We want to encourage that, to have the network summarize the data somewhat to go from layer to layer. A common way to accomplish that is to have a large number of nodes adjacent to the input layer and then reduce it with each following layer.
We will start with a larger number namely 64 and then half it with each following layer.
# Variables. layer1_weights = tf.Variable(tf.truncated_normal( [NHistData * num_stocks, num_hidden], stddev=0.1)) layer1_biases = tf.Variable(tf.zeros([num_hidden])) layer2_weights = tf.Variable(tf.truncated_normal( [num_hidden, int(num_hidden / 2)], stddev=0.1)) layer2_biases = tf.Variable(tf.constant(1.0, shape=[int(num_hidden / 2)])) layer3_weights = tf.Variable(tf.truncated_normal( [int(num_hidden / 2), int(num_hidden / 4)], stddev=0.1)) layer3_biases = tf.Variable(tf.constant(1.0, shape=[int(num_hidden /4)])) layer4_weights = tf.Variable(tf.truncated_normal( [int(num_hidden / 4), num_labels], stddev=0.1)) layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
I would have liked to use a larger number closer to the number of inputs, but it appears the model diverges for 128 or more hidden nodes in the first layer. I’ll have to investigate why some time in the future.
Below are the results for Sept. 26. Notice that the results are much better. It predicted the generally down day on the Dow and it isn’t being so conservative in its estimates anymore. Some stocks like MMM and MCD it got really close. There were others like PG and GS that it got quite wrong. Overall though we seem to be starting to get something useful. However, to re-iterate, don’t bet your real hard earned money based on this simple model.
This article covered a few improvements that could easily be made to our model. This is still a very simple basic model, so please don’t use it to trade stocks. Even if you have a really good model, the hedge funds will still eat you alive.
Generally, this is how developing Neural Networks goes. You start with a simple small model and then iteratively enhance it to develop it into something good.