## The Road to TensorFlow – Part 11: Generalization and Overfitting

# Introduction

With sophisticated Neural Networks, you are dealing with a quite complicated nonlinear function. When fitting a high degree polynomial to a few data points, the polynomial can go through all the points, but have such steep slopes that it is useless for predicting points between the training points, we get this same sort of behaviour in Neural Networks. In a way you are training the Neural Network to exactly memorize all the training data exactly rather than figuring out the trends and patterns that you can use to predict other values.

We’ve touched upon this problem in other articles like here and here, but glossed over what we are doing about this problem. In this article we’ll explore what we can do about this in more detail.

One solution is to perhaps gather more training data, however this may be impossible or quite expensive. It also might be that the training data is missing some representative samples. Here we’ll concentrate on what we can do with the algorithm rather than trying to improve the data.

# Interpolation and Extrapolation

Here we refer to generalization as wanting to get answers to data that isn’t in the training data. We refer to overfitting as the case where the model works really well for the training data but doesn’t do nearly as well for anything else.

There are two distinct cases we want to worry about. One is interpolation, this is trying to estimate values where the inputs are surrounded by data in the training set. Extrapolation is the process of trying to predict what happens beyond the training data. Our stock market data is an example of extrapolation. Recognizing handwriting is an example of interpolation (assuming you have a good sample of training data)

Extrapolation tends to be a much harder problem than interpolation, but both a strongly affected by overfitting.

# Early Stopping

What we often do is divide our training data into three groups. The largest of these we call the training data and use for training. Another is the test data which we run after training to see how well the algorithm works on data that hasn’t been seen by training. To help with detecting overfitting we create a third group which we run after a certain number of steps during training. The following screenshot shows the results for the training and validation sets (this is for a Kaggle competition so the test set needs to be submitted to Kaggle to get the answer). Here smaller values are better. Notice that the training data gets better starting at 3209.5 and going down to 712.8 which indicates training is working. However the validation data starts at 3014.3 goes down to the 1160s and then starts increasing. This indicates we are overfitting the data.

The approach here is really simple, let’s just stop once the validation data starts increasing. So let’s just stop at this point and say we’re done. This is actually a pretty simple and effective way to prevent overfitting. As an added bonus this is a rare technique that leads to faster training.

# Penalizing Large Weights

A sign of overfitting is that the slope of our function is high at the points in the training data. Since the slope is approximated by the appropriate weights in our matrix, we would want to keep the weights in our weight matrices low. The way we accomplish this is to add a penalty to the loss function based on the size of the weights.

loss = (tf.nn.l2_loss( tf.sub(logits, tf_train_labels)) + tf.nn.l2_loss(layer1_weights)*beta + tf.nn.l2_loss(layer2_weights)*beta + tf.nn.l2_loss(layer3_weights)*beta + tf.nn.l2_loss(layer4_weights)*beta)

Here we add the sum of the squares of the weights to our loss function. The factor beta is there to let us scale this value to be in the same magnitude as the main loss function. I’ve found that in some problems making the loss due to the weights about equal to the main loss works quite well. In another problem I found choosing beta so that the weights are 10% of the main loss worked quite well.

I have found that combining this with early stopping works quite well. The weight penalty lets us train longer before we start overfitting, which leads to a better overall result.

# Dropout

One property of the Neural Networks in our brain is that brain cells die, but our brain seems to mostly keep on working. In this sense the brain is far more resilient to damage than a computer. The idea behind dropout is to try to add rules to train the Neural Network to be resilient to Neurons being removed. This means the Neural Network can’t be completely reliant on any given Neuron since it could die (be removed from the model).

The way we accomplish this is we add a dropout activation function at some point:

if dropout: hidden = tf.nn.dropout(hidden, 0.5)

This activation function will remove 50% of the neurons at this layer and scale up its outputs by a matching amount. This is so the sum stays the same which means you can use the same weights whether dropout is present or not.

The reason for the if statement is that you only want to do dropout during training and not during validation, testing or production.

You would do this on each hidden layer. It’s rather surprising that the Neural Network still works as well as it does with this much dropout.

I find dropout doesn’t always help, but when it does you can combine it with penalizing the weights and then you can train longer before you need to stop during overfitting. This can sometimes help a network find finer details without overfitting.

When you do dropout, you do have to train for a longer time, so if this is too time prohibitive you might not want to use it.

I think it’s a good sign that Neural Networks can exhibit the same resilience to damage that the brain shows. Perhaps a bit of biological evidence that we are on the correct track.

# Summary

These are a few techniques you can use to avoid overfitting your model. I generally use all three so I can train a bit longer without overfitting. If you can get more good training data that can also help quite a bit. Using a simpler model (with fewer hidden nodes) can also help with overfitting, but perhaps not provide as good a functional approximation as the more complicated model. As with all things in computer science you are always trading off complexity, overfitting and performance.

## The Road to TensorFlow – Part 10: More on Optimization

# Introduction

We’ve been playing with TensorFlow for a while now and we have a working model for predicting the stock market. I’m not too sure if we’re beating the stocking picking cat yet, but at least we have a good model where we can experiment and learn about Neural Networks. In this article we’re going to look at the optimization methods available in TensorFlow. There are quite a few of these built into the standard toolkit and since TensorFlow is open source you could create your own optimizer. This article follows on from our previous article on optimization and training.

# Weaknesses in Gradient Descent

Gradient Descent has worked for us pretty well so far. Basically it calculates the gradients of the loss function (the partial derivatives of loss by each weight) and moves the weights in the direction of lowering the loss function. However finding the minimums of a complicated nonlinear function is a non-trivial exercise and compound this with the fact that a lot of the data we are feeding in during training is very noisy. In our case the stock market historical data is probably quite contradictory and is probably presenting a good challenge to the training algorithm. Here are some weaknesses these other algorithms attempt to address:

- Learning rate. We have one fixed learning rate (how far we move in the direction of the sign of the gradient). We added an optimization to reduce this learning rate as we proceed, but we use the same learning rate for everything at each step. But some parts of our weight matrix may be changing quickly and other parts remaining close to constant. So perhaps use a different learning rate for each weight/bias and vary it by how fast it’s moving and whether it’s moving consistently in the same direction.
- Getting stuck in local minimums or wandering around plateaus. Are we getting stuck in a local minimum which is much worse than the global minimum we would like to find? How can we power past global minimums and continue to the real goal?

# TensorFlow Optimizers

The optimizers included with TensorFlow are all variations on Gradient Descent. There are many other optimizers that people use like simulated annealing, conjugate gradient and ant colony optimization but these tend to either not work well with multi-layer Neural Networks or don’t parallelize well to run on GPUs or a distributed network or are far too computationally intensive for large matrices. We added to the code all the optimizers and you just uncomment the one that you want to use.

# optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step) # optimizer = tf.train.AdadeltaOptimizer(starter_learning_rate).minimize(loss) # optimizer = tf.train.AdagradOptimizer(starter_learning_rate).minimize(loss) # promising # optimizer = tf.train.AdamOptimizer(starter_learning_rate).minimize(loss) # promising # optimizer = tf.train.MomentumOptimizer(starter_learning_rate, 0.001).minimize(loss) # diverges # optimizer = tf.train.FtrlOptimizer(starter_learning_rate).minimize(loss) # promising optimizer = tf.train.RMSPropOptimizer(starter_learning_rate).minimize(loss) # promising

Perhaps it would be less hacky to make this a parameter to the program, but we’ll leave that till we need it.

Let’s quickly summarize what each optimizer tries to accomplish:

- MomentumOptimizer: If gradient descent is navigating down a valley with steep sides, it tends to madly oscillate from one valley wall to the other without making much progress down the valley. This is because the largest gradients point up and down the valley walls whereas the gradient along the floor of the valley is quite small. Momentum Optimization attempts to remedy this by keeping track of the prior gradients and if they keep changing direction then damp them, and if the gradients stay in the same direction then reward them. This way the valley wall gradients get reduced and the valley floor gradient enhanced. Unfortunately this particular optimizer diverges for the stock market data.
- AdagradOptimizer: Adagrad is optimized to finding needles in haystacks and for dealing with large sparse matrices. It keeps track of the previous changes and will amplify the changes for weights that change infrequently and suppress the changes for weights that change frequently. This algorithm seemed promising for the stock market data.
- AdadeltaOptimizer: Adadelta is an extension of Adagrad that only remembers a fixed size window of previous changes. This tends to make the algorithm less aggressive than pure Adagrad. Adadelta seemed to not work as well as Adagrad for the stock market data.
- AdamOptimizer: Adaptive Moment Estimation (Adam) keeps separate learning rates for each weight as well as an exponentially decaying average of previous gradients. This combines elements of Momentum and Adagrad together and is fairly memory efficient since it doesn’t keep a history of anything (just the rolling averages). It is reputed to work well for both sparse matrices and noisy data. Adam seems promising for the stock market data.
- FtrlOptimizer: Ftrl-Proximal was developed for ad-click prediction where they had billions of dimensions and hence huge matrices of weights that were very sparse. The main feature here is to keep near zero weights at zero, so calculations can be skipped and optimized. This algorithm was promising on our stock market data.
- RMSPropOptimizer: RMSprop is similar to Adam it just uses different moving averages but has the same goals.

Neural networks can be quite different and the best algorithm for the job may depend a lot on the data you are trying to train the network with. Each of these optimizers has several tunable parameters. Besides initial learning rate, I’ve left all the others at the default. We could write a meta-trainer that tries to find an optimal solution for which optimizer to use and with which values of its tunable parameters. You would want a quite powerful distributed set of computers to run this on.

# Summary

Optimization is a tricky subject with Neural Networks, a lot depends on the quality and quantity of your data. It also depends on the size of your model and the contents of the weight matrices. A lot of these optimizers are tuned for rather specific problems like image recognition or ad click-through prediction; however, if you have a unique problem them largely you are left to trial and error (whether automated or manual) to determine the best solution.

Note that a lot of practitioners stick with basic gradient descent since they know it quite well, rather than relying on the newer algorithms. Often massaging your data or altering the random starting point can be a better area to focus on.

## The Road to TensorFlow – Part 9: TensorBoard

# Introduction

We’ve spent some time developing a Neural Network model for predicting the stock market. TensorFlow has produced a fairly black box implementation that is trained by historical data and then can output predictions for tomorrow’s prices.

But what confidence do we have that this model is really doing what we want? Last time we discussed some of the meta-parameters that configure the model. How do we know these are vaguely correct? How do we know if the weights we are training are converging? If we want to step through the model, how do we do that?

TensorFlow comes with a tool called TensorBoard which you can use to get some insight into what is happening. You can’t easily just print variables since they are all internal to the TensorFlow engine and only have values when required as a session is running. There is also the problem with how to visualize the variables. The weights matrix is very large and is constantly changing as you train it, you certainly don’t want to print this out repeatedly, let alone try to read through it.

To use TensorBoard you instrument your program. You tell it what you want to track and assign useful names to those items. This data is then written to log files as your model runs. You then run the TensorBoard program to process these log files and view the results in your Web Browser.

# Something Went Wrong

Due to household logistics I moved my TensorFlow work over to my MacBook Air from running in an Ubuntu VM image on our Windows 10 laptop. Installing Python 3, TensorFlow and the various other libraries I’m using was quite simple and straight forward. Just install Python from Python.org and then use pip3 to install any other libraries. That all worked fine. But when I started running the program from last time, I was getting NaN results quite often. I wondered if TensorFlow wasn’t working right on my Mac? Anyway I went to debug the program and that led me to TensorBoard. As it turns out there was quite a bad bug in the program presented last time due to un-initialized variables.

You tend to get complacent programming in Python about un-initialized variables (and array subscript range errors) because usually Python will raise and exception if you try to use a variable that hasn’t been initialized. The problem is NumPy which is a library written in C for efficiency. When you create a NumPy array, it is returned to Python, telling Python its good to go. But since its managed by C code you don’t get the usual Python error checking. So when I changed the program to add the volumes to the price changes, I had a bug that left some of the data arrays uninitialized. I suspect on the Windows 10 laptop that these were initialized to zero, but that all depends on which exact C runtime is being used. On the Mac these values were just random memory and that immediately led to program errors.

Adding the TensorBoard initialization showed the problem was originating with the data and then it was fairly straight forward to zero in on the problem and fix it.

As a result, for this article, I’m just going to overwrite the Python file from last time with a newer one (tfstocksdiff2.py) which is posted here. This version includes TensorBoard instrumentation and a couple of other improvements that I’ll talk about next time.

# TensorBoard

First we’ll start with some of the things that TensorBoard shows you. If you read an overview of TensorFlow it’s a bit confusing about what are Tensors and what flows. If you’ve looked at the program so far, it shows quite a few algebraic matrix equations, but where are the Tensors? What TensorFlow does is break these equations down into nodes where each node is a function execution and the data flows along the edges. This is a fairly common way to evaluate algebraic expressions and not unique to TensorFlow. TensorFlow then supports executing these on GPUs and in distributed environments as well as providing all the node types you need to create Neural Networks. TensorBoard gives you a way to visualize these graphs. The names of the nodes are from the program instrumentation.

When the program was instrumented it grouped things together. Here is an expansion of the trainingmodel box where you can see the operations that make up our model.

This gives us some confidence that we have constructed our TensorFlow graph correctly, but doesn’t show any data.

We can track various statistics of all our TensorFlow variables over time. This graph is showing a track of the means of the various weight and bias matrixes.

TensorBoard also lets us look at the distribution of the matrix values over time.

TensorBoard also lets us look at histograms of the data and how those histograms evolve over time.

You can see how the layer 1 weights start as their seeded normal distribution of random numbers and then progress to their new values as training progresses. If you look at all these graphs you can see that the values are still progressing when training stops. This is because TensorBoard instrumentation really slows down processing, so I shortened the training steps while using TensorBoard. I could let it run much longer over night to ensure that I am providing sufficient training for all the values to settle down.

# Program Instrumentation

Rather than include all the code here, check out the Google Drive for the Python source file. But quickly we added a function to get all the statistics on a variable:

def variable_summaries(var, name): """Attach a lot of summaries to a Tensor.""" with tf.name_scope('summaries'): mean = tf.reduce_mean(var) tf.scalar_summary('mean/' + name, mean) with tf.name_scope('stddev'): stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) tf.scalar_summary('stddev/' + name, stddev) tf.scalar_summary('max/' + name, tf.reduce_max(var)) tf.scalar_summary('min/' + name, tf.reduce_min(var)) tf.histogram_summary(name, var)

We define names in the various section and indicate the data we want to collect:

with tf.name_scope('Layer1'): with tf.name_scope('weights'): layer1_weights = tf.Variable(tf.truncated_normal( [NHistData * num_stocks * 2, num_hidden], stddev=0.1)) variable_summaries(layer1_weights, 'Layer1' + '/weights') with tf.name_scope('biases'): layer1_biases = tf.Variable(tf.zeros([num_hidden])) variable_summaries(layer1_biases, 'Layer1' + '/biases')

Before the call to initialize_all_variables we need to call:

merged = tf.merge_all_summaries() test_writer = tf.train.SummaryWriter('/tmp/tf/test', session.graph )

And then during training:

summary, _, l, predictions = session.run( [merged, optimizer, loss, train_prediction], feed_dict=feed_dict)

test_writer.add_summary(summary, i)

# Summary

TensorBoard is quite a good tool to give you insight into what is going on in your model. Whether the program is correctly doing what you think and whether there is any sanity to the data. It also lets you tune the various parameters to ensure you are getting best results.

## The Road to TensorFlow – Part 7: Finally Some Code

# Introduction

Well after a long journey through Linux, Python, Python Libraries, the Stock Market, an Introduction to Neural Networks and training Neural Networks we are now ready to look at a complete Python example to predict the stock market.

I placed the full source code listing on my Google Drive here. As described in the previous articles you will need to run this on a Mac or on Linux (could be a virtual image) with Python and TensorFlow installed. You will also need to have the various libraries that are imported at the top of the source file installed or you will get an error when you go to run it. I would suggest getting the source file to play with, Python is very fussy about indentation, so copy/paste from the article may introduce indentation errors caused by the blog formatting.

The Neural Network we are running here is a simple feed forward network with four hidden layers and uses the hyperbolic tangent as the activation function in each case. This is a very simple model so don’t use it to invest with real money. Hopefully this article gives a flavour for how to create and train a Neural Network using TensorFlow. Then in future articles we can discuss the limitation of this model and how to improve it.

# Import Libraries

First we import all the various libraries we will be using, note tensorflow and numpy as being particularly important.

# Copyright 2016 Stephen Smith import time

import math import os from datetime import date from datetime import timedelta import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import pandas as pd import pandas_datareader as pdr from pandas_datareader import data, wb from six.moves import cPickle as pickle from yahoo_finance import Share

# Get Stock Market Data

Next we get the stock market data. If the file stocks.pickle exists we assume we’ve previously saved this file and use it. Otherwise we get the data from Yahoo Finance using a Web Service call, made via the Pandas DataReader. We only keep the adjusted close column and we fill in any NaN’s with the first value we saw (this really only applies to Visa in this case). The data will all be in a standard Pandas data frame after this.

# Choose amount of historical data to use NHistData NHistData = 30 TrainDataSetSize = 3000 # Load the Dow 30 stocks from Yahoo into a Pandas datasheet dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM', 'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM', 'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG', 'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS'] num_stocks = len(dow30) trainData = None loadNew = False # If stocks.pickle exists then this contains saved stock data, so use this, # else use the Pandas DataReader to get the stock data and then pickle it. stock_filename = 'stocks.pickle' if os.path.exists(stock_filename): try: with open(stock_filename, 'rb') as f: trainData = pickle.load(f) except Exception as e: print('Unable to process data from', stock_filename, ':', e) raise print('%s already present - Skipping requesting/pickling.' % stock_filename) else: # Get the historical data. Make the date range quite a bit bigger than # TrainDataSetSize since there are no quotes for weekends and holidays. This # ensures we have enough data. f = pdr.data.DataReader(dow30, 'yahoo', date.today()-timedelta(days=TrainDataSetSize*2+5), date.today()) cleanData = f.ix['Adj Close'] trainData = pd.DataFrame(cleanData) trainData.fillna(method='backfill', inplace=True) loadNew = True print('Pickling %s.' % stock_filename) try: with open(stock_filename, 'wb') as f: pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL) except Exception as e: print('Unable to save data to', stock_filename, ':', e)

# Normalize the Data

We then normalize the data and remember the factor we used so we can de-normalize the results at the end.

# Normalize the data by dividing each price by the first price for a stock. # This way all the prices start together at 1. # Remember the normalizing factors so we can go back to real stock prices # for our final predictions. factors = np.ndarray(shape=( num_stocks ), dtype=np.float32) i = 0 for symbol in dow30: factors[i] = trainData[symbol][0] trainData[symbol] = trainData[symbol]/trainData[symbol][0] i = i + 1

# Re-arrange the Data for TensorFlow

Now we need to build up our training data, test data and validation data. We need to format this as input arrays for the Neural Network. Looking at this code, I think true Python programmers will accuse me of being a C programmer (which I am), since I do this all with loops. I’m sure a more experience Python programmer could accomplish this quicker with more array operations. This part of the code is quite slow so we pickle it, so if we re-run with the saved stock data, we can also use saved training data.

# Configure how much of the data to use for training, testing and validation. usableData = len(trainData.index) - NHistData + 1 #numTrainData = int(0.6 * usableData) #numValidData = int(0.2 * usableData #numTestData = usableData - numTrainData - numValidData - 1 numTrainData = usableData - 1 numValidData = 0 numTestData = 0 train_dataset = np.ndarray(shape=(numTrainData - 1, num_stocks * NHistData), dtype=np.float32) train_labels = np.ndarray(shape=(numTrainData - 1, num_stocks), dtype=np.float32) valid_dataset = np.ndarray(shape=(max(0, numValidData - 1), num_stocks * NHistData), dtype=np.float32) valid_labels = np.ndarray(shape=(max(0, numValidData - 1), num_stocks), dtype=np.float32) test_dataset = np.ndarray(shape=(max(0, numTestData - 1), num_stocks * NHistData), dtype=np.float32) test_labels = np.ndarray(shape=(max(0, numTestData - 1), num_stocks), dtype=np.float32) final_row = np.ndarray(shape=(1, num_stocks * NHistData), dtype=np.float32) final_row_prices = np.ndarray(shape=(1, num_stocks * NHistData), dtype=np.float32) # Build the taining datasets in the correct format with the matching labels. # So if calculate based on last 30 stock prices then the desired # result is the 31st. So note that the first 29 data points can't be used. # Rather than use the stock price, use the pricing deltas. pickle_file = "traindata.pickle" if loadNew == True or not os.path.exists(pickle_file): for i in range(1, numTrainData): for j in range(num_stocks): for k in range(NHistData): train_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k] - trainData[dow30[j]][i + k - 1]) train_labels[i-1][j] = (trainData[dow30[j]][i + NHistData] - trainData[dow30[j]][i + NHistData - 1]) for i in range(1, numValidData): for j in range(num_stocks): for k in range(NHistData): valid_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData] - trainData[dow30[j]][i + k + numTrainData - 1]) valid_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData] - trainData[dow30[j]][i + NHistData + numTrainData - 1]) for i in range(1, numTestData): for j in range(num_stocks): for k in range(NHistData): test_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData + numValidData] - trainData[dow30[j]][i + k + numTrainData + numValidData - 1]) test_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData + numValidData] - trainData[dow30[j]][i + NHistData + numTrainData + numValidData - 1]) try: f = open(pickle_file, 'wb') save = { 'train_dataset': train_dataset, 'train_labels': train_labels, 'valid_dataset': valid_dataset, 'valid_labels': valid_labels, 'test_dataset': test_dataset, 'test_labels': test_labels, } pickle.dump(save, f, pickle.HIGHEST_PROTOCOL) f.close() except Exception as e: print('Unable to save data to', pickle_file, ':', e) raise else: with open(pickle_file, 'rb') as f: save = pickle.load(f) train_dataset = save['train_dataset'] train_labels = save['train_labels'] valid_dataset = save['valid_dataset'] valid_labels = save['valid_labels'] test_dataset = save['test_dataset'] test_labels = save['test_labels'] del save # hint to help gc free up memory for j in range(num_stocks): for k in range(NHistData): final_row_prices[0][j * NHistData + k] = trainData[dow30[j]][k + len(trainData.index - NHistData] final_row[0][j * NHistData + k] = (trainData[dow30[j]][k + len(trainData.index) - NHistData] - trainData[dow30[j]][k + len(trainData.index) - NHistData - 1]) print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape)

# Accuracy

We now setup an accuracy function that is only used to report how we are doing during training. This isn’t used by the training algorithm. It roughly shows what percentage of predictions are within some tolerance.

# This accuracy function is used for reporting progress during training, it isn't actually # used for training. def accuracy(predictions, labels): err = np.sum( np.isclose(predictions, labels, 0.0, 0.005) ) / (predictions.shape[0] * predictions.shape[1]) return (100.0 * err)

# TensorFlow Variables

We now start setting up TensorFlow by creating our graph and defining our datasets and variables.

batch_size = 4 num_hidden = 16 num_labels = num_stocks graph = tf.Graph() # input is 30 days of dow 30 prices normalized to be between 0 and 1. # output is 30 values for normalized next day price change of dow stocks # use a 4 level neural network to compute this. with graph.as_default(): # Input data. tf_train_dataset = tf.placeholder( tf.float32, shape=(batch_size, num_stocks * NHistData)) tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels)) tf_valid_dataset = tf.constant(valid_dataset) tf_test_dataset = tf.constant(test_dataset) tf_final_dataset = tf.constant(final_row) # Variables. layer1_weights = tf.Variable(tf.truncated_normal( [NHistData * num_stocks, num_hidden], stddev=0.05)) layer1_biases = tf.Variable(tf.zeros([num_hidden])) layer2_weights = tf.Variable(tf.truncated_normal( [num_hidden, num_hidden], stddev=0.05)) layer2_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden])) layer3_weights = tf.Variable(tf.truncated_normal( [num_hidden, num_hidden], stddev=0.05)) layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden])) layer4_weights = tf.Variable(tf.truncated_normal( [num_hidden, num_labels], stddev=0.05)) layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

# TensorFlow Model

We now define our Neural Network model. Hyperbolic Tangent is our activation function and rest is matrix algebra as we described in previous articles.

# Model. def model(data): hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases) hidden = tf.tanh(tf.matmul(hidden, layer2_weights) + layer2_biases) hidden = tf.tanh(tf.matmul(hidden, layer3_weights) + layer3_biases) return tf.matmul(hidden, layer4_weights) + layer4_biases

# Training Model

Now we setup the training model and the optimizer to use, namely gradient descent. We also define what are the correct answers to compare against.

# Training computation. logits = model(tf_train_dataset) loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels)) # Optimizer. optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = logits valid_prediction = model(tf_valid_dataset) test_prediction = model(tf_test_dataset) next_prices = model(tf_final_dataset)

# Run the Model

So far we have setup TensorFlow ready to go, but we haven’t calculated anything. This next set of code executes the training run. It will use the data we’ve provided in the configured batch size to train our network while printing out some intermediate information.

num_steps = 2052 with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print('Initialized') for step in range(num_steps): offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict) acc = accuracy(predictions, batch_labels) if (step % 100 == 0): print('Minibatch loss at step %d: %f' % (step, l)) print('Minibatch accuracy: %.1f%%' % acc) if numValidData > 0: print('Validation accuracy: %.1f%%' % accuracy( valid_prediction.eval(), valid_labels)) if numTestData > 0: print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

# Make a Prediction

The final bit of code uses our trained model to make a prediction based on the last set of data we have (where we don’t know the right answer). If you get fresh stock market data for today, then the prediction will be for tomorrow’s price changes. If you run this late enough that Yahoo has updated its prices for the day, then you will get some real errors for comparison. Note that Yahoo is very slow and erratic about doing this, so be careful when reading this table.

predictions = next_prices.eval() * factors print("Stock Last Close Predict Chg Predict Next Current Current Chg Error") i = 0 for x in dow30: yhfeed = Share(x) currentPrice = float(yhfeed.get_price()) print( "%-6s %9.2f %9.2f %9.2f %9.2f %9.2f %9.2f" % (x, final_row_prices[0][i * NHistData + NHistData - 1] * factors[i], predictions[0][i], final_row_prices[0][i * NHistData + NHistData - 1] * factors[i] + predictions[0][i], currentPrice, currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i], abs(predictions[0][i] - (currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i]))) ) i = i + 1

# Results

Below is a screenshot of one run predicting the stock changes for Sept. 22. Basically it didn’t do very well. We’ll talk about why and what to do about this in a future article. As you can see it is very conservative in its predictions.

# Summary

This article shows the code for training and executing a very simple Neural Network using TensorFlow. Definitely don’t bet on the stock market based on this model, it is very simple at this point. We still need to add a number of elements to start making this into a useful model which we’ll look at in future articles.

## The Road to TensorFlow – Part 6: Optimization and Training

# Introduction

Last time we looked at the matrix equation that would be our Neural Network which is:

Output of Layer = ActivationFunction( A x (Input of Layer) + b )

We also specified that our input vector would be 900 elements large (the 30 Dow stocks times the last 30 price changes) and the output vector would be 30 elements (then next price change for each of the Dow 30 stocks). This means that if we have just one hidden layer of say 100 Neurons then we need a 900×100 matrix and a 100×30 matrix plus a 100 element bias vector and a 30 element bias vector. This means we need 900×100 + 100×30 + 100 + 30 = 93,130 values. Where do these all come from? In this article we’ll look at where we get these.

# Training

What we want to do is use some sort of known or historical data to train the Neural Network. When Neural Networks were first proposed, Computer Scientists tuned these by hand which resulted in taking a long time to get a very small Neural Network that didn’t work well. Later on many methods were developed to calculate these from databases of known cases, however until recently these databases were too small to be effective and led to extreme over-fitting. With the advent of big data, shared cloud resources and automated data collection, a large number of high quality extremely large databases are available to train Neural Networks for well know problems like hand writing recognition or shape identification. Notice that in the introduction to find 93,130 values requires far more than 93,130 bits of data, since this will lead to over-fitting (which we’ll talk a lot about in a future article).

If you remember back to basic statistics and linear regression, we found the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why its often called least squares regression. Basically we are formulating the problem as an optimization problem where we are trying to minimize an error function. The linear regression problem is then easily solvable by first year linear algebra. For the Neural Network case it’s a little bit more complicated, but the basic idea is the same.

To train our Neural Network we will use historical data where we provide 30 days of price changes for the Dow 30 stocks and then we know the next change so we can provide the error for our error function. To define an error function, we are going to start by just using the square of the difference, so basically just doing least square minimization just like least squares regression. In TensorFlow we can define our loss function as:

loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

Now that we have the data and an error function how do we go about training our network. First we start by seeding the matrix weights with normally distributed random numbers. TensorFlow provides some help here

layer1_weights = tf.Variable(tf.truncated_normal(

[NHistData * num_stocks, num_hidden], stddev=0.05))

to define our matrix and initialize it with normalized random numbers.

There are a number of optimization algorithms that can be used to solve this problem. The one we are going to use is called Gradient Descent which is a form of Back Propagation. The key property of back propagation algorithms is that they can be applied to Neural Networks with multiple hidden layers. The basic idea is that you take the partial derivative of the loss function with respect to each weight. This gives you a gradient with respect to each weight and then based on whether the gradient is positive or negative you can increase or decrease the weight by a little bit. This little bit is the learning rate which is a parameter to the algorithm (or can be changed dynamically by another algorithm). You then run the training data through this algorithm and hopefully observe your error function decreasing as you go.

This is then the basis of training your network. Once you have the weights you can calculate all the values as you like.

# Testing

A big danger here is that you are overfitting. You reduce the error function to next than nothing and the network works well for all your training data. You then try it on something else and it produces very bad results. This is similar to fitting a 10th degree polynomial through 11 data points. It fits all those points exactly, but has no predictive value outside of those exact points.

A common technique is to divide the training data into three buckets: actual training data, testing data and final validation data. You use the training data to train and as you train, you use the testing data to see how you are doing. Then when everything is finished you use the final validation data to do a final test (where the training process has never seen this data). This then gives you an idea of how well the network will fare out in the real world. We will make the size of these three buckets configurable.

# Local Versus Global Minimums

During the training process a few different things could happen. The solution could diverge, the error could just keep getting larger and larger. The solution could get stuck in a valley and just orbit a minimum value without converging to it. The solution could converge, but to a local minimum rather than the global minimum. These are all things that need to be watched out for.

Since the initial values are random, re-running the training can lead to quite different solutions. For some problems you want to train repeatedly to get the best solution. Or perhaps compare different optimization algorithms to see which gives the best result. Another idea is to use a combination of algorithms, perhaps start with one that gets into the correct neighborhood, and then another that can zero in on it.

There are quite a few tricks to get out of local minimums and to escape valleys using various random numbers. One is to change the learning rate to occasionally take a bigger jump. Others are to try some random perturbations to see if you can start converging to another solution.

# Batch Versus Single

A lot of time we process the training data in batches where we take the average of the partial derivatives to adjust the weights. This can greatly speed up training and avoids the problem of one bad data point sending us in the wrong direction. Again the batch size is a meta-parameter to the training algorithm that we can tune to get the best results.

# Summary

This was a really quick introduction to training a Neural Network. There are many optimization algorithms that can be applied to solve this problem, but we are starting with gradient descent. A number of the algorithms chosen, are done so to facilitate using a GPU or distributed network to parallelize and hence speed up the training process.

Next time we’ll start looking at the TensorFlow code for a simple Neural Network model, then we will start enhancing it to get better results.

## The Road to TensorFlow – Part 5: An Introduction to Neural Networks

# Introduction

We’ve now quickly covered a number of preliminary topics including Linux, Python, Python Libraries and some Stock Market theory. Now we are ready to start talking about Neural Networks and TensorFlow.

TensorFlow is Google’s open source platform for performing the types of numerical computations required by Neural Networks. It isn’t specific to Neural Networks, but has a lot of supporting functions to help with their development. If you had another application that required lots of matrix algebra, then perhaps TensorFlow would also work for you. TensorFlow supports optimized mathematical operations that can either run on your native CPU or be offloaded to a GPU. Google has even developed a custom processor chip to run TensorFlow operations in their data centers.

TensorFlow now powers quite a few Google products for things like speech recognition, photo recognition, and is even giving back some Google search results.

# Biological Versus the Mechanical

A lot of AI researchers like to distance themselves from taking how biological neurons exactly work and rather to just take certain ideas. They point out that to achieve manned flight required taking ideas from birds like wing design while throwing away other ideas like wings flapping. Similarly, for neural networks they take some ideas and throw others away.

If you are interested in a more precise simulation of the brain, check out Waterloo University’s Nengo project. This is a very interesting simulation of the brain that has been able to solve a number of problems. In this discussion we’ll be looking at what is more typically done these days in neural networks which tend to take the ideas where the math works easiest and skipping the rest.

# From Neurons to Matrix Equations

Consider a bunch of neurons in the brain as depicted in the following diagram.

Inputs come into each neuron and then if a weighted sum of the signals it receives is high enough then its outputs will fire (with a certain strength) which will then feed into another layer of neurons. This rather simplistic model of neurons and the brain is what we will model for our initial neural networks.

We will take some sort of vector of inputs and feed them into an input layer of neurons which based on the weighted sums of these inputs will fire with some strength into the next layer of neurons. In neural networks any layers of neurons that aren’t externally connected to inputs or outputs are called hidden layers. The following diagram shows this model.

Notice that all the inputs connect to all the next layer of neurons. In a biological brain, there won’t be that many connections, but here when we train this model to determine the weights, some weights will be zero (or very small) corresponding to there not really being a connection. But having a fixed complete set of connections really is just convenience to make the math easier and more uniform.

If you work out the math of doing all these weighted sums you quickly realize, you are just doing matrix algebra and you can get the input to the next layer by multiplying the inputs to this layer by a matrix. So:

Output of Layer = A x (Input of Layer)

Where A is the matrix of weights. That’s simple and easy to calculate (just ignoring for now where the elements of the matrix A come from).

If you remember your matrix algebra you will realize that if you do this to each layer, since this is just linear, you can multiply all the matrixes together and reduce the multiple layer problem to a single layer problem. So in this simple view there is no value in multiple layers. Additionally, linear models are overly simple and can be constructed and solved quite easily. Also with this the output is unbounded, it can come out at any magnitude, which clearly real neurons can’t.

What most neural networks do is add a non-linear activation function to this equation. The activation function maps the output value back into a valid range, adds a non-linearity so the whole equation doesn’t just transform back to one layer as well as adds flexibility in how the model can produce values. The new form of the equation then becomes:

Output of Layer = ActivationFunction( A x (Input of Layer) + b )

Where b is a scalar vector that allows the output to be shifted into range of the activation function. The simplest activation function is the rectifier function defined as f(x) = max( 0, x ). This basically returns x if x is positive and 0 if x is negative. This is good if we only want positive values as output, it is really simple and it does behave like some biological networks. On the downside, it isn’t invertible so we can’t run the network backwards (useful for sanity checking), it isn’t differentiable everywhere (helps with solving for the weights) and it doesn’t provide an upper bound on the output. All that being said, ReLU (Rectified Linear Unit) neural networks are currently the most popular. A smooth version of ReLU is the softplus function f(x) = ln(1+ex). Other choices of activation function include logistic sigmoid (from probability theory) and hyperbolic tangent (tanh) which we will use.

We’re still a bit theoretical at this point, but once we consider what the inputs look like and what we want for an output then we can start to solve for the bits in the middle. If we have good values for the various A matrixes and b vectors then we can see that with some matrix multiplication, addition and simple function evaluation we can get solutions and as it turns out both modern CPUs and especially GPUs are really good at this.

# Stock Market Example

We’ll now start looking at this with a simple stock market example to get an idea how this all works. Suppose we want to feed in the last 30 adjusted closing prices for the 30 stocks that compose the Dow Jones index and we want our neural network to output the next day closing prices for these 30 stocks. We will be starting simple to give the basic ideas then we’ll look at making this model more sophisticated. Let’s see how we can go about this.

# Our Input Vector

For any Neural Network we have to feed a vector of floating point numbers. So let’s consider feeding in a vector consisting of the last 30 adjusted closing prices of the first Dow component followed by the last 30 adjusted closes of the next component and so on. This means out input vector will contain 900 elements containing the last 30 adjusted closes of each of the 30 Dow stocks.

You can do this but it causes problems because the activation function we are going to use returns values between -1 and 1. Typically neural networks work best with values in this range (or maybe 0 to 1 if only positive values are required). So to make this work you need to normalize the input data to something that works better. We are going to do three things:

- Divide each stocks price by the first price we have in its history so it starts at 1.
- Rather than use the actual stock price, we’ll use the stock price change (of the price normalized by #1).
- If NaN is returned in the historical data, we will back fill it from the next good value. Fortuneately Pandas provides a function to do this:

trainData.fillna(method=’backfill’, inplace=True)

This then puts all the values nicely in range and makes them fairly uniform. The reason for step 3 is that when we go to train the neural network we want to train it with lots of historical data and if we don’t do this we can’t go back very far. Visa, in its current corporate incarnation, only went public in 2008 and then was added to the Dow in 2013 (replacing Bank of America). So there is no Visa historical data from before 2008. Actually I chose tanh as the activation function after switching to price changes, originally I used ReLU with real prices but it tended to be rather unstable.

# Our Output Vector

Out output vector will be the next price changes for the 30 Dow component stocks. Then we just need to undo the first normalization above in order to use them.

# Summary

This article was a quick introduction to the equations we are going to solve with TensorFlow and what motivates them. We started to look at how we input data into the model and we will continue next time with finding all the various matrix components by framing it as an optimization problem.