Stephen Smith's Blog

Musings on Machine Learning…

Archive for September 2016

The Road to TensorFlow – Part 8: Improving the Model

with one comment


In the last part of this series we presented a complete Python program to demonstrate how to create a simple feed forward Neural Network to predict the price changes in the thirty stocks that comprise the Dow Jones Index. The program worked, but its predictions tend to be very close to zero. Perhaps this isn’t too hard to understand since if the data is rather random then this might actually be the minimum. Or perhaps there is something wrong with our model. In this article we’ll start to look at how to improve a Neural Network model.

Remember that humans can’t predict the stock market so this might not be possible. However Neural Networks are very good at finding and recognizing patterns, so if the technical analysts are correct and they really do exist then Neural Networks will be able to find them.

The new source code for this is located here.

Meta Parameters

You might notice that there are quite a few rather arbitrary looking constants in the sample program. How do we know the values chosen are optimal? How can we go about tuning these. Some of these parameters include:

  1. The learning rate passed to the GradientDescentOptimizer function (currently 0.01).
  2. The number of steps we train the model with the data.
  3. The batch size.
  4. The number of hidden nodes.
  5. The number of hidden layers.
  6. The mean and standard deviation of the random numbers we seed the weights with.
  7. We use the tanh activation function; would another activation function be better?
  8. Are we providing too much data to the model, or too little data?
  9. We are providing the last 30 price changes. Should we try different values?
  10. We are using the gradient descent optimizer; would another optimizer be better?
  11. We are using least squares for our loss function, would another loss function be better?

All these parameters affect the results. How do we know we’ve chosen good values? Some of these are constants, some require code changes but all are fairly configurable. Since TensorFlow is a toolkit, there are lots of options to choose from for all of these.

For this article we are going to look at a number of best practices for these parameters and look to improve these by following the best practices.

Another approach is to treat this as an optimization problem and run it through an optimizer like gradient descent to find the best values for these parameters. Another approach is to use evolution algorithms to evolve a better solution. Perhaps we’ll have a look at these in a future article.

Learning Rate

The learning rate affects how far we move as we zero in on the minimum. If you use a bigger value it moves further. If you use too big a value it will overshoot the minimum and perhaps miss it. If you use too small a value it can get stuck in valleys are take an extremely long time to converge. I found that for this problem if I set the learning rate to 0.1 then the program doesn’t converge and the weights actually diverge to infinity. So we can’t make this much larger.

What we can do is make the learning rate variable so it starts out larger and then gets smaller as we zero in on a solution. So let’s look at how to do that in our Python code.

 # Setup Learning Rate Decay
 global_step = tf.Variable(0, trainable=False)
 starter_learning_rate = 0.01
 learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
 5000, 0.96, staircase=True)
 # Optimizer.
 optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss, global_step=global_step)


Basically TensorFlow gives us a way to do this with the exponential_decay function and assistance from the optimizer.

The other thing we’ve done is increase the batch size to 10. Increasing the batch size past 10 leads to the training process diverging, so we are limited to this. We have also dramatically increased the number of steps in the training process. Previously we kept this short to speed up processing time. But now that we are improving the model we need to give it time to train. But at the same time we don’t want it to over-train and overfit the data. In the program there are two ways of running it, one is with no test or validation data, so you can train with all the data to get the best result on the next prediction. The other is with test and validation steps, where you can see where improvements in the test and validation stop improving and hence can stop training.

More Data

We are currently providing the previous 30 price changes for each of the Dow components. Perhaps this isn’t sufficient? Perhaps there is different data we could add? One common practice in stock technical analysis is to consider stock price moves when the trading volume is high as more important than when the trading volume is low. So perhaps if we add the trading volumes to the input data it will help give better results.

The way we do this is we modify the sliding windows we are providing to now give 60 bits of data for each of the Dow stocks. We now provide the price change followed by the matching volume. So our input vector is now alternating price changes and volumes and has grown to 1800 in length. Trading volumes of Dow stocks are quite large numbers, so we normalize these by dividing these by the first volume for a stock. This way they tend to be in the range 0 to 10. In this case the volume isn’t an output anywhere so we don’t need it in range of the activation functions, but we do want to keep the weights in the matrix from getting too large.

I won’t list all the code changes to do that here, but you can download the source file from my Google Drive.

We could also consider adding more data. Perhaps varying the amount of historical data. We could add calculated values that technical analysts find useful like weighted averages, Bollinger bands etc. We could get other metrics from Web services such as fundamental data on the stock, perhaps P/E ratio or market capitalization. We could construct metrics like number of news feed stories on a stock. All these are possibilities to include. Perhaps later on we can see methods to see if these help. After training the model you can see which inputs don’t lead anywhere, i.e. all their connections have zero weights. By analysing these we can see which values the training process has considered important and which ones can be removed to simplify our model.

Number of Hidden Nodes

In the initial implementation we just made the number of hidden nodes to be 16 in each layer. Partly we used a small number to make calculation quicker so we could iterate our coding quicker. But what values should we use?

Neural Networks are very good at pattern recognition. They accomplish this in a similar way to the brain. They take heavily pixelated data and then discover structure in that data as they go from layer to layer. We want to encourage that, to have the network summarize the data somewhat to go from layer to layer. A common way to accomplish that is to have a large number of nodes adjacent to the input layer and then reduce it with each following layer.

We will start with a larger number namely 64 and then half it with each following layer.

# Variables.
 layer1_weights = tf.Variable(tf.truncated_normal(
 [NHistData * num_stocks, num_hidden], stddev=0.1))
 layer1_biases = tf.Variable(tf.zeros([num_hidden]))
 layer2_weights = tf.Variable(tf.truncated_normal(
 [num_hidden, int(num_hidden / 2)], stddev=0.1))
 layer2_biases = tf.Variable(tf.constant(1.0, shape=[int(num_hidden / 2)]))
 layer3_weights = tf.Variable(tf.truncated_normal(
 [int(num_hidden / 2), int(num_hidden / 4)], stddev=0.1))
 layer3_biases = tf.Variable(tf.constant(1.0, shape=[int(num_hidden /4)]))
 layer4_weights = tf.Variable(tf.truncated_normal(
 [int(num_hidden / 4), num_labels], stddev=0.1))
 layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))


I would have liked to use a larger number closer to the number of inputs, but it appears the model diverges for 128 or more hidden nodes in the first layer. I’ll have to investigate why some time in the future.


Below are the results for Sept. 26. Notice that the results are much better. It predicted the generally down day on the Dow and it isn’t being so conservative in its estimates anymore. Some stocks like MMM and MCD it got really close. There were others like PG and GS that it got quite wrong. Overall though we seem to be starting to get something useful. However, to re-iterate, don’t bet your real hard earned money based on this simple model.



This article covered a few improvements that could easily be made to our model. This is still a very simple basic model, so please don’t use it to trade stocks. Even if you have a really good model, the hedge funds will still eat you alive.

Generally, this is how developing Neural Networks goes. You start with a simple small model and then iteratively enhance it to develop it into something good.

Written by smist08

September 27, 2016 at 3:48 pm

The Road to TensorFlow – Part 7: Finally Some Code

with 4 comments


Well after a long journey through Linux, Python, Python Libraries, the Stock Market, an Introduction to Neural Networks and training Neural Networks we are now ready to look at a complete Python example to predict the stock market.

I placed the full source code listing on my Google Drive here. As described in the previous articles you will need to run this on a Mac or on Linux (could be a virtual image) with Python and TensorFlow installed. You will also need to have the various libraries that are imported at the top of the source file installed or you will get an error when you go to run it. I would suggest getting the source file to play with, Python is very fussy about indentation, so copy/paste from the article may introduce indentation errors caused by the blog formatting.

The Neural Network we are running here is a simple feed forward network with four hidden layers and uses the hyperbolic tangent as the activation function in each case. This is a very simple model so don’t use it to invest with real money. Hopefully this article gives a flavour for how to create and train a Neural Network using TensorFlow. Then in future articles we can discuss the limitation of this model and how to improve it.

Import Libraries

First we import all the various libraries we will be using, note tensorflow and numpy as being particularly important.

# Copyright 2016 Stephen Smith

import time
import math
import os
from datetime import date
from datetime import timedelta
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import pandas_datareader as pdr
from pandas_datareader import data, wb
from six.moves import cPickle as pickle
from yahoo_finance import Share

Get Stock Market Data

Next we get the stock market data. If the file stocks.pickle exists we assume we’ve previously saved this file and use it. Otherwise we get the data from Yahoo Finance using a Web Service call, made via the Pandas DataReader. We only keep the adjusted close column and we fill in any NaN’s with the first value we saw (this really only applies to Visa in this case). The data will all be in a standard Pandas data frame after this.

# Choose amount of historical data to use NHistData
NHistData = 30
TrainDataSetSize = 3000

# Load the Dow 30 stocks from Yahoo into a Pandas datasheet

dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM',
         'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM',
         'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG',
         'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS']

num_stocks = len(dow30)

trainData = None
loadNew = False

# If stocks.pickle exists then this contains saved stock data, so use this,
# else use the Pandas DataReader to get the stock data and then pickle it.
stock_filename = 'stocks.pickle'
if os.path.exists(stock_filename):
        with open(stock_filename, 'rb') as f:
            trainData = pickle.load(f)
    except Exception as e:
      print('Unable to process data from', stock_filename, ':', e)
    print('%s already present - Skipping requesting/pickling.' % stock_filename)
    # Get the historical data. Make the date range quite a bit bigger than
    # TrainDataSetSize since there are no quotes for weekends and holidays. This
    # ensures we have enough data.

    f =, 'yahoo',*2+5),
    cleanData = f.ix['Adj Close']
    trainData = pd.DataFrame(cleanData)
    trainData.fillna(method='backfill', inplace=True)
    loadNew = True
    print('Pickling %s.' % stock_filename)
        with open(stock_filename, 'wb') as f:
          pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        print('Unable to save data to', stock_filename, ':', e)

Normalize the Data

We then normalize the data and remember the factor we used so we can de-normalize the results at the end.

# Normalize the data by dividing each price by the first price for a stock.
# This way all the prices start together at 1.
# Remember the normalizing factors so we can go back to real stock prices
# for our final predictions.
factors = np.ndarray(shape=( num_stocks ), dtype=np.float32)
i = 0
for symbol in dow30:
    factors[i] = trainData[symbol][0]
    trainData[symbol] = trainData[symbol]/trainData[symbol][0]
    i = i + 1

Re-arrange the Data for TensorFlow

Now we need to build up our training data, test data and validation data. We need to format this as input arrays for the Neural Network. Looking at this code, I think true Python programmers will accuse me of being a C programmer (which I am), since I do this all with loops. I’m sure a more experience Python programmer could accomplish this quicker with more array operations. This part of the code is quite slow so we pickle it, so if we re-run with the saved stock data, we can also use saved training data.

# Configure how much of the data to use for training, testing and validation.

usableData = len(trainData.index) - NHistData + 1
#numTrainData =  int(0.6 * usableData)
#numValidData =  int(0.2 * usableData
#numTestData = usableData - numTrainData - numValidData - 1
numTrainData = usableData - 1
numValidData = 0
numTestData = 0

train_dataset = np.ndarray(shape=(numTrainData - 1,
    num_stocks * NHistData), dtype=np.float32)
train_labels = np.ndarray(shape=(numTrainData - 1, num_stocks),
valid_dataset = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks * NHistData), dtype=np.float32)
valid_labels = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks), dtype=np.float32)
test_dataset = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks * NHistData), dtype=np.float32)
test_labels = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks), dtype=np.float32)
final_row = np.ndarray(shape=(1, num_stocks * NHistData),
final_row_prices = np.ndarray(shape=(1, num_stocks * NHistData),

# Build the taining datasets in the correct format with the matching labels.
# So if calculate based on last 30 stock prices then the desired
# result is the 31st. So note that the first 29 data points can't be used.
# Rather than use the stock price, use the pricing deltas.
pickle_file = "traindata.pickle"
if loadNew == True or not os.path.exists(pickle_file):
    for i in range(1, numTrainData):
        for j in range(num_stocks):
            for k in range(NHistData):
                train_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k]
                    - trainData[dow30[j]][i + k - 1])
            train_labels[i-1][j] = (trainData[dow30[j]][i + NHistData]
                - trainData[dow30[j]][i + NHistData - 1])  

    for i in range(1, numValidData):
        for j in range(num_stocks):
            for k in range(NHistData):
                valid_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData]
                    - trainData[dow30[j]][i + k + numTrainData - 1])
            valid_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData]
                - trainData[dow30[j]][i + NHistData + numTrainData - 1])

    for i in range(1, numTestData):
        for j in range(num_stocks):
            for k in range(NHistData):
                test_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData + numValidData]
                    - trainData[dow30[j]][i + k + numTrainData + numValidData - 1])
            test_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData + numValidData]
                - trainData[dow30[j]][i + NHistData + numTrainData + numValidData - 1])

      f = open(pickle_file, 'wb')
      save = {
        'train_dataset': train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
      pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
      print('Unable to save data to', pickle_file, ':', e)

    with open(pickle_file, 'rb') as f:
      save = pickle.load(f)
      train_dataset = save['train_dataset']
      train_labels = save['train_labels']
      valid_dataset = save['valid_dataset']
      valid_labels = save['valid_labels']
      test_dataset = save['test_dataset']
      test_labels = save['test_labels']
      del save  # hint to help gc free up memory   

for j in range(num_stocks):
    for k in range(NHistData):
            final_row_prices[0][j * NHistData + k] = trainData[dow30[j]][k + len(trainData.index - NHistData]
            final_row[0][j * NHistData + k] = (trainData[dow30[j]][k + len(trainData.index) - NHistData]
                - trainData[dow30[j]][k + len(trainData.index) - NHistData - 1])

print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)


We now setup an accuracy function that is only used to report how we are doing during training. This isn’t used by the training algorithm. It roughly shows what percentage of predictions are within some tolerance.

# This accuracy function is used for reporting progress during training, it isn't actually
# used for training.
def accuracy(predictions, labels):
  err = np.sum( np.isclose(predictions, labels, 0.0, 0.005) ) / (predictions.shape[0] * predictions.shape[1])
  return (100.0 * err)

TensorFlow Variables

We now start setting up TensorFlow by creating our graph and defining our datasets and variables.

batch_size = 4
num_hidden = 16
num_labels = num_stocks

graph = tf.Graph()

# input is 30 days of dow 30 prices normalized to be between 0 and 1.
# output is 30 values for normalized next day price change of dow stocks
# use a 4 level neural network to compute this.

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, num_stocks * NHistData))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  tf_final_dataset = tf.constant(final_row)

  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [NHistData * num_stocks, num_hidden], stddev=0.05))
  layer1_biases = tf.Variable(tf.zeros([num_hidden]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.05))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

TensorFlow Model

We now define our Neural Network model. Hyperbolic Tangent is our activation function and rest is matrix algebra as we described in previous articles.

  # Model.
  def model(data):
    hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer2_weights) + layer2_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

Training Model

Now we setup the training model and the optimizer to use, namely gradient descent. We also define what are the correct answers to compare against.

  # Training computation.
  logits = model(tf_train_dataset)
  loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
  # Predictions for the training, validation, and test data.
  train_prediction = logits
  valid_prediction = model(tf_valid_dataset)
  test_prediction = model(tf_test_dataset)
  next_prices = model(tf_final_dataset)

Run the Model

So far we have setup TensorFlow ready to go, but we haven’t calculated anything. This next set of code executes the training run. It will use the data we’ve provided in the configured batch size to train our network while printing out some intermediate information.

num_steps = 2052

with tf.Session(graph=graph) as session:
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions =
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    acc = accuracy(predictions, batch_labels)
    if (step % 100 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % acc)
      if numValidData > 0:
          print('Validation accuracy: %.1f%%' % accuracy(
              valid_prediction.eval(), valid_labels))
  if numTestData > 0:        
      print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Make a Prediction

The final bit of code uses our trained model to make a prediction based on the last set of data we have (where we don’t know the right answer). If you get fresh stock market data for today, then the prediction will be for tomorrow’s price changes. If you run this late enough that Yahoo has updated its prices for the day, then you will get some real errors for comparison. Note that Yahoo is very slow and erratic about doing this, so be careful when reading this table.

predictions = next_prices.eval() * factors
  print("Stock    Last Close  Predict Chg   Predict Next      Current     Current Chg       Error")
  i = 0
  for x in dow30:
      yhfeed = Share(x)
      currentPrice = float(yhfeed.get_price())
      print( "%-6s  %9.2f  %9.2f       %9.2f       %9.2f     %9.2f     %9.2f" % (x,
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i] + predictions[0][i],
             currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             abs(predictions[0][i] - (currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i]))) )
      i = i + 1


Below is a screenshot of one run predicting the stock changes for Sept. 22. Basically it didn’t do very well. We’ll talk about why and what to do about this in a future article. As you can see it is very conservative in its predictions.



This article shows the code for training and executing a very simple Neural Network using TensorFlow. Definitely don’t bet on the stock market based on this model, it is very simple at this point. We still need to add a number of elements to start making this into a useful model which we’ll look at in future articles.

Written by smist08

September 23, 2016 at 4:17 pm

The Road to TensorFlow – Part 6: Optimization and Training

with 3 comments


Last time we looked at the matrix equation that would be our Neural Network which is:

Output of Layer = ActivationFunction( A x (Input of Layer) + b )

We also specified that our input vector would be 900 elements large (the 30 Dow stocks times the last 30 price changes) and the output vector would be 30 elements (then next price change for each of the Dow 30 stocks). This means that if we have just one hidden layer of say 100 Neurons then we need a 900×100 matrix and a 100×30 matrix plus a 100 element bias vector and a 30 element bias vector. This means we need 900×100 + 100×30 + 100 + 30 = 93,130 values. Where do these all come from? In this article we’ll look at where we get these.


What we want to do is use some sort of known or historical data to train the Neural Network. When Neural Networks were first proposed, Computer Scientists tuned these by hand which resulted in taking a long time to get a very small Neural Network that didn’t work well. Later on many methods were developed to calculate these from databases of known cases, however until recently these databases were too small to be effective and led to extreme over-fitting. With the advent of big data, shared cloud resources and automated data collection, a large number of high quality extremely large databases are available to train Neural Networks for well know problems like hand writing recognition or shape identification. Notice that in the introduction to find 93,130 values requires far more than 93,130 bits of data, since this will lead to over-fitting (which we’ll talk a lot about in a future article).

If you remember back to basic statistics and linear regression, we found the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why its often called least squares regression. Basically we are formulating the problem as an optimization problem where we are trying to minimize an error function. The linear regression problem is then easily solvable by first year linear algebra. For the Neural Network case it’s a little bit more complicated, but the basic idea is the same.

To train our Neural Network we will use historical data where we provide 30 days of price changes for the Dow 30 stocks and then we know the next change so we can provide the error for our error function. To define an error function, we are going to start by just using the square of the difference, so basically just doing least square minimization just like least squares regression. In TensorFlow we can define our loss function as:

loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

Now that we have the data and an error function how do we go about training our network. First we start by seeding the matrix weights with normally distributed random numbers. TensorFlow provides some help here

layer1_weights = tf.Variable(tf.truncated_normal(
[NHistData * num_stocks, num_hidden], stddev=0.05))

to define our matrix and initialize it with normalized random numbers.

There are a number of optimization algorithms that can be used to solve this problem. The one we are going to use is called Gradient Descent which is a form of Back Propagation. The key property of back propagation algorithms is that they can be applied to Neural Networks with multiple hidden layers. The basic idea is that you take the partial derivative of the loss function with respect to each weight. This gives you a gradient with respect to each weight and then based on whether the gradient is positive or negative you can increase or decrease the weight by a little bit. This little bit is the learning rate which is a parameter to the algorithm (or can be changed dynamically by another algorithm). You then run the training data through this algorithm and hopefully observe your error function decreasing as you go.

This is then the basis of training your network. Once you have the weights you can calculate all the values as you like.

Example of Gradient Descent where it could lead to 2 different minimums

Example of Gradient Descent where it could lead to 2 different minimums


A big danger here is that you are overfitting. You reduce the error function to next than nothing and the network works well for all your training data. You then try it on something else and it produces very bad results. This is similar to fitting a 10th degree polynomial through 11 data points. It fits all those points exactly, but has no predictive value outside of those exact points.

A common technique is to divide the training data into three buckets: actual training data, testing data and final validation data. You use the training data to train and as you train, you use the testing data to see how you are doing. Then when everything is finished you use the final validation data to do a final test (where the training process has never seen this data). This then gives you an idea of how well the network will fare out in the real world. We will make the size of these three buckets configurable.

Local Versus Global Minimums

During the training process a few different things could happen. The solution could diverge, the error could just keep getting larger and larger. The solution could get stuck in a valley and just orbit a minimum value without converging to it. The solution could converge, but to a local minimum rather than the global minimum. These are all things that need to be watched out for.

Since the initial values are random, re-running the training can lead to quite different solutions. For some problems you want to train repeatedly to get the best solution. Or perhaps compare different optimization algorithms to see which gives the best result. Another idea is to use a combination of algorithms, perhaps start with one that gets into the correct neighborhood, and then another that can zero in on it.

There are quite a few tricks to get out of local minimums and to escape valleys using various random numbers. One is to change the learning rate to occasionally take a bigger jump. Others are to try some random perturbations to see if you can start converging to another solution.

Batch Versus Single

A lot of time we process the training data in batches where we take the average of the partial derivatives to adjust the weights. This can greatly speed up training and avoids the problem of one bad data point sending us in the wrong direction. Again the batch size is a meta-parameter to the training algorithm that we can tune to get the best results.


This was a really quick introduction to training a Neural Network. There are many optimization algorithms that can be applied to solve this problem, but we are starting with gradient descent. A number of the algorithms chosen, are done so to facilitate using a GPU or distributed network to parallelize and hence speed up the training process.

Next time we’ll start looking at the TensorFlow code for a simple Neural Network model, then we will start enhancing it to get better results.

Written by smist08

September 21, 2016 at 11:32 pm

The Road to TensorFlow – Part 5: An Introduction to Neural Networks

with 2 comments


We’ve now quickly covered a number of preliminary topics including Linux, Python, Python Libraries and some Stock Market theory. Now we are ready to start talking about Neural Networks and TensorFlow.


TensorFlow is Google’s open source platform for performing the types of numerical computations required by Neural Networks. It isn’t specific to Neural Networks, but has a lot of supporting functions to help with their development. If you had another application that required lots of matrix algebra, then perhaps TensorFlow would also work for you. TensorFlow supports optimized mathematical operations that can either run on your native CPU or be offloaded to a GPU. Google has even developed a custom processor chip to run TensorFlow operations in their data centers.

TensorFlow now powers quite a few Google products for things like speech recognition, photo recognition, and is even giving back some Google search results.

Biological Versus the Mechanical

A lot of AI researchers like to distance themselves from taking how biological neurons exactly work and rather to just take certain ideas. They point out that to achieve manned flight required taking ideas from birds like wing design while throwing away other ideas like wings flapping. Similarly, for neural networks they take some ideas and throw others away.

If you are interested in a more precise simulation of the brain, check out Waterloo University’s Nengo project. This is a very interesting simulation of the brain that has been able to solve a number of problems. In this discussion we’ll be looking at what is more typically done these days in neural networks which tend to take the ideas where the math works easiest and skipping the rest.

From Neurons to Matrix Equations

Consider a bunch of neurons in the brain as depicted in the following diagram.


Inputs come into each neuron and then if a weighted sum of the signals it receives is high enough then its outputs will fire (with a certain strength) which will then feed into another layer of neurons. This rather simplistic model of neurons and the brain is what we will model for our initial neural networks.

We will take some sort of vector of inputs and feed them into an input layer of neurons which based on the weighted sums of these inputs will fire with some strength into the next layer of neurons. In neural networks any layers of neurons that aren’t externally connected to inputs or outputs are called hidden layers. The following diagram shows this model.

Notice that all the inputs connect to all the next layer of neurons. In a biological brain, there won’t be that many connections, but here when we train this model to determine the weights, some weights will be zero (or very small) corresponding to there not really being a connection. But having a fixed complete set of connections really is just convenience to make the math easier and more uniform.

If you work out the math of doing all these weighted sums you quickly realize, you are just doing matrix algebra and you can get the input to the next layer by multiplying the inputs to this layer by a matrix. So:

Output of Layer = A x (Input of Layer)

Where A is the matrix of weights. That’s simple and easy to calculate (just ignoring for now where the elements of the matrix A come from).

If you remember your matrix algebra you will realize that if you do this to each layer, since this is just linear, you can multiply all the matrixes together and reduce the multiple layer problem to a single layer problem. So in this simple view there is no value in multiple layers. Additionally, linear models are overly simple and can be constructed and solved quite easily. Also with this the output is unbounded, it can come out at any magnitude, which clearly real neurons can’t.

What most neural networks do is add a non-linear activation function to this equation. The activation function maps the output value back into a valid range, adds a non-linearity so the whole equation doesn’t just transform back to one layer as well as adds flexibility in how the model can produce values. The new form of the equation then becomes:

Output of Layer = ActivationFunction( A x (Input of Layer) + b )

Where b is a scalar vector that allows the output to be shifted into range of the activation function. The simplest activation function is the rectifier function defined as f(x) = max( 0, x ). This basically returns x if x is positive and 0 if x is negative. This is good if we only want positive values as output, it is really simple and it does behave like some biological networks. On the downside, it isn’t invertible so we can’t run the network backwards (useful for sanity checking), it isn’t differentiable everywhere (helps with solving for the weights) and it doesn’t provide an upper bound on the output. All that being said, ReLU (Rectified Linear Unit) neural networks are currently the most popular. A smooth version of ReLU is the softplus function f(x) = ln(1+ex). Other choices of activation function include logistic sigmoid (from probability theory) and hyperbolic tangent (tanh) which we will use.

We’re still a bit theoretical at this point, but once we consider what the inputs look like and what we want for an output then we can start to solve for the bits in the middle. If we have good values for the various A matrixes and b vectors then we can see that with some matrix multiplication, addition and simple function evaluation we can get solutions and as it turns out both modern CPUs and especially GPUs are really good at this.

Stock Market Example

We’ll now start looking at this with a simple stock market example to get an idea how this all works. Suppose we want to feed in the last 30 adjusted closing prices for the 30 stocks that compose the Dow Jones index and we want our neural network to output the next day closing prices for these 30 stocks. We will be starting simple to give the basic ideas then we’ll look at making this model more sophisticated. Let’s see how we can go about this.

Our Input Vector

For any Neural Network we have to feed a vector of floating point numbers. So let’s consider feeding in a vector consisting of the last 30 adjusted closing prices of the first Dow component followed by the last 30 adjusted closes of the next component and so on. This means out input vector will contain 900 elements containing the last 30 adjusted closes of each of the 30 Dow stocks.

You can do this but it causes problems because the activation function we are going to use returns values between -1 and 1. Typically neural networks work best with values in this range (or maybe 0 to 1 if only positive values are required). So to make this work you need to normalize the input data to something that works better. We are going to do three things:

  1. Divide each stocks price by the first price we have in its history so it starts at 1.
  2. Rather than use the actual stock price, we’ll use the stock price change (of the price normalized by #1).
  3. If NaN is returned in the historical data, we will back fill it from the next good value. Fortuneately Pandas provides a function to do this:
    trainData.fillna(method=’backfill’, inplace=True)

This then puts all the values nicely in range and makes them fairly uniform. The reason for step 3 is that when we go to train the neural network we want to train it with lots of historical data and if we don’t do this we can’t go back very far. Visa, in its current corporate incarnation, only went public in 2008 and then was added to the Dow in 2013 (replacing Bank of America). So there is no Visa historical data from before 2008. Actually I chose tanh as the activation function after switching to price changes, originally I used ReLU with real prices but it tended to be rather unstable.

Our Output Vector

Out output vector will be the next price changes for the 30 Dow component stocks. Then we just need to undo the first normalization above in order to use them.


This article was a quick introduction to the equations we are going to solve with TensorFlow and what motivates them. We started to look at how we input data into the model and we will continue next time with finding all the various matrix components by framing it as an optimization problem.

Written by smist08

September 8, 2016 at 3:49 pm

The Road to TensorFlow – Part 4: The Stock Market

with 3 comments


This is the fourth article in my series on Google TensorFlow and we still won’t get to TensorFlow in this article. We’ve covered Linux, Python and various Python libraries so far. Last time we started to use Python libraries to load stock market data ready to feed into some sort of Neural Network model constructed using TensorFlow. In this article we’re going to take a bit of a side trip into looking at a number of issues, theory and logistics around playing with the stock market.

One thing to remember is that this discussion isn’t pure Mathematics. These are all theories that provide some guidance, they might be based on a lot of historical study, but that doesn’t mean they will be true tomorrow, or even that everyone believes them today. One good reference for this stuff is the Udacity course “Machine Learning for Trading”.


Is This a Suitable Problem for AI?

The first question to ask is whether trading stocks is a suitable problem? After all people can’t predict what the stock market will do tomorrow, so why would we think a computer can? Most AI problems, like image recognition or machine translation, know the problem is solvable since people solve these problems. So they know that if they can successfully model what people are doing, then they should be able to get similar results. In this case we are attempting a problem people can’t solve (but some are better at guessing than others), and hoping that fancy algorithms and big data will perhaps give us an edge. This idea could well be a fantasy since predicting the future in general is impossible. It would be nice if we could do as well as a stock picking cat, but that cat did beat a team of professionals.

Hedge Funds

Hedge funds are typically high risk funds that perform risky trading strategies for small select clienteles. There are many types of these funds that trade in all sorts of things using all sorts of strategies. However, the ones we are interested in, in this article, are the ones that perform high volume computer trading of stocks. Typically, these are driven by algorithms with little or no human oversight and typically the Hedge fund has an extremely favorable arrangement with a given stock market to allow their computers in the stock markets data center and further that they have extremely low transaction fees. Being in the stock market’s data center means they see everything first since they have no latency. Then don’t have to wait 10ms or whatever for information to make it to your location over the internet. Using very high powered computers they can profit by trading during these latencies (possibly taking your profit).

If these advantages weren’t enough, some Hedge funds have negotiated the right to filter all stock market transactions before they happen and optionally execute the trade themselves again allowing them room to make small profits by inserting themselves into other people’s transactions.

The main takeaway from this, is that unless you are such a Hedge fund, you are at a considerable disadvantage. This is one of the main reasons that day traders have all but disappeared. Hedge funds were able to manipulate them and generally profit from the day traders.

The other thing to remember is that Hedge funds are large and capable of manipulating the market. Often they will play against known trading strategies by over selling or buying to make it look like something is happening and then tricking people into doing things that are a bad idea and profiting from it.


The Efficient Market Hypothesis

The Efficient Market Hypothesis (EMH) states that asset prices fully reflect all available information. There are weaker and stronger forms of this hypothesis, but the basic premise is that you can’t beat the market and you may as well put all your money in an index fund that just matches market performance. Basically that it is futile to try and find undervalued stocks to buy or overvalued stocks to sell.

One claim is that Hedge funds contribute to making the markets efficient. Since they trade so quickly any new information is incorporated into the prices of stocks instantly as far as you can tell. Maybe so, but it does rub the wrong way that someone is profiting this way.

Not everyone believes the EMH, but at the same time it has been proven out time and time again especially in the large heavily traded world markets.


This is the Capital Asset Pricing Model that is often used in portfolio management to manage risk, but its also often used in stock trading. A simple form of this equation is:

ri(t) = βi * rm(t) + αi

This says that the return for a given stock i at a point of time t given by ri(t) is equal to a constant βi times the market return at time t given by rm(t) plus a constant αi. Where the expected value of αi is zero. There is usually another term for the base interest rate, but that is effectively zero these days.

The upshot of this is that stocks move with the market and not individually. Each stocks beta can be determined from the stocks history and then this gives a pretty good model for stock returns. This is bad if you have some special insight into a stock, for instance if you are an expert in its industry or perhaps have a good idea of the future trend. For instance if you know something bad is going to happen, you want to short the stock, but if the market goes up that day, it could overwhelm the individual stocks bad news and you lose on your position.

If you believe the EMH then alpha will always be zero or go to zero before you can capitalize on it.

The first Hedge founds came up with a clever scheme to avoid this. If you have two stocks, one you think is going to go up (positive alpha) and another that you think will go down (negative alpha) then you can buy/sell these stocks in pairs by choosing weights of the positions that cause the two beta component to cancel out. This way you eliminate the market from the equation and can concentrate on just the stocks. This is in fact where Hedge funds got their name, using two stocks to hedge their market exposure. This worked for awhile and then others figured out ways to exploit this and it caused a market crash and bailout for a number of funds when it failed. Now this buy/sell pair strategy doesn’t work. As most strategies seem to stop working once they are widely enough known.

Finding alpha is an interesting pursuit. For Hedge funds it could be via illegal insider information as dramatized in the TV series “Billions”. Or it could be via semi-legal methods like hiring a guy in China to sit by the road and could the number of trucks that come out of a factory. Certainly studying Apple’s suppliers and factories is a huge industry in trying to gather information on secretive Apple.

The Fundamental Law of Active Management

The fundamental law of active management is the following:

Performance = skill * square root(breadth)

This basically says that the performance of a portfolio manager is equal to his skill times the square root of the number of trades he makes. This law basically says that a poor portfolio manager can make up for his stupidity via volume.

For instance, Warren Buffet is really smart (high skill) and gets a really great return. His breadth is really small, he just buys 120 stocks and holds them. So his breadth is 120. Suppose a Hedge fund has developed a computer algorithm for stock trading that is 1/1000 as smart as Warren Buffet. Then if you do the math with this formula it comes out that the Hedge fund needs to trade 120,000,000 times a year to match Warren Buffet’s performance. The scary part is that there are lots of Hedge funds that employ this strategy. They have low grade (not very smart) algorithms that can get the same return as Warren Buffet by doing huge numbers of trades.

Adjusted Close

In the previous blog posting we read in the history of adjusted closes for all thirty components of the Dow Jones index. There was also a close price returned, why did we use the adjusted close rather than the real close? If a stock does well, its price goes up and the stock gets too expensive. To help with this every now and then a company will split its stock. They will issue say 2 new stocks for each old stocks. Everyone gets these, then now have twice the stocks at half the price. So from people’s point of view they still have the same value and nothing has changed. Stocks also issue dividends. Whenever a stock does this its prices goes up the value of the dividend before payment and then goes back down right after payment. Again to stock owners this is all well and good and understood. But these two things cause havoc to developing stock market pricing models and algorithms. Without knowing anything else a stock split looks catastrophic. So to help with this, stock markets provide the adjusted close which will adjust historical data for stock splits and dividends so they don’t mess up charts and algorithms. Generally, quite a nice feature of stock feeds. If you compare adjusted close and close they will be the same back to the last event of this nature and at that point will diverge.

Stock Prices

Stock prices don’t by themselves tell you anything about a company and can’t be used to directly compare companies. A company’s value is the stock price times the number of shares. But all companies have issued different numbers of shares and have completely different histories of stock splits, additional share offerings, etc. One way to deal with this is to normalize the stock market data, for instance you could divide all the share prices in a history by the first price. This will cause the stock price to start at 1 and then evolve from there. This does provide one way to compare performance graphically. When doing AI we tend to have to normalize data since the algorithms we are going to use generally don’t like working on large ranges of numbers. We’ll talk more about that later.

Testing with Real Money

We’re not going to test anything with real money. However, most algorithms need real testing in the real market. What we are going to look at doesn’t worry about transaction fees. It also doesn’t worry about some market logistics, since we are only looking at closing prices. You can’t get the previous close price at the next day’s open due to after hours trading and in general how stock order books work. Also if you are a big Hedge fund then actually performing your trades may affect the market. I might have a brilliant algorithm that makes me lots of money in a simulator, but if I run it in the real market, the market may react and counter what I’m doing. Worst sometimes Hedge funds have caused market crashes, or caused the stock market circuit breakers to kick in as a result of their actions.


This was a really quick introduction to the stock market concepts we’ll be talking about. If you are interested, you can follow the links in the article to learn more.

Written by smist08

September 2, 2016 at 11:32 pm