Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘Dow Jones Index

The Road to TensorFlow – Part 7: Finally Some Code

with 9 comments

Introduction

Well after a long journey through Linux, Python, Python Libraries, the Stock Market, an Introduction to Neural Networks and training Neural Networks we are now ready to look at a complete Python example to predict the stock market.

I placed the full source code listing on my Google Drive here. As described in the previous articles you will need to run this on a Mac or on Linux (could be a virtual image) with Python and TensorFlow installed. You will also need to have the various libraries that are imported at the top of the source file installed or you will get an error when you go to run it. I would suggest getting the source file to play with, Python is very fussy about indentation, so copy/paste from the article may introduce indentation errors caused by the blog formatting.

The Neural Network we are running here is a simple feed forward network with four hidden layers and uses the hyperbolic tangent as the activation function in each case. This is a very simple model so don’t use it to invest with real money. Hopefully this article gives a flavour for how to create and train a Neural Network using TensorFlow. Then in future articles we can discuss the limitation of this model and how to improve it.

Import Libraries

First we import all the various libraries we will be using, note tensorflow and numpy as being particularly important.

# Copyright 2016 Stephen Smith

import time
import math
import os
from datetime import date
from datetime import timedelta
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import pandas_datareader as pdr
from pandas_datareader import data, wb
from six.moves import cPickle as pickle
from yahoo_finance import Share

Get Stock Market Data

Next we get the stock market data. If the file stocks.pickle exists we assume we’ve previously saved this file and use it. Otherwise we get the data from Yahoo Finance using a Web Service call, made via the Pandas DataReader. We only keep the adjusted close column and we fill in any NaN’s with the first value we saw (this really only applies to Visa in this case). The data will all be in a standard Pandas data frame after this.

# Choose amount of historical data to use NHistData
NHistData = 30
TrainDataSetSize = 3000

# Load the Dow 30 stocks from Yahoo into a Pandas datasheet

dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM',
         'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM',
         'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG',
         'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS']

num_stocks = len(dow30)

trainData = None
loadNew = False

# If stocks.pickle exists then this contains saved stock data, so use this,
# else use the Pandas DataReader to get the stock data and then pickle it.
stock_filename = 'stocks.pickle'
if os.path.exists(stock_filename):
    try:
        with open(stock_filename, 'rb') as f:
            trainData = pickle.load(f)
    except Exception as e:
      print('Unable to process data from', stock_filename, ':', e)
      raise           
    print('%s already present - Skipping requesting/pickling.' % stock_filename)
else:
    # Get the historical data. Make the date range quite a bit bigger than
    # TrainDataSetSize since there are no quotes for weekends and holidays. This
    # ensures we have enough data.

    f = pdr.data.DataReader(dow30, 'yahoo',
        date.today()-timedelta(days=TrainDataSetSize*2+5), date.today())
    cleanData = f.ix['Adj Close']
    trainData = pd.DataFrame(cleanData)
    trainData.fillna(method='backfill', inplace=True)
    loadNew = True
    print('Pickling %s.' % stock_filename)
    try:
        with open(stock_filename, 'wb') as f:
          pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        print('Unable to save data to', stock_filename, ':', e)

Normalize the Data

We then normalize the data and remember the factor we used so we can de-normalize the results at the end.

# Normalize the data by dividing each price by the first price for a stock.
# This way all the prices start together at 1.
# Remember the normalizing factors so we can go back to real stock prices
# for our final predictions.
factors = np.ndarray(shape=( num_stocks ), dtype=np.float32)
i = 0
for symbol in dow30:
    factors[i] = trainData[symbol][0]
    trainData[symbol] = trainData[symbol]/trainData[symbol][0]
    i = i + 1

Re-arrange the Data for TensorFlow

Now we need to build up our training data, test data and validation data. We need to format this as input arrays for the Neural Network. Looking at this code, I think true Python programmers will accuse me of being a C programmer (which I am), since I do this all with loops. I’m sure a more experience Python programmer could accomplish this quicker with more array operations. This part of the code is quite slow so we pickle it, so if we re-run with the saved stock data, we can also use saved training data.

# Configure how much of the data to use for training, testing and validation.

usableData = len(trainData.index) - NHistData + 1
#numTrainData =  int(0.6 * usableData)
#numValidData =  int(0.2 * usableData
#numTestData = usableData - numTrainData - numValidData - 1
numTrainData = usableData - 1
numValidData = 0
numTestData = 0

train_dataset = np.ndarray(shape=(numTrainData - 1,
    num_stocks * NHistData), dtype=np.float32)
train_labels = np.ndarray(shape=(numTrainData - 1, num_stocks),
    dtype=np.float32)
valid_dataset = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks * NHistData), dtype=np.float32)
valid_labels = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks), dtype=np.float32)
test_dataset = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks * NHistData), dtype=np.float32)
test_labels = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks), dtype=np.float32)
final_row = np.ndarray(shape=(1, num_stocks * NHistData),
    dtype=np.float32)
final_row_prices = np.ndarray(shape=(1, num_stocks * NHistData),
    dtype=np.float32)

# Build the taining datasets in the correct format with the matching labels.
# So if calculate based on last 30 stock prices then the desired
# result is the 31st. So note that the first 29 data points can't be used.
# Rather than use the stock price, use the pricing deltas.
pickle_file = "traindata.pickle"
if loadNew == True or not os.path.exists(pickle_file):
    for i in range(1, numTrainData):
        for j in range(num_stocks):
            for k in range(NHistData):
                train_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k]
                    - trainData[dow30[j]][i + k - 1])
            train_labels[i-1][j] = (trainData[dow30[j]][i + NHistData]
                - trainData[dow30[j]][i + NHistData - 1])  

    for i in range(1, numValidData):
        for j in range(num_stocks):
            for k in range(NHistData):
                valid_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData]
                    - trainData[dow30[j]][i + k + numTrainData - 1])
            valid_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData]
                - trainData[dow30[j]][i + NHistData + numTrainData - 1])

    for i in range(1, numTestData):
        for j in range(num_stocks):
            for k in range(NHistData):
                test_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData + numValidData]
                    - trainData[dow30[j]][i + k + numTrainData + numValidData - 1])
            test_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData + numValidData]
                - trainData[dow30[j]][i + NHistData + numTrainData + numValidData - 1])

    try:
      f = open(pickle_file, 'wb')
      save = {
        'train_dataset': train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
        }
      pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
      f.close()
    except Exception as e:
      print('Unable to save data to', pickle_file, ':', e)
      raise

else:
    with open(pickle_file, 'rb') as f:
      save = pickle.load(f)
      train_dataset = save['train_dataset']
      train_labels = save['train_labels']
      valid_dataset = save['valid_dataset']
      valid_labels = save['valid_labels']
      test_dataset = save['test_dataset']
      test_labels = save['test_labels']
      del save  # hint to help gc free up memory   

for j in range(num_stocks):
    for k in range(NHistData):
            final_row_prices[0][j * NHistData + k] = trainData[dow30[j]][k + len(trainData.index - NHistData]
            final_row[0][j * NHistData + k] = (trainData[dow30[j]][k + len(trainData.index) - NHistData]
                - trainData[dow30[j]][k + len(trainData.index) - NHistData - 1])

print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Accuracy

We now setup an accuracy function that is only used to report how we are doing during training. This isn’t used by the training algorithm. It roughly shows what percentage of predictions are within some tolerance.

# This accuracy function is used for reporting progress during training, it isn't actually
# used for training.
def accuracy(predictions, labels):
  err = np.sum( np.isclose(predictions, labels, 0.0, 0.005) ) / (predictions.shape[0] * predictions.shape[1])
  return (100.0 * err)

TensorFlow Variables

We now start setting up TensorFlow by creating our graph and defining our datasets and variables.

batch_size = 4
num_hidden = 16
num_labels = num_stocks

graph = tf.Graph()

# input is 30 days of dow 30 prices normalized to be between 0 and 1.
# output is 30 values for normalized next day price change of dow stocks
# use a 4 level neural network to compute this.

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, num_stocks * NHistData))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  tf_final_dataset = tf.constant(final_row)

  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [NHistData * num_stocks, num_hidden], stddev=0.05))
  layer1_biases = tf.Variable(tf.zeros([num_hidden]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.05))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

TensorFlow Model

We now define our Neural Network model. Hyperbolic Tangent is our activation function and rest is matrix algebra as we described in previous articles.

  # Model.
  def model(data):
    hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer2_weights) + layer2_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

Training Model

Now we setup the training model and the optimizer to use, namely gradient descent. We also define what are the correct answers to compare against.

  # Training computation.
  logits = model(tf_train_dataset)
  loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
  # Predictions for the training, validation, and test data.
  train_prediction = logits
  valid_prediction = model(tf_valid_dataset)
  test_prediction = model(tf_test_dataset)
  next_prices = model(tf_final_dataset)

Run the Model

So far we have setup TensorFlow ready to go, but we haven’t calculated anything. This next set of code executes the training run. It will use the data we’ve provided in the configured batch size to train our network while printing out some intermediate information.

num_steps = 2052

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    acc = accuracy(predictions, batch_labels)
    if (step % 100 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % acc)
      if numValidData > 0:
          print('Validation accuracy: %.1f%%' % accuracy(
              valid_prediction.eval(), valid_labels))
  if numTestData > 0:        
      print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Make a Prediction

The final bit of code uses our trained model to make a prediction based on the last set of data we have (where we don’t know the right answer). If you get fresh stock market data for today, then the prediction will be for tomorrow’s price changes. If you run this late enough that Yahoo has updated its prices for the day, then you will get some real errors for comparison. Note that Yahoo is very slow and erratic about doing this, so be careful when reading this table.

predictions = next_prices.eval() * factors
  print("Stock    Last Close  Predict Chg   Predict Next      Current     Current Chg       Error")
  i = 0
  for x in dow30:
      yhfeed = Share(x)
      currentPrice = float(yhfeed.get_price())
      print( "%-6s  %9.2f  %9.2f       %9.2f       %9.2f     %9.2f     %9.2f" % (x,
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             predictions[0][i],
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i] + predictions[0][i],
             currentPrice,
             currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             abs(predictions[0][i] - (currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i]))) )
      i = i + 1

Results

Below is a screenshot of one run predicting the stock changes for Sept. 22. Basically it didn’t do very well. We’ll talk about why and what to do about this in a future article. As you can see it is very conservative in its predictions.

results

Summary

This article shows the code for training and executing a very simple Neural Network using TensorFlow. Definitely don’t bet on the stock market based on this model, it is very simple at this point. We still need to add a number of elements to start making this into a useful model which we’ll look at in future articles.

Written by smist08

September 23, 2016 at 4:17 pm

The Road to TensorFlow – Part 5: An Introduction to Neural Networks

with 2 comments

Introduction

We’ve now quickly covered a number of preliminary topics including Linux, Python, Python Libraries and some Stock Market theory. Now we are ready to start talking about Neural Networks and TensorFlow.

tensorflow

TensorFlow is Google’s open source platform for performing the types of numerical computations required by Neural Networks. It isn’t specific to Neural Networks, but has a lot of supporting functions to help with their development. If you had another application that required lots of matrix algebra, then perhaps TensorFlow would also work for you. TensorFlow supports optimized mathematical operations that can either run on your native CPU or be offloaded to a GPU. Google has even developed a custom processor chip to run TensorFlow operations in their data centers.

TensorFlow now powers quite a few Google products for things like speech recognition, photo recognition, and is even giving back some Google search results.

Biological Versus the Mechanical

A lot of AI researchers like to distance themselves from taking how biological neurons exactly work and rather to just take certain ideas. They point out that to achieve manned flight required taking ideas from birds like wing design while throwing away other ideas like wings flapping. Similarly, for neural networks they take some ideas and throw others away.

If you are interested in a more precise simulation of the brain, check out Waterloo University’s Nengo project. This is a very interesting simulation of the brain that has been able to solve a number of problems. In this discussion we’ll be looking at what is more typically done these days in neural networks which tend to take the ideas where the math works easiest and skipping the rest.

From Neurons to Matrix Equations

Consider a bunch of neurons in the brain as depicted in the following diagram.

brain_neural_net

Inputs come into each neuron and then if a weighted sum of the signals it receives is high enough then its outputs will fire (with a certain strength) which will then feed into another layer of neurons. This rather simplistic model of neurons and the brain is what we will model for our initial neural networks.

We will take some sort of vector of inputs and feed them into an input layer of neurons which based on the weighted sums of these inputs will fire with some strength into the next layer of neurons. In neural networks any layers of neurons that aren’t externally connected to inputs or outputs are called hidden layers. The following diagram shows this model.

neural_net2
Notice that all the inputs connect to all the next layer of neurons. In a biological brain, there won’t be that many connections, but here when we train this model to determine the weights, some weights will be zero (or very small) corresponding to there not really being a connection. But having a fixed complete set of connections really is just convenience to make the math easier and more uniform.

If you work out the math of doing all these weighted sums you quickly realize, you are just doing matrix algebra and you can get the input to the next layer by multiplying the inputs to this layer by a matrix. So:

Output of Layer = A x (Input of Layer)

Where A is the matrix of weights. That’s simple and easy to calculate (just ignoring for now where the elements of the matrix A come from).

If you remember your matrix algebra you will realize that if you do this to each layer, since this is just linear, you can multiply all the matrixes together and reduce the multiple layer problem to a single layer problem. So in this simple view there is no value in multiple layers. Additionally, linear models are overly simple and can be constructed and solved quite easily. Also with this the output is unbounded, it can come out at any magnitude, which clearly real neurons can’t.

What most neural networks do is add a non-linear activation function to this equation. The activation function maps the output value back into a valid range, adds a non-linearity so the whole equation doesn’t just transform back to one layer as well as adds flexibility in how the model can produce values. The new form of the equation then becomes:

Output of Layer = ActivationFunction( A x (Input of Layer) + b )

Where b is a scalar vector that allows the output to be shifted into range of the activation function. The simplest activation function is the rectifier function defined as f(x) = max( 0, x ). This basically returns x if x is positive and 0 if x is negative. This is good if we only want positive values as output, it is really simple and it does behave like some biological networks. On the downside, it isn’t invertible so we can’t run the network backwards (useful for sanity checking), it isn’t differentiable everywhere (helps with solving for the weights) and it doesn’t provide an upper bound on the output. All that being said, ReLU (Rectified Linear Unit) neural networks are currently the most popular. A smooth version of ReLU is the softplus function f(x) = ln(1+ex). Other choices of activation function include logistic sigmoid (from probability theory) and hyperbolic tangent (tanh) which we will use.

We’re still a bit theoretical at this point, but once we consider what the inputs look like and what we want for an output then we can start to solve for the bits in the middle. If we have good values for the various A matrixes and b vectors then we can see that with some matrix multiplication, addition and simple function evaluation we can get solutions and as it turns out both modern CPUs and especially GPUs are really good at this.

Stock Market Example

We’ll now start looking at this with a simple stock market example to get an idea how this all works. Suppose we want to feed in the last 30 adjusted closing prices for the 30 stocks that compose the Dow Jones index and we want our neural network to output the next day closing prices for these 30 stocks. We will be starting simple to give the basic ideas then we’ll look at making this model more sophisticated. Let’s see how we can go about this.

Our Input Vector

For any Neural Network we have to feed a vector of floating point numbers. So let’s consider feeding in a vector consisting of the last 30 adjusted closing prices of the first Dow component followed by the last 30 adjusted closes of the next component and so on. This means out input vector will contain 900 elements containing the last 30 adjusted closes of each of the 30 Dow stocks.

You can do this but it causes problems because the activation function we are going to use returns values between -1 and 1. Typically neural networks work best with values in this range (or maybe 0 to 1 if only positive values are required). So to make this work you need to normalize the input data to something that works better. We are going to do three things:

  1. Divide each stocks price by the first price we have in its history so it starts at 1.
  2. Rather than use the actual stock price, we’ll use the stock price change (of the price normalized by #1).
  3. If NaN is returned in the historical data, we will back fill it from the next good value. Fortuneately Pandas provides a function to do this:
    trainData.fillna(method=’backfill’, inplace=True)

This then puts all the values nicely in range and makes them fairly uniform. The reason for step 3 is that when we go to train the neural network we want to train it with lots of historical data and if we don’t do this we can’t go back very far. Visa, in its current corporate incarnation, only went public in 2008 and then was added to the Dow in 2013 (replacing Bank of America). So there is no Visa historical data from before 2008. Actually I chose tanh as the activation function after switching to price changes, originally I used ReLU with real prices but it tended to be rather unstable.

Our Output Vector

Out output vector will be the next price changes for the 30 Dow component stocks. Then we just need to undo the first normalization above in order to use them.

Summary

This article was a quick introduction to the equations we are going to solve with TensorFlow and what motivates them. We started to look at how we input data into the model and we will continue next time with finding all the various matrix components by framing it as an optimization problem.

Written by smist08

September 8, 2016 at 3:49 pm