## The Road to TensorFlow – Part 7: Finally Some Code

# Introduction

Well after a long journey through Linux, Python, Python Libraries, the Stock Market, an Introduction to Neural Networks and training Neural Networks we are now ready to look at a complete Python example to predict the stock market.

I placed the full source code listing on my Google Drive here. As described in the previous articles you will need to run this on a Mac or on Linux (could be a virtual image) with Python and TensorFlow installed. You will also need to have the various libraries that are imported at the top of the source file installed or you will get an error when you go to run it. I would suggest getting the source file to play with, Python is very fussy about indentation, so copy/paste from the article may introduce indentation errors caused by the blog formatting.

The Neural Network we are running here is a simple feed forward network with four hidden layers and uses the hyperbolic tangent as the activation function in each case. This is a very simple model so don’t use it to invest with real money. Hopefully this article gives a flavour for how to create and train a Neural Network using TensorFlow. Then in future articles we can discuss the limitation of this model and how to improve it.

# Import Libraries

First we import all the various libraries we will be using, note tensorflow and numpy as being particularly important.

# Copyright 2016 Stephen Smith import time

import math import os from datetime import date from datetime import timedelta import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import pandas as pd import pandas_datareader as pdr from pandas_datareader import data, wb from six.moves import cPickle as pickle from yahoo_finance import Share

# Get Stock Market Data

Next we get the stock market data. If the file stocks.pickle exists we assume we’ve previously saved this file and use it. Otherwise we get the data from Yahoo Finance using a Web Service call, made via the Pandas DataReader. We only keep the adjusted close column and we fill in any NaN’s with the first value we saw (this really only applies to Visa in this case). The data will all be in a standard Pandas data frame after this.

# Choose amount of historical data to use NHistData NHistData = 30 TrainDataSetSize = 3000 # Load the Dow 30 stocks from Yahoo into a Pandas datasheet dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM', 'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM', 'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG', 'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS'] num_stocks = len(dow30) trainData = None loadNew = False # If stocks.pickle exists then this contains saved stock data, so use this, # else use the Pandas DataReader to get the stock data and then pickle it. stock_filename = 'stocks.pickle' if os.path.exists(stock_filename): try: with open(stock_filename, 'rb') as f: trainData = pickle.load(f) except Exception as e: print('Unable to process data from', stock_filename, ':', e) raise print('%s already present - Skipping requesting/pickling.' % stock_filename) else: # Get the historical data. Make the date range quite a bit bigger than # TrainDataSetSize since there are no quotes for weekends and holidays. This # ensures we have enough data. f = pdr.data.DataReader(dow30, 'yahoo', date.today()-timedelta(days=TrainDataSetSize*2+5), date.today()) cleanData = f.ix['Adj Close'] trainData = pd.DataFrame(cleanData) trainData.fillna(method='backfill', inplace=True) loadNew = True print('Pickling %s.' % stock_filename) try: with open(stock_filename, 'wb') as f: pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL) except Exception as e: print('Unable to save data to', stock_filename, ':', e)

# Normalize the Data

We then normalize the data and remember the factor we used so we can de-normalize the results at the end.

# Normalize the data by dividing each price by the first price for a stock. # This way all the prices start together at 1. # Remember the normalizing factors so we can go back to real stock prices # for our final predictions. factors = np.ndarray(shape=( num_stocks ), dtype=np.float32) i = 0 for symbol in dow30: factors[i] = trainData[symbol][0] trainData[symbol] = trainData[symbol]/trainData[symbol][0] i = i + 1

# Re-arrange the Data for TensorFlow

Now we need to build up our training data, test data and validation data. We need to format this as input arrays for the Neural Network. Looking at this code, I think true Python programmers will accuse me of being a C programmer (which I am), since I do this all with loops. I’m sure a more experience Python programmer could accomplish this quicker with more array operations. This part of the code is quite slow so we pickle it, so if we re-run with the saved stock data, we can also use saved training data.

# Configure how much of the data to use for training, testing and validation. usableData = len(trainData.index) - NHistData + 1 #numTrainData = int(0.6 * usableData) #numValidData = int(0.2 * usableData #numTestData = usableData - numTrainData - numValidData - 1 numTrainData = usableData - 1 numValidData = 0 numTestData = 0 train_dataset = np.ndarray(shape=(numTrainData - 1, num_stocks * NHistData), dtype=np.float32) train_labels = np.ndarray(shape=(numTrainData - 1, num_stocks), dtype=np.float32) valid_dataset = np.ndarray(shape=(max(0, numValidData - 1), num_stocks * NHistData), dtype=np.float32) valid_labels = np.ndarray(shape=(max(0, numValidData - 1), num_stocks), dtype=np.float32) test_dataset = np.ndarray(shape=(max(0, numTestData - 1), num_stocks * NHistData), dtype=np.float32) test_labels = np.ndarray(shape=(max(0, numTestData - 1), num_stocks), dtype=np.float32) final_row = np.ndarray(shape=(1, num_stocks * NHistData), dtype=np.float32) final_row_prices = np.ndarray(shape=(1, num_stocks * NHistData), dtype=np.float32) # Build the taining datasets in the correct format with the matching labels. # So if calculate based on last 30 stock prices then the desired # result is the 31st. So note that the first 29 data points can't be used. # Rather than use the stock price, use the pricing deltas. pickle_file = "traindata.pickle" if loadNew == True or not os.path.exists(pickle_file): for i in range(1, numTrainData): for j in range(num_stocks): for k in range(NHistData): train_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k] - trainData[dow30[j]][i + k - 1]) train_labels[i-1][j] = (trainData[dow30[j]][i + NHistData] - trainData[dow30[j]][i + NHistData - 1]) for i in range(1, numValidData): for j in range(num_stocks): for k in range(NHistData): valid_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData] - trainData[dow30[j]][i + k + numTrainData - 1]) valid_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData] - trainData[dow30[j]][i + NHistData + numTrainData - 1]) for i in range(1, numTestData): for j in range(num_stocks): for k in range(NHistData): test_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData + numValidData] - trainData[dow30[j]][i + k + numTrainData + numValidData - 1]) test_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData + numValidData] - trainData[dow30[j]][i + NHistData + numTrainData + numValidData - 1]) try: f = open(pickle_file, 'wb') save = { 'train_dataset': train_dataset, 'train_labels': train_labels, 'valid_dataset': valid_dataset, 'valid_labels': valid_labels, 'test_dataset': test_dataset, 'test_labels': test_labels, } pickle.dump(save, f, pickle.HIGHEST_PROTOCOL) f.close() except Exception as e: print('Unable to save data to', pickle_file, ':', e) raise else: with open(pickle_file, 'rb') as f: save = pickle.load(f) train_dataset = save['train_dataset'] train_labels = save['train_labels'] valid_dataset = save['valid_dataset'] valid_labels = save['valid_labels'] test_dataset = save['test_dataset'] test_labels = save['test_labels'] del save # hint to help gc free up memory for j in range(num_stocks): for k in range(NHistData): final_row_prices[0][j * NHistData + k] = trainData[dow30[j]][k + len(trainData.index - NHistData] final_row[0][j * NHistData + k] = (trainData[dow30[j]][k + len(trainData.index) - NHistData] - trainData[dow30[j]][k + len(trainData.index) - NHistData - 1]) print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape)

# Accuracy

We now setup an accuracy function that is only used to report how we are doing during training. This isn’t used by the training algorithm. It roughly shows what percentage of predictions are within some tolerance.

# This accuracy function is used for reporting progress during training, it isn't actually # used for training. def accuracy(predictions, labels): err = np.sum( np.isclose(predictions, labels, 0.0, 0.005) ) / (predictions.shape[0] * predictions.shape[1]) return (100.0 * err)

# TensorFlow Variables

We now start setting up TensorFlow by creating our graph and defining our datasets and variables.

batch_size = 4 num_hidden = 16 num_labels = num_stocks graph = tf.Graph() # input is 30 days of dow 30 prices normalized to be between 0 and 1. # output is 30 values for normalized next day price change of dow stocks # use a 4 level neural network to compute this. with graph.as_default(): # Input data. tf_train_dataset = tf.placeholder( tf.float32, shape=(batch_size, num_stocks * NHistData)) tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels)) tf_valid_dataset = tf.constant(valid_dataset) tf_test_dataset = tf.constant(test_dataset) tf_final_dataset = tf.constant(final_row) # Variables. layer1_weights = tf.Variable(tf.truncated_normal( [NHistData * num_stocks, num_hidden], stddev=0.05)) layer1_biases = tf.Variable(tf.zeros([num_hidden])) layer2_weights = tf.Variable(tf.truncated_normal( [num_hidden, num_hidden], stddev=0.05)) layer2_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden])) layer3_weights = tf.Variable(tf.truncated_normal( [num_hidden, num_hidden], stddev=0.05)) layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden])) layer4_weights = tf.Variable(tf.truncated_normal( [num_hidden, num_labels], stddev=0.05)) layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

# TensorFlow Model

We now define our Neural Network model. Hyperbolic Tangent is our activation function and rest is matrix algebra as we described in previous articles.

# Model. def model(data): hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases) hidden = tf.tanh(tf.matmul(hidden, layer2_weights) + layer2_biases) hidden = tf.tanh(tf.matmul(hidden, layer3_weights) + layer3_biases) return tf.matmul(hidden, layer4_weights) + layer4_biases

# Training Model

Now we setup the training model and the optimizer to use, namely gradient descent. We also define what are the correct answers to compare against.

# Training computation. logits = model(tf_train_dataset) loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels)) # Optimizer. optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = logits valid_prediction = model(tf_valid_dataset) test_prediction = model(tf_test_dataset) next_prices = model(tf_final_dataset)

# Run the Model

So far we have setup TensorFlow ready to go, but we haven’t calculated anything. This next set of code executes the training run. It will use the data we’ve provided in the configured batch size to train our network while printing out some intermediate information.

num_steps = 2052 with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print('Initialized') for step in range(num_steps): offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict) acc = accuracy(predictions, batch_labels) if (step % 100 == 0): print('Minibatch loss at step %d: %f' % (step, l)) print('Minibatch accuracy: %.1f%%' % acc) if numValidData > 0: print('Validation accuracy: %.1f%%' % accuracy( valid_prediction.eval(), valid_labels)) if numTestData > 0: print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

# Make a Prediction

The final bit of code uses our trained model to make a prediction based on the last set of data we have (where we don’t know the right answer). If you get fresh stock market data for today, then the prediction will be for tomorrow’s price changes. If you run this late enough that Yahoo has updated its prices for the day, then you will get some real errors for comparison. Note that Yahoo is very slow and erratic about doing this, so be careful when reading this table.

predictions = next_prices.eval() * factors print("Stock Last Close Predict Chg Predict Next Current Current Chg Error") i = 0 for x in dow30: yhfeed = Share(x) currentPrice = float(yhfeed.get_price()) print( "%-6s %9.2f %9.2f %9.2f %9.2f %9.2f %9.2f" % (x, final_row_prices[0][i * NHistData + NHistData - 1] * factors[i], predictions[0][i], final_row_prices[0][i * NHistData + NHistData - 1] * factors[i] + predictions[0][i], currentPrice, currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i], abs(predictions[0][i] - (currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i]))) ) i = i + 1

# Results

Below is a screenshot of one run predicting the stock changes for Sept. 22. Basically it didn’t do very well. We’ll talk about why and what to do about this in a future article. As you can see it is very conservative in its predictions.

# Summary

This article shows the code for training and executing a very simple Neural Network using TensorFlow. Definitely don’t bet on the stock market based on this model, it is very simple at this point. We still need to add a number of elements to start making this into a useful model which we’ll look at in future articles.

I’ve copy pasted all and run through tensorflow, and have this error:

File “tf.py”, line 154

final_row_prices[0][j * NHistData + k] = (trainData[dow30[j]][k + len(trainData.index – NHistData]

Vingt Cent (@fongo360)September 26, 2016 at 6:08 pm

You’ll do better to get the actual Python source file from my Google drive, the link is near the top of the article. Sorry about the error, I needed to add some line breaks in the blog articles, so I added some parentheses to try to keep the Python syntax correct, but as you point out I added one in the wrong place. I’ve fixed that in the article.

smist08September 26, 2016 at 6:19 pm

thanks very much 🙂

Vingt Cent (@fongo360)September 26, 2016 at 6:32 pm

[…] the last part of this series we presented a complete Python program to demonstrate how to create a simple feed […]

The Road to TensorFlow – Part 8: Improving the Model | Stephen Smith's BlogSeptember 27, 2016 at 3:48 pm

[…] I spent quite a bit of time playing and blogging about predicting the stock market with TensorFlow, this is where I started. The data was all numeric, so it was quite easy to get started, no one hot […]

Playing the Kaggle Two Sigma Challenge – Part 2 | Stephen Smith's BlogMarch 3, 2017 at 11:51 pm

Error message while trying to run “tfstocksdiff.py”:

stocks.pickle already present – Skipping requesting/pickling.

Training set (4104, 900) (4104, 30)

Validation set (0, 900) (0, 30)

Test set (0, 900) (0, 30)

Traceback (most recent call last):

File “…/tfstocksdiff.py”, line 211, in

loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

AttributeError: module ‘tensorflow’ has no attribute ‘sub’

P BlueApril 8, 2017 at 8:14 am

This is part of the API changes between TensorFlow 0.8/0.9 and TensorFlow 1.0. Its now tf.subtract. The easiest way to fix things up is to run the script through the conversion/upgrade script that Google provides. Plus see my blog article on TensorFlow 1.0: https://smist08.wordpress.com/2017/02/19/tensorflow-goes-1-0/.

smist08April 8, 2017 at 5:42 pm

Hi, great post!

When you are running the model with below part;

num_steps = 2052

with tf.Session(graph=graph) as session:

tf.initialize_all_variables().run()

print(‘Initialized’)

for step in range(num_steps):

offset = (step * batch_size) % (train_labels.shape[0] – batch_size)

Here, we have

train_labels.shape[0]

4103

and if you give “num_steps” as argument to range function, it’s iterating twice

so “offset” will be something like,

0

4

8

12

16

20

24

.

.

.

4080

4084

4088

4092

4096

1

5

9

13

17

21

.

.

.

4089

4093

4097

2

6

Shouldn’t that iterate the whole index just once?

Best,

umitMay 11, 2017 at 1:53 pm

Since this is an iterative algorithm, you often need to run it over the data a great many times before it converges. Also since there is a lot of noise in the data, it can move around quite a bit along the way. If you do happen to reach a good minimum early, you certainly should stop. See the mention of early stopping in this article: https://smist08.wordpress.com/2016/10/16/the-road-to-tensorflow-part-11-generalization-and-overfitting/. There is often a tradeoff to find the perfect min versus over fitting to the training data.

smist08May 11, 2017 at 6:54 pm