Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘Neural Network

Components Leading to Strong AI

with one comment

Introduction

There have been a lot of advances in AI in the past couple of years. A lot of these advances are better simulating the various functions of the brain. These include the convolutional neural networks which are very good at image recognition and new techniques to incorporate memory into neural networks.

Very Deep Neural Networks

In the early days of Neural Networks, finding the weights for the connections was very difficult and often performed by hand. Then the gradient descent algorithm came along and allowed bigger Neural Networks to be trained. Then in 1986 a groundbreaking paper by D. E. Rumelhart showed how to use back propagation to train a multi-level Neural Network with Gradient Descent. However the shape of the surface that is being optimized is often very ill suited to this algorithm, containing many local minimums, or more usually being very flat not indicating the direction to take. Plus depending on the problem the training data may contain lots of errors that can mislead the training process.

With recent tweaks to the training algorithms, researchers have managed to train very deep Neural Networks. For instance the Oxford Visual Geometry Group (VGG) has released a pre-trained 19 layer Neural Network for image recognition.

This is a great building block for other image manipulation projects like Image Style Transfer that we looked at previously.

Now these Neural Networks are starting to resemble the architecture and structure of biological Neurons in the human brain such as the following from the human cortex.

This shows that we are starting to accurately simulate the computational engine in our brains.

The Road to Memory

Although the deep neural networks in the last section are very large and powerful at some problems, other problems they fail at primarily due to a lack or memory or context. For instance if you are translating text word by word, you need to remember the previous words in the sentence to get a correct translation based on the context. Or you need to do a first pass word by word and then knowing the whole, correct mistakes based on now knowing more generally what is being said. Similarly as an algorithm deals with the world, it should learn about the world as it explores and gathers more information. Just retraining the whole Neural Network for each bit of new information is very inefficient.

For language translation and speech recognition the use of Recurrent Neural Networks (RNNs) have proven quite effective. In these the outputs from Neurons can feed into the inputs of the same layer or into the inputs of previous layers. The networks of the previous section were all feed forward Neural Networks since the output of a layer only feeds the input of the next layer. RNNs aren’t true non feed forward networks since they don’t iterate to find a solution with everything stabilized, Rather these outputs from use n go into the inputs of usage n+1. In this way these act as a sort of memory from usage to usage allowing the network to preserve some context from say word to word in translation.

More recent research has led to Neural Networks that can actually have memory banks. These include Long Short-Term Memory Cells (LSTM Cells) and Gated Recurrent Unit (GRU) Cells.

These artificial neurons have the ability to store memory values (as well as forget memory values). The key difficulty in adding memory to Neural Networks was in how to train them. Gradient Descent and all its variations require that the function being optimized is differentiable or very nearly so. Putting things in memory, reading memory and erasing memory are very discrete functions. These sort of functions are not differentiable and can’t be patched since they are flat with zero derivative elsewhere. Something with a zero derivative doesn’t give any information to Gradient Descent as to which direction to go. The solution to this was to replace the discrete functions with probability distributions that are differentiable. So rather than say put something in memory, the function gives you a probability that you should put the value in memory and then you do so if say the probability is greater than 50%.

Learning

I think the current tools for training Neural Networks work quite well for deep feedforward Neural Networks. I think they do a good job of training the weights to use in the various network layers. However I don’t think they provide a good solution for training systems with memory. The brain probably uses some process similar to what we do to train the input weights and outputs to biological Neurons, such as Hebbian Learning. However I don’t think this is what is used to decide whether to remember something or not. I think we still have a long way to go before effectively using memory in our Neural Networks even though just a little bit of memory is greatly improving our translators, speech and text recognition programs.

Summary

The field of Neural Networks is making great progress. This is due to advances in refining the training process of deep Neural Networks along with advances in making artificial Neurons more sophisticated by adding elements like memory banks. Combine this with the fast pace of development of GPUs allowing essentially low cost supercomputers for training and running these networks and the large amount of venture capital that is flowing into anything AI related and we are seeing a true renaissance in the AI field.

Does someone have a true deep AI running in their lab already? Perhaps; but, if they don’t I think we are starting to get quite close.

Written by smist08

September 29, 2017 at 9:12 pm

Playing with Image Style Transfer

leave a comment »

Introduction

Last time we introduced Image Style Transfer, an AI algorithm that combines the contents of one image with the style of another image. In this article we are going to look at some ways to play with this process in more advanced ways. We are going to play with Anish Athalye’s implementation which is on GitHub here, this implementation is really good at allowing lots of tuning and playing.

Playing around this way is quite time consuming since you have to run Gradient Descent to find the solution, rather than just applying canned solutions. Since I ran all these on an older MacBook Air with no GPU, I had to use a lower resolution in the interest of time. At lower resolution (the MacOS’s small size) it took about an hour for each image. At medium resolution, it took about six hours to generate an image. This is ok for running over night but doesn’t allow a lot of play. Makes me wonder if I should get a beefy desktop computer with a good NVidia GPU?

I found a really good YouTube video explaining Image Style Transfer here which is well worth a watch.

Playing with Algorithms

We’ve seen in previous articles how we can play with the tunable parameters in AI algorithms to get quite different results. Here we’ll look at the effects of playing with some parameters as well as fiddling with the algorithm itself.

The basic observation that lead to Image Style Transfer was that a deep image recognition neural network extracts the features related to content in the lower layers and the features related to style in the higher layers. Interestingly the human brain’s image recognition neurons appear to be structured in the same sort of way and it is believed there is a fair bit of similarity between how an advanced image recognition algorithm works and how the brain works. This separation of content from style is then the basis for merging and manipulating these.

The Image Style Transfer algorithm works by starting with an image of white noise and then iterating it using stochastic gradient descent to minimize the difference between the content in one image and the style in the other. This is the loss function we often talk about in AI. The interesting part of the algorithm is that we aren’t training the neural network matrix weights, since these are pre-done by the VGG group, but we are training the input image. So we have a loss function like:

Total Loss = Loss of content from first image + Loss of style from the second image

We can then play with this Loss function in various ways which we’ll experiment with in the rest of this article.

Apply Some Weights

Usually in Machine Learning algorithms we apply weights everywhere that we can use to tune things. The same applies here. We can weight the contributions from content versus style in the total loss formula to achieve more of a contribution from style or content.

First we take a picture of Tetrahedron Peak and combine it with Vincent van Gogh’s Starry Night using the default settings of the algorithm:

Now we can try playing with the weight of the content contribution. Lower means more style, higher means more content. In the image above the content weight was the default of 5.

Notice the image on the left is much more abstract with the large stars appearing all over.

Using Multiple Styles

Last time we used one style at a time to get our result. But you can actually use the algorithm to incorporate multiple styles at once. In this case we just generalize the Loss function above as:

Total Loss = Loss of content form first image + Loss of style from style image 1 +
                 Loss of style form style image 2

Of course we can then further generalize this to any number of style images.

We’ll use our Starry Night combination and also use Picasso’s Dora Maar:

Now we will use both pictures for the style and see what we get:

This weights the styles of Starry Night and Dora Maar equally. However you can see from the Loss formula that we can easily weight the components and get say 75% Starry Night and 25% Dora Maar:

 

Now if we reverse the weights and do Starry Night at 25% and Dora Maar at 75%:

Playing with the Neural Network

We can also play with the Neural Network used. We  can change a number of parameters in the Neural Network as well as introduce various scaling and weight factors.

Pooling Type

For instance there are something called Pooling Layers in the network. These reduce the resolution of the image and help with reducing the abstraction from fine level details to higher level abstractions. There are two commonly used types of pooling layers namely average pooling and max pooling. We can try either of these to see what affect that might have on the image style transfer.

Here we see that average pooling favoured fine details and preserved more of the content image. Whereas max pooling used more of the style image and is a bit more abstract.

Exponential Style Layer Weight

Another thing we can do is magnify some layers over others. For instance we can magnify each style layer over the last one as follows:

weight(layer<n+1>) = weight_exp*weight(layer<n>)

The default is 1 (ie none). Here is Tetrahedron Peak using 0.2 and 2.0.

A factor less than one means more original content since some style layers are suppressed, and a factor greater than one magnifies some style layer contributions. Since the style layers aren’t all weighted the same  this is a bit different than just changing the weighting factor between content and style.

Iterations

Another parameter that is fun to play with is the number of iterations that Gradient Descent runs for. Below we can see a sequence of images as the number of iterations is increased. We can see the content and style of the image forming out of the initial white noise.

At this resolution we are pretty much converged at 500 iterations, however for higher resolution and more complicated images more iterations might be necessary. We could also use a stopping criterion like when the loss function stops changing by some delta, rather than using a fixed number of iterations.

This problem converges quite well since it is mathematically well defined. Often in AI, we don’t get this good behaviour because the training data has lots of errors and/or lots of noise. Here we are just training against a content picture and one or more style pictures, so by definition there isn’t any erroneous data. These challenges would have been faced and solved by the team developing the VGG image recognition neural network that we get to just use and don’t have to worry about training.

Summary

As we can see we can get quite a few different effects by tuning the algorithm using the same style picture as a reference. Simple tools like Prisma or deepart.io don’t let you play with all these parameters. As a photographer who is trying to get a specific effect, you want the power and flexibility to tune your style transfer exactly. Right now the only way to do this is to run the AI algorithms on your computer and play with them which is very time consuming. I suspect once this technology is incorporated in more advanced tools then various degrees of tuning will be possible. Adobe has been demonstrating Image Transfer Style in their labs, and it will be interesting to see if they incorporate it into Photoshop and then how much tuning is possible. Also if it runs in the Adobe Creative Cloud, it will be interesting to see whether it’s quicker running that way than running on your own computer.

 

Written by smist08

August 21, 2017 at 4:29 pm

The Road to TensorFlow – Part 11: Generalization and Overfitting

with 6 comments

Introduction

With sophisticated Neural Networks, you are dealing with a quite complicated nonlinear function. When fitting a high degree polynomial to a few data points, the polynomial can go through all the points, but have such steep slopes that it is useless for predicting points between the training points, we get this same sort of behaviour in Neural Networks. In a way you are training the Neural Network to exactly memorize all the training data exactly rather than figuring out the trends and patterns that you can use to predict other values.

We’ve touched upon this problem in other articles like here and here, but glossed over what we are doing about this problem. In this article we’ll explore what we can do about this in more detail.

One solution is to perhaps gather more training data, however this may be impossible or quite expensive. It also might be that the training data is missing some representative samples. Here we’ll concentrate on what we can do with the algorithm rather than trying to improve the data.

Interpolation and Extrapolation

Here we refer to generalization as wanting to get answers to data that isn’t in the training data. We refer to overfitting as the case where the model works really well for the training data but doesn’t do nearly as well for anything else.

There are two distinct cases we want to worry about. One is interpolation, this is trying to estimate values where the inputs are surrounded by data in the training set. Extrapolation is the process of trying to predict what happens beyond the training data. Our stock market data is an example of extrapolation. Recognizing handwriting is an example of interpolation (assuming you have a good sample of training data)

Extrapolation tends to be a much harder problem than interpolation, but both a strongly affected by overfitting.

Early Stopping

What we often do is divide our training data into three groups. The largest of these we call the training data and use for training. Another is the test data which we run after training to see how well the algorithm works on data that hasn’t been seen by training. To help with detecting overfitting we create a third group which we run after a certain number of steps during training. The following screenshot shows the results for the training and validation sets (this is for a Kaggle competition so the test set needs to be submitted to Kaggle to get the answer). Here smaller values are better. Notice that the training data gets better starting at 3209.5 and going down to 712.8 which indicates training is working. However the validation data starts at 3014.3 goes down to the 1160s and then starts increasing. This indicates we are overfitting the data.

screen-shot-2016-10-15-at-5-56-02-pm

The approach here is really simple, let’s just stop once the validation data starts increasing. So let’s just stop at this point and say we’re done. This is actually a pretty simple and effective way to prevent overfitting. As an added bonus this is a rare technique that leads to faster training.

Penalizing Large Weights

A sign of overfitting is that the slope of our function is high at the points in the training data. Since the slope is approximated by the appropriate weights in our matrix, we would want to keep the weights in our weight matrices low. The way we accomplish this is to add a penalty to the loss function based on the size of the weights.

     loss = (tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

         + tf.nn.l2_loss(layer1_weights)*beta

         + tf.nn.l2_loss(layer2_weights)*beta

         + tf.nn.l2_loss(layer3_weights)*beta

         + tf.nn.l2_loss(layer4_weights)*beta)

 

Here we add the sum of the squares of the weights to our loss function. The factor beta is there to let us scale this value to be in the same magnitude as the main loss function. I’ve found that in some problems making the loss due to the weights about equal to the main loss works quite well. In another problem I found choosing beta so that the weights are 10% of the main loss worked quite well.

I have found that combining this with early stopping works quite well. The weight penalty lets us train longer before we start overfitting, which leads to a better overall result.

Dropout

One property of the Neural Networks in our brain is that brain cells die, but our brain seems to mostly keep on working. In this sense the brain is far more resilient to damage than a computer. The idea behind dropout is to try to add rules to train the Neural Network to be resilient to Neurons being removed. This means the Neural Network can’t be completely reliant on any given Neuron since it could die (be removed from the model).

dropout

The way we accomplish this is we add a dropout activation function at some point:

            if dropout:

                hidden = tf.nn.dropout(hidden, 0.5)

 

This activation function will remove 50% of the neurons at this layer and scale up its outputs by a matching amount. This is so the sum stays the same which means you can use the same weights whether dropout is present or not.

The reason for the if statement is that you only want to do dropout during training and not during validation, testing or production.

You would do this on each hidden layer. It’s rather surprising that the Neural Network still works as well as it does with this much dropout.

I find dropout doesn’t always help, but when it does you can combine it with penalizing the weights and then you can train longer before you need to stop during overfitting. This can sometimes help a network find finer details without overfitting.

When you do dropout, you do have to train for a longer time, so if this is too time prohibitive you might not want to use it.

I think it’s a good sign that Neural Networks can exhibit the same resilience to damage that the brain shows. Perhaps a bit of biological evidence that we are on the correct track.

Summary

These are a few techniques you can use to avoid overfitting your model. I generally use all three so I can train a bit longer without overfitting. If you can get more good training data that can also help quite a bit. Using a simpler model (with fewer hidden nodes) can also help with overfitting, but perhaps not provide as good a functional approximation as the more complicated model. As with all things in computer science you are always trading off complexity, overfitting and performance.

Written by smist08

October 16, 2016 at 6:49 pm

The Road to TensorFlow – Part 7: Finally Some Code

with 12 comments

Introduction

Well after a long journey through Linux, Python, Python Libraries, the Stock Market, an Introduction to Neural Networks and training Neural Networks we are now ready to look at a complete Python example to predict the stock market.

I placed the full source code listing on my Google Drive here. As described in the previous articles you will need to run this on a Mac or on Linux (could be a virtual image) with Python and TensorFlow installed. You will also need to have the various libraries that are imported at the top of the source file installed or you will get an error when you go to run it. I would suggest getting the source file to play with, Python is very fussy about indentation, so copy/paste from the article may introduce indentation errors caused by the blog formatting.

The Neural Network we are running here is a simple feed forward network with four hidden layers and uses the hyperbolic tangent as the activation function in each case. This is a very simple model so don’t use it to invest with real money. Hopefully this article gives a flavour for how to create and train a Neural Network using TensorFlow. Then in future articles we can discuss the limitation of this model and how to improve it.

Import Libraries

First we import all the various libraries we will be using, note tensorflow and numpy as being particularly important.

# Copyright 2016 Stephen Smith

import time
import math
import os
from datetime import date
from datetime import timedelta
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import pandas_datareader as pdr
from pandas_datareader import data, wb
from six.moves import cPickle as pickle
from yahoo_finance import Share

Get Stock Market Data

Next we get the stock market data. If the file stocks.pickle exists we assume we’ve previously saved this file and use it. Otherwise we get the data from Yahoo Finance using a Web Service call, made via the Pandas DataReader. We only keep the adjusted close column and we fill in any NaN’s with the first value we saw (this really only applies to Visa in this case). The data will all be in a standard Pandas data frame after this.

# Choose amount of historical data to use NHistData
NHistData = 30
TrainDataSetSize = 3000

# Load the Dow 30 stocks from Yahoo into a Pandas datasheet

dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM',
         'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM',
         'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG',
         'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS']

num_stocks = len(dow30)

trainData = None
loadNew = False

# If stocks.pickle exists then this contains saved stock data, so use this,
# else use the Pandas DataReader to get the stock data and then pickle it.
stock_filename = 'stocks.pickle'
if os.path.exists(stock_filename):
    try:
        with open(stock_filename, 'rb') as f:
            trainData = pickle.load(f)
    except Exception as e:
      print('Unable to process data from', stock_filename, ':', e)
      raise           
    print('%s already present - Skipping requesting/pickling.' % stock_filename)
else:
    # Get the historical data. Make the date range quite a bit bigger than
    # TrainDataSetSize since there are no quotes for weekends and holidays. This
    # ensures we have enough data.

    f = pdr.data.DataReader(dow30, 'yahoo',
        date.today()-timedelta(days=TrainDataSetSize*2+5), date.today())
    cleanData = f.ix['Adj Close']
    trainData = pd.DataFrame(cleanData)
    trainData.fillna(method='backfill', inplace=True)
    loadNew = True
    print('Pickling %s.' % stock_filename)
    try:
        with open(stock_filename, 'wb') as f:
          pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        print('Unable to save data to', stock_filename, ':', e)

Normalize the Data

We then normalize the data and remember the factor we used so we can de-normalize the results at the end.

# Normalize the data by dividing each price by the first price for a stock.
# This way all the prices start together at 1.
# Remember the normalizing factors so we can go back to real stock prices
# for our final predictions.
factors = np.ndarray(shape=( num_stocks ), dtype=np.float32)
i = 0
for symbol in dow30:
    factors[i] = trainData[symbol][0]
    trainData[symbol] = trainData[symbol]/trainData[symbol][0]
    i = i + 1

Re-arrange the Data for TensorFlow

Now we need to build up our training data, test data and validation data. We need to format this as input arrays for the Neural Network. Looking at this code, I think true Python programmers will accuse me of being a C programmer (which I am), since I do this all with loops. I’m sure a more experience Python programmer could accomplish this quicker with more array operations. This part of the code is quite slow so we pickle it, so if we re-run with the saved stock data, we can also use saved training data.

# Configure how much of the data to use for training, testing and validation.

usableData = len(trainData.index) - NHistData + 1
#numTrainData =  int(0.6 * usableData)
#numValidData =  int(0.2 * usableData
#numTestData = usableData - numTrainData - numValidData - 1
numTrainData = usableData - 1
numValidData = 0
numTestData = 0

train_dataset = np.ndarray(shape=(numTrainData - 1,
    num_stocks * NHistData), dtype=np.float32)
train_labels = np.ndarray(shape=(numTrainData - 1, num_stocks),
    dtype=np.float32)
valid_dataset = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks * NHistData), dtype=np.float32)
valid_labels = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks), dtype=np.float32)
test_dataset = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks * NHistData), dtype=np.float32)
test_labels = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks), dtype=np.float32)
final_row = np.ndarray(shape=(1, num_stocks * NHistData),
    dtype=np.float32)
final_row_prices = np.ndarray(shape=(1, num_stocks * NHistData),
    dtype=np.float32)

# Build the taining datasets in the correct format with the matching labels.
# So if calculate based on last 30 stock prices then the desired
# result is the 31st. So note that the first 29 data points can't be used.
# Rather than use the stock price, use the pricing deltas.
pickle_file = "traindata.pickle"
if loadNew == True or not os.path.exists(pickle_file):
    for i in range(1, numTrainData):
        for j in range(num_stocks):
            for k in range(NHistData):
                train_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k]
                    - trainData[dow30[j]][i + k - 1])
            train_labels[i-1][j] = (trainData[dow30[j]][i + NHistData]
                - trainData[dow30[j]][i + NHistData - 1])  

    for i in range(1, numValidData):
        for j in range(num_stocks):
            for k in range(NHistData):
                valid_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData]
                    - trainData[dow30[j]][i + k + numTrainData - 1])
            valid_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData]
                - trainData[dow30[j]][i + NHistData + numTrainData - 1])

    for i in range(1, numTestData):
        for j in range(num_stocks):
            for k in range(NHistData):
                test_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData + numValidData]
                    - trainData[dow30[j]][i + k + numTrainData + numValidData - 1])
            test_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData + numValidData]
                - trainData[dow30[j]][i + NHistData + numTrainData + numValidData - 1])

    try:
      f = open(pickle_file, 'wb')
      save = {
        'train_dataset': train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
        }
      pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
      f.close()
    except Exception as e:
      print('Unable to save data to', pickle_file, ':', e)
      raise

else:
    with open(pickle_file, 'rb') as f:
      save = pickle.load(f)
      train_dataset = save['train_dataset']
      train_labels = save['train_labels']
      valid_dataset = save['valid_dataset']
      valid_labels = save['valid_labels']
      test_dataset = save['test_dataset']
      test_labels = save['test_labels']
      del save  # hint to help gc free up memory   

for j in range(num_stocks):
    for k in range(NHistData):
            final_row_prices[0][j * NHistData + k] = trainData[dow30[j]][k + len(trainData.index - NHistData]
            final_row[0][j * NHistData + k] = (trainData[dow30[j]][k + len(trainData.index) - NHistData]
                - trainData[dow30[j]][k + len(trainData.index) - NHistData - 1])

print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Accuracy

We now setup an accuracy function that is only used to report how we are doing during training. This isn’t used by the training algorithm. It roughly shows what percentage of predictions are within some tolerance.

# This accuracy function is used for reporting progress during training, it isn't actually
# used for training.
def accuracy(predictions, labels):
  err = np.sum( np.isclose(predictions, labels, 0.0, 0.005) ) / (predictions.shape[0] * predictions.shape[1])
  return (100.0 * err)

TensorFlow Variables

We now start setting up TensorFlow by creating our graph and defining our datasets and variables.

batch_size = 4
num_hidden = 16
num_labels = num_stocks

graph = tf.Graph()

# input is 30 days of dow 30 prices normalized to be between 0 and 1.
# output is 30 values for normalized next day price change of dow stocks
# use a 4 level neural network to compute this.

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, num_stocks * NHistData))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  tf_final_dataset = tf.constant(final_row)

  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [NHistData * num_stocks, num_hidden], stddev=0.05))
  layer1_biases = tf.Variable(tf.zeros([num_hidden]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.05))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

TensorFlow Model

We now define our Neural Network model. Hyperbolic Tangent is our activation function and rest is matrix algebra as we described in previous articles.

  # Model.
  def model(data):
    hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer2_weights) + layer2_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

Training Model

Now we setup the training model and the optimizer to use, namely gradient descent. We also define what are the correct answers to compare against.

  # Training computation.
  logits = model(tf_train_dataset)
  loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
  # Predictions for the training, validation, and test data.
  train_prediction = logits
  valid_prediction = model(tf_valid_dataset)
  test_prediction = model(tf_test_dataset)
  next_prices = model(tf_final_dataset)

Run the Model

So far we have setup TensorFlow ready to go, but we haven’t calculated anything. This next set of code executes the training run. It will use the data we’ve provided in the configured batch size to train our network while printing out some intermediate information.

num_steps = 2052

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    acc = accuracy(predictions, batch_labels)
    if (step % 100 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % acc)
      if numValidData > 0:
          print('Validation accuracy: %.1f%%' % accuracy(
              valid_prediction.eval(), valid_labels))
  if numTestData > 0:        
      print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Make a Prediction

The final bit of code uses our trained model to make a prediction based on the last set of data we have (where we don’t know the right answer). If you get fresh stock market data for today, then the prediction will be for tomorrow’s price changes. If you run this late enough that Yahoo has updated its prices for the day, then you will get some real errors for comparison. Note that Yahoo is very slow and erratic about doing this, so be careful when reading this table.

predictions = next_prices.eval() * factors
  print("Stock    Last Close  Predict Chg   Predict Next      Current     Current Chg       Error")
  i = 0
  for x in dow30:
      yhfeed = Share(x)
      currentPrice = float(yhfeed.get_price())
      print( "%-6s  %9.2f  %9.2f       %9.2f       %9.2f     %9.2f     %9.2f" % (x,
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             predictions[0][i],
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i] + predictions[0][i],
             currentPrice,
             currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             abs(predictions[0][i] - (currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i]))) )
      i = i + 1

Results

Below is a screenshot of one run predicting the stock changes for Sept. 22. Basically it didn’t do very well. We’ll talk about why and what to do about this in a future article. As you can see it is very conservative in its predictions.

results

Summary

This article shows the code for training and executing a very simple Neural Network using TensorFlow. Definitely don’t bet on the stock market based on this model, it is very simple at this point. We still need to add a number of elements to start making this into a useful model which we’ll look at in future articles.

Written by smist08

September 23, 2016 at 4:17 pm

The Road to TensorFlow – Part 6: Optimization and Training

with 3 comments

Introduction

Last time we looked at the matrix equation that would be our Neural Network which is:

Output of Layer = ActivationFunction( A x (Input of Layer) + b )

We also specified that our input vector would be 900 elements large (the 30 Dow stocks times the last 30 price changes) and the output vector would be 30 elements (then next price change for each of the Dow 30 stocks). This means that if we have just one hidden layer of say 100 Neurons then we need a 900×100 matrix and a 100×30 matrix plus a 100 element bias vector and a 30 element bias vector. This means we need 900×100 + 100×30 + 100 + 30 = 93,130 values. Where do these all come from? In this article we’ll look at where we get these.

Training

What we want to do is use some sort of known or historical data to train the Neural Network. When Neural Networks were first proposed, Computer Scientists tuned these by hand which resulted in taking a long time to get a very small Neural Network that didn’t work well. Later on many methods were developed to calculate these from databases of known cases, however until recently these databases were too small to be effective and led to extreme over-fitting. With the advent of big data, shared cloud resources and automated data collection, a large number of high quality extremely large databases are available to train Neural Networks for well know problems like hand writing recognition or shape identification. Notice that in the introduction to find 93,130 values requires far more than 93,130 bits of data, since this will lead to over-fitting (which we’ll talk a lot about in a future article).

If you remember back to basic statistics and linear regression, we found the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why its often called least squares regression. Basically we are formulating the problem as an optimization problem where we are trying to minimize an error function. The linear regression problem is then easily solvable by first year linear algebra. For the Neural Network case it’s a little bit more complicated, but the basic idea is the same.

To train our Neural Network we will use historical data where we provide 30 days of price changes for the Dow 30 stocks and then we know the next change so we can provide the error for our error function. To define an error function, we are going to start by just using the square of the difference, so basically just doing least square minimization just like least squares regression. In TensorFlow we can define our loss function as:

loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

Now that we have the data and an error function how do we go about training our network. First we start by seeding the matrix weights with normally distributed random numbers. TensorFlow provides some help here

layer1_weights = tf.Variable(tf.truncated_normal(
[NHistData * num_stocks, num_hidden], stddev=0.05))

to define our matrix and initialize it with normalized random numbers.

There are a number of optimization algorithms that can be used to solve this problem. The one we are going to use is called Gradient Descent which is a form of Back Propagation. The key property of back propagation algorithms is that they can be applied to Neural Networks with multiple hidden layers. The basic idea is that you take the partial derivative of the loss function with respect to each weight. This gives you a gradient with respect to each weight and then based on whether the gradient is positive or negative you can increase or decrease the weight by a little bit. This little bit is the learning rate which is a parameter to the algorithm (or can be changed dynamically by another algorithm). You then run the training data through this algorithm and hopefully observe your error function decreasing as you go.

This is then the basis of training your network. Once you have the weights you can calculate all the values as you like.

Example of Gradient Descent where it could lead to 2 different minimums

Example of Gradient Descent where it could lead to 2 different minimums

Testing

A big danger here is that you are overfitting. You reduce the error function to next than nothing and the network works well for all your training data. You then try it on something else and it produces very bad results. This is similar to fitting a 10th degree polynomial through 11 data points. It fits all those points exactly, but has no predictive value outside of those exact points.

A common technique is to divide the training data into three buckets: actual training data, testing data and final validation data. You use the training data to train and as you train, you use the testing data to see how you are doing. Then when everything is finished you use the final validation data to do a final test (where the training process has never seen this data). This then gives you an idea of how well the network will fare out in the real world. We will make the size of these three buckets configurable.

Local Versus Global Minimums

During the training process a few different things could happen. The solution could diverge, the error could just keep getting larger and larger. The solution could get stuck in a valley and just orbit a minimum value without converging to it. The solution could converge, but to a local minimum rather than the global minimum. These are all things that need to be watched out for.

Since the initial values are random, re-running the training can lead to quite different solutions. For some problems you want to train repeatedly to get the best solution. Or perhaps compare different optimization algorithms to see which gives the best result. Another idea is to use a combination of algorithms, perhaps start with one that gets into the correct neighborhood, and then another that can zero in on it.

There are quite a few tricks to get out of local minimums and to escape valleys using various random numbers. One is to change the learning rate to occasionally take a bigger jump. Others are to try some random perturbations to see if you can start converging to another solution.

Batch Versus Single

A lot of time we process the training data in batches where we take the average of the partial derivatives to adjust the weights. This can greatly speed up training and avoids the problem of one bad data point sending us in the wrong direction. Again the batch size is a meta-parameter to the training algorithm that we can tune to get the best results.

Summary

This was a really quick introduction to training a Neural Network. There are many optimization algorithms that can be applied to solve this problem, but we are starting with gradient descent. A number of the algorithms chosen, are done so to facilitate using a GPU or distributed network to parallelize and hence speed up the training process.

Next time we’ll start looking at the TensorFlow code for a simple Neural Network model, then we will start enhancing it to get better results.

Written by smist08

September 21, 2016 at 11:32 pm