Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘TensorFlow

Playing the Kaggle Two Sigma Challenge – Part 2

with 3 comments


Last time I introduced the Kaggle Two Sigma Challenge and this time I’ll start describing what I did at the beginning of the competition. The competition started at the beginning of December, 2016 and completed on March 1, 2017.  This blog covers what I did in December.

Update 2017/03/07: I uploaded the Python source code for the code discussed here to my Google Drive. You can access them here. The files are for the first (wide) TensorFlow attempt, for the second narrow TensorFlow attempt, for my regression one with reinforcement learning and then for the Christmas surprise with reinforcement learning added.



Since I spent quite a bit of time playing and blogging about predicting the stock market with TensorFlow, this is where I started. The data was all numeric, so it was quite easy to get started, no one hot encoding and really the only pre-processing was to fill in missing values with the pandas fillna function (where I just used the mean since this was easiest). I’ll talk more about these missing values later, but to get started they were easy to fill in and ignore.

I started by just feeding all the data into TensorFlow trying some simple 2, 3 and 4 level neural networks. However my results were quite bad. Either the model couldn’t converge or even if it did, the results were much worse than just submitting zeros for everything.

With all the data the model was quite large, so I thought I should simplify it a bit. The Kaggle competition has a public forum which includes people publishing public Python notebooks and early in every competition there are some very generous people that published detailed statistical analysis and visualizations of all the data. Using this I could select a small subset of data columns which had higher correlations with the results and just use these instead. This then let me run the training longer, but still didn’t produce any useful results.

At this point I decided that given the computing resource limitations of the Kaggle playgrounds, I wouldn’t be able to do a serious neural network, or perhaps doing so just wouldn’t work. I did think of doing the training on my laptop, say running overnight and then copy/pasting the weight/bias arrays into my Python code in the playground to just run. But I never pursued this.

Penalized Linear Regression

My next thought was to use linear regression since this tends to be good for extrapolation problems since it doesn’t suffer from non-linearities going wild outside of the training data. Generally regular least squares regression can suffer from overfitting, especially when there are a large number of variables and they aren’t particularly linearly independent. Also least squares regression can be thrown off by bad errant data. The general consensus from the forums was that this training set had a lot of outliers for some reason. In machine learning there are a large family of Penalized Linear Regression algorithms that all try to address these problems via one means or another. Generally they do things like start with the most correlated column and then add the next most correlated column and only keep doing this as long as they have a positive effect on the results. They also penalize large weights borrowing the technique we described here. Then there are various methods to filter out outliers or to change their effect by using different metrics than sum of squares. Two popular methods are Lasso regression that uses the taxi-cab metric (sum of difference of absolute values rather than sum of square differences) and Ridge regression which uses sum of squares regression. Then both penalize large coefficients and bring in variables one at a time. Then there is  a combined algorithm called Elastic Net Regression that uses a ratio of each and you choose the coefficient.

First Victory

Playing around with this a bit, I found the scikit-learn algorithm ElasticNetCV worked quite well for me. ElasticNetCV breaks up the training data and then run iteratively testing the value of how many variables to include to find the best result. Choosing the l1 ratio of 0.45 actually put me in the top ten of the submissions. This was a very simple submission, but I was pretty happy to get such a good result.

Reinforcement Learning

One thing that seemed a bit strange to me about the way the Kaggle Gym worked was that you submitted your results for a given time step and then got a reward for that. However you didn’t get the correct results for the previous timestep. Normally for stock market prediction you predict the next day, then get the correct results at the end of the day and predict the next day. Here you only get a reward which is the R2 score for you submission. The idea is to have an algorithm like the following diagram. But incorporating the R2 score is quite tricky.


I spent a bit of time thinking about this and had the idea that you could sort of calculate the variance from the R2 score and then if you made an assumption about the underlying probability distribution you could then make an estimate of the mean. Then I could introduce a bias to the mean to compensate for cumulative errors as the time gets farther and farther from the training data.

Now there are quite a few problems with this, namely the variance doesn’t give you the sign of the error which is worrying. I tried a few different relationships of mean to variance and found one that improved my score quite bit. But again this was all rather ad-hoc.

Anyway, every ten timesteps I didn’t apply the bias so I could get a new bias and then used the bias on the other timesteps.

Second Victory

The competition moves fairly quickly so a week or two after my first good score, I was well down in the standings. Adding the my mean bias from the reward to my ElasticNetCV regression put me back into the top 10 again.

A Christmas Present

I went to bed on Christmas eve in 6th place on the competition leaderboard. I was pretty happy about that. When I checked in on Christmas Day I was down to 80th place on the leaderboard. As a Christmas present to all the competitors one of the then current top people above me made his solution public, which then meant lots of other folks forked his solution, submitted it and got his score.

This solution used a Random Forest algorithm ExtraTreesRegressor from scikit-learn combined with a simple mean based estimate and a simple regression on one variable. The random forest part was interesting because it let the algorithm know which were missing values so it could learn to act appropriately.

At first I was really upset about this, but when I had time I realized I could take that public solution, add my mean bias and improve upon it. I did this and got back into the top ten. So it wasn’t that bad.


Well this covered the first month of the competition, two more to go. I think getting into the top ten on the leaderboard a few times gave me the motivation to keep plugging away at the competition and finding some more innovative solutions. Next up January.


Written by smist08

March 3, 2017 at 11:51 pm

TensorFlow Goes 1.0

leave a comment »


I’ve been using Google’s TensorFlow machine learning platform for some time now starting with version 0.8, going onto 0.9 and now playing with 1.0 which was released last week. There are some really good videos from the release summit posted on YouTube here. This blog article looks at the evolution of TensorFlow and what 1.0 brings to the table.

Installing the new TensorFlow 1.0 on MacOS was fairly painless, I chose to install it natively rather than using a VM type solution since I don’t try to run multiple versions of Python, just stick to the latest. They recommend using Docker or other VM technology to avoid having to install at all, but I didn’t have any problems.


More Than Neural Networks

TensorFlow has always been built on a low level compute engine that executes graphs of operations on matrices and vectors (tensors). However the main tutorials and higher level functions were always oriented to performing Neural Network calculations. It contains very good algorithms for training Neural Networks and had all the supporting functions you needed to create very powerful Neural Network models. It contained a Linear Regression function, but this was mainly used as a simple tutorial rather than anything real.

With 1.0 TensorFlow is adding a large number of other popular machine learning algorithms out of the box so you can use Random Forests, Support Vector Machines, and many other standard libraries that you find in more complete libraries like scikit-learn. The list of standard algorithms isn’t as full as scikit-learn yet, and a very notable omission is the ensemble method of gradient boosting (which is promised sometime soon).

I’ve been entering some Kaggle competitions where penalized regression, random forests and gradient boosting are often the algorithms that produce the best results. However TensorFlow under Keras has been doing quite well. Often the winning solution is a combination of several of these, since an average of independent techniques will give better results.

The good thing about this is that TensorFlow provides very good GPU and other hardware accelerator support, so now all these algorithms can benefit from this. In addition Google is now offering (in beta) a machine learning cloud service which runs TensorFlow on optimized accelerated hardware. In the past if this only had TensorFlow the usage would have been limited since most full applications use a combination of algorithms in the final deployment.

API Stability

As TensorFlow went through the 0.x versions, there were quite a few API changes that caused you to be frequently updating your programs. With version 1.0 the claim is that for the part of TensorFlow that is in the core library, API compatibility will now be maintained.

A lot of the changes for 1.0 were to make the naming conventions more standard, including following the lead of Python’s Numpy library (so the same function didn’t have a different name in NumPy vs TensorFlow). All this should make coding a bit more straightforward and reduce always having to look everything up continuously.

However beware that a lot of the new advertised features in TensorFlow 1.0 are not in the core library yet, and so their API may change until they are moved there.

The good thing is that Google provided a Python script to convert previous TensorFlow Python programs up to the new API level. This worked fine for my programs, so as to make the process rather painless.

Higher Level APIs

A criticism of TensorFlow was that although it was a great low level framework, it was difficult or tedious to do a number of standard operations, like for instance setting up a simple multi-level neural network. Due to this omission sevel developers created competing high level abstractions to run on various lower level libraries. Probably the most successful of these is Keras which runs on top of both TensorFlow and Theano.

With 1.0 TensorFlow is adding a higher level API which works with all the various algorithms it contains as well as adding a Keras compatible library as a nod to the heavy adoption that Keras has enjoyed.

The non-neural network algorithms follow the API conventions in scikit-learn, which are very efficient. The whole thing is also oriented so you can feed one component into another so you can easily build a compound model consisting of several algorithms and then easily train and deploy the whole thing.

Generally this is a good thing for people looking to just use TensorFlow since the amount of code you need to write becomes much smaller and it embodies all the TensorFlow best practices so it works properly with TensorBoard, deploys flexibly, etc.


The TensorFlow documentation has been greatly improved. The tutorials are way better and it’s much easier to get a basic understanding of TensorFlow from the introductory material. There are also many more videos available as well as training courses.

Although this is all a huge step forward, one annoying side effect is that all the external links, say from Stack Overflow articles (or even Google searches) are now broken.

Lots More

Some of the other notable additions include a new experimental TensorFlow compiler XLA, APIs for Go and Java, addition of a command line debugger, improvements to TensorBoard for better visualizations and lots of additional hardware support.

Windows support was added in version 0.10 which is new since my original blogs. There is support to use Qualcomm DSP chips for co-processing which should greatly enhance the capabilities of Android phones containing this chip.



TensorFlow has come a long way over the last year from a rather specialized Neural Network tool, evolving into a complete machine learning platform. The open source community around TensorFlow is extremely vibrant and extends quite far beyond just Google employees. Looking at what is scheduled for the next couple of point releases looks very exciting and I’m finding this tool becoming more powerful in leaps and bounds.

Written by smist08

February 19, 2017 at 9:32 pm

The Road to TensorFlow – Part 11: Generalization and Overfitting

with one comment


With sophisticated Neural Networks, you are dealing with a quite complicated nonlinear function. When fitting a high degree polynomial to a few data points, the polynomial can go through all the points, but have such steep slopes that it is useless for predicting points between the training points, we get this same sort of behaviour in Neural Networks. In a way you are training the Neural Network to exactly memorize all the training data exactly rather than figuring out the trends and patterns that you can use to predict other values.

We’ve touched upon this problem in other articles like here and here, but glossed over what we are doing about this problem. In this article we’ll explore what we can do about this in more detail.

One solution is to perhaps gather more training data, however this may be impossible or quite expensive. It also might be that the training data is missing some representative samples. Here we’ll concentrate on what we can do with the algorithm rather than trying to improve the data.

Interpolation and Extrapolation

Here we refer to generalization as wanting to get answers to data that isn’t in the training data. We refer to overfitting as the case where the model works really well for the training data but doesn’t do nearly as well for anything else.

There are two distinct cases we want to worry about. One is interpolation, this is trying to estimate values where the inputs are surrounded by data in the training set. Extrapolation is the process of trying to predict what happens beyond the training data. Our stock market data is an example of extrapolation. Recognizing handwriting is an example of interpolation (assuming you have a good sample of training data)

Extrapolation tends to be a much harder problem than interpolation, but both a strongly affected by overfitting.

Early Stopping

What we often do is divide our training data into three groups. The largest of these we call the training data and use for training. Another is the test data which we run after training to see how well the algorithm works on data that hasn’t been seen by training. To help with detecting overfitting we create a third group which we run after a certain number of steps during training. The following screenshot shows the results for the training and validation sets (this is for a Kaggle competition so the test set needs to be submitted to Kaggle to get the answer). Here smaller values are better. Notice that the training data gets better starting at 3209.5 and going down to 712.8 which indicates training is working. However the validation data starts at 3014.3 goes down to the 1160s and then starts increasing. This indicates we are overfitting the data.


The approach here is really simple, let’s just stop once the validation data starts increasing. So let’s just stop at this point and say we’re done. This is actually a pretty simple and effective way to prevent overfitting. As an added bonus this is a rare technique that leads to faster training.

Penalizing Large Weights

A sign of overfitting is that the slope of our function is high at the points in the training data. Since the slope is approximated by the appropriate weights in our matrix, we would want to keep the weights in our weight matrices low. The way we accomplish this is to add a penalty to the loss function based on the size of the weights.

     loss = (tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

         + tf.nn.l2_loss(layer1_weights)*beta

         + tf.nn.l2_loss(layer2_weights)*beta

         + tf.nn.l2_loss(layer3_weights)*beta

         + tf.nn.l2_loss(layer4_weights)*beta)


Here we add the sum of the squares of the weights to our loss function. The factor beta is there to let us scale this value to be in the same magnitude as the main loss function. I’ve found that in some problems making the loss due to the weights about equal to the main loss works quite well. In another problem I found choosing beta so that the weights are 10% of the main loss worked quite well.

I have found that combining this with early stopping works quite well. The weight penalty lets us train longer before we start overfitting, which leads to a better overall result.


One property of the Neural Networks in our brain is that brain cells die, but our brain seems to mostly keep on working. In this sense the brain is far more resilient to damage than a computer. The idea behind dropout is to try to add rules to train the Neural Network to be resilient to Neurons being removed. This means the Neural Network can’t be completely reliant on any given Neuron since it could die (be removed from the model).


The way we accomplish this is we add a dropout activation function at some point:

            if dropout:

                hidden = tf.nn.dropout(hidden, 0.5)


This activation function will remove 50% of the neurons at this layer and scale up its outputs by a matching amount. This is so the sum stays the same which means you can use the same weights whether dropout is present or not.

The reason for the if statement is that you only want to do dropout during training and not during validation, testing or production.

You would do this on each hidden layer. It’s rather surprising that the Neural Network still works as well as it does with this much dropout.

I find dropout doesn’t always help, but when it does you can combine it with penalizing the weights and then you can train longer before you need to stop during overfitting. This can sometimes help a network find finer details without overfitting.

When you do dropout, you do have to train for a longer time, so if this is too time prohibitive you might not want to use it.

I think it’s a good sign that Neural Networks can exhibit the same resilience to damage that the brain shows. Perhaps a bit of biological evidence that we are on the correct track.


These are a few techniques you can use to avoid overfitting your model. I generally use all three so I can train a bit longer without overfitting. If you can get more good training data that can also help quite a bit. Using a simpler model (with fewer hidden nodes) can also help with overfitting, but perhaps not provide as good a functional approximation as the more complicated model. As with all things in computer science you are always trading off complexity, overfitting and performance.

Written by smist08

October 16, 2016 at 6:49 pm

The Road to TensorFlow – Part 10: More on Optimization

with one comment


We’ve been playing with TensorFlow for a while now and we have a working model for predicting the stock market. I’m not too sure if we’re beating the stocking picking cat yet, but at least we have a good model where we can experiment and learn about Neural Networks. In this article we’re going to look at the optimization methods available in TensorFlow. There are quite a few of these built into the standard toolkit and since TensorFlow is open source you could create your own optimizer. This article follows on from our previous article on optimization and training.

Weaknesses in Gradient Descent

Gradient Descent has worked for us pretty well so far. Basically it calculates the gradients of the loss function (the partial derivatives of loss by each weight) and moves the weights in the direction of lowering the loss function. However finding the minimums of a complicated nonlinear function is a non-trivial exercise and compound this with the fact that a lot of the data we are feeding in during training is very noisy. In our case the stock market historical data is probably quite contradictory and is probably presenting a good challenge to the training algorithm. Here are some weaknesses these other algorithms attempt to address:

  • Learning rate. We have one fixed learning rate (how far we move in the direction of the sign of the gradient). We added an optimization to reduce this learning rate as we proceed, but we use the same learning rate for everything at each step. But some parts of our weight matrix may be changing quickly and other parts remaining close to constant. So perhaps use a different learning rate for each weight/bias and vary it by how fast it’s moving and whether it’s moving consistently in the same direction.
  • Getting stuck in local minimums or wandering around plateaus. Are we getting stuck in a local minimum which is much worse than the global minimum we would like to find? How can we power past global minimums and continue to the real goal?

TensorFlow Optimizers

The optimizers included with TensorFlow are all variations on Gradient Descent. There are many other optimizers that people use like simulated annealing, conjugate gradient and ant colony optimization but these tend to either not work well with multi-layer Neural Networks or don’t parallelize well to run on GPUs or a distributed network or are far too computationally intensive for large matrices. We added to the code all the optimizers and you just uncomment the one that you want to use.

    # optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

    # optimizer = tf.train.AdadeltaOptimizer(starter_learning_rate).minimize(loss)

    # optimizer = tf.train.AdagradOptimizer(starter_learning_rate).minimize(loss)     # promising

    # optimizer = tf.train.AdamOptimizer(starter_learning_rate).minimize(loss)      # promising

    # optimizer = tf.train.MomentumOptimizer(starter_learning_rate, 0.001).minimize(loss) # diverges

    # optimizer = tf.train.FtrlOptimizer(starter_learning_rate).minimize(loss)    # promising

    optimizer = tf.train.RMSPropOptimizer(starter_learning_rate).minimize(loss)   # promising


Perhaps it would be less hacky to make this a parameter to the program, but we’ll leave that till we need it.

Let’s quickly summarize what each optimizer tries to accomplish:

  • MomentumOptimizer: If gradient descent is navigating down a valley with steep sides, it tends to madly oscillate from one valley wall to the other without making much progress down the valley. This is because the largest gradients point up and down the valley walls whereas the gradient along the floor of the valley is quite small. Momentum Optimization attempts to remedy this by keeping track of the prior gradients and if they keep changing direction then damp them, and if the gradients stay in the same direction then reward them. This way the valley wall gradients get reduced and the valley floor gradient enhanced. Unfortunately this particular optimizer diverges for the stock market data.
  • AdagradOptimizer: Adagrad is optimized to finding needles in haystacks and for dealing with large sparse matrices. It keeps track of the previous changes and will amplify the changes for weights that change infrequently and suppress the changes for weights that change frequently. This algorithm seemed promising for the stock market data.
  • AdadeltaOptimizer: Adadelta is an extension of Adagrad that only remembers a fixed size window of previous changes. This tends to make the algorithm less aggressive than pure Adagrad. Adadelta seemed to not work as well as Adagrad for the stock market data.
  • AdamOptimizer: Adaptive Moment Estimation (Adam) keeps separate learning rates for each weight as well as an exponentially decaying average of previous gradients. This combines elements of Momentum and Adagrad together and is fairly memory efficient since it doesn’t keep a history of anything (just the rolling averages). It is reputed to work well for both sparse matrices and noisy data. Adam seems promising for the stock market data.
  • FtrlOptimizer: Ftrl-Proximal was developed for ad-click prediction where they had billions of dimensions and hence huge matrices of weights that were very sparse. The main feature here is to keep near zero weights at zero, so calculations can be skipped and optimized. This algorithm was promising on our stock market data.
  • RMSPropOptimizer: RMSprop is similar to Adam it just uses different moving averages but has the same goals.

Neural networks can be quite different and the best algorithm for the job may depend a lot on the data you are trying to train the network with. Each of these optimizers has several tunable parameters. Besides initial learning rate, I’ve left all the others at the default. We could write a meta-trainer that tries to find an optimal solution for which optimizer to use and with which values of its tunable parameters. You would want a quite powerful distributed set of computers to run this on.



Optimization is a tricky subject with Neural Networks, a lot depends on the quality and quantity of your data. It also depends on the size of your model and the contents of the weight matrices. A lot of these optimizers are tuned for rather specific problems like image recognition or ad click-through prediction; however, if you have a unique problem them largely you are left to trial and error (whether automated or manual) to determine the best solution.

Note that a lot of practitioners stick with basic gradient descent since they know it quite well, rather than relying on the newer algorithms. Often massaging your data or altering the random starting point can be a better area to focus on.

Written by smist08

October 4, 2016 at 8:58 pm

The Road to TensorFlow – Part 9: TensorBoard

with one comment


We’ve spent some time developing a Neural Network model for predicting the stock market. TensorFlow has produced a fairly black box implementation that is trained by historical data and then can output predictions for tomorrow’s prices.

But what confidence do we have that this model is really doing what we want? Last time we discussed some of the meta-parameters that configure the model. How do we know these are vaguely correct? How do we know if the weights we are training are converging? If we want to step through the model, how do we do that?

TensorFlow comes with a tool called TensorBoard which you can use to get some insight into what is happening. You can’t easily just print variables since they are all internal to the TensorFlow engine and only have values when required as a session is running. There is also the problem with how to visualize the variables. The weights matrix is very large and is constantly changing as you train it, you certainly don’t want to print this out repeatedly, let alone try to read through it.

To use TensorBoard you instrument your program. You tell it what you want to track and assign useful names to those items. This data is then written to log files as your model runs. You then run the TensorBoard program to process these log files and view the results in your Web Browser.

Something Went Wrong

Due to household logistics I moved my TensorFlow work over to my MacBook Air from running in an Ubuntu VM image on our Windows 10 laptop. Installing Python 3, TensorFlow and the various other libraries I’m using was quite simple and straight forward. Just install Python from and then use pip3 to install any other libraries. That all worked fine. But when I started running the program from last time, I was getting NaN results quite often. I wondered if TensorFlow wasn’t working right on my Mac? Anyway I went to debug the program and that led me to TensorBoard. As it turns out there was quite a bad bug in the program presented last time due to un-initialized variables.

You tend to get complacent programming in Python about un-initialized variables (and array subscript range errors) because usually Python will raise and exception if you try to use a variable that hasn’t been initialized. The problem is NumPy which is a library written in C for efficiency. When you create a NumPy array, it is returned to Python, telling Python its good to go. But since its managed by C code you don’t get the usual Python error checking. So when I changed the program to add the volumes to the price changes, I had a bug that left some of the data arrays uninitialized. I suspect on the Windows 10 laptop that these were initialized to zero, but that all depends on which exact C runtime is being used. On the Mac these values were just random memory and that immediately led to program errors.

Adding the TensorBoard initialization showed the problem was originating with the data and then it was fairly straight forward to zero in on the problem and fix it.

As a result, for this article, I’m just going to overwrite the Python file from last time with a newer one ( which is posted here. This version includes TensorBoard instrumentation and a couple of other improvements that I’ll talk about next time.


First we’ll start with some of the things that TensorBoard shows you. If you read an overview of TensorFlow it’s a bit confusing about what are Tensors and what flows. If you’ve looked at the program so far, it shows quite a few algebraic matrix equations, but where are the Tensors? What TensorFlow does is break these equations down into nodes where each node is a function execution and the data flows along the edges. This is a fairly common way to evaluate algebraic expressions and not unique to TensorFlow. TensorFlow then supports executing these on GPUs and in distributed environments as well as providing all the node types you need to create Neural Networks. TensorBoard gives you a way to visualize these graphs. The names of the nodes are from the program instrumentation.


When the program was instrumented it grouped things together. Here is an expansion of the trainingmodel box where you can see the operations that make up our model.


This gives us some confidence that we have constructed our TensorFlow graph correctly, but doesn’t show any data.

We can track various statistics of all our TensorFlow variables over time. This graph is showing a track of the means of the various weight and bias matrixes.


TensorBoard also lets us look at the distribution of the matrix values over time.


TensorBoard also lets us look at histograms of the data and how those histograms evolve over time.


You can see how the layer 1 weights start as their seeded normal distribution of random numbers and then progress to their new values as training progresses. If you look at all these graphs you can see that the values are still progressing when training stops. This is because TensorBoard instrumentation really slows down processing, so I shortened the training steps while using TensorBoard. I could let it run much longer over night to ensure that I am providing sufficient training for all the values to settle down.

Program Instrumentation

Rather than include all the code here, check out the Google Drive for the Python source file. But quickly we added a function to get all the statistics on a variable:

def variable_summaries(var, name):
  """Attach a lot of summaries to a Tensor."""
  with tf.name_scope('summaries'):
   mean = tf.reduce_mean(var)
   tf.scalar_summary('mean/' + name, mean)
  with tf.name_scope('stddev'):
   stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
   tf.scalar_summary('stddev/' + name, stddev)
  tf.scalar_summary('max/' + name, tf.reduce_max(var))
  tf.scalar_summary('min/' + name, tf.reduce_min(var))
  tf.histogram_summary(name, var)

We define names in the various section and indicate the data we want to collect:

with tf.name_scope('Layer1'):
    with tf.name_scope('weights'):
        layer1_weights = tf.Variable(tf.truncated_normal(
            [NHistData * num_stocks * 2, num_hidden], stddev=0.1))
        variable_summaries(layer1_weights, 'Layer1' + '/weights')
     with tf.name_scope('biases'):
         layer1_biases = tf.Variable(tf.zeros([num_hidden]))
         variable_summaries(layer1_biases, 'Layer1' + '/biases')

Before the call to initialize_all_variables we need to call:

merged = tf.merge_all_summaries()
test_writer = tf.train.SummaryWriter('/tmp/tf/test',
    session.graph )

And then during training:

summary, _, l, predictions =
    [merged, optimizer, loss, train_prediction], feed_dict=feed_dict)
test_writer.add_summary(summary, i)


TensorBoard is quite a good tool to give you insight into what is going on in your model. Whether the program is correctly doing what you think and whether there is any sanity to the data. It also lets you tune the various parameters to ensure you are getting best results.

Written by smist08

October 1, 2016 at 4:49 pm

The Road to TensorFlow – Part 8: Improving the Model

with one comment


In the last part of this series we presented a complete Python program to demonstrate how to create a simple feed forward Neural Network to predict the price changes in the thirty stocks that comprise the Dow Jones Index. The program worked, but its predictions tend to be very close to zero. Perhaps this isn’t too hard to understand since if the data is rather random then this might actually be the minimum. Or perhaps there is something wrong with our model. In this article we’ll start to look at how to improve a Neural Network model.

Remember that humans can’t predict the stock market so this might not be possible. However Neural Networks are very good at finding and recognizing patterns, so if the technical analysts are correct and they really do exist then Neural Networks will be able to find them.

The new source code for this is located here.

Meta Parameters

You might notice that there are quite a few rather arbitrary looking constants in the sample program. How do we know the values chosen are optimal? How can we go about tuning these. Some of these parameters include:

  1. The learning rate passed to the GradientDescentOptimizer function (currently 0.01).
  2. The number of steps we train the model with the data.
  3. The batch size.
  4. The number of hidden nodes.
  5. The number of hidden layers.
  6. The mean and standard deviation of the random numbers we seed the weights with.
  7. We use the tanh activation function; would another activation function be better?
  8. Are we providing too much data to the model, or too little data?
  9. We are providing the last 30 price changes. Should we try different values?
  10. We are using the gradient descent optimizer; would another optimizer be better?
  11. We are using least squares for our loss function, would another loss function be better?

All these parameters affect the results. How do we know we’ve chosen good values? Some of these are constants, some require code changes but all are fairly configurable. Since TensorFlow is a toolkit, there are lots of options to choose from for all of these.

For this article we are going to look at a number of best practices for these parameters and look to improve these by following the best practices.

Another approach is to treat this as an optimization problem and run it through an optimizer like gradient descent to find the best values for these parameters. Another approach is to use evolution algorithms to evolve a better solution. Perhaps we’ll have a look at these in a future article.

Learning Rate

The learning rate affects how far we move as we zero in on the minimum. If you use a bigger value it moves further. If you use too big a value it will overshoot the minimum and perhaps miss it. If you use too small a value it can get stuck in valleys are take an extremely long time to converge. I found that for this problem if I set the learning rate to 0.1 then the program doesn’t converge and the weights actually diverge to infinity. So we can’t make this much larger.

What we can do is make the learning rate variable so it starts out larger and then gets smaller as we zero in on a solution. So let’s look at how to do that in our Python code.

 # Setup Learning Rate Decay
 global_step = tf.Variable(0, trainable=False)
 starter_learning_rate = 0.01
 learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
 5000, 0.96, staircase=True)
 # Optimizer.
 optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss, global_step=global_step)


Basically TensorFlow gives us a way to do this with the exponential_decay function and assistance from the optimizer.

The other thing we’ve done is increase the batch size to 10. Increasing the batch size past 10 leads to the training process diverging, so we are limited to this. We have also dramatically increased the number of steps in the training process. Previously we kept this short to speed up processing time. But now that we are improving the model we need to give it time to train. But at the same time we don’t want it to over-train and overfit the data. In the program there are two ways of running it, one is with no test or validation data, so you can train with all the data to get the best result on the next prediction. The other is with test and validation steps, where you can see where improvements in the test and validation stop improving and hence can stop training.

More Data

We are currently providing the previous 30 price changes for each of the Dow components. Perhaps this isn’t sufficient? Perhaps there is different data we could add? One common practice in stock technical analysis is to consider stock price moves when the trading volume is high as more important than when the trading volume is low. So perhaps if we add the trading volumes to the input data it will help give better results.

The way we do this is we modify the sliding windows we are providing to now give 60 bits of data for each of the Dow stocks. We now provide the price change followed by the matching volume. So our input vector is now alternating price changes and volumes and has grown to 1800 in length. Trading volumes of Dow stocks are quite large numbers, so we normalize these by dividing these by the first volume for a stock. This way they tend to be in the range 0 to 10. In this case the volume isn’t an output anywhere so we don’t need it in range of the activation functions, but we do want to keep the weights in the matrix from getting too large.

I won’t list all the code changes to do that here, but you can download the source file from my Google Drive.

We could also consider adding more data. Perhaps varying the amount of historical data. We could add calculated values that technical analysts find useful like weighted averages, Bollinger bands etc. We could get other metrics from Web services such as fundamental data on the stock, perhaps P/E ratio or market capitalization. We could construct metrics like number of news feed stories on a stock. All these are possibilities to include. Perhaps later on we can see methods to see if these help. After training the model you can see which inputs don’t lead anywhere, i.e. all their connections have zero weights. By analysing these we can see which values the training process has considered important and which ones can be removed to simplify our model.

Number of Hidden Nodes

In the initial implementation we just made the number of hidden nodes to be 16 in each layer. Partly we used a small number to make calculation quicker so we could iterate our coding quicker. But what values should we use?

Neural Networks are very good at pattern recognition. They accomplish this in a similar way to the brain. They take heavily pixelated data and then discover structure in that data as they go from layer to layer. We want to encourage that, to have the network summarize the data somewhat to go from layer to layer. A common way to accomplish that is to have a large number of nodes adjacent to the input layer and then reduce it with each following layer.

We will start with a larger number namely 64 and then half it with each following layer.

# Variables.
 layer1_weights = tf.Variable(tf.truncated_normal(
 [NHistData * num_stocks, num_hidden], stddev=0.1))
 layer1_biases = tf.Variable(tf.zeros([num_hidden]))
 layer2_weights = tf.Variable(tf.truncated_normal(
 [num_hidden, int(num_hidden / 2)], stddev=0.1))
 layer2_biases = tf.Variable(tf.constant(1.0, shape=[int(num_hidden / 2)]))
 layer3_weights = tf.Variable(tf.truncated_normal(
 [int(num_hidden / 2), int(num_hidden / 4)], stddev=0.1))
 layer3_biases = tf.Variable(tf.constant(1.0, shape=[int(num_hidden /4)]))
 layer4_weights = tf.Variable(tf.truncated_normal(
 [int(num_hidden / 4), num_labels], stddev=0.1))
 layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))


I would have liked to use a larger number closer to the number of inputs, but it appears the model diverges for 128 or more hidden nodes in the first layer. I’ll have to investigate why some time in the future.


Below are the results for Sept. 26. Notice that the results are much better. It predicted the generally down day on the Dow and it isn’t being so conservative in its estimates anymore. Some stocks like MMM and MCD it got really close. There were others like PG and GS that it got quite wrong. Overall though we seem to be starting to get something useful. However, to re-iterate, don’t bet your real hard earned money based on this simple model.



This article covered a few improvements that could easily be made to our model. This is still a very simple basic model, so please don’t use it to trade stocks. Even if you have a really good model, the hedge funds will still eat you alive.

Generally, this is how developing Neural Networks goes. You start with a simple small model and then iteratively enhance it to develop it into something good.

Written by smist08

September 27, 2016 at 3:48 pm

The Road to TensorFlow – Part 7: Finally Some Code

with 5 comments


Well after a long journey through Linux, Python, Python Libraries, the Stock Market, an Introduction to Neural Networks and training Neural Networks we are now ready to look at a complete Python example to predict the stock market.

I placed the full source code listing on my Google Drive here. As described in the previous articles you will need to run this on a Mac or on Linux (could be a virtual image) with Python and TensorFlow installed. You will also need to have the various libraries that are imported at the top of the source file installed or you will get an error when you go to run it. I would suggest getting the source file to play with, Python is very fussy about indentation, so copy/paste from the article may introduce indentation errors caused by the blog formatting.

The Neural Network we are running here is a simple feed forward network with four hidden layers and uses the hyperbolic tangent as the activation function in each case. This is a very simple model so don’t use it to invest with real money. Hopefully this article gives a flavour for how to create and train a Neural Network using TensorFlow. Then in future articles we can discuss the limitation of this model and how to improve it.

Import Libraries

First we import all the various libraries we will be using, note tensorflow and numpy as being particularly important.

# Copyright 2016 Stephen Smith

import time
import math
import os
from datetime import date
from datetime import timedelta
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import pandas_datareader as pdr
from pandas_datareader import data, wb
from six.moves import cPickle as pickle
from yahoo_finance import Share

Get Stock Market Data

Next we get the stock market data. If the file stocks.pickle exists we assume we’ve previously saved this file and use it. Otherwise we get the data from Yahoo Finance using a Web Service call, made via the Pandas DataReader. We only keep the adjusted close column and we fill in any NaN’s with the first value we saw (this really only applies to Visa in this case). The data will all be in a standard Pandas data frame after this.

# Choose amount of historical data to use NHistData
NHistData = 30
TrainDataSetSize = 3000

# Load the Dow 30 stocks from Yahoo into a Pandas datasheet

dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM',
         'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM',
         'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG',
         'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS']

num_stocks = len(dow30)

trainData = None
loadNew = False

# If stocks.pickle exists then this contains saved stock data, so use this,
# else use the Pandas DataReader to get the stock data and then pickle it.
stock_filename = 'stocks.pickle'
if os.path.exists(stock_filename):
        with open(stock_filename, 'rb') as f:
            trainData = pickle.load(f)
    except Exception as e:
      print('Unable to process data from', stock_filename, ':', e)
    print('%s already present - Skipping requesting/pickling.' % stock_filename)
    # Get the historical data. Make the date range quite a bit bigger than
    # TrainDataSetSize since there are no quotes for weekends and holidays. This
    # ensures we have enough data.

    f =, 'yahoo',*2+5),
    cleanData = f.ix['Adj Close']
    trainData = pd.DataFrame(cleanData)
    trainData.fillna(method='backfill', inplace=True)
    loadNew = True
    print('Pickling %s.' % stock_filename)
        with open(stock_filename, 'wb') as f:
          pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        print('Unable to save data to', stock_filename, ':', e)

Normalize the Data

We then normalize the data and remember the factor we used so we can de-normalize the results at the end.

# Normalize the data by dividing each price by the first price for a stock.
# This way all the prices start together at 1.
# Remember the normalizing factors so we can go back to real stock prices
# for our final predictions.
factors = np.ndarray(shape=( num_stocks ), dtype=np.float32)
i = 0
for symbol in dow30:
    factors[i] = trainData[symbol][0]
    trainData[symbol] = trainData[symbol]/trainData[symbol][0]
    i = i + 1

Re-arrange the Data for TensorFlow

Now we need to build up our training data, test data and validation data. We need to format this as input arrays for the Neural Network. Looking at this code, I think true Python programmers will accuse me of being a C programmer (which I am), since I do this all with loops. I’m sure a more experience Python programmer could accomplish this quicker with more array operations. This part of the code is quite slow so we pickle it, so if we re-run with the saved stock data, we can also use saved training data.

# Configure how much of the data to use for training, testing and validation.

usableData = len(trainData.index) - NHistData + 1
#numTrainData =  int(0.6 * usableData)
#numValidData =  int(0.2 * usableData
#numTestData = usableData - numTrainData - numValidData - 1
numTrainData = usableData - 1
numValidData = 0
numTestData = 0

train_dataset = np.ndarray(shape=(numTrainData - 1,
    num_stocks * NHistData), dtype=np.float32)
train_labels = np.ndarray(shape=(numTrainData - 1, num_stocks),
valid_dataset = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks * NHistData), dtype=np.float32)
valid_labels = np.ndarray(shape=(max(0, numValidData - 1),
    num_stocks), dtype=np.float32)
test_dataset = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks * NHistData), dtype=np.float32)
test_labels = np.ndarray(shape=(max(0, numTestData - 1),
    num_stocks), dtype=np.float32)
final_row = np.ndarray(shape=(1, num_stocks * NHistData),
final_row_prices = np.ndarray(shape=(1, num_stocks * NHistData),

# Build the taining datasets in the correct format with the matching labels.
# So if calculate based on last 30 stock prices then the desired
# result is the 31st. So note that the first 29 data points can't be used.
# Rather than use the stock price, use the pricing deltas.
pickle_file = "traindata.pickle"
if loadNew == True or not os.path.exists(pickle_file):
    for i in range(1, numTrainData):
        for j in range(num_stocks):
            for k in range(NHistData):
                train_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k]
                    - trainData[dow30[j]][i + k - 1])
            train_labels[i-1][j] = (trainData[dow30[j]][i + NHistData]
                - trainData[dow30[j]][i + NHistData - 1])  

    for i in range(1, numValidData):
        for j in range(num_stocks):
            for k in range(NHistData):
                valid_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData]
                    - trainData[dow30[j]][i + k + numTrainData - 1])
            valid_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData]
                - trainData[dow30[j]][i + NHistData + numTrainData - 1])

    for i in range(1, numTestData):
        for j in range(num_stocks):
            for k in range(NHistData):
                test_dataset[i-1][j * NHistData + k] = (trainData[dow30[j]][i + k + numTrainData + numValidData]
                    - trainData[dow30[j]][i + k + numTrainData + numValidData - 1])
            test_labels[i-1][j] = (trainData[dow30[j]][i + NHistData + numTrainData + numValidData]
                - trainData[dow30[j]][i + NHistData + numTrainData + numValidData - 1])

      f = open(pickle_file, 'wb')
      save = {
        'train_dataset': train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
      pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
      print('Unable to save data to', pickle_file, ':', e)

    with open(pickle_file, 'rb') as f:
      save = pickle.load(f)
      train_dataset = save['train_dataset']
      train_labels = save['train_labels']
      valid_dataset = save['valid_dataset']
      valid_labels = save['valid_labels']
      test_dataset = save['test_dataset']
      test_labels = save['test_labels']
      del save  # hint to help gc free up memory   

for j in range(num_stocks):
    for k in range(NHistData):
            final_row_prices[0][j * NHistData + k] = trainData[dow30[j]][k + len(trainData.index - NHistData]
            final_row[0][j * NHistData + k] = (trainData[dow30[j]][k + len(trainData.index) - NHistData]
                - trainData[dow30[j]][k + len(trainData.index) - NHistData - 1])

print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)


We now setup an accuracy function that is only used to report how we are doing during training. This isn’t used by the training algorithm. It roughly shows what percentage of predictions are within some tolerance.

# This accuracy function is used for reporting progress during training, it isn't actually
# used for training.
def accuracy(predictions, labels):
  err = np.sum( np.isclose(predictions, labels, 0.0, 0.005) ) / (predictions.shape[0] * predictions.shape[1])
  return (100.0 * err)

TensorFlow Variables

We now start setting up TensorFlow by creating our graph and defining our datasets and variables.

batch_size = 4
num_hidden = 16
num_labels = num_stocks

graph = tf.Graph()

# input is 30 days of dow 30 prices normalized to be between 0 and 1.
# output is 30 values for normalized next day price change of dow stocks
# use a 4 level neural network to compute this.

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, num_stocks * NHistData))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  tf_final_dataset = tf.constant(final_row)

  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [NHistData * num_stocks, num_hidden], stddev=0.05))
  layer1_biases = tf.Variable(tf.zeros([num_hidden]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_hidden], stddev=0.05))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.05))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

TensorFlow Model

We now define our Neural Network model. Hyperbolic Tangent is our activation function and rest is matrix algebra as we described in previous articles.

  # Model.
  def model(data):
    hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer2_weights) + layer2_biases)
    hidden = tf.tanh(tf.matmul(hidden, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

Training Model

Now we setup the training model and the optimizer to use, namely gradient descent. We also define what are the correct answers to compare against.

  # Training computation.
  logits = model(tf_train_dataset)
  loss = tf.nn.l2_loss( tf.sub(logits, tf_train_labels))

  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
  # Predictions for the training, validation, and test data.
  train_prediction = logits
  valid_prediction = model(tf_valid_dataset)
  test_prediction = model(tf_test_dataset)
  next_prices = model(tf_final_dataset)

Run the Model

So far we have setup TensorFlow ready to go, but we haven’t calculated anything. This next set of code executes the training run. It will use the data we’ve provided in the configured batch size to train our network while printing out some intermediate information.

num_steps = 2052

with tf.Session(graph=graph) as session:
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions =
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    acc = accuracy(predictions, batch_labels)
    if (step % 100 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % acc)
      if numValidData > 0:
          print('Validation accuracy: %.1f%%' % accuracy(
              valid_prediction.eval(), valid_labels))
  if numTestData > 0:        
      print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Make a Prediction

The final bit of code uses our trained model to make a prediction based on the last set of data we have (where we don’t know the right answer). If you get fresh stock market data for today, then the prediction will be for tomorrow’s price changes. If you run this late enough that Yahoo has updated its prices for the day, then you will get some real errors for comparison. Note that Yahoo is very slow and erratic about doing this, so be careful when reading this table.

predictions = next_prices.eval() * factors
  print("Stock    Last Close  Predict Chg   Predict Next      Current     Current Chg       Error")
  i = 0
  for x in dow30:
      yhfeed = Share(x)
      currentPrice = float(yhfeed.get_price())
      print( "%-6s  %9.2f  %9.2f       %9.2f       %9.2f     %9.2f     %9.2f" % (x,
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             final_row_prices[0][i * NHistData + NHistData - 1] * factors[i] + predictions[0][i],
             currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i],
             abs(predictions[0][i] - (currentPrice - final_row_prices[0][i * NHistData + NHistData - 1] * factors[i]))) )
      i = i + 1


Below is a screenshot of one run predicting the stock changes for Sept. 22. Basically it didn’t do very well. We’ll talk about why and what to do about this in a future article. As you can see it is very conservative in its predictions.



This article shows the code for training and executing a very simple Neural Network using TensorFlow. Definitely don’t bet on the stock market based on this model, it is very simple at this point. We still need to add a number of elements to start making this into a useful model which we’ll look at in future articles.

Written by smist08

September 23, 2016 at 4:17 pm