Stephen Smith's Blog

Musings on Machine Learning…

Updates to the TensorFlow API

leave a comment »


Last year I published a series of posts on getting up and running on TensorFlow and creating a simple model to make stock market predictions. The series starts here, however the coding articles are here, here and here. We are now a year later and TensorFlow has advanced by quite a few versions (1.3 as of this writing). In this article I’m going to rework that original Python code to use some simpler more powerful APIs from TensorFlow as well as adopt some best practices that weren’t well known last year (at least by me).

This is the same basic model we used last year, which I plan to improve on going forwards. I changed the data set to record the actual stock prices rather than differences. This doesn’t work so well since most of these stocks increase over time and since we go around and around on the training data, it tends to make the predictions quite low. I plan to fix this in a future articles where I handle this time series data correctly. But first I wanted to address a few other things before proceeding.

I’ve placed the updated source code on my Google Drive here.

Higher Level API

In the original code to create a layer in our Neural Network, we needed to define the weight and bias Tensors:

layer1_weights = tf.Variable(tf.truncated_normal(
      [NHistData * num_stocks * 2, num_hidden], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([num_hidden]))

And then define the layer with a complicated mathematical expression:

hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)

This code is then repeated with mild variations for every layer in the Neural Network. In the original code this was quite a large block of code.

In TensorFlow 1.3 there is now an API to do this:

hidden = tf.layers.dense(data, num_hidden, activation=tf.nn.elu,
        name=name + "model" + "hidden1")

This eliminates a lot of repetitive variable definitions and error prone mathematics.

Also notice the kernel_regularizer=tf.contrib.layers.l1_l2_regularizer() parameter. Previously we had to process the weights ourselves to add regularization penalties to the loss function, now TensorFlow will do this for you, but you still need to extract the values and add them to your loss function.

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([tf.nn.l2_loss( tf.subtract(logits, tf_train_labels))] + reg_losses)

You can get at the actual weights and biases if you need them in a similar manner as well.

Better Initialization

Previously we initialized the weights using a truncated normal distribution. Back then the recommendation was to use random values to get the initial weights away from zero. However since 2010 (quite a long time ago) there have been better suggestions and the new tf.layers.dense() API supports these. The original paper was “Understanding the difficulty of training deep feedforward neural networks” by Xavier Glorot and Yoshua Bengio. If you ran the previous example you would have gotten an uninitialized variable on he_init. Here is its definition:

he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")

The idea is that these initializers vary based on the number of inputs and outputs for the neuron layer. There is also tf.contrib.layers.xavier_initializer() and tf.contrib.layers.xavier_initializer_conv2d(). For this example with only two hidden layers it doesn’t matter so much, but if you have a much deeper Neural Network, using these initializers can greatly speed up training and avoid having the gradients either go to zero or explode early on.

Vanishing Gradients and Activation Functions

You might also notice I changed the activation function from tanh to elu. This is due to the problem of vanishing gradients. Since we are using Gradient Descent to train our system then any zero gradients will stop training improvement in that dimension. If you get large values out of the neuron then the gradient of the tanh function will be near zero and this causes training to get stalled. The relu function also has similar problems if the value ever goes negative then the gradient is zero and again training will likely stall and get stuck there. On solution to this is to use the elu function or a “leaky” relu function. Below are the graphs of elu, leaky relu and relu.

Leaky relu has a low sloped linear function for negative values. Elu uses an exponential type function to flatten out a bit to the left of zero so if things go a bit negative they can recover. Although if things go more negative with elu, they will get stuck again. Elu has the advantage that its is rigged to be differentiable at 0 to avoid special cases. Practically speaking both of these activation functions have given very good results in very deep Neural Networks which would otherwise get stuck during training with tanh, sigmoid or relu.

Scaling the Input Data

Neural networks work best if all the data is between zero and one. Previously we didn’t scale our data properly and just did an approximation by dividing by the first value. All that code has been deleted and we now use SciKit Learn’s MinMaxScaler object instead. You fit the data using the training data and then transform any data we process with the result. The code for us is:

# Scale all the training data to the range [0,1].
scaler = MinMaxScaler(copy=False)

The copy=False parameter basically says to do the conversion in place rather than producing a new copy of the data.

SciKit Learn has a lot of useful utility functions that can greatly help with using TensorFlow and well worth looking at even though you aren’t using a SciKit Learn Machine Learning function.


The field of Neural Networks is evolving rapidly and the best practices keep getting better. TensorFlow is a very dynamic and quickly evolving tool set which can sometimes be a challenge to keep up with.

The main learnings I wanted to share here are:

  • TensorFlow’s high level APIs
  • More sophisticated initialization like He Initialization
  • Avoiding vanishing gradients with elu or leaky ReLU
  • Scaling the input data to between zero and one

These are just a few things of the new things that I could incorporate. In the future I’ll address how to handle time series data in a better manner.


Written by smist08

October 16, 2017 at 9:42 pm

Components Leading to Strong AI

leave a comment »


There have been a lot of advances in AI in the past couple of years. A lot of these advances are better simulating the various functions of the brain. These include the convolutional neural networks which are very good at image recognition and new techniques to incorporate memory into neural networks.

Very Deep Neural Networks

In the early days of Neural Networks, finding the weights for the connections was very difficult and often performed by hand. Then the gradient descent algorithm came along and allowed bigger Neural Networks to be trained. Then in 1986 a groundbreaking paper by D. E. Rumelhart showed how to use back propagation to train a multi-level Neural Network with Gradient Descent. However the shape of the surface that is being optimized is often very ill suited to this algorithm, containing many local minimums, or more usually being very flat not indicating the direction to take. Plus depending on the problem the training data may contain lots of errors that can mislead the training process.

With recent tweaks to the training algorithms, researchers have managed to train very deep Neural Networks. For instance the Oxford Visual Geometry Group (VGG) has released a pre-trained 19 layer Neural Network for image recognition.

This is a great building block for other image manipulation projects like Image Style Transfer that we looked at previously.

Now these Neural Networks are starting to resemble the architecture and structure of biological Neurons in the human brain such as the following from the human cortex.

This shows that we are starting to accurately simulate the computational engine in our brains.

The Road to Memory

Although the deep neural networks in the last section are very large and powerful at some problems, other problems they fail at primarily due to a lack or memory or context. For instance if you are translating text word by word, you need to remember the previous words in the sentence to get a correct translation based on the context. Or you need to do a first pass word by word and then knowing the whole, correct mistakes based on now knowing more generally what is being said. Similarly as an algorithm deals with the world, it should learn about the world as it explores and gathers more information. Just retraining the whole Neural Network for each bit of new information is very inefficient.

For language translation and speech recognition the use of Recurrent Neural Networks (RNNs) have proven quite effective. In these the outputs from Neurons can feed into the inputs of the same layer or into the inputs of previous layers. The networks of the previous section were all feed forward Neural Networks since the output of a layer only feeds the input of the next layer. RNNs aren’t true non feed forward networks since they don’t iterate to find a solution with everything stabilized, Rather these outputs from use n go into the inputs of usage n+1. In this way these act as a sort of memory from usage to usage allowing the network to preserve some context from say word to word in translation.

More recent research has led to Neural Networks that can actually have memory banks. These include Long Short-Term Memory Cells (LSTM Cells) and Gated Recurrent Unit (GRU) Cells.

These artificial neurons have the ability to store memory values (as well as forget memory values). The key difficulty in adding memory to Neural Networks was in how to train them. Gradient Descent and all its variations require that the function being optimized is differentiable or very nearly so. Putting things in memory, reading memory and erasing memory are very discrete functions. These sort of functions are not differentiable and can’t be patched since they are flat with zero derivative elsewhere. Something with a zero derivative doesn’t give any information to Gradient Descent as to which direction to go. The solution to this was to replace the discrete functions with probability distributions that are differentiable. So rather than say put something in memory, the function gives you a probability that you should put the value in memory and then you do so if say the probability is greater than 50%.


I think the current tools for training Neural Networks work quite well for deep feedforward Neural Networks. I think they do a good job of training the weights to use in the various network layers. However I don’t think they provide a good solution for training systems with memory. The brain probably uses some process similar to what we do to train the input weights and outputs to biological Neurons, such as Hebbian Learning. However I don’t think this is what is used to decide whether to remember something or not. I think we still have a long way to go before effectively using memory in our Neural Networks even though just a little bit of memory is greatly improving our translators, speech and text recognition programs.


The field of Neural Networks is making great progress. This is due to advances in refining the training process of deep Neural Networks along with advances in making artificial Neurons more sophisticated by adding elements like memory banks. Combine this with the fast pace of development of GPUs allowing essentially low cost supercomputers for training and running these networks and the large amount of venture capital that is flowing into anything AI related and we are seeing a true renaissance in the AI field.

Does someone have a true deep AI running in their lab already? Perhaps; but, if they don’t I think we are starting to get quite close.

Written by smist08

September 29, 2017 at 9:12 pm

Playing with Image Style Transfer

leave a comment »


Last time we introduced Image Style Transfer, an AI algorithm that combines the contents of one image with the style of another image. In this article we are going to look at some ways to play with this process in more advanced ways. We are going to play with Anish Athalye’s implementation which is on GitHub here, this implementation is really good at allowing lots of tuning and playing.

Playing around this way is quite time consuming since you have to run Gradient Descent to find the solution, rather than just applying canned solutions. Since I ran all these on an older MacBook Air with no GPU, I had to use a lower resolution in the interest of time. At lower resolution (the MacOS’s small size) it took about an hour for each image. At medium resolution, it took about six hours to generate an image. This is ok for running over night but doesn’t allow a lot of play. Makes me wonder if I should get a beefy desktop computer with a good NVidia GPU?

I found a really good YouTube video explaining Image Style Transfer here which is well worth a watch.

Playing with Algorithms

We’ve seen in previous articles how we can play with the tunable parameters in AI algorithms to get quite different results. Here we’ll look at the effects of playing with some parameters as well as fiddling with the algorithm itself.

The basic observation that lead to Image Style Transfer was that a deep image recognition neural network extracts the features related to content in the lower layers and the features related to style in the higher layers. Interestingly the human brain’s image recognition neurons appear to be structured in the same sort of way and it is believed there is a fair bit of similarity between how an advanced image recognition algorithm works and how the brain works. This separation of content from style is then the basis for merging and manipulating these.

The Image Style Transfer algorithm works by starting with an image of white noise and then iterating it using stochastic gradient descent to minimize the difference between the content in one image and the style in the other. This is the loss function we often talk about in AI. The interesting part of the algorithm is that we aren’t training the neural network matrix weights, since these are pre-done by the VGG group, but we are training the input image. So we have a loss function like:

Total Loss = Loss of content from first image + Loss of style from the second image

We can then play with this Loss function in various ways which we’ll experiment with in the rest of this article.

Apply Some Weights

Usually in Machine Learning algorithms we apply weights everywhere that we can use to tune things. The same applies here. We can weight the contributions from content versus style in the total loss formula to achieve more of a contribution from style or content.

First we take a picture of Tetrahedron Peak and combine it with Vincent van Gogh’s Starry Night using the default settings of the algorithm:

Now we can try playing with the weight of the content contribution. Lower means more style, higher means more content. In the image above the content weight was the default of 5.

Notice the image on the left is much more abstract with the large stars appearing all over.

Using Multiple Styles

Last time we used one style at a time to get our result. But you can actually use the algorithm to incorporate multiple styles at once. In this case we just generalize the Loss function above as:

Total Loss = Loss of content form first image + Loss of style from style image 1 +
                 Loss of style form style image 2

Of course we can then further generalize this to any number of style images.

We’ll use our Starry Night combination and also use Picasso’s Dora Maar:

Now we will use both pictures for the style and see what we get:

This weights the styles of Starry Night and Dora Maar equally. However you can see from the Loss formula that we can easily weight the components and get say 75% Starry Night and 25% Dora Maar:


Now if we reverse the weights and do Starry Night at 25% and Dora Maar at 75%:

Playing with the Neural Network

We can also play with the Neural Network used. We  can change a number of parameters in the Neural Network as well as introduce various scaling and weight factors.

Pooling Type

For instance there are something called Pooling Layers in the network. These reduce the resolution of the image and help with reducing the abstraction from fine level details to higher level abstractions. There are two commonly used types of pooling layers namely average pooling and max pooling. We can try either of these to see what affect that might have on the image style transfer.

Here we see that average pooling favoured fine details and preserved more of the content image. Whereas max pooling used more of the style image and is a bit more abstract.

Exponential Style Layer Weight

Another thing we can do is magnify some layers over others. For instance we can magnify each style layer over the last one as follows:

weight(layer<n+1>) = weight_exp*weight(layer<n>)

The default is 1 (ie none). Here is Tetrahedron Peak using 0.2 and 2.0.

A factor less than one means more original content since some style layers are suppressed, and a factor greater than one magnifies some style layer contributions. Since the style layers aren’t all weighted the same  this is a bit different than just changing the weighting factor between content and style.


Another parameter that is fun to play with is the number of iterations that Gradient Descent runs for. Below we can see a sequence of images as the number of iterations is increased. We can see the content and style of the image forming out of the initial white noise.

At this resolution we are pretty much converged at 500 iterations, however for higher resolution and more complicated images more iterations might be necessary. We could also use a stopping criterion like when the loss function stops changing by some delta, rather than using a fixed number of iterations.

This problem converges quite well since it is mathematically well defined. Often in AI, we don’t get this good behaviour because the training data has lots of errors and/or lots of noise. Here we are just training against a content picture and one or more style pictures, so by definition there isn’t any erroneous data. These challenges would have been faced and solved by the team developing the VGG image recognition neural network that we get to just use and don’t have to worry about training.


As we can see we can get quite a few different effects by tuning the algorithm using the same style picture as a reference. Simple tools like Prisma or don’t let you play with all these parameters. As a photographer who is trying to get a specific effect, you want the power and flexibility to tune your style transfer exactly. Right now the only way to do this is to run the AI algorithms on your computer and play with them which is very time consuming. I suspect once this technology is incorporated in more advanced tools then various degrees of tuning will be possible. Adobe has been demonstrating Image Transfer Style in their labs, and it will be interesting to see if they incorporate it into Photoshop and then how much tuning is possible. Also if it runs in the Adobe Creative Cloud, it will be interesting to see whether it’s quicker running that way than running on your own computer.


Written by smist08

August 21, 2017 at 4:29 pm

An Introduction to Image Style Transfer

with 2 comments


Image Style Transfer is an AI technique that is becoming quite popular for enhancing or stylizing photos. It takes one picture (often a classical painting) and then applies the style of that picture to another picture. For example I could take this photo of the Queen of Surrey passing Hopkins Landing:

Combined with the style of Vincent van Gogh’s Starry Night:

To then feed these through the AI algorithm to get:

In this article, we’ll be look at some of the ways you can accomplish this yourself either through using online services or running your own Neural Network with TensorFlow.

Playing with Image Style Transfer

There are lots of services that let you play with this. Generally to apply a canned style to your own picture is quite fast (a few seconds). To provide your own photo as the style photo is more involved, since it involves “training” the style and this can take 30 minutes (or more).

Probably the most popular program is the Prisma app for either iPhone or Android. This app has a large number of pre-trained styles and can apply any of them to any photo on your phone. This app works quite well and gives plenty of variety to play with. Plus its free. Here is the ferry in Prisma’s comic theme:

If you want to provide your own photo as the style reference then is a good choice. This is available as a web app as well as either an iPhone or Android app. The good part about this for photographers is that you can copy photos from your good camera to your computer and then use this program’s website, no phone required. This site has some pre-programmed styles based on Vincent van Gogh which work really quickly and produce good results. Then it has the ability to upload a style photo. Processing a style is more work and typically takes 25 minutes (you can pay to have it processed quicker, but not that much quicker). If you don’t mind the wait this site is free and works quite well. Here is an example of the ferry picture above van Gogh’ized by (sorry they don’t label the styles so I don’t know which painting this is styled from):

Playing More Directly

These programs are great fun, but I like to tinker with things myself on my computer. So can I run these programs myself? Can I get the source code? Fortunately the answer to both is yes. This turns out to be a bit easier than you first might think, largely due to a project out of the Visual Geometry Group (VGG) at the University of Oxford. They created an exceptional image recognition neural network that they trained and won several competitions with. It turns out that the backbone to doing Image Style Transfer is to have a good image recognition Neural Network. This Neural Net is 19 layers deep and Oxford released the fully trained network for anyone to use. Several people have then taken this network, figured out how to load it into TensorFlow and created some really good Image Style Transfer programs based on this. The first program I played with was Anish Athalye’s program posted on GitHub here. This program uses VGG and can train a neural network for a given style picture. Anish has quite a good write up on his blog here.

Then I played with a program that expanded on Anish’s by Shafeen Tejani which is on GitHub here along with a blog post here. This program lets you keep the trained network so you can perform the transformation quickly on any picture you like. This is similar to how Prisma works. The example up in the introduction was created with this picture. To train the network you require a training set of image like the Microsoft COCO collection.

Running these programs isn’t for everyone. You have to be used to running Python programs and have TensorFlow installed and working on your system. You need a few other dependent Python libraries and of course you need the VGG saved Neural Network. But if you already have Python and TensorFlow, I found both of these programs just ran and I could play with them quite easily.

The writeups on all these programs highly recommend having a good GPU to speed up the calculations. I’m playing on an older MacBook Air with no GPU and was able to get quite good results. One trick I found that helped is to play with reduced resolution images to help speed up the process, then run the algorithm on a higher resolution version when you have things right. I found I couldn’t use the full resolution from my DLSR (12meg), but had to use the Apple’s “large” size (286KB).


This was a quick introduction to Image Style Transfer. We are seeing this in more and more places. There are applications that can apply this same technique to videos. I expect this will become a standard part of all image processing software like PhotoShop or Gimp. It also might remain the domain of specialty programs like HDR has, since it is quite technical and resource intensive. In the meantime projects like VGG have made this technology quite accessible for anyone to play with.

Written by smist08

August 14, 2017 at 6:48 pm

A Crack in the TensorFlow Platform

leave a comment »


Last time we looked at how some tunable parameters through off a TensorFlow solution of a linear regression problem. This time we are going to look at a few more topics around TensorFlow and linear regression. Then we’ll look at how Google is implementing Linear Regression and some problems with their approach.

TensorFlow Graphs

Last time we looked at calculating the solution to a linear regression problem directly using TensorFlow. That bit of code was:

# Now lets calculated the least squares fit exactly using TensorFlow
X = tf.constant(data[:,0], name="X")
Y = tf.constant(data[:,1], name="Y")

Xavg = tf.reduce_mean(X, name="Xavg")
Yavg = tf.reduce_mean(Y, name="Yavg")
num = (X - Xavg) * (Y - Yavg)
denom = (X - Xavg) ** 2
rednum = tf.reduce_sum(num, name="numerator")
reddenom = tf.reduce_sum(denom, name="denominator")
m = rednum / reddenom
b = Yavg - m * Xavg
with tf.Session() as sess:
    writer = tf.summary.FileWriter('./graphs', sess.graph)
    mm, bb =[m, b])


TensorFlow does all its calculations based on a graph where the various operators and constants are nodes that then get connected together to show dependencies. We can use TensorBoard to show the graph for the snippet of code we just reviewed here:

Notice that TensorFlow overloads the standard Python numerical operators, so when we get a line of code like: “denom = (X – Xavg) ** 2”, since X and Xavg are Tensors then we actually generate TensorFlow nodes as if we had called things like tf.subtract and tf.pow. This is much easier code to write, the only downside being that there isn’t a name parameter to label the nodes to get a better graph out of TensorBoard.

With TensorFlow you perform calculations in two steps, first you build the graph (everything before the with statement) and then you execute a calculation by specifying what you want. To do this you create a session and call run. In run we specify the variables we want calculated. TensorFlow then goes through the graph calculating anything it needs to, to get the variables we asked for. This means it may not calculate everything in the graph.

So why does TensorFlow follow this model? It seems overly complicated to perform numerical calculations. The reason is that there are algorithms to separate graphs into separate independent components that can be calculated in parallel. Then TensorFlow can delegate separate parts of the graph to separate GPUs to perform the calculation and then combine the results. In this example this power isn’t needed, but once you are calculating a very complicated large Neural Network then this becomes a real selling point. However since TensorFlow is a general tool, you can use it to do any calculation you wish on a set of GPUs.

TensorFlow’s New LinearRegressor Estimator

Google has been trying to turn TensorFlow into a platform for all sorts of Machine Learning algorithms, not just Neural Networks. They have added estimators for Random Forests and for Linear Regression. However they did this by using the optimizers they created for Neural Nets rather than using the standard algorithms used in other libraries, like those implemented in SciKit Learn. The reasoning behind this is that they have a lot of support for really really big models with lots of support for one-hot encoding, sparse matrices and so on. However the algorithms that solve the problem seem to be exceedingly slow and resource hungry. Anything implemented in TensorFlow will run on a GPU, and similarly any Machine Learning algorithm can be implemented in TensorFlow. The goal here is to have TensorFlow running the Google AI Cloud where all the virtual machines have Google designed GPU like AI accelerator hardware. But I think unless they implement the standard algorithms, so they can solve things like a simple least squares regression quickly hand accurately then its usefulness will be limited.

Here is how you solve our fire versus theft linear regression this way in TensorFlow:


features = [tf.contrib.layers.real_valued_column("x", dimension=1)]
estimator = tf.contrib.learn.LinearRegressor(feature_columns=features,
# Input builders
input_fn ={"x":x}, y,
     num_epochs=10000), steps=2000)

mm = estimator.get_variable_value('linear/x/weight')
bb = estimator.get_variable_value('linear/bias_weight')
print(mm, bb)


This solves the problem and returns a slope of 1.50674927 and intercept of 13.47268105 (the correct numbers from last post are 1.31345600492 and 16.9951572327). By increasing the steps in the fit statement I can get closer to the correct answer, but it is very time consuming.

The documentation for these new estimators is very limited, so I’m not 100% sure it’s solving least squares, but I tried getting the L1 solution using SciKit Learn and it was very close to least squares, so whatever this new estimator is estimating (which might be least squares), it is very slow and quite inaccurate. It is also strange that we now have a couple of tunable parameters added to make a fairly simple calculation problematic. The graph for this solution isn’t too bad, but still since we know the exact solution it is a bit disappointing.

Incidentally I was planning to compare the new TensorFlow RandomForest estimator to the Scikit Learn implementation. Although the SciKit Learn one is quite fast, it uses a huge amount of memory so I kind of would like a better solution. But when I compared the two I found the TensorFlow one so bad (both slow and resource intensive) that I didn’t bother blogging it. I hope that by the time this solution becomes more mainstream in TensorFlow that it improves a lot.


TensorFlow is a very powerful engine for performing calculations that can be automatically parallelized and distributed over multiple GPUs for amazing computational speeds. This really does make it possible to spend a few thousand dollars and build quite a powerful supercomputer.

The downside is that Google appears to have the hammer of their neural network optimizers that they really want to use. As a result they are treating everything else as a nail and hitting it with this hammer. The results are quite sub-optimal. I think they do need to spend the time to implement a few of the standard non-Neural Network algorithms properly in TensorFlow if they really want to unleash the power of this platform.

Written by smist08

August 8, 2017 at 10:09 pm

Dangers of Tunable Parameters in TensorFlow

with 2 comments


One of the great benefits of the Internet era has been the democratization of knowledge. A great contributor to this is the number of great Universities releasing a large number of high quality online courses that anyone can access for free. I was going through one of these, namely Stanford’s CS 20SI: Tensorflow for Deep Learning Research and playing with TensorFlow to follow along. This is an excellent course and the course notes could be put together into a nice book on TensorFlow. I was going through “Lecture note 3: Linear and Logistic Regression in TensorFlow”, which starts with a simple example of using TensorFlow to perform a linear regression. This example demonstrates how to use TensorFlow to solve this problem iteratively using Gradient Descent. This approach will then be turned to much harder problems where this is necessary, however for linear regression we can actually solve the problem exactly. I did this and got very different results than the lesson. So I investigated and figured I’d blog a bit on why this is the case as well as provide some code for different approaches to this problem. Note that a lot of the code in this article comes directly from the Stanford course notes.

The Example Problem

The sample data they used was fire and theft data in Chicago to see if there is a relation between the number of fires in a neighborhood to the number of thefts. The data is available here. If we download the Excel version of the file then we can read it with Python XLRD package.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import xlrd

DATA_FILE = "data/fire_theft.xls"

# Step 1: read in data from the .xls file
book = xlrd.open_workbook(DATA_FILE, encoding_override="utf-8")
sheet = book.sheet_by_index(0)
data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)])
n_samples = sheet.nrows - 1

With the data loaded in we can now try linear regression on it.

Solving With Gradient Descent

This is the code from the course notes which solve the problem by minimizing the loss function which is defined as the square of the difference (ie least squares). I’ve blogged a bit about using TensorFlow this way in my Road to TensorFlow series of posts like this one. Its uses the GadientDecentOptimizer and iterates through the data a few times to arrive at a solution.

# Step 2: create placeholders for input X (number of fire) and label Y (number of theft)
X = tf.placeholder(tf.float32, name="X")
Y = tf.placeholder(tf.float32, name="Y")

# Step 3: create weight and bias, initialized to 0
w = tf.Variable(0.0, name="weights")
b = tf.Variable(0.0, name="bias")

# Step 4: construct model to predict Y (number of theft) from the number of fire
Y_predicted = X * w + b

# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name="loss")

# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

with tf.Session() as sess:

    # Step 7: initialize the necessary variables, in this case, w and b

    # Step 8: train the model
    for i in range(100): # run 100 epochs
        for xx, yy in data:

            # Session runs train_op to minimize loss
  , feed_dict={X: xx, Y:yy})

    # Step 9: output the values of w and b
    w_value, b_value =[w, b])

Running this results in w (the slope) as 1.71838 and b (the intercept) as 15.7892.

Solving Exactly with TensorFlow

We can solve the problem exactly with TensorFlow. You can find the formula for this here, or a complete derivation of the formula here.

# Now lets calculated the least squares fit exactly using TensorFlow
X = tf.constant(data[:,0], name="X")
Y = tf.constant(data[:,1], name="Y")

Xavg = tf.reduce_mean(X, name="Xavg")
Yavg = tf.reduce_mean(Y, name="Yavg")
num = (X - Xavg) * (Y - Yavg)
denom = (X - Xavg) ** 2
rednum = tf.reduce_sum(num, name="numerator")
reddenom = tf.reduce_sum(denom, name="denominator")
m = rednum / reddenom
b = Yavg - m * Xavg
with tf.Session() as sess:
    writer = tf.summary.FileWriter('./graphs', sess.graph)
    mm, bb =[m, b])

This results in a slope of 1.31345600492 and intercept of 16.9951572327.

Solving with NumPy

My first thought was that I did something wrong in TensorFlow, so I thought why not just solve it with NumPy. NumPy has a linear algebra subpackage which easily solves this.

# Calculate least squares fit exactly using numpy's linear algebra package.
x = data[:, 0]
y = data[:, 1]
m, c = np.linalg.lstsq(np.vstack([x, np.ones(len(x))]).T, y)[0]

There is a little extra complexity since it handles n dimensions, so you need to reformulate the data from a vector to a matrix for it to be happy. This then returns the same result as the exact TensorFlow, so I guess my code was somewhat correct.

Visualize the Results

You can easily visualize the results with matplotlib.

# Plot the calculated line against the data to see how it looks.
plt.plot(x, y, "o")
plt.plot([0, 40], [bb, mm * 40 + bb], 'k-', lw=2)

This leads to the following pictures. First we have the plot of the bad result from GradientDecent.

This course instructor looked at this and decided it wasn’t very good (which it isn’t) and that the solution was to fit the data with a parabola instead. The parabola gives a better result as far as the least squares error because it nearly goes through the point on the upper right. But I don’t think that leads to a better predictor because if you remove that one point the picture is completely different. My feeling is that the parabola is already overfitting the problem.

Here is the result with the exact correct solution:

To me this is a better solution because it represents the lower right data better. Looking at this gives much less impetus to replace it with a concave up parabola. The course then looks at some correct solutions, but built on the parabola model rather than a linear model.

What Went Wrong?

So what went wrong with the Gradient Descent solution? My first thought was that it didn’t iterate the data enough, just doing 100 iterations wasn’t enough. So I increased the number of iterations but this didn’t greatly improve the result. I know that theoretically Gradient Descent should converge for least squares since the derivatives are easy and well behaved. Next I tried making the learning rate smaller, this improved the result, and then also doing more iterations solved the problems. I found to get a reasonable result I needed to reduce the learning rate by a factor of 100 to 0.00001 and increase the iterations by 100 to 10,000. This then took about 5 minutes to solve on my computer, as opposed to the exact solution which was instantaneous.

The lesson here is that too high a learning rate leads to the result circling the solution without being able to converge to it. Once the learning rate is reduced so small, it takes a long time for the solution to move from the initial guess to the correct solution which is why we need so many iterations.

This highlights why many algorithms build in adaptable learning rates where they are higher when moving quickly and then they dynamically reduce to zero in on a solution.


Most Machine Learning algorithms can’t be double checked by comparing them to the exact solution. But this example highlights how a simple algorithm can return a wrong result, but a result that is close enough to fool a Stanford researcher and make them (in my opinion) go in a wrong direction. It shows the danger we have in all these tunable parameters to Machine Learning algorithms, how getting things like the learning rate or number of iterations incorrect can lead to quite misleading results.


Written by smist08

August 4, 2017 at 6:25 pm

Your New AI Accountant

leave a comment »


We live in a complex world where we’ve accumulated huge amounts of knowledge. Knowing everything for a profession is getting harder and harder. We have better and better knowledge retrieval programs that let us look up information at our fingertips using natural language queries and Google like searches. But in fields like medicine and law, sifting through all the results and sorting out what is relevant and important is getting harder and harder. Especially in medicine there is a lot of bogus and misleading information that can lead to disastrous results. This is a prime application area where Artificial Intelligence and Machine Learning are starting show some real promise. We have applications like IBM’s Watson successfully diagnosing some quite rare conditions that stumped doctors. We have systems like ROSS that provide AI solutions for law firms.

How about AIs supplementing Accountants? Accountants are very busy and in demand.  All the baby boomers are retiring now, and far more Accountants are retiring than are being replaced by young people entering the profession. For many businesses getting professional business advice from Accountants is getting to be a major problem. This affects them properly meeting financial reporting requirements, legal regulatory compliance and generally having a firm complete understanding on how their business is doing. This article is going to look at how AI can help with this problem. We’ll look at the sort of things that AIs can be trained to do to help perform some of these functions. Of course you will still need an Accountant to provide human oversight and to provide a sanity check, but if things are setup correctly to start with, it will save you a lot of time and money.


If you have an AI with Accounting knowledge, how can it help you? In this sections we’ll look at a few ways that the AI system could interact with both the employees of the business as well as the Business Applications the business uses like their Accounting or CRM systems.


Chatbots are becoming more common, here you either type natural language queries to the AI, or it has a voice recognition component that you can talk to. The query processor is connected to the AI and the AI is then connected to your company’s databases as well as a wealth of professional information on the Internet. These AIs usually have multiple components for voice input, natural language processing, various business areas of expertise, and multiple ways of presenting results.

There have been some notable chatbot failures like Microsoft’s Twitter Chatbot which quickly became a racist asshole. But we are starting to see the start of some more successful implementations like Sage’s Pegg or KLM’s Messenger Bot. Plus the general purpose bots like Alexa, Siri and Allo are getting rather good. There are also some really good toolkits, like Amazon Lex, available to develop chatbots so this becomes easier for more and more developers.

In-program Advice

There have been some terrible examples of in-product advice such as the best forgotten Microsoft Clippy. But with advances in User Centered Design, much less intrusive and subtle ways of helping users have emerged. Generally these require good content so what they present is actually useful, plus they have to be unobtrusive so they never interfere with someone doing their work unless they want to pay attention to them. Then when they are used they can offer to make changes automatically, provide more information or keep things to a simple one line tip.

If these help technologies are combined with an AI engine then they can monitor what the user is doing and present application and context based help. For instance suggesting that perhaps a different G/L account should be used here for better Financial Reporting. Perhaps suggesting that the sales taxes on an invoice should be different due to some local regulations. Making suggestions on additional items that should be added to an Accounting document.

These technologies allow the system to learn from how a company uses the product and to make more useful suggestions. As well as having access to industry standards that can be incorporated to assist.

Offline Monitoring

In most larger businesses, the person using the Business Application isn’t the one that needs or can use an Accountant’s advice. Most data entry personnel have to follow corporate procedures and would get fired if they changed what they’ve been told to do, even if it’s wrong. Usually this has to be the CFO or someone senior in the Accounting department. In these cases an AI can monitor what is going on in the business and make recommendations to the right person. Perhaps seeing how G/L Accounts are being used and sending a recommendation for some changes to facilitate better Financial Reporting or regulatory compliance.

Care has to be taken to keep this functionality clear of other unpopular productivity monitoring software that does things like record people’s keystrokes to monitor when they are working and how fast. Generally this functionality has to stick to improving the business rather than be perceived as big brother snitching on everyone.


Most small business owners consider Accounting as a necessary evil that they are required to do to submit their corporate income tax. They do the minimum required and don’t pay much attention to the results. But as their company grows their Accounting data can give them great insights to how their business is running. Managing Inventory, A/R and A/P make huge differences to a company’s cash flow and profitability. Correctly and proactively handling regulatory compliance can be a huge time saver and huge cost saver in fines and lawsuits.

It used to be that sophisticated programs to handle these things required huge IT departments, millions of dollars invested in software and really was only available to large corporations. With the current advances in AI and Machine Learning, many of these sophisticated functionalities can be integrated into the Business Applications used by all small and medium sized businesses. In fact in a few years this will be a mandatory feature that users expect in all the software they use.

Written by smist08

July 29, 2017 at 8:42 pm