## Posts Tagged ‘**TensorFlow**’

## An Introduction to Image Style Transfer

# Introduction

Image Style Transfer is an AI technique that is becoming quite popular for enhancing or stylizing photos. It takes one picture (often a classical painting) and then applies the style of that picture to another picture. For example I could take this photo of the Queen of Surrey passing Hopkins Landing:

Combined with the style of Vincent van Gogh’s Starry Night:

To then feed these through the AI algorithm to get:

In this article, we’ll be look at some of the ways you can accomplish this yourself either through using online services or running your own Neural Network with TensorFlow.

# Playing with Image Style Transfer

There are lots of services that let you play with this. Generally to apply a canned style to your own picture is quite fast (a few seconds). To provide your own photo as the style photo is more involved, since it involves “training” the style and this can take 30 minutes (or more).

Probably the most popular program is the Prisma app for either iPhone or Android. This app has a large number of pre-trained styles and can apply any of them to any photo on your phone. This app works quite well and gives plenty of variety to play with. Plus its free. Here is the ferry in Prisma’s comic theme:

If you want to provide your own photo as the style reference then deepart.io is a good choice. This is available as a web app as well as either an iPhone or Android app. The good part about this for photographers is that you can copy photos from your good camera to your computer and then use this program’s website, no phone required. This site has some pre-programmed styles based on Vincent van Gogh which work really quickly and produce good results. Then it has the ability to upload a style photo. Processing a style is more work and typically takes 25 minutes (you can pay to have it processed quicker, but not that much quicker). If you don’t mind the wait this site is free and works quite well. Here is an example of the ferry picture above van Gogh’ized by deepart.io (sorry they don’t label the styles so I don’t know which painting this is styled from):

# Playing More Directly

These programs are great fun, but I like to tinker with things myself on my computer. So can I run these programs myself? Can I get the source code? Fortunately the answer to both is yes. This turns out to be a bit easier than you first might think, largely due to a project out of the Visual Geometry Group (VGG) at the University of Oxford. They created an exceptional image recognition neural network that they trained and won several competitions with. It turns out that the backbone to doing Image Style Transfer is to have a good image recognition Neural Network. This Neural Net is 19 layers deep and Oxford released the fully trained network for anyone to use. Several people have then taken this network, figured out how to load it into TensorFlow and created some really good Image Style Transfer programs based on this. The first program I played with was Anish Athalye’s program posted on GitHub here. This program uses VGG and can train a neural network for a given style picture. Anish has quite a good write up on his blog here.

Then I played with a program that expanded on Anish’s by Shafeen Tejani which is on GitHub here along with a blog post here. This program lets you keep the trained network so you can perform the transformation quickly on any picture you like. This is similar to how Prisma works. The example up in the introduction was created with this picture. To train the network you require a training set of image like the Microsoft COCO collection.

Running these programs isn’t for everyone. You have to be used to running Python programs and have TensorFlow installed and working on your system. You need a few other dependent Python libraries and of course you need the VGG saved Neural Network. But if you already have Python and TensorFlow, I found both of these programs just ran and I could play with them quite easily.

The writeups on all these programs highly recommend having a good GPU to speed up the calculations. I’m playing on an older MacBook Air with no GPU and was able to get quite good results. One trick I found that helped is to play with reduced resolution images to help speed up the process, then run the algorithm on a higher resolution version when you have things right. I found I couldn’t use the full resolution from my DLSR (12meg), but had to use the Apple’s “large” size (286KB).

# Summary

This was a quick introduction to Image Style Transfer. We are seeing this in more and more places. There are applications that can apply this same technique to videos. I expect this will become a standard part of all image processing software like PhotoShop or Gimp. It also might remain the domain of specialty programs like HDR has, since it is quite technical and resource intensive. In the meantime projects like VGG have made this technology quite accessible for anyone to play with.

## A Crack in the TensorFlow Platform

# Introduction

Last time we looked at how some tunable parameters through off a TensorFlow solution of a linear regression problem. This time we are going to look at a few more topics around TensorFlow and linear regression. Then we’ll look at how Google is implementing Linear Regression and some problems with their approach.

# TensorFlow Graphs

Last time we looked at calculating the solution to a linear regression problem directly using TensorFlow. That bit of code was:

# Now lets calculated the least squares fit exactly using TensorFlow X = tf.constant(data[:,0], name="X") Y = tf.constant(data[:,1], name="Y") Xavg = tf.reduce_mean(X, name="Xavg") Yavg = tf.reduce_mean(Y, name="Yavg") num = (X - Xavg) * (Y - Yavg) denom = (X - Xavg) ** 2 rednum = tf.reduce_sum(num, name="numerator") reddenom = tf.reduce_sum(denom, name="denominator") m = rednum / reddenom b = Yavg - m * Xavg with tf.Session() as sess: writer = tf.summary.FileWriter('./graphs', sess.graph) mm, bb = sess.run([m, b])

TensorFlow does all its calculations based on a graph where the various operators and constants are nodes that then get connected together to show dependencies. We can use TensorBoard to show the graph for the snippet of code we just reviewed here:

Notice that TensorFlow overloads the standard Python numerical operators, so when we get a line of code like: “denom = (X – Xavg) ** 2”, since X and Xavg are Tensors then we actually generate TensorFlow nodes as if we had called things like tf.subtract and tf.pow. This is much easier code to write, the only downside being that there isn’t a name parameter to label the nodes to get a better graph out of TensorBoard.

With TensorFlow you perform calculations in two steps, first you build the graph (everything before the with statement) and then you execute a calculation by specifying what you want. To do this you create a session and call run. In run we specify the variables we want calculated. TensorFlow then goes through the graph calculating anything it needs to, to get the variables we asked for. This means it may not calculate everything in the graph.

So why does TensorFlow follow this model? It seems overly complicated to perform numerical calculations. The reason is that there are algorithms to separate graphs into separate independent components that can be calculated in parallel. Then TensorFlow can delegate separate parts of the graph to separate GPUs to perform the calculation and then combine the results. In this example this power isn’t needed, but once you are calculating a very complicated large Neural Network then this becomes a real selling point. However since TensorFlow is a general tool, you can use it to do any calculation you wish on a set of GPUs.

# TensorFlow’s New LinearRegressor Estimator

Google has been trying to turn TensorFlow into a platform for all sorts of Machine Learning algorithms, not just Neural Networks. They have added estimators for Random Forests and for Linear Regression. However they did this by using the optimizers they created for Neural Nets rather than using the standard algorithms used in other libraries, like those implemented in SciKit Learn. The reasoning behind this is that they have a lot of support for really really big models with lots of support for one-hot encoding, sparse matrices and so on. However the algorithms that solve the problem seem to be exceedingly slow and resource hungry. Anything implemented in TensorFlow will run on a GPU, and similarly any Machine Learning algorithm can be implemented in TensorFlow. The goal here is to have TensorFlow running the Google AI Cloud where all the virtual machines have Google designed GPU like AI accelerator hardware. But I think unless they implement the standard algorithms, so they can solve things like a simple least squares regression quickly hand accurately then its usefulness will be limited.

Here is how you solve our fire versus theft linear regression this way in TensorFlow:

features = [tf.contrib.layers.real_valued_column("x", dimension=1)] estimator = tf.contrib.learn.LinearRegressor(feature_columns=features, model_dir='./linear_estimator')

# Input builders input_fn = tf.contrib.learn.io.numpy_input_fn({"x":x}, y, num_epochs=10000) estimator.fit(input_fn=input_fn, steps=2000) mm = estimator.get_variable_value('linear/x/weight') bb = estimator.get_variable_value('linear/bias_weight') print(mm, bb)

This solves the problem and returns a slope of 1.50674927 and intercept of 13.47268105 (the correct numbers from last post are 1.31345600492 and 16.9951572327). By increasing the steps in the fit statement I can get closer to the correct answer, but it is very time consuming.

The documentation for these new estimators is very limited, so I’m not 100% sure it’s solving least squares, but I tried getting the L1 solution using SciKit Learn and it was very close to least squares, so whatever this new estimator is estimating (which might be least squares), it is very slow and quite inaccurate. It is also strange that we now have a couple of tunable parameters added to make a fairly simple calculation problematic. The graph for this solution isn’t too bad, but still since we know the exact solution it is a bit disappointing.

Incidentally I was planning to compare the new TensorFlow RandomForest estimator to the Scikit Learn implementation. Although the SciKit Learn one is quite fast, it uses a huge amount of memory so I kind of would like a better solution. But when I compared the two I found the TensorFlow one so bad (both slow and resource intensive) that I didn’t bother blogging it. I hope that by the time this solution becomes more mainstream in TensorFlow that it improves a lot.

# Summary

TensorFlow is a very powerful engine for performing calculations that can be automatically parallelized and distributed over multiple GPUs for amazing computational speeds. This really does make it possible to spend a few thousand dollars and build quite a powerful supercomputer.

The downside is that Google appears to have the hammer of their neural network optimizers that they really want to use. As a result they are treating everything else as a nail and hitting it with this hammer. The results are quite sub-optimal. I think they do need to spend the time to implement a few of the standard non-Neural Network algorithms properly in TensorFlow if they really want to unleash the power of this platform.

## Dangers of Tunable Parameters in TensorFlow

# Introduction

One of the great benefits of the Internet era has been the democratization of knowledge. A great contributor to this is the number of great Universities releasing a large number of high quality online courses that anyone can access for free. I was going through one of these, namely Stanford’s CS 20SI: Tensorflow for Deep Learning Research and playing with TensorFlow to follow along. This is an excellent course and the course notes could be put together into a nice book on TensorFlow. I was going through “Lecture note 3: Linear and Logistic Regression in TensorFlow”, which starts with a simple example of using TensorFlow to perform a linear regression. This example demonstrates how to use TensorFlow to solve this problem iteratively using Gradient Descent. This approach will then be turned to much harder problems where this is necessary, however for linear regression we can actually solve the problem exactly. I did this and got very different results than the lesson. So I investigated and figured I’d blog a bit on why this is the case as well as provide some code for different approaches to this problem. Note that a lot of the code in this article comes directly from the Stanford course notes.

# The Example Problem

The sample data they used was fire and theft data in Chicago to see if there is a relation between the number of fires in a neighborhood to the number of thefts. The data is available here. If we download the Excel version of the file then we can read it with Python XLRD package.

import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import xlrd DATA_FILE = "data/fire_theft.xls" # Step 1: read in data from the .xls file book = xlrd.open_workbook(DATA_FILE, encoding_override="utf-8") sheet = book.sheet_by_index(0) data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)]) n_samples = sheet.nrows - 1

With the data loaded in we can now try linear regression on it.

# Solving With Gradient Descent

This is the code from the course notes which solve the problem by minimizing the loss function which is defined as the square of the difference (ie least squares). I’ve blogged a bit about using TensorFlow this way in my Road to TensorFlow series of posts like this one. Its uses the GadientDecentOptimizer and iterates through the data a few times to arrive at a solution.

# Step 2: create placeholders for input X (number of fire) and label Y (number of theft) X = tf.placeholder(tf.float32, name="X") Y = tf.placeholder(tf.float32, name="Y") # Step 3: create weight and bias, initialized to 0 w = tf.Variable(0.0, name="weights") b = tf.Variable(0.0, name="bias") # Step 4: construct model to predict Y (number of theft) from the number of fire Y_predicted = X * w + b # Step 5: use the square error as the loss function loss = tf.square(Y - Y_predicted, name="loss") # Step 6: using gradient descent with learning rate of 0.01 to minimize loss optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) with tf.Session() as sess: # Step 7: initialize the necessary variables, in this case, w and b sess.run(tf.global_variables_initializer()) # Step 8: train the model for i in range(100): # run 100 epochs for xx, yy in data: # Session runs train_op to minimize loss sess.run(optimizer, feed_dict={X: xx, Y:yy}) # Step 9: output the values of w and b w_value, b_value = sess.run([w, b])

Running this results in w (the slope) as 1.71838 and b (the intercept) as 15.7892.

# Solving Exactly with TensorFlow

We can solve the problem exactly with TensorFlow. You can find the formula for this here, or a complete derivation of the formula here.

# Now lets calculated the least squares fit exactly using TensorFlow X = tf.constant(data[:,0], name="X") Y = tf.constant(data[:,1], name="Y") Xavg = tf.reduce_mean(X, name="Xavg") Yavg = tf.reduce_mean(Y, name="Yavg") num = (X - Xavg) * (Y - Yavg) denom = (X - Xavg) ** 2 rednum = tf.reduce_sum(num, name="numerator") reddenom = tf.reduce_sum(denom, name="denominator") m = rednum / reddenom b = Yavg - m * Xavg with tf.Session() as sess: writer = tf.summary.FileWriter('./graphs', sess.graph) mm, bb = sess.run([m, b])

This results in a slope of 1.31345600492 and intercept of 16.9951572327.

# Solving with NumPy

My first thought was that I did something wrong in TensorFlow, so I thought why not just solve it with NumPy. NumPy has a linear algebra subpackage which easily solves this.

# Calculate least squares fit exactly using numpy's linear algebra package. x = data[:, 0] y = data[:, 1] m, c = np.linalg.lstsq(np.vstack([x, np.ones(len(x))]).T, y)[0]

There is a little extra complexity since it handles n dimensions, so you need to reformulate the data from a vector to a matrix for it to be happy. This then returns the same result as the exact TensorFlow, so I guess my code was somewhat correct.

# Visualize the Results

You can easily visualize the results with matplotlib.

# Plot the calculated line against the data to see how it looks. plt.plot(x, y, "o") plt.plot([0, 40], [bb, mm * 40 + bb], 'k-', lw=2) plt.show()

This leads to the following pictures. First we have the plot of the bad result from GradientDecent.

This course instructor looked at this and decided it wasn’t very good (which it isn’t) and that the solution was to fit the data with a parabola instead. The parabola gives a better result as far as the least squares error because it nearly goes through the point on the upper right. But I don’t think that leads to a better predictor because if you remove that one point the picture is completely different. My feeling is that the parabola is already overfitting the problem.

Here is the result with the exact correct solution:

To me this is a better solution because it represents the lower right data better. Looking at this gives much less impetus to replace it with a concave up parabola. The course then looks at some correct solutions, but built on the parabola model rather than a linear model.

# What Went Wrong?

So what went wrong with the Gradient Descent solution? My first thought was that it didn’t iterate the data enough, just doing 100 iterations wasn’t enough. So I increased the number of iterations but this didn’t greatly improve the result. I know that theoretically Gradient Descent should converge for least squares since the derivatives are easy and well behaved. Next I tried making the learning rate smaller, this improved the result, and then also doing more iterations solved the problems. I found to get a reasonable result I needed to reduce the learning rate by a factor of 100 to 0.00001 and increase the iterations by 100 to 10,000. This then took about 5 minutes to solve on my computer, as opposed to the exact solution which was instantaneous.

The lesson here is that too high a learning rate leads to the result circling the solution without being able to converge to it. Once the learning rate is reduced so small, it takes a long time for the solution to move from the initial guess to the correct solution which is why we need so many iterations.

This highlights why many algorithms build in adaptable learning rates where they are higher when moving quickly and then they dynamically reduce to zero in on a solution.

# Summary

Most Machine Learning algorithms can’t be double checked by comparing them to the exact solution. But this example highlights how a simple algorithm can return a wrong result, but a result that is close enough to fool a Stanford researcher and make them (in my opinion) go in a wrong direction. It shows the danger we have in all these tunable parameters to Machine Learning algorithms, how getting things like the learning rate or number of iterations incorrect can lead to quite misleading results.

## Playing the Kaggle Two Sigma Challenge – Part 2

# Introduction

Last time I introduced the Kaggle Two Sigma Challenge and this time I’ll start describing what I did at the beginning of the competition. The competition started at the beginning of December, 2016 and completed on March 1, 2017. This blog covers what I did in December.

**Update 2017/03/07:** I uploaded the Python source code for the code discussed here to my Google Drive. You can access them here. The files are TensorFlow1.py for the first (wide) TensorFlow attempt, TFNarrow1.py for the second narrow TensorFlow attempt, RegressionLab1.py for my regression one with reinforcement learning and then TreeReg1.py for the Christmas surprise with reinforcement learning added.

# TensorFlow

Since I spent quite a bit of time playing and blogging about predicting the stock market with TensorFlow, this is where I started. The data was all numeric, so it was quite easy to get started, no one hot encoding and really the only pre-processing was to fill in missing values with the pandas fillna function (where I just used the mean since this was easiest). I’ll talk more about these missing values later, but to get started they were easy to fill in and ignore.

I started by just feeding all the data into TensorFlow trying some simple 2, 3 and 4 level neural networks. However my results were quite bad. Either the model couldn’t converge or even if it did, the results were much worse than just submitting zeros for everything.

With all the data the model was quite large, so I thought I should simplify it a bit. The Kaggle competition has a public forum which includes people publishing public Python notebooks and early in every competition there are some very generous people that published detailed statistical analysis and visualizations of all the data. Using this I could select a small subset of data columns which had higher correlations with the results and just use these instead. This then let me run the training longer, but still didn’t produce any useful results.

At this point I decided that given the computing resource limitations of the Kaggle playgrounds, I wouldn’t be able to do a serious neural network, or perhaps doing so just wouldn’t work. I did think of doing the training on my laptop, say running overnight and then copy/pasting the weight/bias arrays into my Python code in the playground to just run. But I never pursued this.

# Penalized Linear Regression

My next thought was to use linear regression since this tends to be good for extrapolation problems since it doesn’t suffer from non-linearities going wild outside of the training data. Generally regular least squares regression can suffer from overfitting, especially when there are a large number of variables and they aren’t particularly linearly independent. Also least squares regression can be thrown off by bad errant data. The general consensus from the forums was that this training set had a lot of outliers for some reason. In machine learning there are a large family of Penalized Linear Regression algorithms that all try to address these problems via one means or another. Generally they do things like start with the most correlated column and then add the next most correlated column and only keep doing this as long as they have a positive effect on the results. They also penalize large weights borrowing the technique we described here. Then there are various methods to filter out outliers or to change their effect by using different metrics than sum of squares. Two popular methods are Lasso regression that uses the taxi-cab metric (sum of difference of absolute values rather than sum of square differences) and Ridge regression which uses sum of squares regression. Then both penalize large coefficients and bring in variables one at a time. Then there is a combined algorithm called Elastic Net Regression that uses a ratio of each and you choose the coefficient.

# First Victory

Playing around with this a bit, I found the scikit-learn algorithm ElasticNetCV worked quite well for me. ElasticNetCV breaks up the training data and then run iteratively testing the value of how many variables to include to find the best result. Choosing the l1 ratio of 0.45 actually put me in the top ten of the submissions. This was a very simple submission, but I was pretty happy to get such a good result.

# Reinforcement Learning

One thing that seemed a bit strange to me about the way the Kaggle Gym worked was that you submitted your results for a given time step and then got a reward for that. However you didn’t get the correct results for the previous timestep. Normally for stock market prediction you predict the next day, then get the correct results at the end of the day and predict the next day. Here you only get a reward which is the R2 score for you submission. The idea is to have an algorithm like the following diagram. But incorporating the R2 score is quite tricky.

I spent a bit of time thinking about this and had the idea that you could sort of calculate the variance from the R2 score and then if you made an assumption about the underlying probability distribution you could then make an estimate of the mean. Then I could introduce a bias to the mean to compensate for cumulative errors as the time gets farther and farther from the training data.

Now there are quite a few problems with this, namely the variance doesn’t give you the sign of the error which is worrying. I tried a few different relationships of mean to variance and found one that improved my score quite bit. But again this was all rather ad-hoc.

Anyway, every ten timesteps I didn’t apply the bias so I could get a new bias and then used the bias on the other timesteps.

# Second Victory

The competition moves fairly quickly so a week or two after my first good score, I was well down in the standings. Adding the my mean bias from the reward to my ElasticNetCV regression put me back into the top 10 again.

# A Christmas Present

I went to bed on Christmas eve in 6th place on the competition leaderboard. I was pretty happy about that. When I checked in on Christmas Day I was down to 80th place on the leaderboard. As a Christmas present to all the competitors one of the then current top people above me made his solution public, which then meant lots of other folks forked his solution, submitted it and got his score.

This solution used a Random Forest algorithm ExtraTreesRegressor from scikit-learn combined with a simple mean based estimate and a simple regression on one variable. The random forest part was interesting because it let the algorithm know which were missing values so it could learn to act appropriately.

At first I was really upset about this, but when I had time I realized I could take that public solution, add my mean bias and improve upon it. I did this and got back into the top ten. So it wasn’t that bad.

# Summary

Well this covered the first month of the competition, two more to go. I think getting into the top ten on the leaderboard a few times gave me the motivation to keep plugging away at the competition and finding some more innovative solutions. Next up January.

## TensorFlow Goes 1.0

# Introduction

I’ve been using Google’s TensorFlow machine learning platform for some time now starting with version 0.8, going onto 0.9 and now playing with 1.0 which was released last week. There are some really good videos from the release summit posted on YouTube here. This blog article looks at the evolution of TensorFlow and what 1.0 brings to the table.

Installing the new TensorFlow 1.0 on MacOS was fairly painless, I chose to install it natively rather than using a VM type solution since I don’t try to run multiple versions of Python, just stick to the latest. They recommend using Docker or other VM technology to avoid having to install at all, but I didn’t have any problems.

# More Than Neural Networks

TensorFlow has always been built on a low level compute engine that executes graphs of operations on matrices and vectors (tensors). However the main tutorials and higher level functions were always oriented to performing Neural Network calculations. It contains very good algorithms for training Neural Networks and had all the supporting functions you needed to create very powerful Neural Network models. It contained a Linear Regression function, but this was mainly used as a simple tutorial rather than anything real.

With 1.0 TensorFlow is adding a large number of other popular machine learning algorithms out of the box so you can use Random Forests, Support Vector Machines, and many other standard libraries that you find in more complete libraries like scikit-learn. The list of standard algorithms isn’t as full as scikit-learn yet, and a very notable omission is the ensemble method of gradient boosting (which is promised sometime soon).

I’ve been entering some Kaggle competitions where penalized regression, random forests and gradient boosting are often the algorithms that produce the best results. However TensorFlow under Keras has been doing quite well. Often the winning solution is a combination of several of these, since an average of independent techniques will give better results.

The good thing about this is that TensorFlow provides very good GPU and other hardware accelerator support, so now all these algorithms can benefit from this. In addition Google is now offering (in beta) a machine learning cloud service which runs TensorFlow on optimized accelerated hardware. In the past if this only had TensorFlow the usage would have been limited since most full applications use a combination of algorithms in the final deployment.

# API Stability

As TensorFlow went through the 0.x versions, there were quite a few API changes that caused you to be frequently updating your programs. With version 1.0 the claim is that for the part of TensorFlow that is in the core library, API compatibility will now be maintained.

A lot of the changes for 1.0 were to make the naming conventions more standard, including following the lead of Python’s Numpy library (so the same function didn’t have a different name in NumPy vs TensorFlow). All this should make coding a bit more straightforward and reduce always having to look everything up continuously.

However beware that a lot of the new advertised features in TensorFlow 1.0 are not in the core library yet, and so their API may change until they are moved there.

The good thing is that Google provided a Python script to convert previous TensorFlow Python programs up to the new API level. This worked fine for my programs, so as to make the process rather painless.

# Higher Level APIs

A criticism of TensorFlow was that although it was a great low level framework, it was difficult or tedious to do a number of standard operations, like for instance setting up a simple multi-level neural network. Due to this omission sevel developers created competing high level abstractions to run on various lower level libraries. Probably the most successful of these is Keras which runs on top of both TensorFlow and Theano.

With 1.0 TensorFlow is adding a higher level API which works with all the various algorithms it contains as well as adding a Keras compatible library as a nod to the heavy adoption that Keras has enjoyed.

The non-neural network algorithms follow the API conventions in scikit-learn, which are very efficient. The whole thing is also oriented so you can feed one component into another so you can easily build a compound model consisting of several algorithms and then easily train and deploy the whole thing.

Generally this is a good thing for people looking to just use TensorFlow since the amount of code you need to write becomes much smaller and it embodies all the TensorFlow best practices so it works properly with TensorBoard, deploys flexibly, etc.

# Documentation

The TensorFlow documentation has been greatly improved. The tutorials are way better and it’s much easier to get a basic understanding of TensorFlow from the introductory material. There are also many more videos available as well as training courses.

Although this is all a huge step forward, one annoying side effect is that all the external links, say from Stack Overflow articles (or even Google searches) are now broken.

# Lots More

Some of the other notable additions include a new experimental TensorFlow compiler XLA, APIs for Go and Java, addition of a command line debugger, improvements to TensorBoard for better visualizations and lots of additional hardware support.

Windows support was added in version 0.10 which is new since my original blogs. There is support to use Qualcomm DSP chips for co-processing which should greatly enhance the capabilities of Android phones containing this chip.

# Summary

TensorFlow has come a long way over the last year from a rather specialized Neural Network tool, evolving into a complete machine learning platform. The open source community around TensorFlow is extremely vibrant and extends quite far beyond just Google employees. Looking at what is scheduled for the next couple of point releases looks very exciting and I’m finding this tool becoming more powerful in leaps and bounds.

## The Road to TensorFlow – Part 11: Generalization and Overfitting

# Introduction

With sophisticated Neural Networks, you are dealing with a quite complicated nonlinear function. When fitting a high degree polynomial to a few data points, the polynomial can go through all the points, but have such steep slopes that it is useless for predicting points between the training points, we get this same sort of behaviour in Neural Networks. In a way you are training the Neural Network to exactly memorize all the training data exactly rather than figuring out the trends and patterns that you can use to predict other values.

We’ve touched upon this problem in other articles like here and here, but glossed over what we are doing about this problem. In this article we’ll explore what we can do about this in more detail.

One solution is to perhaps gather more training data, however this may be impossible or quite expensive. It also might be that the training data is missing some representative samples. Here we’ll concentrate on what we can do with the algorithm rather than trying to improve the data.

# Interpolation and Extrapolation

Here we refer to generalization as wanting to get answers to data that isn’t in the training data. We refer to overfitting as the case where the model works really well for the training data but doesn’t do nearly as well for anything else.

There are two distinct cases we want to worry about. One is interpolation, this is trying to estimate values where the inputs are surrounded by data in the training set. Extrapolation is the process of trying to predict what happens beyond the training data. Our stock market data is an example of extrapolation. Recognizing handwriting is an example of interpolation (assuming you have a good sample of training data)

Extrapolation tends to be a much harder problem than interpolation, but both a strongly affected by overfitting.

# Early Stopping

What we often do is divide our training data into three groups. The largest of these we call the training data and use for training. Another is the test data which we run after training to see how well the algorithm works on data that hasn’t been seen by training. To help with detecting overfitting we create a third group which we run after a certain number of steps during training. The following screenshot shows the results for the training and validation sets (this is for a Kaggle competition so the test set needs to be submitted to Kaggle to get the answer). Here smaller values are better. Notice that the training data gets better starting at 3209.5 and going down to 712.8 which indicates training is working. However the validation data starts at 3014.3 goes down to the 1160s and then starts increasing. This indicates we are overfitting the data.

The approach here is really simple, let’s just stop once the validation data starts increasing. So let’s just stop at this point and say we’re done. This is actually a pretty simple and effective way to prevent overfitting. As an added bonus this is a rare technique that leads to faster training.

# Penalizing Large Weights

A sign of overfitting is that the slope of our function is high at the points in the training data. Since the slope is approximated by the appropriate weights in our matrix, we would want to keep the weights in our weight matrices low. The way we accomplish this is to add a penalty to the loss function based on the size of the weights.

loss = (tf.nn.l2_loss( tf.sub(logits, tf_train_labels)) + tf.nn.l2_loss(layer1_weights)*beta + tf.nn.l2_loss(layer2_weights)*beta + tf.nn.l2_loss(layer3_weights)*beta + tf.nn.l2_loss(layer4_weights)*beta)

Here we add the sum of the squares of the weights to our loss function. The factor beta is there to let us scale this value to be in the same magnitude as the main loss function. I’ve found that in some problems making the loss due to the weights about equal to the main loss works quite well. In another problem I found choosing beta so that the weights are 10% of the main loss worked quite well.

I have found that combining this with early stopping works quite well. The weight penalty lets us train longer before we start overfitting, which leads to a better overall result.

# Dropout

One property of the Neural Networks in our brain is that brain cells die, but our brain seems to mostly keep on working. In this sense the brain is far more resilient to damage than a computer. The idea behind dropout is to try to add rules to train the Neural Network to be resilient to Neurons being removed. This means the Neural Network can’t be completely reliant on any given Neuron since it could die (be removed from the model).

The way we accomplish this is we add a dropout activation function at some point:

if dropout: hidden = tf.nn.dropout(hidden, 0.5)

This activation function will remove 50% of the neurons at this layer and scale up its outputs by a matching amount. This is so the sum stays the same which means you can use the same weights whether dropout is present or not.

The reason for the if statement is that you only want to do dropout during training and not during validation, testing or production.

You would do this on each hidden layer. It’s rather surprising that the Neural Network still works as well as it does with this much dropout.

I find dropout doesn’t always help, but when it does you can combine it with penalizing the weights and then you can train longer before you need to stop during overfitting. This can sometimes help a network find finer details without overfitting.

When you do dropout, you do have to train for a longer time, so if this is too time prohibitive you might not want to use it.

I think it’s a good sign that Neural Networks can exhibit the same resilience to damage that the brain shows. Perhaps a bit of biological evidence that we are on the correct track.

# Summary

These are a few techniques you can use to avoid overfitting your model. I generally use all three so I can train a bit longer without overfitting. If you can get more good training data that can also help quite a bit. Using a simpler model (with fewer hidden nodes) can also help with overfitting, but perhaps not provide as good a functional approximation as the more complicated model. As with all things in computer science you are always trading off complexity, overfitting and performance.

## The Road to TensorFlow – Part 10: More on Optimization

# Introduction

We’ve been playing with TensorFlow for a while now and we have a working model for predicting the stock market. I’m not too sure if we’re beating the stocking picking cat yet, but at least we have a good model where we can experiment and learn about Neural Networks. In this article we’re going to look at the optimization methods available in TensorFlow. There are quite a few of these built into the standard toolkit and since TensorFlow is open source you could create your own optimizer. This article follows on from our previous article on optimization and training.

# Weaknesses in Gradient Descent

Gradient Descent has worked for us pretty well so far. Basically it calculates the gradients of the loss function (the partial derivatives of loss by each weight) and moves the weights in the direction of lowering the loss function. However finding the minimums of a complicated nonlinear function is a non-trivial exercise and compound this with the fact that a lot of the data we are feeding in during training is very noisy. In our case the stock market historical data is probably quite contradictory and is probably presenting a good challenge to the training algorithm. Here are some weaknesses these other algorithms attempt to address:

- Learning rate. We have one fixed learning rate (how far we move in the direction of the sign of the gradient). We added an optimization to reduce this learning rate as we proceed, but we use the same learning rate for everything at each step. But some parts of our weight matrix may be changing quickly and other parts remaining close to constant. So perhaps use a different learning rate for each weight/bias and vary it by how fast it’s moving and whether it’s moving consistently in the same direction.
- Getting stuck in local minimums or wandering around plateaus. Are we getting stuck in a local minimum which is much worse than the global minimum we would like to find? How can we power past global minimums and continue to the real goal?

# TensorFlow Optimizers

The optimizers included with TensorFlow are all variations on Gradient Descent. There are many other optimizers that people use like simulated annealing, conjugate gradient and ant colony optimization but these tend to either not work well with multi-layer Neural Networks or don’t parallelize well to run on GPUs or a distributed network or are far too computationally intensive for large matrices. We added to the code all the optimizers and you just uncomment the one that you want to use.

# optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step) # optimizer = tf.train.AdadeltaOptimizer(starter_learning_rate).minimize(loss) # optimizer = tf.train.AdagradOptimizer(starter_learning_rate).minimize(loss) # promising # optimizer = tf.train.AdamOptimizer(starter_learning_rate).minimize(loss) # promising # optimizer = tf.train.MomentumOptimizer(starter_learning_rate, 0.001).minimize(loss) # diverges # optimizer = tf.train.FtrlOptimizer(starter_learning_rate).minimize(loss) # promising optimizer = tf.train.RMSPropOptimizer(starter_learning_rate).minimize(loss) # promising

Perhaps it would be less hacky to make this a parameter to the program, but we’ll leave that till we need it.

Let’s quickly summarize what each optimizer tries to accomplish:

- MomentumOptimizer: If gradient descent is navigating down a valley with steep sides, it tends to madly oscillate from one valley wall to the other without making much progress down the valley. This is because the largest gradients point up and down the valley walls whereas the gradient along the floor of the valley is quite small. Momentum Optimization attempts to remedy this by keeping track of the prior gradients and if they keep changing direction then damp them, and if the gradients stay in the same direction then reward them. This way the valley wall gradients get reduced and the valley floor gradient enhanced. Unfortunately this particular optimizer diverges for the stock market data.
- AdagradOptimizer: Adagrad is optimized to finding needles in haystacks and for dealing with large sparse matrices. It keeps track of the previous changes and will amplify the changes for weights that change infrequently and suppress the changes for weights that change frequently. This algorithm seemed promising for the stock market data.
- AdadeltaOptimizer: Adadelta is an extension of Adagrad that only remembers a fixed size window of previous changes. This tends to make the algorithm less aggressive than pure Adagrad. Adadelta seemed to not work as well as Adagrad for the stock market data.
- AdamOptimizer: Adaptive Moment Estimation (Adam) keeps separate learning rates for each weight as well as an exponentially decaying average of previous gradients. This combines elements of Momentum and Adagrad together and is fairly memory efficient since it doesn’t keep a history of anything (just the rolling averages). It is reputed to work well for both sparse matrices and noisy data. Adam seems promising for the stock market data.
- FtrlOptimizer: Ftrl-Proximal was developed for ad-click prediction where they had billions of dimensions and hence huge matrices of weights that were very sparse. The main feature here is to keep near zero weights at zero, so calculations can be skipped and optimized. This algorithm was promising on our stock market data.
- RMSPropOptimizer: RMSprop is similar to Adam it just uses different moving averages but has the same goals.

Neural networks can be quite different and the best algorithm for the job may depend a lot on the data you are trying to train the network with. Each of these optimizers has several tunable parameters. Besides initial learning rate, I’ve left all the others at the default. We could write a meta-trainer that tries to find an optimal solution for which optimizer to use and with which values of its tunable parameters. You would want a quite powerful distributed set of computers to run this on.

# Summary

Optimization is a tricky subject with Neural Networks, a lot depends on the quality and quantity of your data. It also depends on the size of your model and the contents of the weight matrices. A lot of these optimizers are tuned for rather specific problems like image recognition or ad click-through prediction; however, if you have a unique problem them largely you are left to trial and error (whether automated or manual) to determine the best solution.

Note that a lot of practitioners stick with basic gradient descent since they know it quite well, rather than relying on the newer algorithms. Often massaging your data or altering the random starting point can be a better area to focus on.