## Posts Tagged ‘**TensorFlow**’

## Julia Flux for Machine Learning

# Introduction

Flux is a Neural Network Machine Learning library for the Julia programming language. It is entirely written in Julia and relies on Julia’s built-in support for running on GPUs and providing distributed processing. It makes writing Neural Networks easy and leverages the power and expressiveness of the Julia language to make creating your Neural Network just the same as writing any other Julia expressions.

My last article pointed out some problems with using TensorFlow from Julia, due to many of the newer features being implemented in Python rather than being implemented in the core shared library. One recommendation from the TensorFlow folks is that if you want eager execution then use Flux rather than TensorFlow. The Flux folks claim a real benefit of Flux over TensorFlow is that you only need to know one language to do ML. Whereas for TensorFlow you need to know TensorFlow (its graph language) plus the host language like Python. Then it’s confusing because there is a lot of duplication and it isn’t always clear in which system to do things or whether to use a TensorFlow of Python data type. Flux then simplifies all this.

Although this all sounds wonderful remember that Julia just hit version 1.0 and Flux just hit version 0.67. The main problem I found was excessive memory usage, which I’ll benchmark and discuss later on.

Also note that Flux isn’t a giant compilation of algorithms like SciKit Learn. It is rather specific to Neural Networks. There are other libraries available in Julia for things like Random Forests, but you need to find the correct package and install it. Then each of these may or may not fully support Julia 1.0 yet.

# MNIST in Flux

To give a flavour for using Julia and Flux here are a couple of examples from the FluxML model zoo. You can see it’s very simple to setup the Neural Network layers, perform the training and test the accuracy.

using Flux, Flux.Data.MNIST, Statistics using Flux: onehotbatch, onecold, crossentropy, throttle using Base.Iterators: repeated # using CuArrays # Classify MNIST digits with a simple multi-layer-perceptron imgs = MNIST.images() # Stack images into one large batch X = hcat(float.(reshape.(imgs, :))...) |> gpu labels = MNIST.labels() # One-hot-encode the labels Y = onehotbatch(labels, 0:9) |> gpu m = Chain( Dense(28^2, 32, relu), Dense(32, 10), softmax) |> gpu loss(x, y) = crossentropy(m(x), y) accuracy(x, y) = mean(onecold(m(x)) .== onecold(y)) dataset = repeated((X, Y), 200) evalcb = () -> @show(loss(X, Y)) opt = ADAM(params(m)) Flux.train!(loss, dataset, opt, cb = throttle(evalcb, 10)) println("acc X,Y ", accuracy(X, Y)) # Test set accuracy tX = hcat(float.(reshape.(MNIST.images(:test), :))...) |> gpu tY = onehotbatch(MNIST.labels(:test), 0:9) |> gpu println("acc tX, tY ", accuracy(tX, tY))

Here is a more sophisticated model which uses a convolutional Neural Network.

using Flux, Flux.Data.MNIST, Statistics using Flux: onehotbatch, onecold, crossentropy, throttle using Base.Iterators: repeated, partition # using CuArrays # Classify MNIST digits with a convolutional network imgs = MNIST.images() labels = onehotbatch(MNIST.labels(), 0:9) # Partition into batches of size 1,000 train = [(cat(float.(imgs[i])..., dims = 4), labels[:,i]) for i in partition(1:60_000, 1000)] train = gpu.(train) # Prepare test set (first 1,000 images) tX = cat(float.(MNIST.images(:test)[1:1000])..., dims = 4) |> gpu tY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> gpu m = Chain( Conv((2,2), 1=>16, relu), x -> maxpool(x, (2,2)), Conv((2,2), 16=>8, relu), x -> maxpool(x, (2,2)), x -> reshape(x, :, size(x, 4)), Dense(288, 10), softmax) |> gpu m(train[1][1]) loss(x, y) = crossentropy(m(x), y) accuracy(x, y) = mean(onecold(m(x)) .== onecold(y)) evalcb = throttle(() -> @show(accuracy(tX, tY)), 10) opt = ADAM(params(m)) Flux.train!(loss, train, opt, cb = evalcb)

# Performance

One of Julia’s promises is the ease of use of a scripting language like Python with the speed of a compiled language like C. As it stands Flux isn’t there yet. There seem to be some points where Flux goes away for a long time. These might be the garbage collector kicking in, or something else. I find the speed is about the same order of magnitude as other systems (modulo the pauses), but the big problem is memory usage.

To solve MNIST using a convolutional Neural Network from Python using the TensorFlow tutorial runs quite well and uses 400Meg of memory. Running the similar model using Julia and TensorFlow uses 600Meg of memory. Running the simple model above using Julia and Flux takes 2Gig or memory. Running the convolutional model above uses 2.6Gig. This laptop that I’m using has 4Gig of RAM and is running Ubuntu Linux. This is why I think the big stalls in performance is garbage collection.

The problem with this is that MNIST is a nice small dataset and the model used to solve it isn’t very large as Neural Networks go. If Flux is using six times as much memory as Python then it really diminishes its usefulness as an ML toolkit.

I spent a bit of time looking at the Julia Differential Equations tutorial. They were pointing out that using matrix operations in the Julia expression evaluator would lead to lots of unnecessary temporary storage for instance to evaluate:

D = A + B + C

Where these are all large matrices has to create a temporary matrix to hold the sum A + B which is then added to C. This temporary matrix has to be allocated from the heap and then later garbage collected. This process seems to be rather inefficient in Julia, at least by going by all the workarounds they have to avoid this situation. They have SVectors which are for small vectors that can be allocated on the stack rather than the heap. They recommend using the +. operator which does things element by element and is smart enough to not create lots of temporary values on the heap. I wonder if Flux needs some optimisations like they spent so much time putting into the Differential Equations library.

# Summary

Julia and Flux make a nice system for Machine Learning in theory. I think until the technology matures a bit and some problems like memory management are better addressed, that using this for large projects is a bit problematic. A lot of the current ML systems being worked on with Flux are by PhD candidates who are developing Flux as part of their thesis work. Hopefully they improve the memory usage and allow Flux and Julia to live up to their full potential.

## TensorFlow from Julia

# Introduction

Last time, I gave a quick introduction to the Julia programming language which has just reached the 1.0 release mark after ten years of development. Julia is touted as the next great thing for scientific computing, machine learning, data science and artificial intelligence. Its hope is to supplant Python which is currently the goto language in these fields. The goal is a more unified language, since it was developed well after Python and learned from a lot of its mistakes. It also claims to have the flexibility of Python but with the speed of a true compiled language like C.

I saw that in the list of packages there was support for using Google’s TensorFlow AI system natively from Julia so I thought I would give this a try. Although it worked, it did reveal some challenges that Julia is going to face in its battle to become a true equal with Python.

# Using TensorFlow in Julia

The TensorFlow wrapper/interface for Julia is in a package created by a PhD candidate at MIT, Jon Malmaud. You can add it to Julia using Pkg.add(“TensorFlow”) as well as view the source code on GitHub. Since I wrote an article recently comparing TensorFlow running on a Raspberry Pi to running on my laptop, I thought I’d use the same example and compare Julia to those cases. I cut/pasted the code into the Julia IDE Juno and made some code syntax changes and gave it a go. It came back that the Keras object was undefined.

I then noticed that in the Tensorflow.jl github there were a couple of examples doing predictions on the MNIST dataset, so at least these were solving the same problem as my article, just using different models. I fired these up, but they failed with syntax errors in the code to load the MNIST dataset. Right now this is a bit of a problem in Julia that not all libraries have been updated to the Julia 1.0 syntax. I had a look at the library to load MNIST and noticed that no one had contributed to it in three years. It appeared to be abandoned with no plans to continue it. After a bit more research I found another Julia package called MLDatasets that was maintained and would load MNIST along with several other popular datasets.

I logged an issue with the Tensorflow.jl repository that they should fix this. They replied that they didn’t have time but if I wanted to fix it, to go ahead. So I fixed this and checked it in to the Tensorflow.jl Github. So now these MNIST examples work with Julia 1.0. I was then happy to have given my small contribution back to this community.

I then thought, why not be ambitious and add the Keras layer to Tensorflow.jl? Well this led to some interesting revelations to how Tensorflow is architected.

# Problems with the Tensorflow Architecture

Looking at some of the issues in the Tensorflow.jl library there were requests for things like TensorFlow’s eager execution and the TensorFlow layers interface. The answer to these issues was that the Julia interface only talked to the DLL/SO interface to Tensorflow and that these modules didn’t exist there and were in fact written in Python rather than C++. I had a look inside the TensorFlow Github and found that their Keras layer is also written in Python.

Originally Tensorflow.jl talked to the Tensorflow Python interface. Julia is really good at interoperability and can easily talk to both Python libraries as well as C/C++ DLL/SOs. The problem with talking to Python libraries is that it involves running a Python process and then doing process to process communications to execute the code. This tends to be way slower than talking to DLLs or SOs. So early on the TensorFlow.jl library was changed to just talk to the DLL/SO interface for Tensorflow and eliminated all Python dependencies. This then lets Julia use the really performant part of TensorFlow and perform all the core operations very quickly.

Now the problem seems to be that Google is doing a lot of the new Tensorflow development in Python and not putting the code into the core shared library. Google is also spending a lot of time promoting these new interfaces as the way to go. This means if you aren’t programming in Python you are definitely a second class citizen.

OK, so is this just bad for the newbie language Julia? Should Julia programmers just use the Jula native Flux AI library? Well, the other thing Google is promoting is running TensorFlow on things like mobile devices, but then you are accessing TensorFlow from Swift on iOS or from Java on Android. Now you have the same problems as the Julia programmer. You only have efficient access to the core low level APIs for TensorFlow and all the new fancy high level access is denied to you. Google’s API block diagram below highlights this.

To me this is a big architectural problem with TensorFlow. Its great to use from Python, but is really limited in other environments. The videos and blogs starting to surface on TensorFlow 2.0 are promoting eager execution and the Keras layer will be the default and primary ways to program with TensorFlow. This then begs the question as to whether these will be moved into the core shared library or will remain as Python code? At this point I haven’t seen this explained, but as we get closer to the 2.0 preview later this year, I’ll be watching this keenly.

It would certainly be nice if they move this Python code into C++ in the shared library so everyone can use it. At that point I think TensorFlow would be much more usable from Julia, Swift, Java, C++, etc. Here’s hoping that is a major upgrade in the 2.0 release.

# Julia TensorFlow Code

Just for interest here is the simplest Julia MNIST example just to give a flavour for the code. This is a simple linear model, so doesn’t give great results. There is a more complicated example that uses a convolutional neural network and gives far superior results.

using TensorFlow include("mnist_loader.jl") loader = DataLoader() sess = Session(Graph()) x = placeholder(Float32) y_ = placeholder(Float32) W = Variable(zeros(Float32, 784, 10)) b = Variable(zeros(Float32, 10)) run(sess, global_variables_initializer()) y = nn.softmax(x*W + b) cross_entropy = reduce_mean(-reduce_sum(y_ .* log(y), axis=[2])) train_step = train.minimize(train.GradientDescentOptimizer(.00001), cross_entropy) correct_prediction = argmax(y, 2) .== argmax(y_, 2) accuracy=reduce_mean(cast(correct_prediction, Float32)) for i in 1:1000 batch = next_batch(loader, 100) run(sess, train_step, Dict(x=>batch[1], y_=>batch[2])) end testx, testy = load_test_set() println(run(sess, accuracy, Dict(x=>testx, y_=>testy)))

# Summary

You can certainly use TensorFlow from Julia. Just beware that you are limited to the lower level APIs, so anything TensorFlow has implemented in Python isn’t available to you. This means you set up the graph and then execute it, really like you always did in the earlier versions of TensorFlow. It would certainly be nice if Google fixes this problem for TensorFlow 2.0.

## Playing with Julia 1.0 on the Raspberry Pi

# Introduction

A couple of weeks ago I saw the press release about the release of version 1.0 of the Julia programming language and thought I’d check it out. I saw it was available for the Raspberry Pi, so I booted up my Pi and installed it. Julia has been in development since 2012, it was created by four MIT professors as an open source project for mathematical computing.

# Why Julia?

Most people doing data science and numerical computing use the Python or R languages. Both of these are open source languages with huge followings. All new machine learning projects need to integrate to these to get anywhere. Both are very productive environments, so why do we need a new one? The main complaint about Python and R is that these are interpreted languages and as a result are very slow when compared to compiled languages like C. They both get around this by supporting large libraries of optimized code written in C, C++, Assembler and Fortran to give highly optimized off the shelf algorithms. These work great, but if one of these doesn’t apply and you need to write Python loops to process a large data set then it can get really frustrating. Another frustration with Python is that it doesn’t have a built in array data type and relies on the numpy and pandas libraries. Between these you can do a lot, but there are holes and strange differences between the two systems.

Julia has a powerful builtin array type and most of the array manipulation features of numpy and pandas are built in to the core language. Further Julia was created from scratch around powerful new just in time (JIT) compiler technology to provide both the speed of development of an interpreted language combined with the speed of a compiled language. You don’t get the full speed of C, but it’s close and a lot better than Python.

The Julia language borrows a lot of features from Python and I find programming in it quite similar. There are tuples, sets, dictionaries and comprehensions. Functions can return multiple values. For loops work very similarly to Python with ranges (using the : built into the language rather than the range() function).

Julia can call C functions directly (meaning you can get pointers to objects), and this allows many wrapper objects to have been created for other systems such as TensorFlow. This is why Julia is very precise about the physical representation of data types and the ability to get a pointer to any data.

Julia uses the end keyword to terminate blocks of code, rather than Pythons forced indentation or C’s semicolons. You can use semicolons to have multiple statements on one line, but don’t need them at the end of a line unless you want it to return null.

Julia has native built in support of most numeric data types including complex numbers and rational numbers. It has types for all the common hardware supported ints and floats. Then it also has arbitrary precision types build around GNU’s bignum library.

There are currently 1906 registered Julia packages and you can see the emphasis on scientific computing, along with machine learning and data science.

The creators of Julia always keep performance at the top of mind. As a result the parallelization support is exceptional along with the ability to run Julia code on CUDA NVidia graphics cards and easily setup clusters.

# Is Julia Ready for Prime Time?

As of the time of this writing, the core Julia 1.0 language has been released and looks quite good. Many companies have produced impressive working systems with the 0.x versions of Julia. However right now there are a few problems.

- Although Julia 1.0 has been released, most of the add on packages haven’t been upgraded to this version yet. In the first release you need to add the Pkg package to add other packages to discourage people using them yet. For instance the library with GPIO support for the Pi is still at version 0.6 and if you add it to 1.0 you get a syntax error in the include file.
- They have released the binaries for all the versions of Julia, but these haven’t made them into the various package management systems yet. So for instance if you do “sudo apt install julia” on a Raspberry Pi, you still get version 0.6.

Hopefully these problems will be sorted out fairly quickly and are just a result of being too close to the bleeding edge.

I was able to get Julia 1.0 going on my Raspberry Pi by downloading the ARM32 files from Julia’s website and then manually copying them over the 0.6 release. Certainly 1.0 works much better than 0.6 (which segmentation faults pretty much every time you have a syntax error). Hopefully they update Raspbian’s apt repository shortly.

# Julia for Machine Learning

There is a TensorFlow.jl wrapper to use Google’s TensorFlow. However the Julia group put out a white paper dissing the TensorFlow approach. Essentially TensorFlow is a separate programming language that you use from another programming language like Python. This results in a lot of duplication and forces the programmer to operate in two different paradigms at once. To solve this problem, Julia has the Flux machine learning system built natively in Julia. This is a fairly powerful machine learning system that is really easy to use, reducing the learning curve to getting working models. Hopefully I’ll write a bit more about Flux in a future article.

# Summary

Julia 1.0 looks really promising. I think in a month or so all the add-on packages should be updated to the 1.0 level and all the binaries should make it out to the various package distribution repositories. In the meantime, it’s a good time to learn Julia and you can accomplish a lot with the core language.

I was planning to publish a version of my LED flashing light program in Julia, but with the PiGPIO package not updated to 1.0 yet, this will have to wait for a future article.

## Updates to the TensorFlow API

# Introduction

Last year I published a series of posts on getting up and running on TensorFlow and creating a simple model to make stock market predictions. The series starts here, however the coding articles are here, here and here. We are now a year later and TensorFlow has advanced by quite a few versions (1.3 as of this writing). In this article I’m going to rework that original Python code to use some simpler more powerful APIs from TensorFlow as well as adopt some best practices that weren’t well known last year (at least by me).

This is the same basic model we used last year, which I plan to improve on going forwards. I changed the data set to record the actual stock prices rather than differences. This doesn’t work so well since most of these stocks increase over time and since we go around and around on the training data, it tends to make the predictions quite low. I plan to fix this in a future articles where I handle this time series data correctly. But first I wanted to address a few other things before proceeding.

I’ve placed the updated source code tfstocksdiff13.py on my Google Drive here.

# Higher Level API

In the original code to create a layer in our Neural Network, we needed to define the weight and bias Tensors:

layer1_weights = tf.Variable(tf.truncated_normal( [NHistData * num_stocks * 2, num_hidden], stddev=0.1)) layer1_biases = tf.Variable(tf.zeros([num_hidden]))

And then define the layer with a complicated mathematical expression:

```
hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)
```

This code is then repeated with mild variations for every layer in the Neural Network. In the original code this was quite a large block of code.

In TensorFlow 1.3 there is now an API to do this:

hidden = tf.layers.dense(data, num_hidden, activation=tf.nn.elu, kernel_initializer=he_init, kernel_regularizer=tf.contrib.layers.l1_l2_regularizer(), name=name + "model" + "hidden1")

This eliminates a lot of repetitive variable definitions and error prone mathematics.

Also notice the kernel_regularizer=tf.contrib.layers.l1_l2_regularizer() parameter. Previously we had to process the weights ourselves to add regularization penalties to the loss function, now TensorFlow will do this for you, but you still need to extract the values and add them to your loss function.

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) loss = tf.add_n([tf.nn.l2_loss( tf.subtract(logits, tf_train_labels))] + reg_losses)

You can get at the actual weights and biases if you need them in a similar manner as well.

# Better Initialization

Previously we initialized the weights using a truncated normal distribution. Back then the recommendation was to use random values to get the initial weights away from zero. However since 2010 (quite a long time ago) there have been better suggestions and the new tf.layers.dense() API supports these. The original paper was “Understanding the difficulty of training deep feedforward neural networks” by Xavier Glorot and Yoshua Bengio. If you ran the previous example you would have gotten an uninitialized variable on he_init. Here is its definition:

```
he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")
```

The idea is that these initializers vary based on the number of inputs and outputs for the neuron layer. There is also tf.contrib.layers.xavier_initializer() and tf.contrib.layers.xavier_initializer_conv2d(). For this example with only two hidden layers it doesn’t matter so much, but if you have a much deeper Neural Network, using these initializers can greatly speed up training and avoid having the gradients either go to zero or explode early on.

# Vanishing Gradients and Activation Functions

You might also notice I changed the activation function from tanh to elu. This is due to the problem of vanishing gradients. Since we are using Gradient Descent to train our system then any zero gradients will stop training improvement in that dimension. If you get large values out of the neuron then the gradient of the tanh function will be near zero and this causes training to get stalled. The relu function also has similar problems if the value ever goes negative then the gradient is zero and again training will likely stall and get stuck there. On solution to this is to use the elu function or a “leaky” relu function. Below are the graphs of elu, leaky relu and relu.

Leaky relu has a low sloped linear function for negative values. Elu uses an exponential type function to flatten out a bit to the left of zero so if things go a bit negative they can recover. Although if things go more negative with elu, they will get stuck again. Elu has the advantage that its is rigged to be differentiable at 0 to avoid special cases. Practically speaking both of these activation functions have given very good results in very deep Neural Networks which would otherwise get stuck during training with tanh, sigmoid or relu.

# Scaling the Input Data

Neural networks work best if all the data is between zero and one. Previously we didn’t scale our data properly and just did an approximation by dividing by the first value. All that code has been deleted and we now use SciKit Learn’s MinMaxScaler object instead. You fit the data using the training data and then transform any data we process with the result. The code for us is:

# Scale all the training data to the range [0,1]. scaler = MinMaxScaler(copy=False) scaler.fit(train_dataset) scaler.transform(train_dataset) scaler.transform(valid_dataset) scaler.transform(test_dataset) scaler.transform(final_row)

The copy=False parameter basically says to do the conversion in place rather than producing a new copy of the data.

SciKit Learn has a lot of useful utility functions that can greatly help with using TensorFlow and well worth looking at even though you aren’t using a SciKit Learn Machine Learning function.

# Summary

The field of Neural Networks is evolving rapidly and the best practices keep getting better. TensorFlow is a very dynamic and quickly evolving tool set which can sometimes be a challenge to keep up with.

The main learnings I wanted to share here are:

- TensorFlow’s high level APIs
- More sophisticated initialization like He Initialization
- Avoiding vanishing gradients with elu or leaky ReLU
- Scaling the input data to between zero and one

These are just a few things of the new things that I could incorporate. In the future I’ll address how to handle time series data in a better manner.

## An Introduction to Image Style Transfer

# Introduction

Image Style Transfer is an AI technique that is becoming quite popular for enhancing or stylizing photos. It takes one picture (often a classical painting) and then applies the style of that picture to another picture. For example I could take this photo of the Queen of Surrey passing Hopkins Landing:

Combined with the style of Vincent van Gogh’s Starry Night:

To then feed these through the AI algorithm to get:

In this article, we’ll be look at some of the ways you can accomplish this yourself either through using online services or running your own Neural Network with TensorFlow.

# Playing with Image Style Transfer

There are lots of services that let you play with this. Generally to apply a canned style to your own picture is quite fast (a few seconds). To provide your own photo as the style photo is more involved, since it involves “training” the style and this can take 30 minutes (or more).

Probably the most popular program is the Prisma app for either iPhone or Android. This app has a large number of pre-trained styles and can apply any of them to any photo on your phone. This app works quite well and gives plenty of variety to play with. Plus its free. Here is the ferry in Prisma’s comic theme:

If you want to provide your own photo as the style reference then deepart.io is a good choice. This is available as a web app as well as either an iPhone or Android app. The good part about this for photographers is that you can copy photos from your good camera to your computer and then use this program’s website, no phone required. This site has some pre-programmed styles based on Vincent van Gogh which work really quickly and produce good results. Then it has the ability to upload a style photo. Processing a style is more work and typically takes 25 minutes (you can pay to have it processed quicker, but not that much quicker). If you don’t mind the wait this site is free and works quite well. Here is an example of the ferry picture above van Gogh’ized by deepart.io (sorry they don’t label the styles so I don’t know which painting this is styled from):

# Playing More Directly

These programs are great fun, but I like to tinker with things myself on my computer. So can I run these programs myself? Can I get the source code? Fortunately the answer to both is yes. This turns out to be a bit easier than you first might think, largely due to a project out of the Visual Geometry Group (VGG) at the University of Oxford. They created an exceptional image recognition neural network that they trained and won several competitions with. It turns out that the backbone to doing Image Style Transfer is to have a good image recognition Neural Network. This Neural Net is 19 layers deep and Oxford released the fully trained network for anyone to use. Several people have then taken this network, figured out how to load it into TensorFlow and created some really good Image Style Transfer programs based on this. The first program I played with was Anish Athalye’s program posted on GitHub here. This program uses VGG and can train a neural network for a given style picture. Anish has quite a good write up on his blog here.

Then I played with a program that expanded on Anish’s by Shafeen Tejani which is on GitHub here along with a blog post here. This program lets you keep the trained network so you can perform the transformation quickly on any picture you like. This is similar to how Prisma works. The example up in the introduction was created with this picture. To train the network you require a training set of image like the Microsoft COCO collection.

Running these programs isn’t for everyone. You have to be used to running Python programs and have TensorFlow installed and working on your system. You need a few other dependent Python libraries and of course you need the VGG saved Neural Network. But if you already have Python and TensorFlow, I found both of these programs just ran and I could play with them quite easily.

The writeups on all these programs highly recommend having a good GPU to speed up the calculations. I’m playing on an older MacBook Air with no GPU and was able to get quite good results. One trick I found that helped is to play with reduced resolution images to help speed up the process, then run the algorithm on a higher resolution version when you have things right. I found I couldn’t use the full resolution from my DLSR (12meg), but had to use the Apple’s “large” size (286KB).

# Summary

This was a quick introduction to Image Style Transfer. We are seeing this in more and more places. There are applications that can apply this same technique to videos. I expect this will become a standard part of all image processing software like PhotoShop or Gimp. It also might remain the domain of specialty programs like HDR has, since it is quite technical and resource intensive. In the meantime projects like VGG have made this technology quite accessible for anyone to play with.

## A Crack in the TensorFlow Platform

# Introduction

Last time we looked at how some tunable parameters through off a TensorFlow solution of a linear regression problem. This time we are going to look at a few more topics around TensorFlow and linear regression. Then we’ll look at how Google is implementing Linear Regression and some problems with their approach.

# TensorFlow Graphs

Last time we looked at calculating the solution to a linear regression problem directly using TensorFlow. That bit of code was:

# Now lets calculated the least squares fit exactly using TensorFlow X = tf.constant(data[:,0], name="X") Y = tf.constant(data[:,1], name="Y") Xavg = tf.reduce_mean(X, name="Xavg") Yavg = tf.reduce_mean(Y, name="Yavg") num = (X - Xavg) * (Y - Yavg) denom = (X - Xavg) ** 2 rednum = tf.reduce_sum(num, name="numerator") reddenom = tf.reduce_sum(denom, name="denominator") m = rednum / reddenom b = Yavg - m * Xavg with tf.Session() as sess: writer = tf.summary.FileWriter('./graphs', sess.graph) mm, bb = sess.run([m, b])

TensorFlow does all its calculations based on a graph where the various operators and constants are nodes that then get connected together to show dependencies. We can use TensorBoard to show the graph for the snippet of code we just reviewed here:

Notice that TensorFlow overloads the standard Python numerical operators, so when we get a line of code like: “denom = (X – Xavg) ** 2”, since X and Xavg are Tensors then we actually generate TensorFlow nodes as if we had called things like tf.subtract and tf.pow. This is much easier code to write, the only downside being that there isn’t a name parameter to label the nodes to get a better graph out of TensorBoard.

With TensorFlow you perform calculations in two steps, first you build the graph (everything before the with statement) and then you execute a calculation by specifying what you want. To do this you create a session and call run. In run we specify the variables we want calculated. TensorFlow then goes through the graph calculating anything it needs to, to get the variables we asked for. This means it may not calculate everything in the graph.

So why does TensorFlow follow this model? It seems overly complicated to perform numerical calculations. The reason is that there are algorithms to separate graphs into separate independent components that can be calculated in parallel. Then TensorFlow can delegate separate parts of the graph to separate GPUs to perform the calculation and then combine the results. In this example this power isn’t needed, but once you are calculating a very complicated large Neural Network then this becomes a real selling point. However since TensorFlow is a general tool, you can use it to do any calculation you wish on a set of GPUs.

# TensorFlow’s New LinearRegressor Estimator

Google has been trying to turn TensorFlow into a platform for all sorts of Machine Learning algorithms, not just Neural Networks. They have added estimators for Random Forests and for Linear Regression. However they did this by using the optimizers they created for Neural Nets rather than using the standard algorithms used in other libraries, like those implemented in SciKit Learn. The reasoning behind this is that they have a lot of support for really really big models with lots of support for one-hot encoding, sparse matrices and so on. However the algorithms that solve the problem seem to be exceedingly slow and resource hungry. Anything implemented in TensorFlow will run on a GPU, and similarly any Machine Learning algorithm can be implemented in TensorFlow. The goal here is to have TensorFlow running the Google AI Cloud where all the virtual machines have Google designed GPU like AI accelerator hardware. But I think unless they implement the standard algorithms, so they can solve things like a simple least squares regression quickly hand accurately then its usefulness will be limited.

Here is how you solve our fire versus theft linear regression this way in TensorFlow:

features = [tf.contrib.layers.real_valued_column("x", dimension=1)] estimator = tf.contrib.learn.LinearRegressor(feature_columns=features, model_dir='./linear_estimator')

# Input builders input_fn = tf.contrib.learn.io.numpy_input_fn({"x":x}, y, num_epochs=10000) estimator.fit(input_fn=input_fn, steps=2000) mm = estimator.get_variable_value('linear/x/weight') bb = estimator.get_variable_value('linear/bias_weight') print(mm, bb)

This solves the problem and returns a slope of 1.50674927 and intercept of 13.47268105 (the correct numbers from last post are 1.31345600492 and 16.9951572327). By increasing the steps in the fit statement I can get closer to the correct answer, but it is very time consuming.

The documentation for these new estimators is very limited, so I’m not 100% sure it’s solving least squares, but I tried getting the L1 solution using SciKit Learn and it was very close to least squares, so whatever this new estimator is estimating (which might be least squares), it is very slow and quite inaccurate. It is also strange that we now have a couple of tunable parameters added to make a fairly simple calculation problematic. The graph for this solution isn’t too bad, but still since we know the exact solution it is a bit disappointing.

Incidentally I was planning to compare the new TensorFlow RandomForest estimator to the Scikit Learn implementation. Although the SciKit Learn one is quite fast, it uses a huge amount of memory so I kind of would like a better solution. But when I compared the two I found the TensorFlow one so bad (both slow and resource intensive) that I didn’t bother blogging it. I hope that by the time this solution becomes more mainstream in TensorFlow that it improves a lot.

# Summary

TensorFlow is a very powerful engine for performing calculations that can be automatically parallelized and distributed over multiple GPUs for amazing computational speeds. This really does make it possible to spend a few thousand dollars and build quite a powerful supercomputer.

The downside is that Google appears to have the hammer of their neural network optimizers that they really want to use. As a result they are treating everything else as a nail and hitting it with this hammer. The results are quite sub-optimal. I think they do need to spend the time to implement a few of the standard non-Neural Network algorithms properly in TensorFlow if they really want to unleash the power of this platform.

## Dangers of Tunable Parameters in TensorFlow

# Introduction

One of the great benefits of the Internet era has been the democratization of knowledge. A great contributor to this is the number of great Universities releasing a large number of high quality online courses that anyone can access for free. I was going through one of these, namely Stanford’s CS 20SI: Tensorflow for Deep Learning Research and playing with TensorFlow to follow along. This is an excellent course and the course notes could be put together into a nice book on TensorFlow. I was going through “Lecture note 3: Linear and Logistic Regression in TensorFlow”, which starts with a simple example of using TensorFlow to perform a linear regression. This example demonstrates how to use TensorFlow to solve this problem iteratively using Gradient Descent. This approach will then be turned to much harder problems where this is necessary, however for linear regression we can actually solve the problem exactly. I did this and got very different results than the lesson. So I investigated and figured I’d blog a bit on why this is the case as well as provide some code for different approaches to this problem. Note that a lot of the code in this article comes directly from the Stanford course notes.

# The Example Problem

The sample data they used was fire and theft data in Chicago to see if there is a relation between the number of fires in a neighborhood to the number of thefts. The data is available here. If we download the Excel version of the file then we can read it with Python XLRD package.

import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import xlrd DATA_FILE = "data/fire_theft.xls" # Step 1: read in data from the .xls file book = xlrd.open_workbook(DATA_FILE, encoding_override="utf-8") sheet = book.sheet_by_index(0) data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)]) n_samples = sheet.nrows - 1

With the data loaded in we can now try linear regression on it.

# Solving With Gradient Descent

This is the code from the course notes which solve the problem by minimizing the loss function which is defined as the square of the difference (ie least squares). I’ve blogged a bit about using TensorFlow this way in my Road to TensorFlow series of posts like this one. Its uses the GadientDecentOptimizer and iterates through the data a few times to arrive at a solution.

# Step 2: create placeholders for input X (number of fire) and label Y (number of theft) X = tf.placeholder(tf.float32, name="X") Y = tf.placeholder(tf.float32, name="Y") # Step 3: create weight and bias, initialized to 0 w = tf.Variable(0.0, name="weights") b = tf.Variable(0.0, name="bias") # Step 4: construct model to predict Y (number of theft) from the number of fire Y_predicted = X * w + b # Step 5: use the square error as the loss function loss = tf.square(Y - Y_predicted, name="loss") # Step 6: using gradient descent with learning rate of 0.01 to minimize loss optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) with tf.Session() as sess: # Step 7: initialize the necessary variables, in this case, w and b sess.run(tf.global_variables_initializer()) # Step 8: train the model for i in range(100): # run 100 epochs for xx, yy in data: # Session runs train_op to minimize loss sess.run(optimizer, feed_dict={X: xx, Y:yy}) # Step 9: output the values of w and b w_value, b_value = sess.run([w, b])

Running this results in w (the slope) as 1.71838 and b (the intercept) as 15.7892.

# Solving Exactly with TensorFlow

We can solve the problem exactly with TensorFlow. You can find the formula for this here, or a complete derivation of the formula here.

# Now lets calculated the least squares fit exactly using TensorFlow X = tf.constant(data[:,0], name="X") Y = tf.constant(data[:,1], name="Y") Xavg = tf.reduce_mean(X, name="Xavg") Yavg = tf.reduce_mean(Y, name="Yavg") num = (X - Xavg) * (Y - Yavg) denom = (X - Xavg) ** 2 rednum = tf.reduce_sum(num, name="numerator") reddenom = tf.reduce_sum(denom, name="denominator") m = rednum / reddenom b = Yavg - m * Xavg with tf.Session() as sess: writer = tf.summary.FileWriter('./graphs', sess.graph) mm, bb = sess.run([m, b])

This results in a slope of 1.31345600492 and intercept of 16.9951572327.

# Solving with NumPy

My first thought was that I did something wrong in TensorFlow, so I thought why not just solve it with NumPy. NumPy has a linear algebra subpackage which easily solves this.

# Calculate least squares fit exactly using numpy's linear algebra package. x = data[:, 0] y = data[:, 1] m, c = np.linalg.lstsq(np.vstack([x, np.ones(len(x))]).T, y)[0]

There is a little extra complexity since it handles n dimensions, so you need to reformulate the data from a vector to a matrix for it to be happy. This then returns the same result as the exact TensorFlow, so I guess my code was somewhat correct.

# Visualize the Results

You can easily visualize the results with matplotlib.

# Plot the calculated line against the data to see how it looks. plt.plot(x, y, "o") plt.plot([0, 40], [bb, mm * 40 + bb], 'k-', lw=2) plt.show()

This leads to the following pictures. First we have the plot of the bad result from GradientDecent.

This course instructor looked at this and decided it wasn’t very good (which it isn’t) and that the solution was to fit the data with a parabola instead. The parabola gives a better result as far as the least squares error because it nearly goes through the point on the upper right. But I don’t think that leads to a better predictor because if you remove that one point the picture is completely different. My feeling is that the parabola is already overfitting the problem.

Here is the result with the exact correct solution:

To me this is a better solution because it represents the lower right data better. Looking at this gives much less impetus to replace it with a concave up parabola. The course then looks at some correct solutions, but built on the parabola model rather than a linear model.

# What Went Wrong?

So what went wrong with the Gradient Descent solution? My first thought was that it didn’t iterate the data enough, just doing 100 iterations wasn’t enough. So I increased the number of iterations but this didn’t greatly improve the result. I know that theoretically Gradient Descent should converge for least squares since the derivatives are easy and well behaved. Next I tried making the learning rate smaller, this improved the result, and then also doing more iterations solved the problems. I found to get a reasonable result I needed to reduce the learning rate by a factor of 100 to 0.00001 and increase the iterations by 100 to 10,000. This then took about 5 minutes to solve on my computer, as opposed to the exact solution which was instantaneous.

The lesson here is that too high a learning rate leads to the result circling the solution without being able to converge to it. Once the learning rate is reduced so small, it takes a long time for the solution to move from the initial guess to the correct solution which is why we need so many iterations.

This highlights why many algorithms build in adaptable learning rates where they are higher when moving quickly and then they dynamically reduce to zero in on a solution.

# Summary

Most Machine Learning algorithms can’t be double checked by comparing them to the exact solution. But this example highlights how a simple algorithm can return a wrong result, but a result that is close enough to fool a Stanford researcher and make them (in my opinion) go in a wrong direction. It shows the danger we have in all these tunable parameters to Machine Learning algorithms, how getting things like the learning rate or number of iterations incorrect can lead to quite misleading results.