Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘stanford

Dangers of Tunable Parameters in TensorFlow

with one comment

Introduction

One of the great benefits of the Internet era has been the democratization of knowledge. A great contributor to this is the number of great Universities releasing a large number of high quality online courses that anyone can access for free. I was going through one of these, namely Stanford’s CS 20SI: Tensorflow for Deep Learning Research and playing with TensorFlow to follow along. This is an excellent course and the course notes could be put together into a nice book on TensorFlow. I was going through “Lecture note 3: Linear and Logistic Regression in TensorFlow”, which starts with a simple example of using TensorFlow to perform a linear regression. This example demonstrates how to use TensorFlow to solve this problem iteratively using Gradient Descent. This approach will then be turned to much harder problems where this is necessary, however for linear regression we can actually solve the problem exactly. I did this and got very different results than the lesson. So I investigated and figured I’d blog a bit on why this is the case as well as provide some code for different approaches to this problem. Note that a lot of the code in this article comes directly from the Stanford course notes.

The Example Problem

The sample data they used was fire and theft data in Chicago to see if there is a relation between the number of fires in a neighborhood to the number of thefts. The data is available here. If we download the Excel version of the file then we can read it with Python XLRD package.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import xlrd

DATA_FILE = "data/fire_theft.xls"

# Step 1: read in data from the .xls file
book = xlrd.open_workbook(DATA_FILE, encoding_override="utf-8")
sheet = book.sheet_by_index(0)
data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)])
n_samples = sheet.nrows - 1

With the data loaded in we can now try linear regression on it.

Solving With Gradient Descent

This is the code from the course notes which solve the problem by minimizing the loss function which is defined as the square of the difference (ie least squares). I’ve blogged a bit about using TensorFlow this way in my Road to TensorFlow series of posts like this one. Its uses the GadientDecentOptimizer and iterates through the data a few times to arrive at a solution.

# Step 2: create placeholders for input X (number of fire) and label Y (number of theft)
X = tf.placeholder(tf.float32, name="X")
Y = tf.placeholder(tf.float32, name="Y")

# Step 3: create weight and bias, initialized to 0
w = tf.Variable(0.0, name="weights")
b = tf.Variable(0.0, name="bias")

# Step 4: construct model to predict Y (number of theft) from the number of fire
Y_predicted = X * w + b

# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name="loss")

# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

with tf.Session() as sess:

    # Step 7: initialize the necessary variables, in this case, w and b
    sess.run(tf.global_variables_initializer())

    # Step 8: train the model
    for i in range(100): # run 100 epochs
        for xx, yy in data:

            # Session runs train_op to minimize loss
            sess.run(optimizer, feed_dict={X: xx, Y:yy})

    # Step 9: output the values of w and b
    w_value, b_value = sess.run([w, b])

Running this results in w (the slope) as 1.71838 and b (the intercept) as 15.7892.

Solving Exactly with TensorFlow

We can solve the problem exactly with TensorFlow. You can find the formula for this here, or a complete derivation of the formula here.

# Now lets calculated the least squares fit exactly using TensorFlow
X = tf.constant(data[:,0], name="X")
Y = tf.constant(data[:,1], name="Y")

Xavg = tf.reduce_mean(X, name="Xavg")
Yavg = tf.reduce_mean(Y, name="Yavg")
num = (X - Xavg) * (Y - Yavg)
denom = (X - Xavg) ** 2
rednum = tf.reduce_sum(num, name="numerator")
reddenom = tf.reduce_sum(denom, name="denominator")
m = rednum / reddenom
b = Yavg - m * Xavg
with tf.Session() as sess:
    writer = tf.summary.FileWriter('./graphs', sess.graph)
    mm, bb = sess.run([m, b])

This results in a slope of 1.31345600492 and intercept of 16.9951572327.

Solving with NumPy

My first thought was that I did something wrong in TensorFlow, so I thought why not just solve it with NumPy. NumPy has a linear algebra subpackage which easily solves this.

# Calculate least squares fit exactly using numpy's linear algebra package.
x = data[:, 0]
y = data[:, 1]
m, c = np.linalg.lstsq(np.vstack([x, np.ones(len(x))]).T, y)[0]

There is a little extra complexity since it handles n dimensions, so you need to reformulate the data from a vector to a matrix for it to be happy. This then returns the same result as the exact TensorFlow, so I guess my code was somewhat correct.

Visualize the Results

You can easily visualize the results with matplotlib.

# Plot the calculated line against the data to see how it looks.
plt.plot(x, y, "o")
plt.plot([0, 40], [bb, mm * 40 + bb], 'k-', lw=2)
plt.show()

This leads to the following pictures. First we have the plot of the bad result from GradientDecent.

This course instructor looked at this and decided it wasn’t very good (which it isn’t) and that the solution was to fit the data with a parabola instead. The parabola gives a better result as far as the least squares error because it nearly goes through the point on the upper right. But I don’t think that leads to a better predictor because if you remove that one point the picture is completely different. My feeling is that the parabola is already overfitting the problem.

Here is the result with the exact correct solution:

To me this is a better solution because it represents the lower right data better. Looking at this gives much less impetus to replace it with a concave up parabola. The course then looks at some correct solutions, but built on the parabola model rather than a linear model.

What Went Wrong?

So what went wrong with the Gradient Descent solution? My first thought was that it didn’t iterate the data enough, just doing 100 iterations wasn’t enough. So I increased the number of iterations but this didn’t greatly improve the result. I know that theoretically Gradient Descent should converge for least squares since the derivatives are easy and well behaved. Next I tried making the learning rate smaller, this improved the result, and then also doing more iterations solved the problems. I found to get a reasonable result I needed to reduce the learning rate by a factor of 100 to 0.00001 and increase the iterations by 100 to 10,000. This then took about 5 minutes to solve on my computer, as opposed to the exact solution which was instantaneous.

The lesson here is that too high a learning rate leads to the result circling the solution without being able to converge to it. Once the learning rate is reduced so small, it takes a long time for the solution to move from the initial guess to the correct solution which is why we need so many iterations.

This highlights why many algorithms build in adaptable learning rates where they are higher when moving quickly and then they dynamically reduce to zero in on a solution.

Summary

Most Machine Learning algorithms can’t be double checked by comparing them to the exact solution. But this example highlights how a simple algorithm can return a wrong result, but a result that is close enough to fool a Stanford researcher and make them (in my opinion) go in a wrong direction. It shows the danger we have in all these tunable parameters to Machine Learning algorithms, how getting things like the learning rate or number of iterations incorrect can lead to quite misleading results.

 

Written by smist08

August 4, 2017 at 6:25 pm