## Posts Tagged ‘**vanishing gradients**’

## Updates to the TensorFlow API

# Introduction

Last year I published a series of posts on getting up and running on TensorFlow and creating a simple model to make stock market predictions. The series starts here, however the coding articles are here, here and here. We are now a year later and TensorFlow has advanced by quite a few versions (1.3 as of this writing). In this article I’m going to rework that original Python code to use some simpler more powerful APIs from TensorFlow as well as adopt some best practices that weren’t well known last year (at least by me).

This is the same basic model we used last year, which I plan to improve on going forwards. I changed the data set to record the actual stock prices rather than differences. This doesn’t work so well since most of these stocks increase over time and since we go around and around on the training data, it tends to make the predictions quite low. I plan to fix this in a future articles where I handle this time series data correctly. But first I wanted to address a few other things before proceeding.

I’ve placed the updated source code tfstocksdiff13.py on my Google Drive here.

# Higher Level API

In the original code to create a layer in our Neural Network, we needed to define the weight and bias Tensors:

layer1_weights = tf.Variable(tf.truncated_normal( [NHistData * num_stocks * 2, num_hidden], stddev=0.1)) layer1_biases = tf.Variable(tf.zeros([num_hidden]))

And then define the layer with a complicated mathematical expression:

```
hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)
```

This code is then repeated with mild variations for every layer in the Neural Network. In the original code this was quite a large block of code.

In TensorFlow 1.3 there is now an API to do this:

hidden = tf.layers.dense(data, num_hidden, activation=tf.nn.elu, kernel_initializer=he_init, kernel_regularizer=tf.contrib.layers.l1_l2_regularizer(), name=name + "model" + "hidden1")

This eliminates a lot of repetitive variable definitions and error prone mathematics.

Also notice the kernel_regularizer=tf.contrib.layers.l1_l2_regularizer() parameter. Previously we had to process the weights ourselves to add regularization penalties to the loss function, now TensorFlow will do this for you, but you still need to extract the values and add them to your loss function.

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) loss = tf.add_n([tf.nn.l2_loss( tf.subtract(logits, tf_train_labels))] + reg_losses)

You can get at the actual weights and biases if you need them in a similar manner as well.

# Better Initialization

Previously we initialized the weights using a truncated normal distribution. Back then the recommendation was to use random values to get the initial weights away from zero. However since 2010 (quite a long time ago) there have been better suggestions and the new tf.layers.dense() API supports these. The original paper was “Understanding the difficulty of training deep feedforward neural networks” by Xavier Glorot and Yoshua Bengio. If you ran the previous example you would have gotten an uninitialized variable on he_init. Here is its definition:

```
he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")
```

The idea is that these initializers vary based on the number of inputs and outputs for the neuron layer. There is also tf.contrib.layers.xavier_initializer() and tf.contrib.layers.xavier_initializer_conv2d(). For this example with only two hidden layers it doesn’t matter so much, but if you have a much deeper Neural Network, using these initializers can greatly speed up training and avoid having the gradients either go to zero or explode early on.

# Vanishing Gradients and Activation Functions

You might also notice I changed the activation function from tanh to elu. This is due to the problem of vanishing gradients. Since we are using Gradient Descent to train our system then any zero gradients will stop training improvement in that dimension. If you get large values out of the neuron then the gradient of the tanh function will be near zero and this causes training to get stalled. The relu function also has similar problems if the value ever goes negative then the gradient is zero and again training will likely stall and get stuck there. On solution to this is to use the elu function or a “leaky” relu function. Below are the graphs of elu, leaky relu and relu.

Leaky relu has a low sloped linear function for negative values. Elu uses an exponential type function to flatten out a bit to the left of zero so if things go a bit negative they can recover. Although if things go more negative with elu, they will get stuck again. Elu has the advantage that its is rigged to be differentiable at 0 to avoid special cases. Practically speaking both of these activation functions have given very good results in very deep Neural Networks which would otherwise get stuck during training with tanh, sigmoid or relu.

# Scaling the Input Data

Neural networks work best if all the data is between zero and one. Previously we didn’t scale our data properly and just did an approximation by dividing by the first value. All that code has been deleted and we now use SciKit Learn’s MinMaxScaler object instead. You fit the data using the training data and then transform any data we process with the result. The code for us is:

# Scale all the training data to the range [0,1]. scaler = MinMaxScaler(copy=False) scaler.fit(train_dataset) scaler.transform(train_dataset) scaler.transform(valid_dataset) scaler.transform(test_dataset) scaler.transform(final_row)

The copy=False parameter basically says to do the conversion in place rather than producing a new copy of the data.

SciKit Learn has a lot of useful utility functions that can greatly help with using TensorFlow and well worth looking at even though you aren’t using a SciKit Learn Machine Learning function.

# Summary

The field of Neural Networks is evolving rapidly and the best practices keep getting better. TensorFlow is a very dynamic and quickly evolving tool set which can sometimes be a challenge to keep up with.

The main learnings I wanted to share here are:

- TensorFlow’s high level APIs
- More sophisticated initialization like He Initialization
- Avoiding vanishing gradients with elu or leaky ReLU
- Scaling the input data to between zero and one

These are just a few things of the new things that I could incorporate. In the future I’ll address how to handle time series data in a better manner.