Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘pandas

The Road to TensorFlow – Part 3: Python Libraries

with 3 comments


Continuing on with my long and winding journey to learn TensorFlow, we started with Linux then went on to Python. Today we will be looking at a number of necessary Python libraries.

My background is Mathematics and I’ve always had an interest in Numerical Analysis and Scientific Computing. But I mostly left these behind when I left University. As I learned Python and started to play with it, among the attendant libraries, I was very pleasantly surprised to find that all my favorite numerical algorithms (and many more). These were now all part of the Python fairly standard libraries. Many of these core libraries are still written in their original Fortran or C code, but are tailored to fit very well into the Python ecosystem. All of this is all open source software and to a certain degree made possible by the good work of the GNU Fortran and C compilers.

These libraries led to quite a few diversions from my primary task of learning TensorFlow, but I found this to be quite a wonderful world to become conversant in.

As I completed the TensorFlow tutorials and an Udacity course, I wanted a different problem to play with rather than the standard image recognition and speech analysis projects that seem pretty standard. To use these, you need quite a bit of data to train your algorithms with, so I thought why not do something with stock market data? After all you can easily get gobs of stock market data via web service calls fairly easily (and freely).

Some Useful Libraries

Here are a few of the libraries that I found useful to help with machine learning and TensorFlow.

Numpy – this is the fundamental Python numerical package that most other libraries are built over. It includes a powerful N dimensional array object, useful linear algebra, Fourier transform, random number capabilities and much more.

Scipy – is built on numpy and includes most numerical algorithms you’ve ever heard of including numerical integration, ODE solvers, optimization, interpolation, special functions and signal processing.

Matplotlib – is a very powerful 2D plotting library that is very useful to use to visualize your results.n

Pandas – was originally written as a library to manipulate stock market data and perform the standard things market technical analysts like to do, but now it markets itself as a general purpose data analysis library.

Sympy – is a library for performing symbolic mathematics. Although I’m not using this in relation to TensorFlow (currently), it is a fascinating tool for performing symbolic algebra and calculus.

IPython – is interactive Python when you program in interactive web based notebooks. A useful tool to play with, but I tend to do my real programming in an IDE. Still if you want to quickly play with something, this is lots of fun.

Pickle – although this is a standard library, I thought I’d highlight it since we are about to use it. This library lets you easily save and load Pythons objects to disk files.

Scikit-learn – is a collection of machine learning algorithms for things like clustering, classification and regression. I.e. neural networks aren’t the only way to accomplish these tasks.

There are many more Python libraries for things like writing GUI programs, performing web requests, processing web data, accessing databases, etc. We’ll talk about those as we need them. Since Python has such a large community of users and contributors there are tons of good web pages, blogs, books courses and forums on all of these. Google is your friend.

Some Code Finally

So let’s use all of this to load some stock market data which will then be ready for our TensorFlow model. We are going to use Pandas to load some recent prices for the Dow 30 stocks and we’ll use matplotlib to display a graph of their values. This graph is a bit too busy since 30 stocks is also really too many to display at once. Also we haven’t normalized the data at all, so this doesn’t give any real way to compare them. It really only shows we’ve loaded a bunch of data which is hopefully correct.

In this snippet we only load a small bit of history, so its reasonably quick but when we want large amounts of data we will want to cache this. So when we do the web services call to get the data, we pickle it to a file (Python speak for serializing our data object and saving it to a file). If the file exists we just read it from the file and skip the web service call. To refresh the data from the web service, just delete the stocks.pickle file.

We get the data from Yahoo Finance. We could use Yahoo’s Python library directly, but I thought I might use the Pandas DataReader general purpose API to make it easy to switch to Google if Verizon shuts down (or strangles) this service now that they own Yahoo. The Web Services call returns the open, high, low, volume, close and adjusted close which is why we have the couple of lines to clean up the data and only keep the adjusted close. I’ll talk more about the stock market and what the adjusted close is next time.

The program wants to get TrainDataSetSize prices for each stock which is set to 50 below. But due to weekends and holidays, you can’t just subtract 50 from today’s date to get that. So I use a simple heuristic to ensure I get more data than that (which massively overestimates).

import time
import math
import os
from datetime import date
from datetime import timedelta
import numpy as np
import matplotlib
import pandas as pd
import pandas_datareader as pdr
from pandas_datareader import data, wb
from six.moves import cPickle as pickle

TrainDataSetSize = 50

# Load the Dow 30 stocks from Yahoo into a Pandas datasheet

dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM',
          'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM',
          'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG',
          'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS']

stock_filename = 'stocks.pickle'
if os.path.exists(stock_filename):
         with open(stock_filename, 'rb') as f:
             trainData = pickle.load(f)
     except Exception as e:
       print('Unable to process data from', stock_filename, ':', e)
     print('%s already present - Skipping requesting/pickling.' %
     f =, 'yahoo',
     cleanData = f.ix['Adj Close']
     trainData = pd.DataFrame(cleanData)
     print('Pickling %s.' % stock_filename)
         with open(stock_filename, 'wb') as f:
           pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL)
     except Exception as e:
         print('Unable to save data to', stock_filename, ':', e)




Generally, I think this is a fairly short bit of code that accomplishes all this. This is one of the beauties of Python that it is so compact.



This was a quick introduction the Python libraries we’ll be using in addition to TensorFlow. Hopefully the quick sample program gave a taste of how we will be using them and is in fact how we will be getting training data for our TensorFlow model.



Written by smist08

August 30, 2016 at 10:49 pm