Stephen Smith's Blog

Musings on Machine Learning…

Archive for August 2016

The Road to TensorFlow – Part 3: Python Libraries

with 3 comments

Introduction

Continuing on with my long and winding journey to learn TensorFlow, we started with Linux then went on to Python. Today we will be looking at a number of necessary Python libraries.

My background is Mathematics and I’ve always had an interest in Numerical Analysis and Scientific Computing. But I mostly left these behind when I left University. As I learned Python and started to play with it, among the attendant libraries, I was very pleasantly surprised to find that all my favorite numerical algorithms (and many more). These were now all part of the Python fairly standard libraries. Many of these core libraries are still written in their original Fortran or C code, but are tailored to fit very well into the Python ecosystem. All of this is all open source software and to a certain degree made possible by the good work of the GNU Fortran and C compilers.

These libraries led to quite a few diversions from my primary task of learning TensorFlow, but I found this to be quite a wonderful world to become conversant in.

As I completed the TensorFlow tutorials and an Udacity course, I wanted a different problem to play with rather than the standard image recognition and speech analysis projects that seem pretty standard. To use these, you need quite a bit of data to train your algorithms with, so I thought why not do something with stock market data? After all you can easily get gobs of stock market data via web service calls fairly easily (and freely).

Some Useful Libraries

Here are a few of the libraries that I found useful to help with machine learning and TensorFlow.

Numpy – this is the fundamental Python numerical package that most other libraries are built over. It includes a powerful N dimensional array object, useful linear algebra, Fourier transform, random number capabilities and much more.

Scipy – is built on numpy and includes most numerical algorithms you’ve ever heard of including numerical integration, ODE solvers, optimization, interpolation, special functions and signal processing.

Matplotlib – is a very powerful 2D plotting library that is very useful to use to visualize your results.n

Pandas – was originally written as a library to manipulate stock market data and perform the standard things market technical analysts like to do, but now it markets itself as a general purpose data analysis library.

Sympy – is a library for performing symbolic mathematics. Although I’m not using this in relation to TensorFlow (currently), it is a fascinating tool for performing symbolic algebra and calculus.

IPython – is interactive Python when you program in interactive web based notebooks. A useful tool to play with, but I tend to do my real programming in an IDE. Still if you want to quickly play with something, this is lots of fun.

Pickle – although this is a standard library, I thought I’d highlight it since we are about to use it. This library lets you easily save and load Pythons objects to disk files.

Scikit-learn – is a collection of machine learning algorithms for things like clustering, classification and regression. I.e. neural networks aren’t the only way to accomplish these tasks.

There are many more Python libraries for things like writing GUI programs, performing web requests, processing web data, accessing databases, etc. We’ll talk about those as we need them. Since Python has such a large community of users and contributors there are tons of good web pages, blogs, books courses and forums on all of these. Google is your friend.

Some Code Finally

So let’s use all of this to load some stock market data which will then be ready for our TensorFlow model. We are going to use Pandas to load some recent prices for the Dow 30 stocks and we’ll use matplotlib to display a graph of their values. This graph is a bit too busy since 30 stocks is also really too many to display at once. Also we haven’t normalized the data at all, so this doesn’t give any real way to compare them. It really only shows we’ve loaded a bunch of data which is hopefully correct.

In this snippet we only load a small bit of history, so its reasonably quick but when we want large amounts of data we will want to cache this. So when we do the web services call to get the data, we pickle it to a file (Python speak for serializing our data object and saving it to a file). If the file exists we just read it from the file and skip the web service call. To refresh the data from the web service, just delete the stocks.pickle file.

We get the data from Yahoo Finance. We could use Yahoo’s Python library directly, but I thought I might use the Pandas DataReader general purpose API to make it easy to switch to Google if Verizon shuts down (or strangles) this service now that they own Yahoo. The Web Services call returns the open, high, low, volume, close and adjusted close which is why we have the couple of lines to clean up the data and only keep the adjusted close. I’ll talk more about the stock market and what the adjusted close is next time.

The program wants to get TrainDataSetSize prices for each stock which is set to 50 below. But due to weekends and holidays, you can’t just subtract 50 from today’s date to get that. So I use a simple heuristic to ensure I get more data than that (which massively overestimates).

import time
import math
import os
from datetime import date
from datetime import timedelta
import numpy as np
import matplotlib
import pandas as pd
import pandas_datareader as pdr
from pandas_datareader import data, wb
from six.moves import cPickle as pickle

TrainDataSetSize = 50

# Load the Dow 30 stocks from Yahoo into a Pandas datasheet

dow30 = ['AXP', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'DD', 'XOM',
          'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM',
          'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG',
          'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WMT', 'DIS']

stock_filename = 'stocks.pickle'
if os.path.exists(stock_filename):
     try:
         with open(stock_filename, 'rb') as f:
             trainData = pickle.load(f)
     except Exception as e:
       print('Unable to process data from', stock_filename, ':', e)
       raise            
     print('%s already present - Skipping requesting/pickling.' %
         stock_filename)
else:
     f = pdr.data.DataReader(dow30, 'yahoo', date.today()-
         timedelta(days=TrainDataSetSize*2+5), date.today())
     cleanData = f.ix['Adj Close']
     trainData = pd.DataFrame(cleanData)
     print('Pickling %s.' % stock_filename)
     try:
         with open(stock_filename, 'wb') as f:
           pickle.dump(trainData, f, pickle.HIGHEST_PROTOCOL)
     except Exception as e:
         print('Unable to save data to', stock_filename, ':', e)

print(trainData)

trainData.plot()
matplotlib.pyplot.show()

 

Generally, I think this is a fairly short bit of code that accomplishes all this. This is one of the beauties of Python that it is so compact.

stocks1

Summary

This was a quick introduction the Python libraries we’ll be using in addition to TensorFlow. Hopefully the quick sample program gave a taste of how we will be using them and is in fact how we will be getting training data for our TensorFlow model.

 

 

Advertisements

Written by smist08

August 30, 2016 at 10:49 pm

The Road to TensorFlow – Part 2: Python

with 6 comments

Introduction

This is part 2 on my blog series on playing with TensorFlow. Last time I blogged on getting Linux going in a VM. This time we will be talking about the Python programming language. The API for TensorFlow is primarily aimed at Python and in fact much of the research in AI, scientific computing, numerical computing and data research all takes place in Python. There is a C++ API as well, but it seems like a good chance to give Python a try.

Python is an interpreted language that is very rich in supporting various programming paradigms like object oriented, procedural and functional. Python is open source and runs on many platforms. Most Linux’s and the MacOS come with some version of Python pre-installed. Python is very interoperable and can work with most other programming systems, and there are a huge number of libraries of functionality available to the Python programmer. Python is oriented to getting things done quickly with a minimum of code and a minimum of fuss. The name Python is a tribute to the comedy troupe Monty Python and there are many references to Monty Python throughout the documentation.

Monty_Python's_Flying_Circus_Title_Card

Installation and Versions

Although I generally like Python it has one really big problem that is generally a pain in the ass when setting up new systems and browsing documentation. The newest version of Python as of this writing is 3.5.2 which is the one I wanted to use along with all the attendant libraries. However, if you type python in a terminal window you get 2.7.12. This is because when Python went to version 3 it broke source code compatibility. So they made the decision to maintain version 2 going forwards while everyone updated their programs and scripts to version 3. Version 3.0 was released in 2008 and this mess is still going on eight years later. The latest Python 2.x, namely 2.7.12 was just released in June 2016 and seems to be quite actively developed by a good sized community. So generally to get anything Python 3.x you need to add a 3 to the end. So to run Python 3.5.2 in a terminal window you type python3. Similarly, the IDE is IDLE3 and the package installer is pip3. It makes it very easy to make a mistake an to get the wrong thing. Worse the naming isn’t entirely consistent across all packages, there are several that I’ve run into where you add a 2 for the 2.x version and the version 3 one is just the name. As a result, I always get a certain amount of Python 2.x stuff accidentally installed by mistake (which doesn’t hurt anything, just wastes time and disk space). This also leads to a bit of confusion when you Google for information, in that you have to be careful to get 3.x info rather than 2.x info as the wrong one may or may not work and may or may not be a best practice.

On Ubuntu Linux I just used apt-get to install the various packages I needed. I’ll talk about these a bit more in the next posting. Another option for installing Python and all the scientific libraries is to use the Anaconda distribution which is quite a good way to get everything in Python installed all at once. I used Anaconda to install Python on Windows 10 at it worked really well, you just don’t get the fine control of what it does and it creates a separate installation to keep everything separate from anything already installed.

Python the Language

Python is a very large language; it has everything from object orientation to functional programming to huge built in libraries. It does have a number of quirks though. For instance, the way you define blocks is via indentation rather than using curly brackets or perhaps end block statements. So indentation isn’t just a style guideline, it’s fundamental to how the program works. In the following bit of code:

for i in range(10):
    a = i * 8
    print( i, a )
a = 8

the two indented statements are part of the for loop and the out-dented assignment is outside the loop. You don’t define variables, they are defined when first assigned to, and you can’t use a variable without assigning it first (or an exception will be thrown). There are a lot of built in types including dictionaries and lists, but no array type (but the numpy library does add these). Notice how the for loop uses in rather than to, to do a basic loop.

I don’t want to get too much into the language since it is quite large. If you are interested there are many good sites on the web to teach Python and the O’Reilly book “Learning Python” is recommended (but quite long).

Since Python is interpreted, you don’t need to wait for any compile steps so the coding, testing, debugging cycle is quite quick. Writing tight loops in Python will be slower than C, but generally Python gives you quite good libraries to do most of what you want and the libraries tend to be written in C or Fortran and very fast. So far I haven’t found speed to be an issue. TensorFlow is also written in C for speed, plus it has the ability to run on NVidia graphics cards for an extra boost.

Summary

This was my quick intro to Python. I’ll talk more about relevant parts of Python as I go along in this series. I generally like Python and so far my only big complaint is the confusion between the version 2 world and the version 3 world.

 

Written by smist08

August 26, 2016 at 11:10 pm

Posted in Artificial Intelligence

Tagged with ,

The Road to TensorFlow – Part 1 Linux

with 10 comments

Introduction

There have been some remarkable advancements in Artificial Intelligence type algorithms lately. I blogged on this a little while ago here. Whether its computers reading hand-writing, understanding speech, driving cars or winning at games like Go, there seems to be a continual flood of stories of new amazing accomplishments. I thought I’d spend a bit of time getting to know how this was all coming about by doing a bit of reading and playing with the various technologies.

I wanted to play with Neural Network technology, so thought the Google TensorFlow open source toolkit would be a good place to start. This led me down the road to quite a few new (to me) technologies. So I thought I’d write a few blog posts on my road to getting some working TensorFlow programs. This might take quite a few articles covering Linux, Python, Python libraries like Pandas, Stock Market technical analysis, and then TensorFlow.

Linux

The first obstacle I ran into was that TensorFlow had no install image for Windows, after a bit of Googling, I found you need to run it on MacOS or Linux. I haven’t played with Linux in a few years and I’d been meaning to give it a try.

I happened to have just read about a web site osboxes.org that provides VirtualBox and VMWare images of all sorts of versions of Linux all ready to go. So I thought I’d give this a try. I downloaded and installed VirtualBox and downloaded a copy of 64Bit Ubuntu Linux. Since I didn’t choose anything special I got Canonical’s Unity Desktop. Since I was trying new things, I figured oh well, lets get going.

Things went pretty well at first, I figured out how to install things on Ubuntu which uses APT (Advanced Packaging Tool) which is a command line utility to install things into Ubuntu Linux. This worked pretty well and the only problems I had were particular to installing Python which I’ll talk about when I get to Python. I got TensorFlow installed and was able to complete the tutorial, I got the IDLE3 IDE for Python going and all seemed good and I felt I was making good progress.

Then Ubuntu installed an Ubuntu update for me (which like Windows is run automatically by default). This updated many packages on my virtual image. And in the process broke the Unity desktop. Now the desktop wouldn’t come up and all I could do was run a single terminal window. So at least I could get my work off the machine. I Googled the problem and many people had it, but none of the solutions worked for me and I couldn’t resolve the problem. I don’t know if its just that Unity is finicky and buggy or if it’s a problem with running in a VirtualBox VM. Perhaps something with video drivers, who knows.

Anyway I figured to heck with Ubuntu and switched to Red Hat’s Fedora Linux. I chose a standard simple Gnome desktop and swore to never touch Unity again. I also realized that now I’m retired, I’m not a commercial user, so I can freely use VMWare, so I also switched to VMWare since I wondered if my previous problem was caused by VirtualBox. Anyway installing TensorFlow on Fedora seemed to be quite difficult. The dependencies in the TensorFlow install assume the packages that Ubuntu installs by default and apparently these are quite different that Fedora. So after madly installing things that I didn’t really think were necessary (like the Gnu Fortran compiler), I gave up on Fedora.

So I went back to osboxes.org and downloaded an Ubuntu image with the Gnome desktop. This then has been working great. I got everything re-installed quite quickly and was back to being productive. I like Gnome much better than Unity and I haven’t had any problems. Similarly, I think VMWare works a bit better than VirtalBox and I think I get a bit better performance in this configuration.

I have Python along with all the Python scientific and numerical computing libraries working. I have TensorFlow working. I spend most of my time in Terminal windows and the IDLE3 IDE, but occasionally use FireFox and some of the other programs pre-installed with the distribution.

gnome

I’m greatly enjoying working with Linux again, and I’m considering replacing my currently broken desktop computer with something inexpensive natively running Linux. I haven’t really enjoyed the direction Windows has taken after Windows 7 and I’m thinking of perhaps doing most of my computing on Linux and MacOS.

Summary

I am enjoying using Linux again. In spite of my initial problems with Ubuntu’s Unity Desktop and then with Fedora (running TensorFlow). Now that I have a good system that seems to be stable and working well I’m pretty happy with it. I’m also glad to be free of things like App stores and its nice to feel in control of my environment when running Linux. Anyway this was the small first step to TensorFlow.

Written by smist08

August 23, 2016 at 11:40 pm