Stephen Smith's Blog

Musings on Machine Learning…

Archive for the ‘Artificial Intelligence’ Category

Playing the Kaggle Two Sigma Challenge 2018/2019

leave a comment »

 

Introduction

A couple of years ago, I entered a Kaggle data science competition sponsored by Two Sigma for stock market prediction. I blogged about this in part 1, part 2, part 3, part 4 and part 5. The upshot of this was that although I put in a lot of work, I performed quite poorly in the final stages. I learned a lot about machine learning and data science along the way and was keen to have another go, when another Two Sigma sponsored competition rolled around starting last year.

In this competition we had three months to create our models in the fall of 2018, then they used our models to predict the stock market through the first half of 2019. With my learnings from the last competition, I was able to do much better this time around.

Don’t Overfit

My big lesson from the first competition was to not overfit the model to the training data. This is equivalent to having ten data points and fitting them perfectly with a 9th degree polynomial. There is no error in predicting the ten points, but the model is useless at predicting anything else, and in fact gives about the worst predictions possible.

A more subtle form of overfitting is trying hundreds of models and fiddling with their parameters until they work really well with the training data. This is a lot of work and won’t help you on any data outside of the training data. I did this for the first competition, it was a lot of work, and it performed badly.

Avoiding overfitting means doing less work, which is good. I spent very little time on this competition and got quite good results.

Kaggle Virtual Environment

One of the things I like about this competition is that you play it in Google/Kaggle’s virtual environment. You have a fixed set of computer resources and everyone plays in the same environment. This levels the playing field with people who have access to very high powered equipment at corporations or Universities. This year the environment included a GPU and we could run for six hours on a high powered server.

This does limit the models you can use, I wasn’t successful at using a neural networks, probably because historical stock market data is very flat and it is hard to get these models to converge. I ended up using an Extra Trees model in SciKit Learn.

Make the Program Robust

Usually in this sort of competition, when you run your model, behind the scenes Kaggle runs it on the secret test data as well, so if you run successfully on the provided test data, you know you also run on the secret data you are going to be scored against. In this case the secret test data didn’t exist yet. This led to the worry that sometime in the six months that they would be running our program, something unexpected would appear in the data and cause my program to crash, knocking me out of the competition.

I was careful to put in try/catch statements and added extra checks to try and keep my program running. The other thing with Python programs, is that sometimes they work, but throw a memory exception when they shutdown. I spent some time tracking down a number of these bugs and made sure my program could exit gracefully without any errors.

From the message board for the competition, it appears quite a few competitors were knocked out during the run on the new data.

Don’t Cheat

Some people spend all their time trying to cheat. Trying to hack the system to gain access to the secret test data. For the purposes of the leaderboard, there was secret test data to give people a score during the model building phase, but this data wouldn’t be used in the real competition. There was no real protection on this data, since using it to cheat would be useless. However quite a few people did cheat to move to the top of the leaderboard before the real competition started. These programs crashed when the real competition started.

There are rumours that people have succeeded in other Kaggle competitions by cheating, but in this one, since it was based on stock market data generated after the models were frozen, it wasn’t going to work.

My Model

The intent of the competition was to try to use news data to enhance your stock prediction algorithm. I don’t think this worked well, I used an Extra Trees variation on a Random Forest, and the news data never seemed to contribute much. Many other competitors didn’t even use it. The competition metric was a Sharpe Ratio. A couple of observations about the Sharpe Ratio, one is that it has the standard deviation of the estimates in the denominator. This means volatile stocks will hurt you even if you predict them accurately. Second, you enter a confidence value on whether you think the stock will do well and this confidence can be negative if you think it will go down. If you get the sign wrong it will doubly hurt you.

Given the Sharpe Ratio as the metric, I decided to sort the stocks by standard deviation of their returns and then rate any with a high standard deviation as zero, which would exclude them from the model. This meant I was building a portfolio of a subset of all the stocks present. I wanted stocks I could predict accurately and that had a lower volatility.

In reading the discussion board after the competition, it appears many of the top performers took this approach.

Summary

I enjoyed this competition. I liked working with the high powered servers in the competition’s virtual environment. The lesson learned to not overfit, greatly reduced the amount of work. Whenever I was tempted to tune my model, I just said no. I ended up receiving a silver medal, coming in 145 out of 2927 competitors. With each competition I learn a bit more and I look forward to the next one.

Written by smist08

August 16, 2019 at 5:43 pm

Playing with CUDA on My NVIDIA Jetson Nano

with 2 comments

Introduction

I reported last time about my new toy, an NVIDIA Jetson Nano Development Kit. I’m pretty familiar with Linux and ARM processors. I even wrote a couple of articles on Assembler programming, here and here. The thing that intrigued be about the Jetson Nano is its 128 Maxwell GPU cores. What can I do with these? Sure I can speed up TensorFlow since it uses these automatically. I could probably do the same with OpenGL programs. But what can I do directly?

So I downloaded the CUDA C Programming Guide from NVIDIA’s website to have a look at what is involved.

Setup

The claim is that the microSD image of 64Bit Ubuntu Linux that NVIDIA provides for this computer has all the NVIDIA libraries and utilities you need all pre-installed. The programming guide made it clear that if you need to use the NVIDIA C compiler nvcc to compile your work. But if I typed nvcc at a command prompt, I just got an error that this command wasn’t found. A bit of Googling revealed that everything is installed, but it did it before installation created your user, so you need to add the locations to some PATHS. Adding:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

To my .bashrc file got everything working. It also shows where cuda is installed. This is handy since it includes a large collection of samples.

Compiling the deviceQuery sample produced the following output on my Nano:

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3957 MBytes (4148756480 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1

Result = PASS

This is all good information and what all this data means is explained in NVIDIA’s developer documentation (which is actually pretty good). The deviceQuery sample exercises various information APIs in the CUDA library to tell you all it can about what you are running. If you can compile and run deviceQuery in the samples/1_Utilities folder then you should be good to go.

CUDA Hello World

The 128 NVidia Maxwell cores basically consist of a SIMD computer (Single Instruction Multiple Data). This means you have one instruction that they all execute, but on different data. For instance if you want to add two arrays of 128 floating point numbers you have one instruction, add, and then each processor core adds a different element of the array. NVidia actually calls their processors SIMT meaning single instruction multiple threads, since you can partition the processors to different threads and have the two threads each with a collection of processors doing their SIMD thing at once.

When you write a CUDA program, you have two parts, one is the part that runs on the host CPU and the other is the part that runs on the NVidia GPUs. The NVidia C compiler, NVCC adds a number of extensions to the C language to specify what runs where and provide some more convenient syntaxes for the common things you need to do. For the host parts, NVCC translates its custom syntax into CUDA library calls and then passes the result onto GCC to compile regularly. For the GPU parts, NVCC compiles to an intermediate format called PTX. The reason it does this is to support all the various NVidia GPU models. When the NVidia device driver goes to load this code, it does a just in time compile (which it then caches), where the PTX code is compiled to the correct binary code for your particular set of GPUs.

Here is the skeleton of a simple CUDA program:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

 

The __global__ identifier specifies the VecAdd routine as to run on the GPU. One instance of this routine will be downloaded to run on N processors. Notice there is no loop to add these vectors, Each processor will be a different thread and the thread’s x member will be used to choose which array element to add.

Then in the main program we call VecAdd using the VecAdd<<>> syntax which indicates we are calling a GPU function with these three arrays (along with the size).

This little example skips the extra steps of copying the arrays to GPU memory or copying the result out of GPU memory. There are quite a few different memory types, and various trade offs for using them.

The complete program for adding two vectors from the samples is at the end of this article.

This example also doesn’t explain how to handles larger arrays or how to do error processing. For these extra levels of complexity, refer to the CUDA C Programming Guide.

The CUDA program here is very short, just doing an addition. If you wanted to say multiply two 10×10 matrices, you would have your CUDA code do the dot product of a row in the first matrix by a column in the second matrix. Then you would have 100 cores execute this code, so the result of the multiplication would be done 100 times faster than just using the host processor. There are a lot of samples on how to do matrix multiplication in the samples and documentation.

Newer CUDA Technologies

The Maxwell GPUs in the Jetson Nano are a bit old and reading and playing with the CUDA libraries revealed a few interesting tidbits on things they are missing. We all know how NVidia has been enhancing their products for gaming and graphics with the introduction of things like real time ray tracing, but the thing of more interest to me is how they’ve been adding features specific to Machine Learning and AI. Even though Google produces their own hardware for accelerating their TensorFlow product in their data centers, NVidia has added specific features that greatly help TensorFlow and other Neural Network programs.

One thing the Maxwell GPU lacks is direct matrix multiplication support, newer GPUs can just do A * B + C as a single instruction, where these are all matrices.

Another thing that NVidia just added is direct support for executing computation graphs. If you worked with the early version of TensorFlow then you know that you construct your model by building a computational graph and then training and executing it. The newest NVidia GPUs can now execute these graphs directly. NVidia has a TensorRT library to move parts of TensorFlow to the GPU, this library does work for the Maxwell GPUs in the Jetson Nano, but is probably way more efficient in the newest, bright and shiny GPUs. Even just using TensorFlow without TensorRT is a great improvement and handles moving the matrix calculations to the GPUs even for the Nano, it just means the libraries have more work to do.

Summary

The GPU cores in a product like the Jetson Nano can be easily utilized using products that support them like TensorFlow or OpenGL, but it’s fun to explore the lower level programming models to see how things are working under the covers. If you are interested in parallel programming on a SIMD type machine, then this is a good way to go.

 

/**
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/**
 * Vector addition: C = A + B.
 *
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.
 */

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>

/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

/**
 * Host main routine
 */

int
main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
    {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
    {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
    {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
        {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}






Written by smist08

April 3, 2019 at 6:01 pm

Can NVidia Bake a Better Pi Than Raspberry?

with 4 comments

Introduction

I love my Raspberry Pi, but I find it’s limited 1Gig of RAM can be quite restricting. It is still pretty amazing what you can do with these $35 computers. I was disappointed when the Raspberry Foundation announced that the Raspberry Pi 4 is still over a year away, so I started to look at Raspberry Pi alternatives. I wanted something with 4Gig of RAM and a faster ARM processor. I was considering purchasing an Odroid N2, when I saw the press release from NVidia’s Developer Conference that they just released their NVidia Jetson Nano Developer Kit. This board has a faster ARM A57 quad core processor, 4 Gig of RAM plus the bonus of a 128 core Maxwell GPU. The claim being that this is an ideal DIY computer for those interested in AI and machine learning (i.e. me). It showed up for sale on arrow.com, so I bought one and received it via FedEx in 2 days.

Setup

If you already have a Raspberry Pi, setup is easy, since you can unplug things from the Pi and plug them into the Nano, namely the power supply, keyboard, monitor and mouse. Like the Pi, the Nano runs from a microSD card, so I reformatted one of my Pi cards to a download of the variant of Ubuntu Linux that NVidia provides for these. Once the operating system was burned to the microSD card, I plugged it into the Nano and away I went.

One difference from the Pi is that the Nano does not have built in Wifi or Bluetooth. Fortunately the room I’m setting this up in has a wired Internet port, so I went into the garage and found a long Internet cable in my box of random cables, plugged it in and was all connected to the Internet. You can plug a USB Wifi dongle in if you need Wifi, or there is an M.2 E slot (which is hard to access) for an M.2 Wifi card. Just be careful of compatibility, since the drivers need to be compiled for ARM64 Linux.

The board doesn’t come with a case, but the box folds into a stand to hold the board. For now that is how I’m running. If they sell enough of these, I’m sure cases will appear, but you will need to ensure there is enough ventilation for the huge heat sink.

Initial Impressions

The Jetson Nano certainly feels faster than the Raspberry Pi. This is all helped by the faster ARM processor, the quadrupled memory, using the GPU cores for graphics acceleration and that the version of Linux is 64 Bit (unlike Raspbian which is 32 Bit). It ran the pre installed Chromium Browser quite well.

As I installed more software, I found that writing large amounts of data to the microSD card can be a real bottleneck, and I would often have to wait for it to catch up. This is more pronounced than on the Pi, probably because other things are quite slow as well. It would be nice if there was an M.2 M interface for an NVMe SSD drive, but there isn’t. I ordered a faster microSD card (over three times faster than what I have) and hope that helps. I can also try putting some things on a USB SSD, but again this isn’t the fastest.

I tried running the TensorFlow MNIST tutorial program. The version of TensorFlow for this is 1.11. If I want to try TensorFlow 2.0, I’ll have to compile it myself for ARM64, which I haven’t attempted yet. Anyway, TensorFlow automatically used the GPU and executed the tutorial orders of magnitude faster than the Pi (a few minutes versus several hours). So I was impressed with that.

This showed up another gotcha. The GPU cores and CPU share the same memory. So when TensorFlow used the GPU, that took a lot of memory away from the CPU. I was running the tutorial in a Jupyter notebook running locally, so that meant I was running a web server, Chromium, Python, and then TensorFlow with bits on the CPU and GPU. This tended to use up all memory and then things would grind to a halt until garbage collection sorted things out. Running from scratch was fine, but running iteratively felt like it kept hitting a wall. I think the lesson here is that to do machine learning training on this board, I really have to use a lighter Python environment than Jupyter.

The documentation mentions a utility to control the processor speeds of the ARM cores and GPU cores, so you can tune the heat produced. I think this is more for if you embed the board inside something, but beware this sucker can run hot if you keep all the various processors busy.

How is it so Cheap?

The NVidia Jetson Nano costs $99 USD. The Odroid is $79 so it is fairly competitive with other boards trying to be super-Pis. However, it is cheaper than pretty much any NVidia graphics card and even their Nano compute board (which has no ports and costs $129 in quantities of 1000).

The obvious cost saving is no Wifi and no bluetooth. Another is the lack of a SATA or M.2 M interface. It does have a camera interface, a serial interface and a Pi like GPIO block.

The Nano has 128 Maxwell GPU cores. Sounds impressive, but remember most graphics cards have 700 to 4000 cores. Further Maxwell is the oldest supported platform (version 5) where as the newest is the version 7 Volta core.

I think NVidia is keeping the cost low, to get the DIY crowd using their technologies, they’ve seen the success of the Raspberry Pi community and want to duplicate it for their various processor boards. I also think they want to be in the ARM board game, so as better ARM processors come out, they might hope to supplant Intel in producing motherboards for desktop and laptop computers.

Summary

If the Raspberry Pi 4 team can produce something like this for $35 they will have a real winner. I’m enjoying playing with the board and learning what it can do. So far I’ve been pretty impressed. There are some limitations, but given the $100 price tag, I don’t think you can lose. You can play with parallel processing with the GPU cores, you can interface to robots with the GPIO pins, or play with object recognition via the camera interface.

For an DIY board, there are a lot of projects you can take on.

 

Avoiding Airline Collisions with Julia

leave a comment »

Introduction

I was just watching an old episode of “Mayday: Air Crash Investigations“, on the crash of a Russian passenger jet with a DHL cargo plane over Switzerland. In this episode, both planes had onboard collision avoidance systems, but one plane listened to air traffic control rather than the collision avoidance system and went down rather than up, resulting in the collision. In reading about the programming language Julia recently, I had noticed several presentations on the development of the next generation of collision avoidance systems, in Julia. This piqued my interest, along with the fact that my wife is currently getting her pilot’s license, to have a slightly deeper look into this.

Modern airliners have employed an onboard Traffic Collision Avoidance Systems (TCAS) since the 1980s. TCAS is required on any passenger airplane that takes more than 19 passengers. These systems work by monitoring the transponders of nearby aircraft and determining when a collision is imminent. At this point it provides a warning to the plane’s pilot along with a course of action. The TCAS systems on the two aircraft communicate so one plane is ordered to go up and the other to descend.

Generally there are three layers to collision avoidance that operate on different timescales. At the coarsest level planes travelling in one direction are required to be at a different altitude than planes in the reversion direction. Usually one direction gets even altitudes like 30,000 feet and the reverse gets odd altitude like 31,000 feet. At a finer level, air traffic control is responsible for keeping the planes apart at medium distances. Then close up (minutes apart) it is TCAS’s job to avoid the collisions. This is partly due to the aftermath of the Russian/DHL crash and partly due to a realization that the latency in communications with air traffic control is too great when things get too close for comfort.

Interestingly it was the collision of two passenger plane’s over the Grand Canyon in 1956 that caused congress to create the FAA and started the development of the current TCAS system. It took thirty years to develop and deploy since it required computers to get much smaller and faster first.

Why Julia

The FAA has funded the development of the next generation of traffic avoidance which has been dubbed ACAS X. This started in 2008 and after quite a bit of study, it was decided to use Julia extensively in its development. Reading the reasons for why Julia was selected is rather scary when you consider what it highlights about the current TCAS system.

Problem 1 – Specifications

A big problem with TCAS was that the people that defined the system wrote the specification first as English like pseudo-code and then re-wrote that as a more programmy pseudo-code with variables and such. Then others would take this code and implement it in Mathlab to test the algorithms. Then the people who actually made the hardware would take this and re-implement it in C++ or Assembler. When people had a recent look at all this code, they found it to be a big mess, where the different specs and code bases had been maintained separately and didn’t match. There was no automation and very little validation. The first idea of fixing this code base was rejected as completely unreliable and impossible to add new features to.

They wanted to the new system to take advantage of modern technologies like satellite navigation systems, GPS, and on-board radar systems. This means the new system will work with other planes that don’t have collision avoidance or perhaps don’t even have a transponder. In fact they wanted the new system to be easily extensible as new sensor inputs are added. Below is a small example of the reams of pseudo code that makes up TCAS.

The hope with Julia is to unify these different code bases into one. The variable pseudo-code would actually be true Julia code and the English code would be incorporated into JavaDoc like comments in the code (actually using Latex). This would then eliminate the need to use Mathlab to test the pseudo-code. The consensus is that Julia code is easily as readable as the above pseudo-code but with the advantage of being runnable and testable.

The FAA doesn’t have the authority to mandate Avionics hardware companies run Julia on their ACAS X systems, but the hope is that the performance of Julia is good enough that they won’t bother reimplementing the system in C++ and that everything will be the same Julia code. Current estimates have the Julia code running 1.5 times the speed of C code and the thought is that with newer computer chips, this should be sufficient. The hope then is that the new system will not have the translation errors that dog TCAS.

Now that the specification is true computer code many other tools can be written or used to help check correctness, such as the tool below which generates a flowchart from the Julia code/specification.

Problem 2 – Testing/Validation

Certainly with TCAS implementing the system in Mathlab was hard. But then Mathlab is quite slow and that greatly restricts the number of test cases that can be effectively be automated. The TCAS system is based on a huge number of giant decision trees and billions of test cases. A number of test/validation frameworks have been developed to test the new ACAS X system including using theorem proving, probabilistic model checking, adaptive stress testing, simulations and weakest precondition code analysis.

Now if the Avionics hardware manufacturers run the actual Julia code, there will have only been one code base from specification to deployment and it will have all been very thoroughly developed, tested and validated.

Summary

The new ACAS X system is currently being flight tested and is projected to start being deployed in regular commercial aircraft starting in 2020. Looking at the work that has gone into this system, it looks like it will make flying much safer. Hopefully it also sets the stage for how future large safety-critical systems will be developed. Further it looks like the Julia programming language will play a central part in this.

Written by smist08

October 7, 2018 at 10:28 pm

Julia Flux for Machine Learning

leave a comment »

Introduction

Flux is a Neural Network Machine Learning library for the Julia programming language. It is entirely written in Julia and relies on Julia’s built-in support for running on GPUs and providing distributed processing. It makes writing Neural Networks easy and leverages the power and expressiveness of the Julia language to make creating your Neural Network just the same as writing any other Julia expressions.

My last article pointed out some problems with using TensorFlow from Julia, due to many of the newer features being implemented in Python rather than being implemented in the core shared library. One recommendation from the TensorFlow folks is that if you want eager execution then use Flux rather than TensorFlow. The Flux folks claim a real benefit of Flux over TensorFlow is that you only need to know one language to do ML. Whereas for TensorFlow you need to know TensorFlow (its graph language) plus the host language like Python. Then it’s confusing because there is a lot of duplication and it isn’t always clear in which system to do things or whether to use a TensorFlow of Python data type. Flux then simplifies all this.

Although this all sounds wonderful remember that Julia just hit version 1.0 and Flux just hit version 0.67. The main problem I found was excessive memory usage, which I’ll benchmark and discuss later on.

Also note that Flux isn’t a giant compilation of algorithms like SciKit Learn. It is rather specific to Neural Networks. There are other libraries available in Julia for things like Random Forests, but you need to find the correct package and install it. Then each of these may or may not fully support Julia 1.0 yet.

MNIST in Flux

To give a flavour for using Julia and Flux here are a couple of examples from the FluxML model zoo. You can see it’s very simple to setup the Neural Network layers, perform the training and test the accuracy.

using Flux, Flux.Data.MNIST, Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated
# using CuArrays

# Classify MNIST digits with a simple multi-layer-perceptron
imgs = MNIST.images()

# Stack images into one large batch
X = hcat(float.(reshape.(imgs, :))...) |> gpu

labels = MNIST.labels()
# One-hot-encode the labels
Y = onehotbatch(labels, 0:9) |> gpu

m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax) |> gpu

loss(x, y) = crossentropy(m(x), y)

accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

dataset = repeated((X, Y), 200)
evalcb = () -> @show(loss(X, Y))
opt = ADAM(params(m))

Flux.train!(loss, dataset, opt, cb = throttle(evalcb, 10))

println("acc X,Y ", accuracy(X, Y))

# Test set accuracy
tX = hcat(float.(reshape.(MNIST.images(:test), :))...) |> gpu
tY = onehotbatch(MNIST.labels(:test), 0:9) |> gpu

println("acc tX, tY ", accuracy(tX, tY))

Here is a more sophisticated model which uses a convolutional Neural Network.

using Flux, Flux.Data.MNIST, Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated, partition
# using CuArrays

# Classify MNIST digits with a convolutional network
imgs = MNIST.images()

labels = onehotbatch(MNIST.labels(), 0:9)

# Partition into batches of size 1,000
train = [(cat(float.(imgs[i])..., dims = 4), labels[:,i])
         for i in partition(1:60_000, 1000)]

train = gpu.(train)

# Prepare test set (first 1,000 images)
tX = cat(float.(MNIST.images(:test)[1:1000])..., dims = 4) |> gpu
tY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> gpu

m = Chain(
  Conv((2,2), 1=>16, relu),
  x -> maxpool(x, (2,2)),
  Conv((2,2), 16=>8, relu),
  x -> maxpool(x, (2,2)),
  x -> reshape(x, :, size(x, 4)),
  Dense(288, 10), softmax) |> gpu

m(train[1][1])

loss(x, y) = crossentropy(m(x), y)

accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

evalcb = throttle(() -> @show(accuracy(tX, tY)), 10)
opt = ADAM(params(m))

Flux.train!(loss, train, opt, cb = evalcb)

Performance

One of Julia’s promises is the ease of use of a scripting language like Python with the speed of a compiled language like C. As it stands Flux isn’t there yet. There seem to be some points where Flux goes away for a long time. These might be the garbage collector kicking in, or something else. I find the speed is about the same order of magnitude as other systems (modulo the pauses), but the big problem is memory usage.

To solve MNIST using a convolutional Neural Network from Python using the TensorFlow tutorial runs quite well and uses 400Meg of memory. Running the similar model using Julia and TensorFlow uses 600Meg of memory. Running the simple model above using Julia and Flux takes 2Gig or memory. Running the convolutional model above uses 2.6Gig. This laptop that I’m using has 4Gig of RAM and is running Ubuntu Linux. This is why I think the big stalls in performance is garbage collection.

The problem with this is that MNIST is a nice small dataset and the model used to solve it isn’t very large as Neural Networks go. If Flux is using six times as much memory as Python then it really diminishes its usefulness as an ML toolkit.

I spent a bit of time looking at the Julia Differential Equations tutorial. They were pointing out that using matrix operations in the Julia expression evaluator would lead to lots of unnecessary temporary storage for instance to evaluate:

D = A + B + C

Where these are all large matrices has to create a temporary matrix to hold the sum A + B which is then added to C. This temporary matrix has to be allocated from the heap and then later garbage collected. This process seems to be rather inefficient in Julia, at least by going by all the workarounds they have to avoid this situation. They have SVectors which are for small vectors that can be allocated on the stack rather than the heap. They recommend using the +. operator which does things element by element and is smart enough to not create lots of temporary values on the heap. I wonder if Flux needs some optimisations like they spent so much time putting into the Differential Equations library.

Summary

Julia and Flux make a nice system for Machine Learning in theory. I think until the technology matures a bit and some problems like memory management are better addressed, that using this for large projects is a bit problematic. A lot of the current ML systems being worked on with Flux are by PhD candidates who are developing Flux as part of their thesis work. Hopefully they improve the memory usage and allow Flux and Julia to live up to their full potential.

 

Written by smist08

September 24, 2018 at 9:02 pm

TensorFlow from Julia

with 2 comments

Introduction

Last time, I gave a quick introduction to the Julia programming language which has just reached the 1.0 release mark after ten years of development. Julia is touted as the next great thing for scientific computing, machine learning, data science and artificial intelligence. Its hope is to supplant Python which is currently the goto language in these fields. The goal is a more unified language, since it was developed well after Python and learned from a lot of its mistakes. It also claims to have the flexibility of Python but with the speed of a true compiled language like C.

I saw that in the list of packages there was support for using Google’s TensorFlow AI system natively from Julia so I thought I would give this a try. Although it worked, it did reveal some challenges that Julia is going to face in its battle to become a true equal with Python.

Using TensorFlow in Julia

The TensorFlow wrapper/interface for Julia is in a package created by a PhD candidate at MIT, Jon Malmaud. You can add it to Julia using Pkg.add(“TensorFlow”) as well as view the source code on GitHub. Since I wrote an article recently comparing TensorFlow running on a Raspberry Pi to running on my laptop, I thought I’d use the same example and compare Julia to those cases. I cut/pasted the code into the Julia IDE Juno and made some code syntax changes and gave it a go. It came back that the Keras object was undefined.

I then noticed that in the Tensorflow.jl github there were a couple of examples doing predictions on the MNIST dataset, so at least these were solving the same problem as my article, just using different models. I fired these up, but they failed with syntax errors in the code to load the MNIST dataset. Right now this is a bit of a problem in Julia that not all libraries have been updated to the Julia 1.0 syntax. I had a look at the library to load MNIST and noticed that no one had contributed to it in three years. It appeared to be abandoned with no plans to continue it. After a bit more research I found another Julia package called MLDatasets that was maintained and would load MNIST along with several other popular datasets.

I logged an issue with the Tensorflow.jl repository that they should fix this. They replied that they didn’t have time but if I wanted to fix it, to go ahead. So I fixed this and checked it in to the Tensorflow.jl Github. So now these MNIST examples work with Julia 1.0. I was then happy to have given my small contribution back to this community.

I then thought, why not be ambitious and add the Keras layer to Tensorflow.jl? Well this led to some interesting revelations to how Tensorflow is architected.

Problems with the Tensorflow Architecture

Looking at some of the issues in the Tensorflow.jl library there were requests for things like TensorFlow’s eager execution and the TensorFlow layers interface. The answer to these issues was that the Julia interface only talked to the DLL/SO interface to Tensorflow and that these modules didn’t exist there and were in fact written in Python rather than C++. I had a look inside the TensorFlow Github and found that their Keras layer is also written in Python.

Originally Tensorflow.jl talked to the Tensorflow Python interface. Julia is really good at interoperability and can easily talk to both Python libraries as well as C/C++ DLL/SOs. The problem with talking to Python libraries is that it involves running a Python process and then doing process to process communications to execute the code. This tends to be way slower than talking to DLLs or SOs. So early on the TensorFlow.jl library was changed to just talk to the DLL/SO interface for Tensorflow and eliminated all Python dependencies. This then lets Julia use the really performant part of TensorFlow and perform all the core operations very quickly.

Now the problem seems to be that Google is doing a lot of the new Tensorflow development in Python and not putting the code into the core shared library. Google is also spending a lot of time promoting these new interfaces as the way to go. This means if you aren’t programming in Python you are definitely a second class citizen.

OK, so is this just bad for the newbie language Julia? Should Julia programmers just use the Jula native Flux AI library? Well, the other thing Google is promoting is running TensorFlow on things like mobile devices, but then you are accessing TensorFlow from Swift on iOS or from Java on Android. Now you have the same problems as the Julia programmer. You only have efficient access to the core low level APIs for TensorFlow and all the new fancy high level access is denied to you. Google’s API block diagram below highlights this.

To me this is a big architectural problem with TensorFlow. Its great to use from Python, but is really limited in other environments. The videos and blogs starting to surface on TensorFlow 2.0 are promoting eager execution and the Keras layer will be the default and primary ways to program with TensorFlow. This then begs the question as to whether these will be moved into the core shared library or will remain as Python code? At this point I haven’t seen this explained, but as we get closer to the 2.0 preview later this year, I’ll be watching this keenly.

It would certainly be nice if they move this Python code into C++ in the shared library so everyone can use it. At that point I think TensorFlow would be much more usable from Julia, Swift, Java, C++, etc. Here’s hoping that is a major upgrade in the 2.0 release.

Julia TensorFlow Code

Just for interest here is the simplest Julia MNIST example just to give a flavour for the code. This is a simple linear model, so doesn’t give great results. There is a more complicated example that uses a convolutional neural network and gives far superior results.

using TensorFlow
include("mnist_loader.jl")

loader = DataLoader()

sess = Session(Graph())

x = placeholder(Float32)
y_ = placeholder(Float32)

W = Variable(zeros(Float32, 784, 10))
b = Variable(zeros(Float32, 10))

run(sess, global_variables_initializer())

y = nn.softmax(x*W + b)

cross_entropy = reduce_mean(-reduce_sum(y_ .* log(y), axis=[2]))
train_step = train.minimize(train.GradientDescentOptimizer(.00001), cross_entropy)

correct_prediction = argmax(y, 2) .== argmax(y_, 2)
accuracy=reduce_mean(cast(correct_prediction, Float32))

for i in 1:1000
    batch = next_batch(loader, 100)
    run(sess, train_step, Dict(x=>batch[1], y_=>batch[2]))
end

testx, testy = load_test_set()

println(run(sess, accuracy, Dict(x=>testx, y_=>testy)))

Summary

You can certainly use TensorFlow from Julia. Just beware that you are limited to the lower level APIs, so anything TensorFlow has implemented in Python isn’t available to you. This means you set up the graph and then execute it, really like you always did in the earlier versions of TensorFlow. It would certainly be nice if Google fixes this problem for TensorFlow 2.0.

Written by smist08

September 22, 2018 at 5:43 pm

Updates to the TensorFlow API

leave a comment »

Introduction

Last year I published a series of posts on getting up and running on TensorFlow and creating a simple model to make stock market predictions. The series starts here, however the coding articles are here, here and here. We are now a year later and TensorFlow has advanced by quite a few versions (1.3 as of this writing). In this article I’m going to rework that original Python code to use some simpler more powerful APIs from TensorFlow as well as adopt some best practices that weren’t well known last year (at least by me).

This is the same basic model we used last year, which I plan to improve on going forwards. I changed the data set to record the actual stock prices rather than differences. This doesn’t work so well since most of these stocks increase over time and since we go around and around on the training data, it tends to make the predictions quite low. I plan to fix this in a future articles where I handle this time series data correctly. But first I wanted to address a few other things before proceeding.

I’ve placed the updated source code tfstocksdiff13.py on my Google Drive here.

Higher Level API

In the original code to create a layer in our Neural Network, we needed to define the weight and bias Tensors:

layer1_weights = tf.Variable(tf.truncated_normal(
      [NHistData * num_stocks * 2, num_hidden], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([num_hidden]))

And then define the layer with a complicated mathematical expression:

hidden = tf.tanh(tf.matmul(data, layer1_weights) + layer1_biases)

This code is then repeated with mild variations for every layer in the Neural Network. In the original code this was quite a large block of code.

In TensorFlow 1.3 there is now an API to do this:

hidden = tf.layers.dense(data, num_hidden, activation=tf.nn.elu,
        kernel_initializer=he_init,
        kernel_regularizer=tf.contrib.layers.l1_l2_regularizer(),
        name=name + "model" + "hidden1")

This eliminates a lot of repetitive variable definitions and error prone mathematics.

Also notice the kernel_regularizer=tf.contrib.layers.l1_l2_regularizer() parameter. Previously we had to process the weights ourselves to add regularization penalties to the loss function, now TensorFlow will do this for you, but you still need to extract the values and add them to your loss function.

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([tf.nn.l2_loss( tf.subtract(logits, tf_train_labels))] + reg_losses)

You can get at the actual weights and biases if you need them in a similar manner as well.

Better Initialization

Previously we initialized the weights using a truncated normal distribution. Back then the recommendation was to use random values to get the initial weights away from zero. However since 2010 (quite a long time ago) there have been better suggestions and the new tf.layers.dense() API supports these. The original paper was “Understanding the difficulty of training deep feedforward neural networks” by Xavier Glorot and Yoshua Bengio. If you ran the previous example you would have gotten an uninitialized variable on he_init. Here is its definition:

he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")

The idea is that these initializers vary based on the number of inputs and outputs for the neuron layer. There is also tf.contrib.layers.xavier_initializer() and tf.contrib.layers.xavier_initializer_conv2d(). For this example with only two hidden layers it doesn’t matter so much, but if you have a much deeper Neural Network, using these initializers can greatly speed up training and avoid having the gradients either go to zero or explode early on.

Vanishing Gradients and Activation Functions

You might also notice I changed the activation function from tanh to elu. This is due to the problem of vanishing gradients. Since we are using Gradient Descent to train our system then any zero gradients will stop training improvement in that dimension. If you get large values out of the neuron then the gradient of the tanh function will be near zero and this causes training to get stalled. The relu function also has similar problems if the value ever goes negative then the gradient is zero and again training will likely stall and get stuck there. On solution to this is to use the elu function or a “leaky” relu function. Below are the graphs of elu, leaky relu and relu.

Leaky relu has a low sloped linear function for negative values. Elu uses an exponential type function to flatten out a bit to the left of zero so if things go a bit negative they can recover. Although if things go more negative with elu, they will get stuck again. Elu has the advantage that its is rigged to be differentiable at 0 to avoid special cases. Practically speaking both of these activation functions have given very good results in very deep Neural Networks which would otherwise get stuck during training with tanh, sigmoid or relu.

Scaling the Input Data

Neural networks work best if all the data is between zero and one. Previously we didn’t scale our data properly and just did an approximation by dividing by the first value. All that code has been deleted and we now use SciKit Learn’s MinMaxScaler object instead. You fit the data using the training data and then transform any data we process with the result. The code for us is:

# Scale all the training data to the range [0,1].
scaler = MinMaxScaler(copy=False)
scaler.fit(train_dataset)
scaler.transform(train_dataset)
scaler.transform(valid_dataset)
scaler.transform(test_dataset)
scaler.transform(final_row)

The copy=False parameter basically says to do the conversion in place rather than producing a new copy of the data.

SciKit Learn has a lot of useful utility functions that can greatly help with using TensorFlow and well worth looking at even though you aren’t using a SciKit Learn Machine Learning function.

Summary

The field of Neural Networks is evolving rapidly and the best practices keep getting better. TensorFlow is a very dynamic and quickly evolving tool set which can sometimes be a challenge to keep up with.

The main learnings I wanted to share here are:

  • TensorFlow’s high level APIs
  • More sophisticated initialization like He Initialization
  • Avoiding vanishing gradients with elu or leaky ReLU
  • Scaling the input data to between zero and one

These are just a few things of the new things that I could incorporate. In the future I’ll address how to handle time series data in a better manner.

Written by smist08

October 16, 2017 at 9:42 pm

Components Leading to Strong AI

with one comment

Introduction

There have been a lot of advances in AI in the past couple of years. A lot of these advances are better simulating the various functions of the brain. These include the convolutional neural networks which are very good at image recognition and new techniques to incorporate memory into neural networks.

Very Deep Neural Networks

In the early days of Neural Networks, finding the weights for the connections was very difficult and often performed by hand. Then the gradient descent algorithm came along and allowed bigger Neural Networks to be trained. Then in 1986 a groundbreaking paper by D. E. Rumelhart showed how to use back propagation to train a multi-level Neural Network with Gradient Descent. However the shape of the surface that is being optimized is often very ill suited to this algorithm, containing many local minimums, or more usually being very flat not indicating the direction to take. Plus depending on the problem the training data may contain lots of errors that can mislead the training process.

With recent tweaks to the training algorithms, researchers have managed to train very deep Neural Networks. For instance the Oxford Visual Geometry Group (VGG) has released a pre-trained 19 layer Neural Network for image recognition.

This is a great building block for other image manipulation projects like Image Style Transfer that we looked at previously.

Now these Neural Networks are starting to resemble the architecture and structure of biological Neurons in the human brain such as the following from the human cortex.

This shows that we are starting to accurately simulate the computational engine in our brains.

The Road to Memory

Although the deep neural networks in the last section are very large and powerful at some problems, other problems they fail at primarily due to a lack or memory or context. For instance if you are translating text word by word, you need to remember the previous words in the sentence to get a correct translation based on the context. Or you need to do a first pass word by word and then knowing the whole, correct mistakes based on now knowing more generally what is being said. Similarly as an algorithm deals with the world, it should learn about the world as it explores and gathers more information. Just retraining the whole Neural Network for each bit of new information is very inefficient.

For language translation and speech recognition the use of Recurrent Neural Networks (RNNs) have proven quite effective. In these the outputs from Neurons can feed into the inputs of the same layer or into the inputs of previous layers. The networks of the previous section were all feed forward Neural Networks since the output of a layer only feeds the input of the next layer. RNNs aren’t true non feed forward networks since they don’t iterate to find a solution with everything stabilized, Rather these outputs from use n go into the inputs of usage n+1. In this way these act as a sort of memory from usage to usage allowing the network to preserve some context from say word to word in translation.

More recent research has led to Neural Networks that can actually have memory banks. These include Long Short-Term Memory Cells (LSTM Cells) and Gated Recurrent Unit (GRU) Cells.

These artificial neurons have the ability to store memory values (as well as forget memory values). The key difficulty in adding memory to Neural Networks was in how to train them. Gradient Descent and all its variations require that the function being optimized is differentiable or very nearly so. Putting things in memory, reading memory and erasing memory are very discrete functions. These sort of functions are not differentiable and can’t be patched since they are flat with zero derivative elsewhere. Something with a zero derivative doesn’t give any information to Gradient Descent as to which direction to go. The solution to this was to replace the discrete functions with probability distributions that are differentiable. So rather than say put something in memory, the function gives you a probability that you should put the value in memory and then you do so if say the probability is greater than 50%.

Learning

I think the current tools for training Neural Networks work quite well for deep feedforward Neural Networks. I think they do a good job of training the weights to use in the various network layers. However I don’t think they provide a good solution for training systems with memory. The brain probably uses some process similar to what we do to train the input weights and outputs to biological Neurons, such as Hebbian Learning. However I don’t think this is what is used to decide whether to remember something or not. I think we still have a long way to go before effectively using memory in our Neural Networks even though just a little bit of memory is greatly improving our translators, speech and text recognition programs.

Summary

The field of Neural Networks is making great progress. This is due to advances in refining the training process of deep Neural Networks along with advances in making artificial Neurons more sophisticated by adding elements like memory banks. Combine this with the fast pace of development of GPUs allowing essentially low cost supercomputers for training and running these networks and the large amount of venture capital that is flowing into anything AI related and we are seeing a true renaissance in the AI field.

Does someone have a true deep AI running in their lab already? Perhaps; but, if they don’t I think we are starting to get quite close.

Written by smist08

September 29, 2017 at 9:12 pm

Playing with Image Style Transfer

leave a comment »

Introduction

Last time we introduced Image Style Transfer, an AI algorithm that combines the contents of one image with the style of another image. In this article we are going to look at some ways to play with this process in more advanced ways. We are going to play with Anish Athalye’s implementation which is on GitHub here, this implementation is really good at allowing lots of tuning and playing.

Playing around this way is quite time consuming since you have to run Gradient Descent to find the solution, rather than just applying canned solutions. Since I ran all these on an older MacBook Air with no GPU, I had to use a lower resolution in the interest of time. At lower resolution (the MacOS’s small size) it took about an hour for each image. At medium resolution, it took about six hours to generate an image. This is ok for running over night but doesn’t allow a lot of play. Makes me wonder if I should get a beefy desktop computer with a good NVidia GPU?

I found a really good YouTube video explaining Image Style Transfer here which is well worth a watch.

Playing with Algorithms

We’ve seen in previous articles how we can play with the tunable parameters in AI algorithms to get quite different results. Here we’ll look at the effects of playing with some parameters as well as fiddling with the algorithm itself.

The basic observation that lead to Image Style Transfer was that a deep image recognition neural network extracts the features related to content in the lower layers and the features related to style in the higher layers. Interestingly the human brain’s image recognition neurons appear to be structured in the same sort of way and it is believed there is a fair bit of similarity between how an advanced image recognition algorithm works and how the brain works. This separation of content from style is then the basis for merging and manipulating these.

The Image Style Transfer algorithm works by starting with an image of white noise and then iterating it using stochastic gradient descent to minimize the difference between the content in one image and the style in the other. This is the loss function we often talk about in AI. The interesting part of the algorithm is that we aren’t training the neural network matrix weights, since these are pre-done by the VGG group, but we are training the input image. So we have a loss function like:

Total Loss = Loss of content from first image + Loss of style from the second image

We can then play with this Loss function in various ways which we’ll experiment with in the rest of this article.

Apply Some Weights

Usually in Machine Learning algorithms we apply weights everywhere that we can use to tune things. The same applies here. We can weight the contributions from content versus style in the total loss formula to achieve more of a contribution from style or content.

First we take a picture of Tetrahedron Peak and combine it with Vincent van Gogh’s Starry Night using the default settings of the algorithm:

Now we can try playing with the weight of the content contribution. Lower means more style, higher means more content. In the image above the content weight was the default of 5.

Notice the image on the left is much more abstract with the large stars appearing all over.

Using Multiple Styles

Last time we used one style at a time to get our result. But you can actually use the algorithm to incorporate multiple styles at once. In this case we just generalize the Loss function above as:

Total Loss = Loss of content form first image + Loss of style from style image 1 +
                 Loss of style form style image 2

Of course we can then further generalize this to any number of style images.

We’ll use our Starry Night combination and also use Picasso’s Dora Maar:

Now we will use both pictures for the style and see what we get:

This weights the styles of Starry Night and Dora Maar equally. However you can see from the Loss formula that we can easily weight the components and get say 75% Starry Night and 25% Dora Maar:

 

Now if we reverse the weights and do Starry Night at 25% and Dora Maar at 75%:

Playing with the Neural Network

We can also play with the Neural Network used. We  can change a number of parameters in the Neural Network as well as introduce various scaling and weight factors.

Pooling Type

For instance there are something called Pooling Layers in the network. These reduce the resolution of the image and help with reducing the abstraction from fine level details to higher level abstractions. There are two commonly used types of pooling layers namely average pooling and max pooling. We can try either of these to see what affect that might have on the image style transfer.

Here we see that average pooling favoured fine details and preserved more of the content image. Whereas max pooling used more of the style image and is a bit more abstract.

Exponential Style Layer Weight

Another thing we can do is magnify some layers over others. For instance we can magnify each style layer over the last one as follows:

weight(layer<n+1>) = weight_exp*weight(layer<n>)

The default is 1 (ie none). Here is Tetrahedron Peak using 0.2 and 2.0.

A factor less than one means more original content since some style layers are suppressed, and a factor greater than one magnifies some style layer contributions. Since the style layers aren’t all weighted the same  this is a bit different than just changing the weighting factor between content and style.

Iterations

Another parameter that is fun to play with is the number of iterations that Gradient Descent runs for. Below we can see a sequence of images as the number of iterations is increased. We can see the content and style of the image forming out of the initial white noise.

At this resolution we are pretty much converged at 500 iterations, however for higher resolution and more complicated images more iterations might be necessary. We could also use a stopping criterion like when the loss function stops changing by some delta, rather than using a fixed number of iterations.

This problem converges quite well since it is mathematically well defined. Often in AI, we don’t get this good behaviour because the training data has lots of errors and/or lots of noise. Here we are just training against a content picture and one or more style pictures, so by definition there isn’t any erroneous data. These challenges would have been faced and solved by the team developing the VGG image recognition neural network that we get to just use and don’t have to worry about training.

Summary

As we can see we can get quite a few different effects by tuning the algorithm using the same style picture as a reference. Simple tools like Prisma or deepart.io don’t let you play with all these parameters. As a photographer who is trying to get a specific effect, you want the power and flexibility to tune your style transfer exactly. Right now the only way to do this is to run the AI algorithms on your computer and play with them which is very time consuming. I suspect once this technology is incorporated in more advanced tools then various degrees of tuning will be possible. Adobe has been demonstrating Image Transfer Style in their labs, and it will be interesting to see if they incorporate it into Photoshop and then how much tuning is possible. Also if it runs in the Adobe Creative Cloud, it will be interesting to see whether it’s quicker running that way than running on your own computer.

 

Written by smist08

August 21, 2017 at 4:29 pm

An Introduction to Image Style Transfer

with 2 comments

Introduction

Image Style Transfer is an AI technique that is becoming quite popular for enhancing or stylizing photos. It takes one picture (often a classical painting) and then applies the style of that picture to another picture. For example I could take this photo of the Queen of Surrey passing Hopkins Landing:

Combined with the style of Vincent van Gogh’s Starry Night:

To then feed these through the AI algorithm to get:

In this article, we’ll be look at some of the ways you can accomplish this yourself either through using online services or running your own Neural Network with TensorFlow.

Playing with Image Style Transfer

There are lots of services that let you play with this. Generally to apply a canned style to your own picture is quite fast (a few seconds). To provide your own photo as the style photo is more involved, since it involves “training” the style and this can take 30 minutes (or more).

Probably the most popular program is the Prisma app for either iPhone or Android. This app has a large number of pre-trained styles and can apply any of them to any photo on your phone. This app works quite well and gives plenty of variety to play with. Plus its free. Here is the ferry in Prisma’s comic theme:

If you want to provide your own photo as the style reference then deepart.io is a good choice. This is available as a web app as well as either an iPhone or Android app. The good part about this for photographers is that you can copy photos from your good camera to your computer and then use this program’s website, no phone required. This site has some pre-programmed styles based on Vincent van Gogh which work really quickly and produce good results. Then it has the ability to upload a style photo. Processing a style is more work and typically takes 25 minutes (you can pay to have it processed quicker, but not that much quicker). If you don’t mind the wait this site is free and works quite well. Here is an example of the ferry picture above van Gogh’ized by deepart.io (sorry they don’t label the styles so I don’t know which painting this is styled from):

Playing More Directly

These programs are great fun, but I like to tinker with things myself on my computer. So can I run these programs myself? Can I get the source code? Fortunately the answer to both is yes. This turns out to be a bit easier than you first might think, largely due to a project out of the Visual Geometry Group (VGG) at the University of Oxford. They created an exceptional image recognition neural network that they trained and won several competitions with. It turns out that the backbone to doing Image Style Transfer is to have a good image recognition Neural Network. This Neural Net is 19 layers deep and Oxford released the fully trained network for anyone to use. Several people have then taken this network, figured out how to load it into TensorFlow and created some really good Image Style Transfer programs based on this. The first program I played with was Anish Athalye’s program posted on GitHub here. This program uses VGG and can train a neural network for a given style picture. Anish has quite a good write up on his blog here.

Then I played with a program that expanded on Anish’s by Shafeen Tejani which is on GitHub here along with a blog post here. This program lets you keep the trained network so you can perform the transformation quickly on any picture you like. This is similar to how Prisma works. The example up in the introduction was created with this picture. To train the network you require a training set of image like the Microsoft COCO collection.

Running these programs isn’t for everyone. You have to be used to running Python programs and have TensorFlow installed and working on your system. You need a few other dependent Python libraries and of course you need the VGG saved Neural Network. But if you already have Python and TensorFlow, I found both of these programs just ran and I could play with them quite easily.

The writeups on all these programs highly recommend having a good GPU to speed up the calculations. I’m playing on an older MacBook Air with no GPU and was able to get quite good results. One trick I found that helped is to play with reduced resolution images to help speed up the process, then run the algorithm on a higher resolution version when you have things right. I found I couldn’t use the full resolution from my DLSR (12meg), but had to use the Apple’s “large” size (286KB).

Summary

This was a quick introduction to Image Style Transfer. We are seeing this in more and more places. There are applications that can apply this same technique to videos. I expect this will become a standard part of all image processing software like PhotoShop or Gimp. It also might remain the domain of specialty programs like HDR has, since it is quite technical and resource intensive. In the meantime projects like VGG have made this technology quite accessible for anyone to play with.

Written by smist08

August 14, 2017 at 6:48 pm