Stephen Smith's Blog

Musings on Machine Learning…

The Brain’s Operating System

with one comment


Last time we posited that the human brain and in fact any biological brain is really a type of computer. If that is so, then how is it programmed? What is its operating system? What are the algorithms that facilitate learning and intelligent action? In this article we’ll start to look at some of the properties of this Biological operating system and how it compares to a modern computer operating system like Windows.


You can install Windows from a DVD that contains about 3.8Gigabytes of information. You install this on a computer that required a huge specification for its construction including the microprocessor, bios, memory and all the other components. So the amount of information required for the combined computer and operating system is far greater than 3.8Gigabytes.

The human genome which is like the DVD for a human is used to construct both the physical human and provide any initial programming for the brain. This human genome contains only 3.2Gigabytes of information. So most of this information is used to build the heart, liver, kidneys, legs, eyes, ears as well as the brain as well as any initial information to store in the brain.

Compared to modern computer operating systems, video games and ERP systems, this is an amazingly compact specification for something as complex as a human body.

This is partly why higher mammals like humans require so much learning as children. The amount of initial information we are born with is very small and limited to the basest survival requirements like knowing how to breathe and eat. Plus perhaps a few primal reflexes like a fear of snakes.


Windows runs on very reliable hardware and yet still crashes. The equivalent of a Blue Screen of Death (BSOD) in a human or animal would likely be fatal in the wild. Sure in diseases like epilepsy, an epileptic fit could be considered a biological BSOD, but still in most healthy humans this doesn’t happen. You could also argue that if you reboot your computer every night, your chance of a BSOD is much lower and similarly if a human sleeps every night then they work much more reliably than if they don’t.

The brain has a further challenge that neural cells are dying all the time. It’s estimated that about 9000 neurons die a day. Yet the brain keeps functioning quite well in spite of this. It was originally thought that we were born with all our neurons and that after that they just died off, but more modern research has shown that new neuron cells are in fact produced but not uniformly in the brain. Imagine how well Windows would run if 9000 transistors stopped working in your CPU every day and a few random new transistors were added every now and then? How many BSOD would this cause? It’s a huge testament to the brain’s operating system that it can run so reliably for so long under these conditions.

It is believed that the algorithms must be quite simple to operate under these conditions and able to easily adapt to changing configurations. Interestingly in Neural Networks it’s found that as part of training neural networks, that removing and adding neurons during the training process reduces overfitting and produces better results even in a reliable system. We talked about this “dropout” in this article about TensorFlow. So to some degree perhaps the “unreliability” actually led to better intelligence in biological systems.


Most modern computers are roughly based on the von Neuman architecture and if they do support parallelism it through multiple von Neuman architecture computers synchronized together (ie with multiple cores). The neurons in the brain don’t follow this architecture and operate much more independently and in parallel than the logic gates in a computer do. This is more to do with the skill of the programmer than a constraint on how computer hardware is put together. As a programmer it’s hard enough to program and debug a computer that executes one instruction at a time in a simple linear fashion with simple flow of control statements. Programming a system where everything happens at once is beyond my ability as a programmer. Part of this comes down to economics. Biology programmed the brain’s algorithms over millions of years using trial and error driven by natural selection. I’m required to produce a new program in usually less than one year (now three months in Internet time). Obviously if I put in a project plan to produce a new highly parallel super ERP system in even ten years, it would be rejected as too long, never mind taking a million years.

The brain does have synchronization mechanisms. It isn’t totally an uncontrolled environment. Usually as we study biological systems, at first they just look mushy, but as we study in more detail we find there is an elegant design to them and that they do tend to build on modular building blocks. With the brain we are just starting to get into understanding all these building blocks and how they build together to make the whole.

In studying more primitive nervous systems (without brains), there are two typical simple neural connection, one is from a sensor connecting to a single action, then there is the connection from a sensor connected to a coordinator to do something like activate two flippers to swim straight. These simple coordinators are the start of the brain’s synchronization mechanisms that lead to much more sophisticated coordinated behaviour.

Recurrent and Memory

The brain also is recurrent, it feeds its output back in as input to iterate in a sense to reach a solution. There is also memory, though the brain’s memory is a bit different than computers. Computers have a memory bank that is separate then the executing program. The executing program access the memory but these are two separate systems. In the brain there is only one system and the neurons act as both logic gates (instruction executors) and memory. To some degree you can do this in a computer, but it’s considered bad programming practice. As a programmer we are trained to never hard code data in our programs and this will be pointed out to us at any code review. However the brain does this as a matter of course. But unlike computer programs the brain can dynamically change these as it feels like, unlike our computer programs that need a programmer to be assigned to do a maintenance update.

This leads to another fundamental difference between computers and humans, computers we update by re-installing software. The brain is never reinstalled, what you get initially is what you live with, it can learn and adapt, but never be reinstalled. This eliminates one of tech supports main solutions to problems, namely reinstall the operating system. The other tech support solution to problems of turn the computer off and on again is quite difficult with biological brains and involves a defibrillator.


This article was an initial look at the similarities and differences between the brain and a modern computer. We looked at a few properties of the brain’s operating system and how they compare to something like Windows. Next time we’ll start to look at how the brain is programmed, namely how we learn.



Written by smist08

June 16, 2017 at 6:52 pm

Is the Brain Really a Computer?

with 2 comments


There is a lot of debate about whether the human brain is really a computer or is it something more than a computer or is it something quite different from a computer? In this article I’m going to look at some of these arguments, many of them positing behaviours of the brain that are claimed to be impossible to be exhibited by a computer.

Some of the arguments tend to be based on a need for humans to somehow be special, similar to the passion of people who stuck to the idea that the Earth was the center of the universe because we were somehow special and they couldn’t bear the idea that we were located on one insignificant planet orbiting one of billions of suns in our galaxy in a universe of billions of galaxies.

Other arguments are based around human behaviours like humour, saying it would be impossible to program a computer to create or really appreciate humour.

We’ll look at some of these arguments and consider them in the context of what we’ve been looking at in complex emergent behaviour of simple iterated systems.

The Brain Looks Like a Computer

As biologists study the workings of the brain, it is very structurally similar to a modern computer. In the sense that a neuron cell has a number of inputs through synapses and dendrites that conduct the input signals into the cell body that then does a summing and limiting function to decide if it will fire an output signal through the axon to feed into other neural cells. This structure is very similar to the basic logic gates the modern processing units are composed of. It also seems like a very simple and logical comparison. Often the simplest and most straightforward theory is also the correct one.

Emotional Computers

One argument against the brain being a computer is that computers are logical and not emotional. How could a computer program be humorous? How could a computer program appreciate humor? How could a computer program ever be jealous? A lot of these arguments were used to highlight how humans are different than animals with the claim being that animals never find anything funny or exhibit jealousy. That these are strictly human traits and show how we are special and different in some fundamental way than animals. However modern animal research now shows that animals do exhibit these behaviours and that we aren’t special in these regards. In fact any one who own two or more dogs will certainly see a lot of jealousy exhibited, plus any dog owner knows that dogs do find some things exceedingly funny. I think the people who promote these ideas really put on the blinders and really have some deep down need to be special, to avoid all the rather clear evidence to the contrary.

There is now a branch of AI that is looking to add emotion to computer systems, so that personal assistants can be humorous and can understand and take into account our emotional state so they can be better assistants. I tend to think that long term this forcing of emotion into chat-bots and such is unnecessary and that as these programs become more complex we will see emotions start to surface as emergent properties like some of the emergent behaviour we talked about here and here.

Quantum Complexity

Another argument is that the billions of neurons in the brain would be a computer if they worked electrically and chemically. However this wouldn’t be good enough to produce human intelligence. The argument here is that neurons hide in their structure small constructs that operate at the quantum level and that these combine to form some sort of new much powerful computing structure that might be like a computer or might not. That if it is like a computer then it’s many orders of complexity more than current computer hardware, so AI can’t be anywhere close yet. Or the quantum nature of these behaviours is beyond a Turing machine and much more powerful.

The problem with this argument is that neuron cells have been studied to great depth by biologists and nothing like this has been found. Further neurons don’t contain any way to network or communicate these processes with other cells. Further we’ve studied and simulated much simpler life forms that have just a few neurons and managed to accurately simulate their behaviour, indicating that we do have a fairly good idea of how neurons work.

I think these arguments tend to be blind to how complex a few billion neurons are already and how complex emergent properties from such a system can be.

Something Undiscovered

Perhaps a more religious argument is that there is some force or dimension that science hasn’t discovered. Perhaps intelligence doesn’t reside entirely in the brain, but in something like a soul. And that its having this soul that leads to human level intelligence. Religious thinkers started to unravel this argument back in the 1600s where it was usually referred to as Cartesian Dualism. It is understood how the neurons in the brain control the body through our nervous system. The question becomes how does the soul interact or affect the brain?

What science has shown is that if the interaction was through a known force like electromagnetism or nuclear weak force, then we would be able to detect and see this in action, and it has never been observed. What is then posited is that it must be via a force that science hasn’t discovered yet. However quantum field theory eliminates this possibility. There can certainly be undiscovered forces, but due to experiments in devices like the Large Hadron Collider, we know that any undiscovered force would be so powerful that we could detect it and that the interactions would be like nuclear explosions going off (ie very hard to miss). This is because if a force interacts with a particle like an electron, then quantum field theory says that you can produce the carrier particle for this force by crashing an electron into an anti-electron (positron) with sufficient force. We’ve now done this with all the particles to very high energy levels to know there is no low energy unknow force that could be doing this. Incidentally this is the same argument basically to prove that life after death is impossible, because we would be able to detect it at the point of death.


As Biologists study the brain, it does appear that the brain acts like a computer. As our studies get more and more detailed we are thoroughly eliminating any contending theories. Further, being a computer doesn’t limit us in any way because we know how complex and amazing emergent behaviour can be when simple systems are iterated.


Written by smist08

June 14, 2017 at 9:05 pm

Posted in Artificial Intelligence

Tagged with , ,

Intelligence Through Emergent Behaviour – Part 2

with one comment


Last time we looked at a simple physical dynamical system namely Taylor Couette Fluid Flow, where a very simple experiment led to more and more complicated solutions as a parameter was varied, namely the speed of the inner cylinder. In this article we are going to look at an example from Mathematics and an example from Computer Science. What we are interested in is how very complicated behaviour results from very simply stated problems. We are looking into insights in how something as complicated as human intelligence can result from the simple behaviour of billions of our neural cells. Or for that matter can we have intelligence arise from the simple behaviour of billions of logic gates in a modern computer? The amazing thing is that as we study nature we see this phenomena more and more, whether it’s fractals appearing in nature or the chaotic behaviour of ecological systems. It turns out that much of the richness and complexity of our environment can result from a few very simple rules.

The Mandelbrot Set

Everyone has seen fantastic images of the Mandelbrot Set. Such as:

Or a zoom in:

To see more detail along with more fabulous graphics, have a look at the WikiPedia article here.

The definition of the Mandelbrot Set is remarkably simple. It is the set of complex numbers c such that the quadratic map

Zn+1 = Zn^2 + c

converges. In the graphics black represents the convergent points, and then the colors are specified by how fast a point diverges. The Mandelbrot Set is a fractal, meaning that as you zoom in on it you see the same structures recurring at all magnifications. The key point is that we get this infinite complexity out of such a simple defining equation. We are used to simple formulas like quadratics leading to simple predictable behaviour like parabolas. However once you start iterating simple behaviours you start getting this amazingly rich complexity appearing.

The Game of Life

Another well known example of complex behaviour from very simple rules is John Conway’s Game of Life. The definition of the game is quite simple, so I’ll just quote the WikiPedia entry:

“The universe of the Game of Life is an infinite two-dimensional orthogonal grid of square cells, each of which is in one of two possible states, alive or dead, or “populated” or “unpopulated”. Every cell interacts with its eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur:

1. Any live cell with fewer than two live neighbours dies, as if caused by underpopulation.
2. Any live cell with two or three live neighbours lives on to the next generation.
3. Any live cell with more than three live neighbours dies, as if by overpopulation.
4. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.

The initial pattern constitutes the seed of the system. The first generation is created by applying the above rules simultaneously to every cell in the seed—births and deaths occur simultaneously, and the discrete moment at which this happens is sometimes called a tick (in other words, each generation is a pure function of the preceding one). The rules continue to be applied repeatedly to create further generations.”

For lots of examples and images have a look at the full WikiPedia article here. Below is a gif showing a set pattern spawning gliders that fly off diagonally:

Some games just die out, others are extremely chaotic. Some people tried to “solve” the Game of Life, ie given an initial condition have a formula that predicts what will happen without having to run the full simulation. But is was shown that you can actually create an Universal Turing Machine in the game and hence solving this is equivalent to the Halting Problem and hence is impossible.


The Mandelbrot Set is one example of many where a very simple problem statement leads to infinitely complex solutions. The Game of Life shows how another simple statement leads to extremely complex and unpredictable behaviour. Further since you can create a Universal Turing Machine in the Game of Life, Turing’s Completeness theorem shows you could solve any problem computation using the Game of Life. This partly shows how Turing’s work is so influential and how so many things that we may not think of a computers, are in fact equivalent to computers.

Our brain consists of billions of neurons that have very simple rules that then get applied over and over in an iterative manner, so as we’ve seen this will lead to very complicated, rich and stable patterns emerging. Our thesis then is that this is the foundation of intelligence.



Written by smist08

June 2, 2017 at 6:27 pm

Intelligence Through Emergent Behaviour – Part 1

with 2 comments


One of the arguments against Strong AI relates to how computers can somehow break out of their programming to be creative. Or how can you program a computer to be self-aware? The argument is usually along the lines that AIs are typically programmed with a lot of linear algebra (matrix operations) to form Neural Networks, or are programmed as lots of if statements like in Random Forests. These seem like very predetermined operations and how can they ever produce anything creative or beyond what it is initially trained to do?

This article is going to look at how fairly simply defined systems can produce remarkably complex behaviours that go way beyond what would be imagined. This study started with the mathematical analysis of how physical systems can start with very simple behaviour and as more energy is added their behaviour becomes more and more complex resulting in what appears to be pure chaotic behaviour. But these studies show there is a lot of structure in that chaos and that this structure is quite stable.

The arguments used against Strong AI, also apply to the human brain which consists of billions of fairly simple elements, namely our neurons that somehow each perform a fairly simple operation, yet combined yield our very complex human behaviour. This also can be used to explain the spectrum of intelligence as you go up the evolutionary chain from fairly simple animals to mammals to primates to humans.

Taylor Couette Flow

Taylor Couette Flow is from fluid mechanics where you have an experiment of water between two cylinders. Fluid mechanics may seem far away from AI, but this is one of my favourite examples of the transition from simple to complex behaviour since it’s what I wrote my master’s thesis on long ago (plus there really is a certain inter-connectedness of all things).

Consider the outer cylinder stationary and the inner cylinder rotating.

At slow speeds the fluid close to the inner cylinder will move at the speed of that cylinder and the fluid next to the outer cylinder will be stationary. Then the fluid speed will be linear between the two to give nice simple non-turbulent flow. The motion of the fluid in this experiment is governed by the Navier-Stokes equations, which generally can’t be exactly solved, but in this case it can be shown that for slow speeds this is the solution and that this solution is unique and stable (to solve the equations you have to assume the cylinders are infinitely long to avoid end effects). Stable means that if you perturb the flow then it will return to this solution after a period of time (ie if you mix it up with a spoon, it will settle down again to this simple flow).

As you speed up the inner cylinder, at some point centrifugal force will become sufficient to cause fluid to flow outward from the inner cylinder and fluid to then flow inward to fill the gap. What is observed are called Taylor cells where the fluid forms what looks like cylinders of flow.

Again the Navier Stokes equations are solvable and we can show that now we have two new stable solutions (the second being the rotation is in the opposite direction) and that the original linear solution, although it still exists is no longer stable. We call this a bifurcation, where we vary a parameter and new solutions to the differential equations appear.

As we increase the speed of this inner cylinder, we will have further bifurcations where more and more smaller faster spinning Taylor cells appear and the previous solutions become unstable. But past a certain point the structure changes again and we start getting new phenomena, for instance waves appearing.

And as we keep progressing we get more and more complicated patterns appearing.

But an interesting property is that the overall macro-structure of these flow is stable, meaning if we stir it with a spoon, after it settles down it will appear the same at the macro level, indicating this isn’t total random chaotic behaviour but that there is a lot of high level structure to this very complicated fluid flow. It can be shown that often these stable macro-structures in fact have a fractal structure, in which case we call them strange attractors.

This behaviour is very common in differential equations and dynamical systems where you vary a parameter (in our case the speed of the inner cylinder).

If you are interested in some YouTube videos of Taylor Couette flow, have a look here or here.

What Does This Have to Do with Intelligence?

OK, this is all very interesting, but what does it have to do with intelligence? The point is that the Taylor Couette experiment is a very simple physical system that can produce amazing complexity. Brains consist of billions of very simple neurons and computers consist of billions of very simple transistor logic gates. If a simple system like Taylor Couette flow can produce such complexity then what is the potential for complexity beyond our understanding in something as complicated as the brain or computers?

In the next article we’ll look at how we see this same behaviour of complexity out of simplicity in computer programs to start see how this can lead to intelligent behaviour.



Written by smist08

May 22, 2017 at 6:03 pm

The Road to Strong AI

with 2 comments


There have been a great many strides in the field of Artificial Intelligence (AI) lately, with self driving cars becoming a reality, computers now routinely beating human masters at chess and go, computers accurately recognizing speech and even providing real time translation between languages. We have digital assistants like Alexa or Siri.

This article expands on some ideas in my previous article: “Some Thoughts on Artificial Intelligence”. This article is a little over a year old and I thought I might want to expand on some of the ideas here. Mostly since I’ve been reading quite a few articles and books recently that claim this is all impossible and true machine intelligence will never happen. I think there is always a large number of people that argue that anything that hasn’t happened is impossible, after all there were a large number of people still believed human flight was impossible after the Wright brothers actually did fly and for that matter it’s amazing how many people still believe the world is flat. Covering this all in one article is too much, so I’ll start with an overview this time and then expand on some of the topics in future articles.

The Quest for Strong AI

Strong AI or Artificial General Intelligence usually refers to the goal of producing a true intelligence with consciousness, self awareness and any other cognitive functions that a human processes. This is the form of AI you typically see in Science Fiction movies. Weak AI refers to solving narrow tasks and to appear intelligent at doing them. Weak AI was what you you are typically seeing with computers playing Go or Chess, self driving cars or machine pattern recognition. For practical purposes weak AI research is proving to solve all sorts of common problems and there are a great many algorithms that contribute to making this work well.

At this point Strong AI tends to be more a topic for research, but at the same time many companies are working hard on this, but often we suspect in highly secretive labs.

Where is AI Today?

A lot of AI researchers and practitioners today consider themselves working on modules that will later be connected to build a much larger whole. Perhaps a good model for this are the current self driving cars where people are working on all sorts of individual components, like vision recognition, radar interpretation, choice of what to do next, interpreting feedback from the last action. All of these modules are then connected up to form the whole. A self driving car makes a good model of what could be accomplished this way, but note that I don’t think anyone would say a self driving car has any sort of self awareness or consciousness, even to the level of say a cat or dog.

Researchers today in strong AI are building individual components, for instance good visual pattern recognition that use algorithms very similar to how neurologists have determined the visual cortex in the brain work. Then they are putting these components together on a “bus” and getting them to work together. At this point they are developing more and more modules, but they are still really working in the weak AI world and haven’t figured out quite how to make the jump to strong AI.

The Case Against Strong AI

There have been quite a few books recently about why strong AI is impossible, usually arguing that the brain isn’t a computer, that it is something else. Let’s have a look at some of these arguments.

This argument takes a few different forms. One compares the brain to a typical von Neumann architecture computer, and I think it’s clear to everyone that this isn’t the architecture of the brain. But the von Neumann architecture was just a convenient way for us poor humans to build computers in a fairly structured way that weren’t too hard to program. Brains are clearly highly parallel and distributed. However there is Turing’s completeness theorem which does say all computers are equivalent, so that means a von Neuman computer could be programmed for intelligence (if the brain is some sort of computer). But like all theoretical results, this says nothing about performance or practicality.

I recently read “Beyond Zero and One” by Andrew Smart which seems to infer that machines can never hallucinate or do LSD and hence must somehow be fundamentally different than the brain. The book doesn’t say what the brain is instead of being a computer, just that it can’t be a computer.

I don’t buy this argument. I tend to believe that machine intelligence doesn’t need to fail the same way human brains fail when damaged, but at the same time we learn an awful lot about the brain when studying it when it malfunctions. It may turn turn out that hallucinations are a major driver in creativity and that once we achieve a higher level of AI, that AIs in fact hallucinate, have dreams and exhibit the same creativity as humans. One theory is that LSD removes the filters through which we perceive the world and opens us up to greater possibilities, if this is the case, removing or changing filters is probably easier for AIs than for biological brains.

Another common argument is that the brain is more than a current digital computer, and is in fact a quantum computer of far greater complexity than we currently imagine. That in fact it isn’t chemical reactions that drive intelligence, but quantum reactions and that in fact every neuron is really a quantum computer in its own right. I don’t buy this argument at all, since the scale and speed of the brain exactly match that of the general chemical reactions we understand in biology and that the scale of the brain is much larger than the electronic circuits where we start to see quantum phenomena.

A very good book on modern Physics is “The Big Picture” by Sean Carroll. This book shreds a lot of the weird quantum brain model theories and also shows how a lot of the other more flaky theories (usually involving souls and such) are impossible under modern Physics.

The book is interesting, in that it explains very well the areas we don’t understand, but also shows how much of what happens on our scale (the Earth, Solar System, etc.) are mathematically provable to be completely understood to a very high accuracy. For instance if there is an unknown force that interacts with the brain, then we must be able to see its force carrier particle when we crash either antiprotons with protons or positrons with electrons. And since we haven’t seen these to very high energies, it means if something unknown exists then it would operate at the energy of a nuclear explosion.

Consciousness and Intelligence in Animals

I recently read “Are We Smart Enough To Know How Smart Animals Are?” by Frans de Waal. This was an excellent book highlighting how we (humans) often use our own prejudices and sense of self-importance to denigrate or deny the ability of the “lesser” animals. The book contains many examples of intelligent behaviour in animals including acts of reasoning, memory, communication and emotion.

I think the modern study of animal intelligence is showing that intelligence and self-awareness isn’t just an on/off attribute. That in fact there are levels and degrees. I think this bodes very well for machine intelligence, since it shows that many facets of intelligence can be achieved at a complexity far less than that inherent in a human brain.


I don’t recommend the book “Beyond Zero and One”, however I strongly recommend the books: “Are We Smart Enough to Know How Smart Animals Are?” and “The Big Picture”. I don’t think intelligence will turn out to be unique to humans and as we are recognizing more and more intelligence in animals, so we will start to see more and more intelligence emerging in computers. In future articles we will look at how the brain is a computer and how we are starting to copy its operations in electronic computers.




Written by smist08

May 16, 2017 at 7:49 pm

Posted in Artificial Intelligence

Tagged with ,

Playing the Kaggle Two Sigma Challenge – Part 5

with one comment


This posting will conclude my coverage of Kaggle’s Two Sigma Financial Modeling Challenge. I introduced the challenge here and then blogged on what I did in December, January and February. I had planned to write this article after the challenge is fully finished, however Kaggle is still reviewing the final entries and I’m not too sure how long that is going to take. I’m not sure if this delay is being caused by Google purchasing Kaggle or just the nature of judging a competition where code is entered rather than just uploading data. I’ll just go by what is known at this point and then if there are any further surprises I’ll post an update.

Public vs Private Leaderboards

During the competition you are given a data set which you can download or run in the Kaggle Cloud. You could train on all this data offline or if you did a test run in the cloud it would train on half the data and then test on the other half. When you submit an entry it would train on all of this data set and then test against a hidden dataset that us competitors never see.

This hidden dataset was divided into two parts, ⅓ of the data was used to calculate our public score which was revealed to us and provided our score on the leaderboard during the competition. The other ⅔ was reserved for the final evaluation at the end of the competition which was called the private leaderboard score. The private leaderboard score was secret to us, so we could only make decisions based on the public leaderboard scores.

My Final Submissions

You last step into the competition is to choose two submissions as your official entries. These will then be judged based on their private leaderboard scores (which you don’t know). If you don’t choose two entries, it will select your two highest entries. I just chose my two highest entries (based on the public leaderboard).

The Results Revealed

So after the competition closed on March 1, the private leaderboard was made public. This was quite a shock since it didn’t resemble the public leaderboard at all. Kaggle hasn’t said much about the internal mechanisms of the competition so some of the following is speculation either on my part of from the public forms. It appears that the ⅓ – ⅔ split of the data wasn’t a random sample but was a time based split and it appears the market conditions were quite different in the second two thirds than the first third. This then led to quite different scores. For myself I dropped from 71st place to 1086th place (out of 2071).

What Went Wrong

At first I didn’t know what had gone wrong. I had lots of theories but had no way to test them. A few days later they revealed the private leaderboard scores for all our submissions so I could get a better idea of what worked and what didn’t. It appears that the thing that really killed me was using Orthogonal Matching Pursuit in my ensemble of algorithms. Any submission that included this had a much worse private leaderboard score than the public leaderboard score. On the reverse side any submission that used Ridge regression did better on the private leaderboard score than the public leaderboard.

Since I chose my two best entries, they were based on the same algorithm and got the same bad private leaderboard score. I should have chosen something quite different as my second entry and with luck would have gotten a much better overall score.

There is a tendency to blame overfitting on solutions that did well on the public leaderboard but badly on the private leaderboard. But with only two submissions a day and given the size of the data, I don’t think this was the case. I think it’s more a matter of not having a good representative test set. Especially given how big a shake up there was.

What Worked

Before I knew the private leaderboard scores for all my submissions I was worried the problem was caused by either offline training or my use of reinforcement learning, but these turned out to be ok. So here is list of what worked for me:

  • Training offline was fine. It provided good results and fit the current Kaggle competition.
  • RANSAC did provide better results.
  • My reinforcement learning results gave equivalent improvements on both the public and private leaderboards.
  • Lasso and ElasticNet worked about the same for both leaderboards.
  • ExtraTreesRegressor worked about the same for both leaderboards.
  • Using current and old time series data worked about the same for both leaderboards.

My best private leaderboard submission was an ExtraTreesRegressor where it had added columns for last timestamp data on a few select columns. I then had several ensembles score well as long as they didn’t include Orthogonal Matching Pursuit.

How the Winners Won

Several people that scored in the top ten revealed their winning strategies. A key idea from the people with stock market experience was to partition the data based on an estimate of market volatility and then use a different model for each volatility range. So for instance use one model when things are calm (small deltas) and and a different model when volatile. These seemed to be a common theme. One that did quite well divided the data into three volatile ranges and then used Ridge Regression on each. Another added the volatility as a calculated column and then used offline gradient boosting to generate the model. The very top people have so far kept their solutions secret, probably waiting for the next stock market competition to come along.

Advice for New Competitors

Here is some advice I have based on my current experience:

  • Leverage the expertise on the public forum. Great code and ideas get posted here. Perhaps even wait a bit before starting, so you can get a jump start.
  • Research a bit of domain knowledge in the topic. Usually a bit of domain knowledge goes a long way.
  • Deal with outliers, either by removing or clipping them or using an algorithm like RANSAC to reduce their effect.
  • Don’t spend a lot of time fine tuning tunable parameters. The results will be different on the private leaderboard anyway so you don’t really know if this is helping or not.
  • Know a range of algorithms. Usually an ensemble wins. Know gradient boosting, it seems to always be a contender.
  • Don’t get too wrapped up in the competition, there is a lot of luck involved. Remember this is only one specific dataset (usually) and there is a large amount of data that you will be judged against that you don’t get to test against.
  • Select your two entries to win based on completely different algorithms so you have a better chance with the private leaderboard.
  • Only enter a challenge if you are actually interested in the topic. Don’t just enter for the sake of entering.

Some Suggestions for Improvement

Based on my experience with this challenge, here are some of my frustrations that I think could be eliminated by some changes to the competition.

  • Don’t make all the data anonymous. Include a one line description of each column. To me the competition would have been way better (and fairer) if we knew better what we were dealing with.
  • Given the processing restrictions on the competition, provide a decent dataset which isn’t full of missing values. I think it would have been better if reasonable values were filled in or some other system was used. Given the size of the dataset, not much could be done about these.
  • A more rectangular structure to the data would have helped processing in the limited resources and enhanced accuracy. For instance stocks that enter the portfolio still have prices from before that point and these could be included. This would have helped treating the dataset as a time series easier.
  • Including columns for the market would have be great (like the Dow30 or S&P500). Most stock models heavily follow the market so this would have been a big help.
  • Be more upfront on the rules. For instance is offline training allowed? Be explicit.
  • Provide information on how the public/private leaderboard data is split. Personally I think it should have been a random sample rather than a time based split.
  • Give the VMs a bit more oomph. Perhaps now that Google owns Kaggle these will get more power in the Google cloud. But keep them free, unlike the YouTube challenge where you get a $300 credit which is used up very quickly.


This wraps up my coverage of the Kaggle Two Sigma Financial Challenge. It was very interesting and educational participating in the challenge. I was disappointed in the final result, but that is part of the learning curve. I will enter future challenges (assuming Google keeps doing these) and hopefully can apply what I’ve learned along the way.

Written by smist08

March 13, 2017 at 7:36 pm

Posted in Artificial Intelligence

Tagged with ,

Playing the Kaggle Two Sigma Challenge – Part 4

with one comment


The Kaggle Two Sigma Financial Modeling Challenge ran from December 1, 2016 through March 1, 2017. In previous blog posts I introduced the challenge, covered what I did in December then what I did in January. In this posting I’ll continue on with what I did in February. This consisted of refining my work from before, finding ways to refine the methods I was using and getting more done during the Kaggle VM runs.

The source code for these articles is located here. The file is the code I used to train offline. You can see how I comment/uncomment code to try different things. The file shows how to use these results for 3 regression models and 1 random forest model. The offline file uses the datafile train.h5 which is obtained from the Kaggle competition, I can’t redistribute this, but you can get it from Kaggle by acknowledging the terms of use.

Training Offline

Usually training was the slowest part of running these solution. It was quite hard to setup a solution with ensemble averaging when you only had time to train one algorithm. Within the Kaggle community there are a number of people that religiously rely on gradient boosting for their solutions and gradient boosting has provided the key components in previous winning solutions. Unfortunately in this competition it was very hard to get gradient boosting to converge within the runtime provided. Some of the participants took to training gradient boosting offline locally on their computers and then taking the trained model and inserting it into the source code to run in the Kaggle VM. This was quite painful since the trained model is a binary Python object. So they pickled it to a string and then output the string as an ascii representation of the hex digits that they could cut and paste into the Kaggle source code. The problem here was that the Kaggle source file is limited to 1meg in size, so it limited the size of the model they could use. However a number of people got this to work.

I thought about this and realized that for linear regression, this was much easier. In linear regression the model only requires the coefficient array which is the size of the number of variables and the intercept. So generating these and cut/pasting them into the Kaggle solution is quite easy. I was a bit worried that the final test data would have different training data, which would cause this method to fail, but in the end it turned out to be ok. A few people questioned whether this was against the rules of the competition, but no one could quote an exact rule to prevent it, just that you might need to provide the code that produced the numbers. Kaggle never gave a definitive answer to this question when asked.

Bigger Ensembles

With all this in mind, I trained my regression models offline. Some algorithms are quite slow so this opened up quite a few possibilities. I basically ran through all the regression algorithms in scikit-learn and then used a collection of them that gave the best scores individually. Scikit-learn has a lot of regression algorithms and many of them didn’t perform very well. The best results I got were for Lasso, ElasticNet (with L1 ratios bigger than 0.4) and Orthogonal Matching Pursuit. Generally I found the algorithms that eliminated a lot of variables (setting their coefficients to zero) worked the best. I was a bit surprised that Ridge regression worked quite badly for me (more on that next time). I also tried some adding some polynomial components using the scikit-learn PolynomialFeatures function, but I couldn’t find anything useful here.

I trained these models using cross-validation (ie the CV versions of the functions). Cross-validation divides the data up and does various training/testing on different folds to find the best results. To some degree this avoids overfitting and provides more robustness to bad data.

Further I ran these regressions on two views of the data, one on my last data/current data on a bunch of columns and the other on the whole dataset but just for the current time stamp. Once doing this for one regression, adding more regressions didn’t seem to slow down processing much and the overall time I was using wasn’t much. So I had enough processing time leftover to add an ExtraTreesRegressor which was trained during the runs.

It took quite a few submissions to figure out a good balance of solutions. Perhaps with more time a better optimum could have been obtained, but hard time limits are often good.


A number of people in the competition with more of a data background spent quite a bit of time cleaning the data which seemed quite noisy with quite a few bad outliers. I wasn’t really keen on this and really wanted my ML algorithms to do this for me. This is when I discovered the the scikit-learn functions for dealing with outliers and modeling errors. The one I found useful was RANSAC (RANdom SAmple Consensus). I thought this was quite a clever algorithm to use subsets of the data to figure out the outliers (by how far they were from various prediction) and to find a good subset of the data without outliers to train on. You pass a linear model into RANSAC to use for estimating and then you can get the coefficients out at the end to use. The downside is that running RANSAC is very slow and to get good results it would take me about 8 hours to train a single linear model.

The good news here is that using RANSAC rather than cross-validation, I improved my score quite a bit and as a result ended up in about 70th place before the competition ended. You can pass the cross-validation version of the function into RANSAC to perhaps get even better results, but I found this too slow (ie to was still running after a day or two).


This wraps up what I did in February and basically the RANSAC version of my best Ensemble is what I submitted as my final result for the competition. Next time I’ll discuss the final results of the competition and how I did on the final test dataset.

Written by smist08

March 7, 2017 at 9:25 pm