Posts Tagged ‘machine learning’
As I started learning about machine learning and playing with simple problems, I wasn’t really satisfied with the standard datasets everyone starts out with like MNINST. So I went to playing with stock market predictions which was fun, but there was really no metric on how well I was doing (especially since I wasn’t going to use real money). Then as I was reading various books on machine learning and AI, I often ran into references to Kaggle competitions.
Kaggle is a company that hosts machine learning competitions. It also facilitates hosting data sets, mentoring people and promoting machine learning education and research. The competitions are very popular in the machine learning community and often have quite large cash prizes, though a lot of people just do it to get Kaggle competition badges.
Kaggle appealed to me because there were quite a few interesting data sets and you could compare how your algorithms were doing against the other people playing there. The competitions usually run for 3 or 4 months and I wanted to start one at the beginning rather than jump into the middle of one or play with the dataset of an already completed competition so I waited for the next one to start.
The next competition to start was the AllState Claims Severity challenge where you predicted how much money someone was going to cost the insurance company. There was no prize money with this one and I wasn’t really keen on the problem. However the dataset was well suited to using TensorFlow and Neural Networks, so I started on that. I only played with it for a month before abandoning it for the next competition, but my last score ended up being 1126th out of 3055 teams.
Then came the Outbrain Click prediction competition. Outbrain is that annoying company that places ads at the bottom of news articles and they wanted help better predicting what you might be interested in. This competition had $25,000 in prizes, but besides not really wanting to help Outbrain, I quickly realized that the problem was way beyond what I could work on with my little MacBook Air. So I abandoned that competition very quickly.
Next came the Santander Bank Product Recommendation competition where you predicted other banking products to recommend to customers. There was $60,000 in prize money for this one. I played with this one a bit, but didn’t really get anywhere. I think the problem was largely dealing with bad data which although important, isn’t really what I’m interested in.
Then along came the Two Sigma Financial Modelling Challenge. This one had $100,000 in prize money and was actually a problem I’m interested in. I played this competition to completion and my plan is to detail my journey over a few blog posts. Only the top seven entries receive any prize money and since these competitions are entered by University research teams, corporate AI departments and many other people, it is very hard to actually win any money. The top 14 win a gold medal, the top 5% get a silver medal and the top 10% a bronze. 2071 teams entered this competition.
The Two Sigma Financial Modeling Challenge
One of the problems I had with the first set of challenges was that you ran your models on your own computer and then just submitted the results to the competition for evaluation. This put me at a disadvantage to people with access to much more powerful computers or access to large cloud computing infrastructure. I didn’t really want to spend money on the cloud or buy specialized hardware for a hobby. With the Two Sigma challenge this changed completely. With this challenge you run your code in a Kaggle hosted docker image and rather than submit your results, you submit your code and it has to run in the resources of the Kaggle hosted image. This then leveled the playing field for all the competitors. This restricted you to programming in Python (which I like, but many R programmers objected to) and using the standard Python machine learning libraries, but they seemed to include anything I’d ever heard of.
The provided dataset consisted of 60 or so fundamental metrics, 40 or so technical indicators and 5 derived indicators. Plus in the test set the value of what you are trying to predict. No further explanation of what anything was was given, it was up to you to make what you could of the data and predict the target variable. The data was grouped by id and timestamp so you appeared to be tracking a portfolio of about 1000 stocks or commodities through time. The stocks in the portfolio changed over time and when you were predicting values you weren’t explicitly given the previous values of what you were predicting. There was some feedback in that you predicted the results for the portfolio one timestamp at a time and received a score for each submitted group for a timestamp.
We were given about 40% of the data to train our models and do test runs, which we could either do locally or in the Kaggle cloud. Then when you submitted your code for evaluation it ran on the full dataset, training on that 40% and then predicting the rest that we never saw. You could only submit two entries a day for evaluation, so you had to make each one count. This was to stop various algorithms of simply being able to overfit the evaluation data to win the competition (ie limit cheating).
We could run against the test data all we liked in the Kaggle cloud. But there were restrictions on processing time and memory usage. When running we were allowed 20 minutes processing and 8Gig of RAM. When submitting we were allowed 1 hour or processing and 16Gig or RAM. This tended to work out due to the size difference in the data sets used. Given the size of the dataset, this meant your training and evaluation had to be very efficient to run under the given constraints. We’ll discuss the tradeoffs I had to make to run in this environment quite a bit over the coming blog articles, this especially meant you had to use pandas and numpy very efficiently and avoided Python loops at all costs.
Well I seem to have filled up this blog post without even starting the competition. In future posts I’ll cover the various algorithms that I tried. What worked and what didn’t. And the progression of results that eventually combined to give me my final score. I’ll also discuss what is revealed by the winners and how what they did differed from what I did, what I missed and what I got right. Anyway I think this going to take quite a few posts to cover everything.