Written By: Eddy Yeo
Part 1 of a 2 Part Series
Massive datasets are now available online for every sport, so sports analytics is no longer an exclusive pursuit of sporting professionals. It is now an every man activity. Any amateur armed with a computer, internet access and maybe a bit of brains can now have opinions that are just as sound as any professional analyst. Power to the people!
Actually, this is not exactly true. Yes, amateurs can now use computers to do their own number crunching. However, this only means that the power has shifted from the sports insiders to those well-versed in the arts of data analysis. Pittsburgh's very own Bill Benter has made a hefty profit from a computer horse betting system he developed with some of his friends. By some accounts, he has a PhD in Physics. CMU's Professor AC Thomas has written numerous research papers on hockey and baseball. And yes, he has a PhD in Statistics.
I don't think the amateur data scientist should be discouraged, however. In the age of the internet, learning data analysis is not as hard as it was before. Statistics and machine learning classes are offered in most universities and even on educational websites like coursera.
This is the first blog post of a three-part series where I will attempt to train a neural network to predict horse racing results. Depending on its performance, I may evaluate betting strategies for financial gain. I will be posting all source code and documenting my design choices. I would greatly appreciate any feedback or comments on my design choices. If I don't post my source code, it probably means I am making a lot of money, in which case I won't need any feedback... Note that I will not assume any prior knowledge of neural networks or machine learning in my posts.
What does training a neural network mean?
The training process is similar to a student preparing for an exam. A student has an understanding of a subject. His understanding is the 'neural network'. (Incidentally, in this case his understanding is literally a biological neural network, which is what our artificial neural network is trying to model.) To prepare for the exam, he does exams from past semesters, and then checks the solutions. These past exams and solutions are the training dataset. Depending on his performance, he modifies his understanding. On the day of his exam, he uses his understanding to 'predict' the answers to the questions in the exam, without seeing the solution.
Similarly, we will create an aritifical neural network for horse racing. We set it up with some initial state, representing our initial understanding. This may very well be a random state if we have zero understanding. We feed in data from past races with results, and tweak our neural network to improve its performance. Of course, we use computers to do this as datasets could be huge.
There are many online sources explaining the details of neural networks, so I will try to explain them as simply as possible.
There are three types of nodes in our network: input nodes, output nodes and hidden layer nodes.
Input nodes represent the inputs to the network. They receive inputs from the outside world (e.g. past year exam questions). These input nodes are typically connected to hidden layer nodes. Hidden layer nodes are called 'hidden' because they are not exposed to the 'outside world'. Their inputs and outputs are internal to the network. Based on the signals they receive from the input nodes, hidden layer nodes then send signals to either other hidden layer nodes (we can have multiple hidden layers) or output nodes. Based on the signals they receive, output nodes output a value (e.g. 'predicted' exam answers).
This is a very nice picture from a Stanford webpage:
So how do we build a network that makes accurate predictions? There are many parts of the network that we can tweak:
- The weight of each connection between two nodes. This is typically tweaked by the computer as it processes the training data.
- The number of hidden layers
- The number of nodes in each hidden layer
- How each node should process inputs to produce an output (called the activation function)
Each of these design choices involve weighing tradeoffs and understanding the specific problem we are trying to solve (horse racing, in this case).
There is also the issue of data preprocessing. Most algorithms assume that inputs are real numbers within a certain range, so we have to process inputs to transform them into real numbers. For example, for horse racing, if the horse is male, we may want to map it to 1 and if it's female, to -1.
Perhaps the most important decisions are related to what variables we'll use as inputs. I'm guessing this is the toughest part of the problem to crack.
**To be continued...
Over the next two weeks, I will be reading research papers by those who have tried using neural networks to predict horse race results, and doing the actual implementation of my prediction system.