As previously explained; I am a pretty big fan of evolutionary computing. Meaning that in my algorithms I like to use nature as an example. Nature has some pretty sweet ways of finding a (local) optimal solution in an extremely big solution space. For BookieBacker I use an adapted version of the genetic algorithm. Before you can use this method, you need to do two things. The first is translating your problem into a representation of DNA. The second one is defining a cost function. This cost function should be able to evaluate your string of DNA and tell you how good it is. For example, a string of DNA might be the composition of a random animal (wings, claws, beak, teeth, fur, number of eyes etc) and the cost function should indicate if this random animal can survive in the outside world (can it find food, can it take on predators etc). So for animals (and humans) our attributes are the genetic algorithm solution, the big scary world outside is our train/test area and whether we survive (or reproduce) is our cost function that we need to optimize.
That’s a great story, but the question remains; how do I translate football data into DNA? And what is my cost function? What if we translate previous results into attributes. For instance; check out the contents of the full dataset right here. Let’s say we chose a random set of five of these attributes;
1. Number of home wins for the home team in the previous 3 matches. (X1)
2. Difference in goals scored between the teams in the previous two matches. (X2)
3. The result of the last four confrontations between these two teams. (X3)
4. The number of clean sheets for the away team in the past 3 matches. (X4)
5. The number of losses with 2 goals or more for the away team in the past 3 matches. (X5)
Great, nice selection right?
We will use these variable to see if dividing all the matches that we have with these attributes in two groups can land us a profit. The two groups of course being; ‘I will bet on this’ and ‘I will not bet on this’. But the values of this set of attributes are not interpreted or weighted yet, they are just a set of incomparable values without a direction towards betting yes or no. So we need something to make them tick. Referring back to DNA, I don’t just want a location for my ‘claws’ attribute, I want to know how BIG my claws are going to be.
So we also need a set of random coefficients, let’s say that we draw the set [-10,5,10,-5,3] that means that we can calculate our ‘random’ total for this and other matches as;
The result of this formula on all matches, re-scaled to numbers between 0 and 1, can represent the chance for, for instance, a home win. Betting on those matches where the % chance of a home win gives you and advantage over the bookie will probably not result in a profit for this first set of attributes and coefficients. So what are we going to be doing about that?
This is where the evolution comes in place. If you see your set of X1-X5 as a string of DNA and thus an animal, imagine making 50 animals. So 50 sets of X1-X5 (all different indicators) and 50 sets of coefficients. If you evaluate the result of all these 50 sets you can sort them from most profitable to least profitable. Just like in nature, not every animal will survive.
The genetic algorithm is an iterative approach. The 50 animals that we take into account in our next iteration are;
– The top ten performers of the previous iteration.
– 10 Mutations of the top ten performers of the previous iteration.
– 10 Children of the top ten performers of the previous iteration.
– 20 brand new sets.
Mutations? Children? What kind of sick and twisted algorithm is this? Hear me out. If a solution performs well, you obviously did something right. You put emphasis on the correct variable or set of variables. It is, however, not certain that this solution is optimal.
First mutation. If we want to run a marathon, we need to practice. We need to break the 20km, 25km, 30km and 35km barrier before we can attempt the 42km. That is your own mutation. Practice, stronger legs, bigger lung capacity.
To mutate the top 10 performers, we can chose to make the variables (not the coefficients) switch places. So X1 switches places with X2 and is now multiplied by (in the example) 5. You keep the same attributes but look for a better weighting solution.
Secondly, children of top performers. If we imagine the African Savanna a long time ago, some horse-like animals were having trouble reaching high enough places in the trees to eat leafs. Out of a group of those animals, which one do you think got the most ass? That’s right, the one with the longest neck. His children had a longer than usual neck and that made them more desirable as well. Repeat for a 1000 times and there you have it, the giraffe!
To create children from your top ten performers, pick 2 performers and make two new sets using the X1-X3 of your first solution with X4-X5 of you second and the other way around.
A third possibility is that you were not at all on the right track to create an animal that survives. We might be trying to select the optimal set of claws to survive in the ocean. Claws are always nice but an animal that can breath underwater is the clear winner in every ocean battle. Maybe Will Ferrell can explain this better. In conclusion, we will also take another set of 20 brand new animals with us in the equation.
Next time I’ll write they’ll be more details and a link to GitHub code. Hopefully you can use it yourselves. Any questions? firstname.lastname@example.org or twitter.com/bookiebacker.