I can’t wait to share the results of my football prediction model with you. I promise to contain my excitement and stick to the facts. The model took much more time than initially anticipated. It took me about 4 months from idea to the somewhat working model. My hands are still sore from all copying and pasting. The findings, however are quite promising and a few back tests showed to be quite profitable.
But first, a quick refresher: this football prediction model is based on relative player salaries (for methodology, see part 2 of this blog). Why did I not use a more traditional xG prediction model? I have outlined their limitations in part 1.
The process of Creating the Football Prediction Model for Serie A
My goal was to analyze at least 5,000 matches and estimate how often more expensive football squads win at home and away. I had to go back and find Serie A wage bills that are 10-12 years old. The more the merrier. Tiresome internet browsing revealed 11 years worth of data. Italy’s Serie A seasons from 2006/2007 up to 2017/2018. There were a total of 4180 matches that were analyzed. A bit short of my goal but it will do. The aim was to learn the relationship between the player salaries and match outcome.
Team wages were taken from La Gazzetta Delo Sport. Here’s a sample salary report that the model uses. I almost had to learn Italian to find them all 🙂
The model was intended to simulate various what if scenarios. What if the home team outspends the visitors four-fold, how often do they win? (62% of the time) How often do they tie? (22.5%) and are those probabilities reflected in the betting market (Yes, but NOT always).
The key metric had to be simple, so I called it the ratio. Simply put, it’s the wage bill of a home team divided by the visiting team.
After looking at 4,200 matches, I broke them down into categories of about 300 matches each, sorted by the ratio.
Quick Video Overview of the Football Prediction Model
Here’s a quick snapshot of results. To paint a picture, lets sort data by ratio and break the data set into clusters. Each cluster representing about 300 matches. This is a fairly small sample size. We then do a quick pivot table in Excel and arrange the probability of match outcome in each of the clusters. Here’s what we get:
Intuitively it makes sense. When teams have have similar wage bill (Ratio ~1), the home team will win 49% of the time. The draw will occur 27% of the time and 24% of the time away team will win. These are the matches like Napoli vs Inter and Verona vs Parma in 2015.
But look what happens when you move away toward the edges of the grid. When a large outsider plays an away game, the team only has a 5% chance of winning (Milan vs Capri in 2016 for instance). But the visiting team will try very hard to contain the favourite. As a result the probability of the draw is 19%.
Quick graphical representation. We see a few bumps but I attribute it to smaller sample size.
How does this Model compare to the Bookies?
On average, the bookmaker odds are aligned with the statistical model. There are however some irregularities and trends. A few clusters are overvalued by bettors and a few are undervalued. Or so it seems. Also, certain matches have a very large gap between statistical probability and the bookies. Lets not forget we are looking at the static model and this will happen occasionally. A star player gets injured. The team decides to give some rest to the 7 starting players. Or perhaps the stadium is banned for fan behaviour. All of these factors would influence the odds significantly. However on few occasions the model uncovers some profitable betting opportunities.
But it’s a topic for another blog post and a long analysis. If you want to see a separate post about it, just drop me a note or simply comment below.
Oh, and I am pretty sure I have discovered a fixed match in Serie A at the end of the season. Obvious I can’t prove it was fixed but the odds were VERY suspect.
Football Prediction Model Back Tests
What the point of all this hard modelling work without exploring monetization opportunities, right? 🙂 The formula was simple: look for value bets in historical performance data. For this to be a good test, two separate leagues other than Serie A had to be used. I back tested it on one season of EPL games and one season of La Liga. Thanks to Football Data, they had statistical results not only of the matches, but also the bookmaker odds.
Simulation consisted of looking for the a) difference between Pinnacle odds and statistical probabilities of the model and b) probability of event ocuring. Every time system showed an edge, I instructed the spreadsheet to allocate a bet based on probability of an event. If this is confusing, take a quick look at the video.
Bet Criteria: Only bet if expected bet value is >15%. I tried various values, but 15% looks optimal.
Bet amount: I used flat bet amount depending on the probability of outcome. I wanted an average stake to be near $20, so simply multiplied probability of an outcome by $40. If a home win had a value of >15% and 67% probability of occurring, I would stake $26.8 ($40*.67)
Starting Bankroll: $500 for each test
Both scenarios were profitable, which is fantastic news! La Liga was profitable from week one and kept going, but with EPL was more volatile and by the end of October the bankroll went down to $248. It got back to $500 range by mid November. If I was betting in real life, I would be nervous with two losing months in a row. But nevertheless, if I started with $1000, I would be $1,316 richer.
For the purposes of this back test, I didn’t increase the bets as the bankroll got bigger, I didn’t use Kelly Criterion or any of the complex bet allocations. I am sure that with some modelling, I can reduce the risk and optimize for profit.
So Stan, can you REALLY beat the bookies with this football prediction model?
That is a million dollar question now, isn’t it? I think I can. With time. Please understand that this is a very simplistic model that does not account for important factors like injuries, team motivation, player form, etc. My goal is to take it to the next step. That’s a totally different can of worms, but I am prepared. I think 🙂
P.S. if I stop blogging about this model in the near future, I am probably making money with it. Or I just stopped blogging. Which is less likely 🙂
P.P.S. While the model is fairly basic now, I don’t mind sharing it with readers. All 11 of you! But first I have to figure out how to upload Excel spreadsheets to the blog 😉 You could really motivate me by leaving some comments below in case you are interested;)