Predicting the FIFA World Cup Group Stage

Written by: Adam Tucker (@adamtb182)

Introduction

The FIFA world cup has kicked off, and all soccer fans (myself included) are ecstatic for its return. No other tournament on Earth compares to this for me – there just simply is not an event that brings together so many quality teams and talented players to battle it out for glory. Unfortunately, the United States failed to qualify, so I was left without a home country for which to root. I thought that all I wanted out of this World Cup was to see some great soccer, until I remembered that I have learned a few tricks in statistics since four years ago. So, my new goal is to see how well I can predict the Group Stage in absence of rooting for the US.

Data

Before any modeling can be done, there needs to be a thorough data search and manipulation effort to provide a framework for analysis. I will point you to the datasets I used in case you would like to use the datasets. Many thanks to these people for providing such awesome data. I used three different data sources to create a match prediction model: The World Cup 2018 dataset, provided by Nuggs on Kaggle, FIFA Soccer Rankings, provided by Tadhg Fitzgerald on Kaggle, and finally the International Football Results, provided by Mart Jürisoo also on Kaggle. The World Cup 2018 dataset was used to set up an R data frame to predict the Group Stage results after fitting the model; the FIFA Soccer rankings, which consist of the monthly FIFA rankings dating back to 1993, was used to pull out potential predictors for team form and team strength; and the International Football Results dataset, which consists of international football match results dating back to 1892, was used for its match outcomes to fit the model and its record of home and neutral fields to control for the home team’s advantage.

I won’t bore you with the manipulation details, but there a few key edits to the data I made. First and foremost, I only considered games that took place in 1993-2018 because those dates coincided with the available rankings data. In addition, I filtered out all friendly international matches because I believed this would skew my results. I believe that teams often have no incentive to play their strongest players in these types of games, so for that reason I threw those observations out.

Model

Group stage games have three possible outcomes: win, loss, or tie. Since there isn’t a possibility of penalty shootouts, this natural trichotomy of categorical outcomes guided me in choosing the multinomial logistic regression model, an extension of the logistic regression model to categorical outcomes that aren’t binary.  A multinomial logistic regression model estimates the log odds of wins and losses against a baseline outcome, in this case a tie. Formally, my model estimates the following:

 

 

I wanted to test my soccer intuition with data, so as predictors of match outcomes, I considered team strength, team form, and home field advantage. As a proxy for team strength, I chose to use the FIFA world rankings, which have been shown to be a good predictor of game outcomes by Professor Roger Pielke of UC Boulder. Similarly, for team form, I chose the change in monthly rank to control for this aspect of a team. I also controlled for home field advantage to see if that would give Russia an edge. All in all, I had a sample of a little bit over 10,000 international matches from 1993-2018 to fit my model.

The multinomial logistic model outputs log odds of wins, losses, and ties. In the Group Stage, a win has a points value of three, a tie has a points value of 1, and a loss has a value of 0. Therefore, instead of just choosing the outcome corresponding to the max of these probabilities, I chose to calculate the expected points value a team can expect after their three games. More formally, I calculated:

where the index i refers to one of the three group stage games a country plays. I did this to attempt to control for some of the crazy results we are sure to see in the World Cup, but unfortunately three games is not a huge number of observations. I’m sure that this statistic derived from the model will get some picks wrong, but who doesn’t love an underdog making it through?

Results

Here is a summary of my model’s coefficient estimations (note that these are in the form of log odds described above and compared against the baseline outcome of a tie) and p-values through Wald tests:

Coefficient

Estimate (loss) Estimate (win) P-value (loss) P-value (win)

Intercept

-0.424 0.385  1.1015*10^(-11)  1.500*10^(-12)

Team One Strength

0.017 -0.018 0.000

0.000

Team Two Strength

-0.016 0.021 0.000

0.000

Team One Form

-0.022 0.044  4.402*10^(-8)

0.000

Team Two Form 0.035 -0.031 0.000  2.220*10^(-16)
Neutral Ground 0.517 -0.333  3.109*10^(-15) 4.214*10^(-8)

When you remember that my model predicts the outcome of team one and the coefficients are relative to the match ending in a tie, the signs on them make sense. In addition, the Wald tests report that they are all significant, so I like where this model is going. So which teams does my model think will advance? Below is a graph of my model’s estimates for the expected points for each country for the group stage and a summary of its picks.

 

Group

Team 1 Team 2
A Uruguay

Egypt

B Portugal

Spain

C

France

Peru

D Argentina

Iceland

E

Brazil

Switzerland

F

Germany

Mexico

G Belgium

England

H

Poland

Colombia 


Quantifying Uncertainty

I find that people love to tell you what they know, but I think that telling people what you don’t know is just as important. In order to get a sense of the variability of expected points, I did a nonparametric bootstrap of the statistic by resampling cases. The error bars in the above graph correspond to +/- one bootstrapped standard error. When you take the variability into account, I lose some confidence in my predictions. More specifically, I cannot confidently say that teams such as Iceland, Mexico, and Colombia will advance to the overlap in their standard error bars with the teams I picked for third place in the group.

Summary and Future Work

Now I have a little bit more at stake this World Cup: seeing how well my model does! I’m starting to sweat a little bit in regards to how well Russia has performed and how poorly Argentina has performed, however. Unfortunately, since I don’t have a score line for these games, it won’t extend to the knockout stage well. Some future work needs to be done for the knockout stage, perhaps competing Poisson processes.  Many factors go into deciding the outcome of the beautiful game, which means that no model is ever perfect. But isn’t that why we all watch in the first place?

 

All code can be found here in a GitHub repository for this project.

Leave a Reply