Predicting Who Will Win the World Cup with Wolfram Language

The FIFA World Cup is underway. From June 12 to July 13, 32 national football teams play against each other to determine the FIFA world champion for the next four years. Who will succeed? Experts and fans all have their opinions, but is it possible to answer this question in a more scientific way? Football is an unpredictable sport: few goals are scored, the supposedly weaker team often manages to win, and referees make mistakes. Nevertheless, by investigating the data of past matches and using the new machine learning functions of the Wolfram Language Predict and Classify, we can attempt to predict the outcome of matches.

The first step is to gather data. FIFA results will soon be accessible from Wolfram|Alpha, but for now we have to do it the hard way: scrape the data from the web. Fortunately, many websites gather historical data (www.espn.co.uk, www.rsssf.com, www.11v11.com, etc.) and all the scraping and parsing can be done with Wolfram Language functions. We first stored web pages locally using URLSave and then imported these pages using Import[myfile,"XMLObject"] (and Import[myfile,"Hyperlinks"] for the links). Using XML objects allows us to keep the structure of the page, and the content can be parsed using Part and pattern-matching functions such as Cases. After the scraping, we cleaned and interpreted the data: for example, we had to infer the country from a large number of cities and used Interpreter to do so:

From scraping various websites, we obtained a dataset of about 30,000 international matches of 203 teams from 1950 to 2014 and 75,000 players. Loaded into the Wolfram Language, its size is about 200MB of data. Here is a match and a player example stored in a Dataset:

Matches include score, date, location, competition, players, referee, etc. along with players’ birth date, height, weight, number of selection in national teams, etc. However, the dataset contains missing elements: most players have missing characteristics, for example. Fortunately, machine learning functions such as Predict and Classify can handle missing data automatically.

Before starting to construct a predictive model, let’s compute some amusing statistics about football matches and players.

The mean number of goals per match is 2.8 (which corresponds to one goal every 30 minutes on average). Here is the distribution of this variable:

It can be roughly approximated by a PoissonDistribution with mean 2.8, which tells us that the probability rate for a goal to happen is about the same in most matches. Another interesting analysis is the evolution of the mean number of goals per match from the 1950s to present day:

We see that in the ’50s, almost four goals were scored on average, while sadly it is only about 2.5 goals per match nowadays. As a result, the probability for teams to tie is now higher (almost 25% end in draws now, against 20% in the ’50s).

Here are the evolutions of the (estimated) probabilities to win when teams are playing in their home country and when they are playing away:

The effect of playing at home is important: teams have about a 50% chance of winning when they are at home, while only a 27% chance when they are away! A naive predicting strategy might then be to always predict the victory of the home team. But there is not always a home team: for this World Cup, the only home team is Brazil.

Let’s now analyze what we can determine about players. Here is the average player height for matches played in a given year:

As expected, players tend to be taller (matching the growth of the entire population). However, they have not gotten heavier (at least not in the last 30 years), in fact, they are getting thinner. Here is their average Body Mass Index (BMI, computed as weight/height2) as a function of time:

We can see that in the ’70s, players’ average BMI increased from 23 kg/m-2 to 24 kg/m-2. In the ’80s, the average BMI stayed roughly the same, and since the ’90s it has been steadily decreasing, down to 22.8 kg/m-2 in 2014. It is hard to interpret the reasons for this behavior, though one could argue that in modern football, speed and agility are preferred over impact skills.

Let’s now dive into the predictions of football matches. In order to predict the winning probabilities of the World Cup, we need to be able to predict the results of individual matches. Predicting the exact score would be interesting, but it is not necessary for our problem. Instead we prefer predicting whether the first team will win (labeled Team1), the second team will win (labeled Team2), or the match will end in a draw (labeled Draw). We thus want a classifier for the classes Team1, Team2, and Draw.

A first classifier would be to pick a class randomly with a uniform distribution, which would give 33% accuracy. To do better, we can use some of the statistical information we gathered earlier on: for example, we know that only 23% of matches are tied, so we could then predict either Team1 or Team2 at random, which would give 38.5% accuracy. To improve upon these naive baselines, we need to start using information about matches and teams, that is, to extract “features” and use them in machine learning algorithms.

With our dataset, we can construct many features in order to feed machine learning algorithms: the number of goals scored in previous matches, the fact that a team plays at home, etc. These algorithms try to find statistical patterns in these features, which will be used to predict the outcome of matches. With the new functions Classify and Predict, we don’t have to worry about how these algorithms work or which one to choose, but only about which features we want to give them. In our problem, we want to predict classes, and thus we will use the Classify function.

We saw in the previous analyses that when teams are playing in their country they have a greater chance of winning. This effect is also present for continents (although in a much less important way). We thus construct a first classifier that uses features indicating whether teams play in their own country or continent. The Country feature will be set to Team1 if the first team plays in their own country, Team2 if the second team plays in their own country, and Neutral if both teams play away. Same goes for the Continent feature (when both teams are from the same continent, the feature is also set to Neutral). Our dataset uses associations to have named features; here is a sample of it:

In order to assess the quality of our classifier, we split the dataset into a training set and a test set, which is composed of the 2000 most recent matches (the dataset is sorted by date here):

We can now train the classifier with a simple command:

With this dataset, the k-nearest neighbors algorithm has been selected by Classify. We can now evaluate the classification performance on the test set:

We obtain about 48% accuracy, which roughly corresponds to the 50% accuracy when always predicting a home win (except that the test set also contains matches played in neutral locations).

Let’s now add a very valuable feature: the Elo ratings of teams. Originally developed for chess, the Elo rating system has been adapted for football (see “World Football Elo Ratings“). This system rates teams according to how good they are. The rating has a probabilistic interpretation: if D = Eloteam1 – Eloteam2, then the predicted probability for team1 to win is P(D) = 1/(1+10-D/400).

The Elo rating of all teams starts at 1500 (this value is arbitrary). After a match is played by a given team, their Elo rating is updated according to the formula Elonew = Eloold + K * (r – P(D)), where P(D) is the probability for the team to win, r is a variable marked 1 if the team won, 0 if they lost, and 0.5 for a draw, and K is a coefficient that depends on the match type and the difference of goals. Here is an implementation of the rating update in the Wolfram Language:

where matchWeight gives a weight depending on the competition (60 for World Cup finals, 20 for friendly matches, etc.). Here are the computed Elo ratings with our dataset (restricted to matches before the World Cup):

and the time evolution of Elo ratings for some selected teams:

We then compute, before each match, the Elo ratings of both teams and add them as features. Here is a training example:

Again we train a classifier and test its accuracy:

This time, Classify chose the logistic regression method. With this new classifier, about 58.3% of test set examples are correctly classified, which is a great improvement upon the previous classifier. In matches where draws are forbidden (in the knockout phase, for example), this classifier obtains 75.7% accuracy.

Let’s now add some extra features that we think are relevant in order to build a better classifier. Usually, adding more features might lead to overfitting (that is, modeling patterns that are just statistical fluctuations, thus reducing the generalization of our prediction to new examples). Fortunately, Classify has automatic regularization methods to avoid overfitting, so we should not be too concerned about that. We choose to add four extra features for each team:

– goal average of the last three matches
– mean age of players
– mean number of national selection of players
– mean Body Mass Index of players

Here is a training example of the dataset:

Let’s now train our final classifier:

The logistic regression has again been used. We now generate a ClassifierMeasurements[...] object in order to query various performance results:

We now have 58.9% accuracy on the test set. In knockout-type matches, this classifier gives 76.5% accuracy. As we can see, it is only a marginal improvement on the previous classifier. This confirms how powerful the Elo rating feature is, and it is a sign that, from now on, accuracy percentages will be hard to improve. However, we have to keep in mind that our dataset contained many missing values for these extra features.

Let’s now have a look at the confusion matrix for the classification on the test set:

This matrix shows the counts cij of class i examples classified as class j. The rows represent the true classes while the column represents the predicted classes. For example, we can read that amongst 779 matches won by Team1, two have been classified as Draw, 600 as Team1, and 177 as Team2. Interestingly, the classifier decides to predict Draw very rarely. This is due to the low proportion of tied matches (only 23%), but it does not mean the classifier excludes the possibility of draws; here are the classification probabilities on an example:

Is it possible to improve upon this classifier? Certainly, but we will probably need more and better-quality data. It would be interesting to have access to national championship results, infer players’ skills, how players interact together, etc. With our data, the prospects for improvement seem limited, so we will thus continue using this classifier to predict World Cup matches.

Our goal is to predict the probabilities for each team to access a given stage of the competition (round of 16, quarter-finals, semi-finals, finals, and victory). We must infer these probabilities from the outcome probabilities of individual matches given by the classifier. One way to do so would be to compute the probabilities for all possible World Cup results. Unfortunately, the number of possible configurations grows exponentially with the number of matches; it will thus be very slow to compute. Instead, we will simulate World Cup results through Monte Carlo simulations: for each match, we randomly pick one of the outcomes (with RandomChoice) according to their distribution. We can then simulate the development of many imaginary World Cups and count how many times a given team reached a given stage.

We first compute the features associated with each team (continent, Elo rating, mean age, etc.). Here are the features for Brazil:

Using this, we construct a function converting the features of both teams into features used by the classifier:

In the group stage, a victory is three points, a draw one point, and a defeat zero points. Only the first and second teams qualify. Here is a function that simulates the qualified teams for the “round of 16″:

As we cannot compute goal averages, if two teams have an equal number of points, their order is chosen randomly.

We then code a function that simulates a knockout round from a list of countries. To do so, we use the option ClassPriors in order to tell the classifier that the probability of Draw in this phase is 0:

We can now have our full simulation function:

Here is one simulation and the corresponding plot of the tournament tree:

We can now perform many trials and count how many times each team reaches a given level of the competition.

After performing 100,000 simulations, here is what we obtained for winning probabilities:

As one might expect, Brazil is the favorite, with a probability to win of 42.5%. This striking result is due to the fact that Brazil has both the highest Elo ranking and plays at home. Spain and Germany follow and are the most serious challengers, with about 21.5% and 15.6% probability to win, respectively. There is almost 80% chance that one of these teams will win the World Cup according to our model.

Let’s now look at the probabilities to get out of the group phase:

This ranking follows the ranking of final victory. There are some interesting things to note: while Germany and Argentina have about the same probability to get out of their group, Germany is more than three times as likely to win. This is partly due to the fact that Germany has strong opponents in its group (Portugal, USA, and Ghana), while Argentina is in quite a weak group.

Finally, here are plots of the probabilities to reach each stage of the competition for the nine favorite teams:

We can see the domination of Europe and South America in football.

At the time of writing (June 17), some matches have already been played. Let’s see how our classifier would have predicted them:

From the first 15 matches, 11 have been correctly classified, which gives 73.3% accuracy. This is higher than expected; we have been lucky. We will report the final accuracy on all the matches after the World Cup is over.

So what else can we do with this classifier? Besides being disappointed that our favorite team has little chance of winning, one straightforward application is for betting. How could we do that? Let’s say that we just want to bet on the result of matches (Team1 wins, Team2 wins, or Draw). The naive approach would be to bet on the outcome predicted by the classifier, but this is not the best strategy. What we really want is to maximize our gain according to the probabilities predicted by the classifier and the bookmaker odds. In order to do so, we can use the option UtilityFunction, which sets the utility function of the classifier. This function defines our utility for each pair of actual-predicted classes. In order to make a decision, the classifier maximizes the expected utility. By default, the utility is 1 when an example is correctly classified, and 0 otherwise; therefore, the most likely class is predicted. In our case, the utility should be our money gain: if we do the correct prediction, it will be the betting odds for the corresponding outcome, and otherwise it will be 0. Here is how we can construct such a utility function using associations:

However, if we add the betting odds in the utility, the decision is the opposite:

It thus seems reasonable to bet on Switzerland. Now, should we blindly follow the decision of the classifier? Well, there are some counterarguments. First, this method does not take into account our risk aversion: it will choose the maximum expected utility no matter what the risks are. This strategy is winning in the long run, but might lead to severe loss of money at a given time. We also have to consider the quality of the predictions: are they better than bookmakers’ odds? Betting odds reflect what people think, and people often put feelings into their bet (e.g. they have a tendency to bet for their favorite team). In that sense, a cold machine learning algorithm will perform better. On the other hand, many betters already use algorithms to bet and they are probably more sophisticated than this one. So use at your own risk!

Taking the 4 top teams, the probability that all 4 would survive the first round is only ~56% (by my estimate from the bar graph). Thus not anomalous for one of the 4 to be out. Of course given that one would be out, there was a priori only a 25% chance it would be Spain. But that could be said of any of the 4.

Thank you for your comment! In its current state, the Classify function first uses the number of example, number of features, type of data etc. to determine possible models. Then, the best model is selected by cross validation: the models are trained on a part of the data, and tested on another part (the operation might be repeated using a different data split to improve the statistical relevance).

I really don’t understand how the USA is supposed to perform better at all than Uruguay, Italy, Russia or Mexico. Even Japan, who everybody knows as a sure loser in the first round, scores better than Russia, Mexico and Costa Rica (who seem to have a good team as of late). I’m guessing that there is some sort of confounding effect: some of the processed variables may be irrelevant or not weight at all as much as you think they do.

Then again Spain was quickly eliminated after perfoming much worse than expected, so statistcs can only predict so much in the end. Anyhow, I guess that Brazil-X, where X is Argentina, Germany, France, England, Uruguay or Costa Rica is a good bet for the final. Teams from outside Latin America or Europe are out almost by default, barring the odd African surprise.

Not that I follow football (boring) but near everybody around does, so in the end you get some info that may be missing in the algorithm.

Soccer world cups happen once every 4 years, they are quick and not very thorough, and national teams have a high turnover rate, so there is very little reason to build statistics on the history of “Italy” or “Spain”, as all teams are way too different from one tournament to the next.
The fact that your predictions have been defied so consistently, in my view, lends more credibility to soccer as a proper sport, where psychological resilience, rehearsals and athletic condition (if we could measure them meaningfully) should be much better predictors for a match’s outcome than historical analyses.

For instance, the emergence of Spain as a serial trophy winner over the last few years has been explained with their peculiar style of playing (“tiki-taka”), which used to confuse opponents. Apparently, the surprise effect just faded over the years, they did not evolve fast enough, and now what once was confusing has become too predictable. This is very logical and straightforward, it has a predictably huge impact on results, but it is just noise for a purely historical analysis…

This is not true. There are lots (im talking crowds of 30 – 40.000) of ticket paying fans that travelled from Argentina, Chile, Colombia and Uruguay, not counting those that already live in Brazil. These teams will have the home advantage against anyone (except when playing against Brazil). Just watch any of their games.

Any algorithm and any sort of data can not predict who will win the world cup .
According to ur graph, The winning probability of Spain is 2nd among 32 teams.
Just see big favorites Spain already knocked out of tournament after playing 2 matches . Looking into the past history of Spain , they are the defending World cup champions and Euro championship winners . The team does not change a lot since the last world cup played .

Data can help but it can not predict the correct winner . Let few things remains in the hands of god.

At last , I appreciate you try something out of box , and looking forward to read many more interesting articles from your guys.

In this article you start by importing xml files onto ´Wolfram Language´. Could you describe, or point me to, the basics of Wolfram Language, because I don’t really know what it is and how I can use it to make the model you made (or other models for that matter).

I’d love to build my own model (also for other football competitions) but I’ve got no idea where to start.

The Wolfram Language is a complete programming language which has a very large number of built-in functions, algorithms, as well as data. For example if you wanted to sort a list you could use the built-in function Sort:

Sort[{d,b,c,a}]

which will result in a list where the elements are indeed sorted:

{a,b,c,d}

You can try this yourself in the Wolfram Programming Cloud (which gives you access to explore and use the Wolfram Language). There are of course way more functions than this, and you can explore the Documentation Center to get a better idea of the scope that the Wolfram Language can cover as well as examples contained in the Code Gallery. There are also quite a few training videos that you can watch for free as well. In particular for this model, Etienne used some pattern matching to prepare his data for analysis and some machine learning techniques to make predictions. I hope this gets you started in the right direction, but if you find out you still have more questions, please feel free to post some questions in the Wolfram Community. I wouldn’t be surprised if you found some other football fans out there with similar interest in creating models for predict matches.

As interesting as this analysis is. It is all based on historical evidence. The fact that Brazil completely fell apart in the last two games of the world cup didn’t reflect it’s elo rating whatsoever.

Based on what I saw in the group stage games, I saw Brazil as a weak team that got lucky (even with Neymar). Germany looked strong from the start, a well put together team.

Argentina showed some strength against Germany, with of course Messi being able to get by the German defense a couple of times.

It was an interesting analysis but one cannot use the eloratings as a good predictor in the world cup stage. If one watched all the group stages you could potentially pick out which teams would move on. The only other factor is bad calls or “fixed” calls by the referee. Some games are legitimate but if you watch there may be a few key games that could be fixed. Perhaps the Wald-Wolfowitz could pick out the fixed games?

Elo ratings don’t add up in one respect. Given an initial start of 1500 (an arbitrary number as you said) and given 4 fictional teams all playing each other only once. What is not right about the rating is when a weaker team plays a stronger team and wins more points than if it were the other way around. Given our example if we switch up when the teams played they end up with a different score in the end. So in that respect the rating will need to change.

Also given the FIFA scandal over the last 20 years, all those rating will definitely be skewed, perhaps we do need to do the Wald Wolfowitz test to determine where things are going awry.