Researchers have predicted the result after simulating the entire soccer tournament 100,000 times.
Only a few days after the World Cup, some students set a goal to predict the future winner of the 2018 Russia World Cup, the Cup starts in Russia on Thursday and is likely to be one of the most-watched sporting events in history, even more, popular than the Olympic Games. Then, the possible winners are of great interest.
One way to measure possible outcomes is to observe the odds of bookmakers. These companies use professional statistics to analyze extensive databases of results in a way that quantifies the probability of different results of any possible coincidence. In this way, bookmakers can offer odds on all games that will begin in the coming weeks, as well as the odds of potential winners.
An even better estimate comes from comparing the odds of many different bookmakers. This approach suggests that Brazil is the clear favourite to win the 2018 World Cup, with a probability of 16.6%, followed by Germany (12.8%) and Spain (12.5%).
But in recent years, researchers have developed machine learning techniques that have the potential to overcome conventional statistical approaches. What do these new techniques predict as the possible outcome of the 2018 World Cup?
One answer comes from the work of Andreas Groll at the Technical University of Dortmund in Germany and some colleagues. These guys use a combination of machine learning and conventional statistics, a method called a random forest approach, to identify a different winner.
First some background. The random forest technique has emerged in recent years as a powerful way to analyze large data sets and at the same time avoid some of the disadvantages of other data extraction methods. It is based on the idea that a future event can be determined by a decision tree in which a result in each branch is calculated by reference to a set of training data.
However, decision trees suffer from a well-known problem. In the final stages of the branching process, decisions can be seriously distorted by training data that is scarce and prone to the wide variation in this type of resolution, a problem known as overfitting.
The random forest approach is different. Instead of calculating the result in each branch, the process calculates the result of the random branches. And it does so many times, each time with a different set of randomly selected branches. The end result is the average of all these randomly constructed decision trees.
This approach has significant advantages. First, it does not suffer from the same problem of overfitting that affects ordinary decision trees. It also reveals which factors are most important in determining the outcome.
So, if a particular decision tree includes many parameters, it is easy to see which have the greatest impact on the result and which do not. These less important factors may be ignored in the future.
Groll and co use exactly this approach to model the 2018 World Cup. They model the outcome of each game that teams can play and use the results to build the most likely course of the tournament.
Groll and co begin with a wide range of potential factors that can determine the outcome. These include economic factors such as GDP and the population of a country, classification of FIFA national teams and the properties of the teams themselves, such as their average age, the number of Champions League players they have if they have the advantage of local, etc.
Interestingly, the random forest approach allows Groll and company to include other classification attempts, such as the rankings used by bookmakers.
Connecting all this in the model provides some interesting ideas. For example, the most influential factors are the team rankings created by other methods, including those of bookmakers, FIFA and others.
Other important factors include the GDP and the number of Champions League players in the team. Non-important factors include the population of the country, the nationality of the coach, etc.
The predictions through this process differ from others in some important respects. For starters, the random forest method chooses Spain as the most likely winner, with a probability of 17.8 per cent.
However, an important factor in this prediction is the structure of the tournament itself. If Germany passes the group stage of the competition, it is more likely to face strong opposition in the 16-team elimination phase. Because of this, the random forest method calculates that Germany's chances of reaching the quarterfinals are 58 per cent. On the contrary, it is unlikely that Spain will face strong opposition in the last 16 games, so it has a 73 per cent chance of reaching the quarterfinals.
If both reach the quarterfinals, they have more or less chance of winning. "Spain is slightly favoured with respect to Germany, mainly due to the fact that Germany has a comparatively high chance of dropping out in the sixteen round," says Groll and company.
But there is an additional twist. The random tree process makes it possible to simulate the entire tournament, and this produces a different result.
Groll and Co simulated the entire tournament 100,000 times. "According to the most likely tournament, instead of Spanish, the German team would win the World Cup," they say.
Of course, due to a large number of permutations of games, this course is extremely unlikely. Groll and co put the odds at around 1 in 100,000.
So there you have it. At the beginning of the tournament, Spain has the best chance of winning, according to Groll and company. But if Germany reaches the quarterfinals, then it becomes the favourite.
The tournament starts on Thursday, when the hosts, Russia, face Saudi Arabia. Sadly, none of these teams seems likely to reach the quarterfinals.
During the World Cup, different applications were created to be aware of the World Cup, apps were also created in Peru to be aware of the Peruvian National Team.