Simulating the Rugby World Cup 2019 Japan in R

I really like running simulation models before sporting events because they can give you a much greater depth of understanding of team performance compared to the ‘raw’ odds that you might get from the media or bookmakers, or the often varied opinions of different sports pundits.

Yes, Ireland usually get knocked out in the Quarter Finals, and this is what many people are saying is most likely to happen again this year – but does the data show this? Pool C is the ‘pool of death’, right? With England, France and Argentina vying for qualification in the top two spots. Or, does Pool D hold that title, with Wales, Australia and Fiji battling it out? Oh, and we know Italy is having a tough time right now, but if we could have 10,000 world cups, would they win it even once?

That’s where a simulation can provide a lot of extra oompf for a little extra effort.  If we can take the ‘raw’ odds but then add in some volatility then we can understand the distribution of outcomes.

That being said, with the Rugby World Cup 2019 officially starting in just over two weeks on 20 September, I thought I’d best run a simulation model.  The model is relatively simple, but nevertheless I hope it can provide some insight into both the tournament and also more generally how to simulate a sporting competition!

Venues of the RWC 2019 Japan
Structure of the tournament. Source: RWC2019

Rating Data – How good is a team?  Given that, how likely is team A to beat team B?

I used two different data sources:

World Rugby, the global governing body of rugby union, publishes official world rankings weekly. These have some shortcomings, although they’re widely followed by the media and have been around for many years.

Rugby Pass, a rugby broadcaster from New Zealand, publishes its Rugby Pass Index. It’s a bit of a black-box as they don’t publish their methodology, and it has only been around for a year, but they claim to use machine learning and it appears to work at player-granularity.

When plotted for each team, there is a fairly linear relationship between the two ratings systems, although the World Rugby rankings make a smaller gap between the lower and the higher teams.

I could write another post comparing the two ratings systems, but in a nutshell I believe Rugby Pass Index more accurately represents the current form of teams, although it doesn’t track the lower tier nations so well which the World Rugby rankings cover.

A source of recent volatility in the ratings has been the summer friendly matches, which are counted as full fixtures by both systems, although which may have been regarded by some teams as experimental/warm-up matches.

So I decided to take a consensus, using the mean of both ratings for each team, for before and after the summer friendly matches. To allow pathways for teams with performance improvement/drop between now and tournament start, a rand(-5,5) adjustment was made to each rating per simulation. Finally, Japan was given a 1.5pt home advantage adjustment as host.

The favourite team doesn’t always win – how much leeway for luck is there? We need a Distribution for Points Scored in a Rugby Match.

For those less familiar with rugby, these are examples of scorelines, from RWC 2015.

In a game like soccer, you get a lot of draws because goals don’t happen very often, at least compared to rugby where it’s not unusual to see 60pts or more scored per team.  What does a distribution of points look like?

If we can define this then combined with the ‘raw’ probability of winning we can take random samples of it during multiple simulations to understand the distribution and volatility of outcomes.

For the general probability distribution of occurrence of tries scored, conversions given a try has been scored, and penalties scored, I have recycled a distribution I calculated previously based on a season of the European club Pro 14 competition. Perhaps in another post I will look at how this compares to the distribution in international/RWC matches but for an initial model this is sufficient.

Now that we have the probability of winning, and can randomly sample from the distribution of points scored, we can build our Monte Carlo model!

For each match simulation, a random sample of the probability distribution on tries, conversions, and tries scored was taken and essentially multiplied by the probability of winning for each team. For each tournament simulation, the results were recorded and the tournament was simulated again, until we got 10,000 simulated tournament results.

Finally, let’s look at some results – the pool stages.

Pool A: Ireland and Scotland most likely to qualify, although it’s plausible that Japan in particular could humble one of them and qualify instead.

Pool B: this is far more clear-cut! New Zealand and South Africa are very likely to qualify, unless Italy causes an upset. Although Italy are unlikely to be upset by Namibia and Canada who will battle it out for fourth place.

Pool C: England are fairly likely to qualify, but Argentina and France would have to battle it out for runner-up. USA has a fair chance then of causing an upset. Although it would have to be an upset against the winner of Argentina v France which would seem unlikely. I hope to build this feature in by dynamically changing the ratings as the tournament simulation progresses, for a future version.

Pool D: a similar picture to Pool A with Australia and Wales being favourites to qualify, unless Fiji manage to force an upset. Equally, Georgia are looking like a strong fourth team, who could themselves upset Fiji by taking third place. All whilst Uruguay look the most certain of any team to finish bottom of their group.

I’m calling out Pool D now as a good pool to watch for the neutral fan!

Won’t you tell me, who’s going to win the World Cup?

Results of the knock-out stages. Click to enlarge. Probabilities are cumulative, i.e. shows how often a team played in a quarterfinal, semifinal, etc.

It’s perhaps no surprise that the New Zealand All Blacks, current World Cup holders with the strongest legacy in world rugby are favourites to win (28% chance), with close to double the odds of their nearest rivals (16% chance). However these four rivals: Ireland, England, Wales, South Africa all share roughly the same odds (as a bloc accounting for 60% of the chance). The remaining chance is taken up by Australia (5% chance), Argentina, Scotland and France (1-2% chance each) with remaining teams having less than 1% chance. So, unlike at the FIFA soccer World Cup last year, which saw relative outsider Croatia make the final, it’s very unlikely a team from outside these top nine teams will win the tournament.

This story cascades down when we look back to the semi-final and quarter finals. With effectively six very strong teams and three fairly strong teams that makes in most cases only nine teams vying for eight places to get into the knock-out stages which means there is not much volatility in this regard.

Fiji, Japan, USA, Georgia, Samoa, Italy all have a plausible chance of causing an upset in the pool stages and making it to the quarter-finals.

Barring a near-miracle, Russia, Tonga, Uruguay, Canada and Namibia are unlikely to reach the finals, although I look forward to some good pool stage matches from these teams who successfully battled qualification to make it to Japan.

Are Ireland going to go out in the quarter-finals?

The subject of much debate given Ireland’s recent volatile performance, with the notable statistic that Ireland has been knocked-out in the quarter finals in six out of the past eight world cups. As an Ireland fan, it pains me to say that Ireland (45% absolute chance) are the team most likely to exit at the quarter finals! It doesn’t make it any better, but they are followed by Scotland (41% chance) and Australia (36% chance).

Most likely to exit at the semi-finals are Wales (28% chance), Engalnd (27% chance), South Africa (24% chance).

Wrap-up

Well, I hope you enjoyed this as much as I did.  The code, as ever, is available via Git here. I’ll hopefully follow-up soon with a post about a Sankey plot to visualise the stats more effectively.

Road to Rugby World Cup 2019: Rugby scores decomposition

With the Rugby World Cup 2019 Japan starting on 20th September, I thought I’d take a look at the tournament from a few different statistical angles. For this post I’ll be looking at the problem: given a rugby score, how can we decompose it into possible combinations of tries, conversions, penalties and dropped goals?

Context

I have a dataframe of results for almost all professional and international rugby union scores since the 2012/13 season, more than 10,000 matches. This is nice in terms of ‘breadth’ of the sample – however in terms of depth it’s a bit lacking! For each result I only have home/away team and home/away score, for example:

27/07/2019  New Zealand  South Africa  16  16

I was curious: is it possible to decompose the results into valid combinations of scoring methods? Then, perhaps as a second stage, estimate the probability of occurrence of each combination for a given score? The first question I will be looking at in this post, and the second will be next up in the series!

I’ve never seen a match of rugby before! What are the scoring methods you’re referring to?

TRY (5 points): awarded when an attacking player grounds the ball in the area at the end of the pitch (“in-goal area”).

CONVERSION (2 points): the team who has scored a try immediately gets to kick at goal for another 2 points before kick-off restart.

PENALTY GOAL (3 points): when an infringement is made, a penalty may be awarded to the other team who may then choose to take a penalty kick at goal.

DROP GOAL (3 points): a player may, at any time in play, drop-kick the ball over and between the posts.

PENALTY TRY (7 points): if a foul has stopped the attacking team from scoring then a penalty try is awarded, worth a full 7 points. Note: these happen fairly rarely so I include this just for completeness but don’t refer to penalty tries hereon.

I took the videos above from World Rugby Laws of the Game which is a great resource if you want to learn more about the laws of the game.

Grouping the Elementary Scoring Methods

  • 3 points: penalty goal or drop goal.
  • 5 points: unconverted try (i.e. a try has been scored, but the conversion did not score the extra 2 points)
  • 7 points: converted try (i.e. a try has been scored, and the conversion succeeded in scoring the extra 2 points).

Starting Off the Analysis: Scores 0-7

ScorePenalties or
drop goals (3pt ea)
Unconverted
tries (5pt ea)
Converted
tries (7pt ea)
0000
3100
5010
6200
7001
  • 0 is obviously a valid score.
  • 3, 5, 7 are obtained from elementary scores only.
  • 6 is obtained only from two penalties.
  • 1, 2 and 4 are not valid scores as they cannot be sums of

Onwards! Scores 8-10

scpdutct
8110
9300
10101
10020

The only way forward is to score 3, 5, or 7 points! So new valid scores/combinations are {previous scores} + {3,5,7}. In the table above:

  • 8 is the row for 5 points but plus 1 penalty/drop goal.
  • 9 is the row for 6 points but plus 1 penalty/drop goal.
  • 10 is either a converted try plus 1 penalty/drop goal OR an unconverted try plus another unconverted try (hence two rows).

Scripting

With the general rule established, it is fairly easy to script it:

for each i in 8:150
    if i - 3 in validscores, copy row(s) of i - 3, increment pen/dg field by 1 and set score to i
    if i - 5 in validscores...  "
    if i - 7 in validscores...  "

I wrote a script in R, which can be found on my github repo along with the results for scores up to 150 points.

What about the New Zealand vs South Africa match mentioned at the start?

scpdutct
16301
16220

Both teams scored 16 points. Both teams got there through one converted try and three penalties, corresponding to the first row of two possible ways to reach 16 points.

Are all scoring combinations equally likely?

No, for a given score, not all scoring combinations are equally likely because even if all scoring methods were of equal probability (1/3 probability each), they contribute different amounts of points and so this would make the likelihood uneven!

The table below shows all of the possible scoring combinations relating to the score of 48 points.

scpdutct
481600
481211
481130
48903
48903
48741
48660
48514
48433
48352
48271
48206
48190
48125
48044

It’s pretty unlikely that a team would ‘rack up’ so many points through 16 penalties/drop goals without scoring any tries! Equally, it’s unlikely a team would get there through just scoring tries alone. Intuitively, it would seem that it would be through a mixture that a team would be most likely to get there.

If we know the relative likelihood of occurrence of the three scoring methods to each other then we can calculate the probability of scoring combinations for a given score. That’s what we’ll be looking at in the next post!

Simulating the Six Nations 2019 Rugby Tournament in R: Final Round Update

In an earlier post I blogged how I had made a Monte Carlo simulation model of the Six Nations Rugby Tournament.  With the final round of the tournament approaching this Saturday, I decided to do a quick update.

Who can win at this stage?
Wales, England, or Ireland can still win.  Scotland, France and Italy do not have enough points at this stage to win.  Quite a good article from the London Evening Standard explains the detail.  The current league table is below.

Actual standings after round 4 out of 5

Who is playing who in the final round?

What is the simulation model based upon?
A random sample from a probability mass function for tries, conversions and penalties, which is combined with a pwin for each team, calculated based on the RugbyPass Index for both home and away teams.  If you want to know more, feel free to look at my previous post (linked above) or the R script (linked at the bottom).

What does the simulated league table look after the final round?
Running a simulation for the final three games, and adding these results on to the actual points each team has achieved after round 4, we get the distribution of league points shown below.

Apologies: a box plot can be a bit odd for discrete data such as this.  Please forgive me!  If I had the time I would reform this into something like a stacked histogram which would be more accurate 🙂

It should be noted that, whilst the ‘standard’ scoring scheme applies for these final matches, i.e.

  • 4pt for a win, 2pt for a draw, 0pt for a loss.
  • plus 1 bonus pt for scoring 4 tries or more, regardless of win/lose/draw.
  • plus 1 bonus pt if a team has lost but by 7 game points or less.

…there are also 3 additional points awarded if a team wins the ‘Grand Slam’ (wins all of their matches).  The candidate for this is Wales only.  They have so far won every match, and if they win their final match they get these extra points to ensure they win the tournament.

This rule avoids the situations where a team could lose one match but obtain maximum bonus points in the other, finishing up with more points overall than a team that has won every match but never obtained any bonus points.

So then, what are the final standings likely to look like?
After having run a simulation of the final round, the results are below.

 

Wales are “firm favourites” to win the tournament.  England have a “reasonable chance”.  Ireland retain an “outside chance”.

How does all of this compare to expectations before the start of the tournament?


Ireland, England, and Wales were predicted to be in close contention.  Wales have outperformed the prediction (mainly due to beating England).  England have outperformed the prediction (mainly due to beating Ireland, and due to amassing a lot of bonus points).  Ireland have under performed against the prediction (mainly due to losing to England, and then narrowly missing out on bonus points: scored only 3 tries against Scotland; lost to England by only 12 game pt).

Scotland beat Italy with a bonus point victory, but they have only managed to pick up one bonus point in their other games.  Picking up points against England in their final match will be tough.  So they will be likely to under perform.  France will likely beat Italy and perform roughly as expected.  Italy are looking firm against the prediction of finishing bottom again this year (however imho they could be a team to watch in their final match, as they’ll be playing a presently disorientated France, at home in Rome).

It has been an interesting journey for me simulating sports tournaments over the past few months.  Monte Carlo approaches can help you see the wood from the trees in complex situations, which has applications not just in sport but in industry as well.

Maybe this has inspired you to have a go yourself?  If so, the code for this blog post is available via Git here.  Although if you wish to have a play or to adopt the code, the original version is much cleaner, available here.  Good luck!