[latexpage]
In the article posted 4 weeks ago about the dark side of the tennis world I talked about how vulnerable the tennis sport is to match fixing. Bets can even be placed on every single point. However, is there a way to predict the outcome of matches and maybe even points? There are multiple articles that have tried to solve this problem. Many of these articles mention Markov models as I explained earlier. Now I will be digging deeper into this subject and try to predict a match between Rafael Nadal and Roger Federer.
Markov model
Every tennis match consists of sets, which then consist of games, which in turn consist of points. We can therefore make three different, but very similar, Markov models. To be even more precise, we will include tiebreaks in our model. The first Markov model represents winning a game. In every game a player must win four points to win a game. In addition, when the score is 40-40, a game is won with a two point difference. This is called deuce. As I have shown a few weeks back, such a model would look like this.
Note that a score of 30-30 is mathematically basically the same as deuce. We denoted the probability that the player serving wins the point by $p$. So losing the point will have a probability of $1-p$. This is because a point is always won by someone, so the probability of a point being won is 1. Moreover, we make the assumption that every point in a tennis game is iid (independent and identically distributed). This is not always true as some articles have shown, but these are very small deviations which we will ignore for simplicity.
The Markov model for a set, however, is a little bit more complicated. Each player takes turns in service after each game that is played. This will be important to remember for the remainder of the article. A player can win a game on his service game, but can also be “broken” by his opponent. Let the probability of winning a game on your own service be denoted by $p_g$. As the model for a game, a set is eventually won by someone. Therefore, the probability of this happening is again 1. So the probability of losing a game on your service will be denoted by $1-p_g$. Now the first game is played and it is your opponent’s turn to serve. Assuming the skills of the two players are different, the probability that the opponent wins a game is different than $p_g$. So we let the probability of the opponent winning a game be denoted by $q_g$. Therefore, the probability that the opponent loses a game will be denoted by $1-q_g$. After this game the other player will be serving again until someone wins the set. Graphically shown:
This is what a Markov model of a set would look like. $p_t$ and $q_t$ denote the probability of a player winning a tiebreak. Only if the score is 6-6 in games a tiebreak will be played. There is also a possibility to make a model out of this, but this one is really similar to the model of a game, so we will skip this to keep on track. Moreover, in a very special case, when it is 2-2 in sets and 6-6 in games, players must play further until there is a two game difference. Only then will a player win a match. However, because this phenomenon is very rare, we will assume that this rule does not exist. So there is always a tiebreak at 6-6.
At last, someone must win a match. Let us assume we are playing a best-of-five tennis match, that is when you have to win at least three sets to win a match. Let the probability of a player winning a set be denoted by $p_s$. Then losing a set will have a probability of $1-p_s$ as there will always be a set that must be won by either of the two players. This model will look a lot like the Markov model for a game, however it is slightly different since we cannot have a so called deuce in sets. This is the Markov model for a match:
As we can observe it looks a lot like a game. The big difference is that at 2-2, a final set is played which will determine the outcome of the match.
Deterministic Markov chains
The Markov models described above are called Markov chains. In very common terms, a Markov chain models the state of a system of a random variable that changes through time. In our case, it models the probability of winning a point, game or set. Hence, it is actually possible to come up with a formula that determines the probability of winning a game, set or match. The model takes three input parameters: $p_A$, $q_B$ and the score. Where $p_A$ and $q_B$ are the probabilities of player A or B winning a point on their serve. We assume that these probabilities are constant throughout the match to keep the model somewhat simplistic.
Let $(a,b)$ be the current score for player A and B respectively. For example, $(30,15) = 2-1$ and $(40-0) = 3-0$. Then by definition of a Markov chain, the state of the sore $(a,b)$ changes into $(a+1,b)$ with probability $p_A$. Therefore the transition from $(a,b)$ to $(a,b+1)$ occurs with probability $q_A = 1 – p_A$. By letting $P_A$ be the probability of player A winning a game on his serve, we can describe this in the form of a recursion formula. When the score is $(a,b)$ we have:
\begin{equation}
P_A(a,b) = p_AP_A(a+1,b) + (1 – p_A)P_A(a,b+1)
\end{equation}
Where the boundaries are determined as:
\begin{equation}
P_A(a,b) = 1 \Leftrightarrow a =4, b \leq 2
\end{equation}
\begin{equation}
P_A(a,b) = 0 \Leftrightarrow a \leq 2, b = 4
\end{equation}
The logic behind the boundaries is as follows. The probability when player A wins the game occurs only if the player scored four points whereas his opponent scores less than 3 points, since a game is only won by a two point difference.
To go even further we can actually alternate this formulas a little bit to determine the probability of winning a set and a match. The formulas are very similar, since the Markov models share the same principle: you either lose or win the game, set or match with a constant probability $p_A$. Assuming player A begins with serving in the first set, we can make the same recursion formulas:
\begin{equation}
P_{set}(a,b) = p_A^{game}P_{set}(a+1,b) + (1 – p_A^{game})P_{set}(a,b+1)
\end{equation}
\begin{equation}
P_{set}(a,b) = p_B^{game}P_{set}(a+1,b) + (1 – p_B^{game})P_{set}(a,b+1)
\end{equation}
The first equation is for even $a+b$, since player A begins the set serving he will only serve on even games. The second equation is therefore for odd $a+b$, where $p_A^{game}$ is the probability of player A winning a game on his serve from score $(0,0)$. The same goes for player B. $P_{set}(a,b)$ is then the probability of player A winning the set from score $(a,b)$ while serving first. The boundaries for these formulas are also a little bit different:
\begin{equation}
P_{set}(a,b) = 1 \Leftrightarrow a \geq 6, a-b \leq 2
\end{equation}
\begin{equation}
P_{set}(a,b) = 0 \Leftrightarrow b \geq 6, b-a \leq 2
\end{equation}
\begin{equation}
P_{set}(6,6) = p_{tiebreak}
\end{equation}
Where $p_{tiebreak}$ is the probability of player A winning a tiebreak from score $(0,0)$ while serving first. The other boundaries almost follow the same reasoning as before. A set can only be won when a player won 6 or more games given that the difference in games is always two. Only the tiebreak, at 6-6 in games, is an exception.
Now we can come up with the recursion formulas for player A winning the match from score $(a,b)$:
\begin{equation}
P_{match}(a,b) = p_A^{set}P_{match}(a+1,b) + (1 – p_A^{set})P_{match}(a,b+1)
\end{equation}
\begin{equation}
P_{match}(a,b) = p_B^{set}P_{match}(a+1,b) + (1 – p_B^{set})P_{match}(a,b+1)
\end{equation}
Where the boundaries are:
\begin{equation}
P_{match}(3,b) = 1 \Leftrightarrow b \leq 2
\end{equation}
\begin{equation}
P_{match}(a,3) = 0 \Leftrightarrow a \leq 2
\end{equation}
\begin{equation}
P_{match}(2,2) = p_A^{set}
\end{equation}
Again the first equation is for even $a+b$ and the second for odd $a+b$. If the match is at 2-2 in sets, one deciding set is played. As $P_{match}(a,b)$ is the probability of player A winning the set, we have that at $P_{match}(2,2)$ must equal $p_A^{set}$.
Important variables
In this sport there are a lot of variables to consider that will determine the outcome of a match. Luckily the ATP, which is the Association of men’s Tennis Professionals, displays the statistics of every match and player freely on their site. With this data we can actually use our econometric knowledge to do some regressions and determine which variables are statistically significant that determine the outcome of a match. To determine which variables were of utmost importance for Nadal to win from Federer, a basic regression model could be used to determine this:
\begin{equation}
\mathbf{y} = \boldsymbol{\alpha}+ \boldsymbol{\beta} \mathbf{x^T} + \boldsymbol{\epsilon}$
\end{equation}
Where $\mathbf{y}$ is the outcome of a match, $\boldsymbol{\alpha}$ is a constant for a specific match, $\boldsymbol{\beta}$ is the coefficient, $\mathbf{x^T}$ are the different variables we want to test and $\boldsymbol{\epsilon}$ is a residual. However, note that our $\mathbf{y}$ consists of binary numbers only, since either Nadal loses which will result in a 0, or Nadal wins which gives a 1. Therefore, we must use the logit model instead to determine the significance of our variables. This model will look like this:
\begin{equation}
P(Y=1|x_1, \dots , x_n) = \frac{exp(\beta_0 + \beta_1x_1 + \dots + \beta_nx_n)}{1 + exp(\beta_0 + \beta_1x_1 + \dots + \beta_nx_n)}
\end{equation}
Further defining:
\begin{equation}
logit(x) = log(\frac{x}{1 – x}) \Longrightarrow
\end{equation}
\begin{equation}
logit(P(Y=1|x_1, \dots , x_n)) =
\beta_0 + \beta_1x_1 + \dots + \beta_nx_n
\end{equation}
Where $P(Y=1|x_1, \dots ,x_n)$ denotes the probability that Nadal wins the match. Using this we can determine which variables are most significant to use.
After computing some regression analyses with the logit model, the conclusion was that there were four variables that were significant enough to determine the outcome of a match. One of the variables is the surface the match is played on. As every player has its own preference for a specific surface to play on, it is quite obvious that the preferred surface results in better performance of the player. Two other variables are related to one’s service. In tennis, having a strong and consistent service is very important to win a match. The analyses indicated that the percentage of points won on the first and the percentage of points won on the second serve are one of the most significant variables that determine the outcome of a match. At last, the percentage of return games won. This variable indicates how many games a player has “broken” his opponent in percentages. Recall that a “broken” player is someone who lost his service game, which is a game where the player is at serve.
Implementation
We have determined how the dynamics of a tennis match work using the Markov models. In addition, we also discovered which variables in a tennis match are statistically significant such that Nadal won. Past matches between these two players showed that the surface played a huge role in the outcome of the match. Nadal only won one time on grass against Federer, while Federer only won twice from Nadal on clay. However, in most of their matches, the player who served better usually won. Meaning that the player who had a higher percentage of points won on his first and second serve usually won the match. Thus we cannot just rely on one variable to determine some kind of probability who will win.
Using all this information, we can make a function in R that predicts the probabilities of winning the match of both players. The function is kept very simplistic to keep the main idea. Using the data that is provided by the site of the ATP we can determine what the probability could be when Federer and Nadal would play against each other now. We have four key variables: percentage of first serve points won, percentage of second serve won, percentage of return points won and the ranking of the player. The latter usually already determines the outcome of a match. Having a higher rank makes one the favorable player, hence the factor for this variable adding to the probability is largest. If the outcome of a match would only be determined by the ranking, it would look like this:
In the real model, this would of course be very different. We have multiple variables that have a big influence on the outcome of a match. For example, assume a match is only determined by serving. Than the three variables: percentage of first serve points won, percentage of second serve won, percentage of return points won, play the most important role in our function. Observe here that the function also gives the probability of player B, so the variables must also be known for player B. The function will then exist out of six variables, because we have three variables for player A and three for B. Eventually our function would look like this:
Where p and q are the percentages of points won on first serve for Nadal and Federer respectively; r and s are the percentages of points won on the second serve for Nadal and Federer respectively; dA and dB are the percentages of points won in return games for Nadal and Federer respectively; surf is the surface on which the match will be played. The numbers that are added to the probabilities are rough estimates according to the coefficients of our logit model and the statistical significance of the variables. It is very difficult to say what the exact probabilities will be as this requires a lot of data and Nadal and Federer only played 40 times against each other. However, these estimates can indicate a fair enough prediction of what the probabilities of winning will be. When using both players all time career stats we get the following results:
prob_winning_game(1, 3, 0.7195, 0.7739, 0.5745, 0.5686, 0.3360, 0.2672, “Grass”) = (0.55, 0.45).
prob_winning_game(1, 3, 0.7195, 0.7739, 0.5745, 0.5686, 0.3360, 0.2672, “Gravel”) = (0.75, 0.25).
So if the next Fedal match is played on grass, Nadal will have a probability of 55% to win the match and Federer will have a slight disadvantage with 45% chance to win. However, as expected, when the match is played on gravel Nadal will have a big advantage. He will have 75% chance of winning the match while Federer only has a 25% chance. As it is a predictive model of course, real life results could be different. However, when these two legends will play again against each other, it would be a match no one would want to miss.
Sources
- Bevc, M. (2015, 14 april). Predicting the Outcome of Tennis Matches From Point-by-Point Data.
- T. J. Barnett, Mathematical modelling in hierarchical games with specific reference to tennis. PhD thesis, Ph.D. Thesis, Swinburne University of Technology, Melbourne, Australia, 2006.
- F. J. Klaassen and J. R. Magnus, “Are points in tennis independent and identically distributed? evidence from a dynamic binary panel data model,” Journal of the American Statistical Association, vol. 96, no. 454, pp. 500–509, 2001.
This article is written by Sam Ansari