# Interesting Statistics Papers: Hidden Markov Models in Tennis

I’ve been taking a computational biology class where we use Hidden Markov Models to predict things like whether a part of a DNA sequence is a promoter, coding region, polya or non-coding (intergenic) region. You can read a little more about HMMs here. They  also have big applications in finance but the paper I’m discussing here is more related to the biological models.

For the DNA the basic idea is for each of the 4 states= promoter, coding region, polya, intergene they can each emit nucleotides = A,T,G,C. So the HMM takes a DNA sequence ( for example x=AACGTAGCC…) and for each letter tries to predict if it’s a promoter, coding region, polya or intergene region.

I’ve been thinking about how these types models can be applied to sports. For example, can we tell if a team is on a hot streak or slump based on game play or game to game play?

http://xm2vtsdb.ee.surrey.ac.uk/CVSSP/Publications/papers/kolonias-sspr04.pdf

By Kolonias et al. uses hidden markov models to predict who will receive a tennis point based on events that happen within each round. The big idea is what moves can a tennis player make in terms of returning balls or serving that will increase (decrease) their (their opponents) probability of scoring.

They do this by making 5 major states within their HMM: 1st serve, second serve, ace, rally and point scored where the first 4 each have sub markov models.

The main transitions are determined by the rules of tennis as follows. After a serve (1st or second), you can either fault and serve again, score on the serve (ace), have the serve returned to start a rally, or double fault and get a point scored against you. From the first serve state you can transition to any of the other states – 2nd serve, Ace, Rally, Point while from the second serve you can’t transition back to the first serve. Ace and Rally states are always ended by a point scored.

The real wow factor of this model comes in the sub markov models.

Let’s take a look at the graphic for the Rally sub model shown in the paper:

The ball can rally back and forth until an unforced error is made or a point is forced. Surely this could be complicated by adding direction of the shot or how hard they hit it back.

I’m interested in how this type of model works in predicting whether or not a point is scored in basketball. For example how many passes is too many and how many will lead to a high scoring probability. How effective are ball screens followed by a pass? How much does different amounts of spacing effect scoring in several common plays (give and go, hero ball etc)

These are questions that could be answered using this type of model.

Anyways, I thought this was a neat idea and it was nice to see Hidden Markov Models outside of The human genome.

If you liked this check out some of my other posts!

-Andrew G Chapple

# Quick Tutorial for a Reversible Jump MCMC sampler

The following slides are a presentation that I made for a class involving constructing a reversible jump markov chain monte carlo sampler – Which everyone seems to be terrified of for some reason. All we have to do is to be careful about our storage matrix and that it is formatted to keep track of the matrix dimension. An additional variable storage of this dimension is recommended.

If you would like this presentation for your class please email me!

If you liked this, check out some of my other posts!

-Andrew G. Chapple

# Can’t we all use the same Parameterizations of Distributions?

This is something that’s been bothering me for a while in statistics.

$f(x|\lambda)= \lambda e^{-\lambda x} \hspace{3mm} \text{or} \hspace{3mm} f(x|\beta)=\frac{1}{\beta} e^{-x/\beta}$

depending on who you ask. Obviously they are equivalent with $\lambda = \frac{1}{\beta}$ but the discrepancies in other distribution parameterizations have wide ranging effects in statistics.

I’ll be the first to admit, I’ve accidentally used the wrong parameterization several times for the gamma functions [rgamma(), dgamma(), etc…] in R.  They use the parameterization on the right.

$f(x|\alpha,\beta)= \frac{1}{\Gamma(\alpha) \beta^\alpha } x^{\alpha-1} e^{-x/\beta} \hspace{3mm} \text{or} \hspace{3mm} f(x|\alpha,\theta)=\frac{\theta^\alpha}{\Gamma(\alpha)} x^{\alpha -1} e^{-x \theta}$

but the left parameterization is the way that the Gamma distribution is introduced in several textbooks.

Now luckily I noticed that something was going majorly awry in my Bayesian MCMC when I used rgamma() with the intent of using the left parameterization and corrected to the right parameterization but can you imagine how many people accidentally used the wrong density functions in their analysis?

Wikipedia gives a pretty good summary of some of the reasons for the different parameterizations, I know I’ve tended to stick to the parameterization with $\theta$ but I learned the $\beta$ parameterization in the classic Casella Berger. We know that $GAMMA(p/2,2) \sim \chi_p^2$ under this parameterization, but it is a $GAMMA(p/2,1/2)$ under the alternative parameterization. Is that much of a difference? I don’t think so.

As a Bayesian who uses CRAN a lot (I’m sure many of you also use dgamma) I say we should just consolidate to the parameterization with $(\alpha, \theta)$ instead of $(\alpha,\beta)$. The one thing that’s lacking is the relationship between the Gamma and Weibull distributions.

But then again, Weibull people (Reliability and Survival Gurus) instead of using this parameterization

$f(x| k, \lambda) = \frac{k}{\lambda} \left(\frac{x}{\lambda} \right)^{k-1} e^{-(x/\lambda)^k}$

$f(x| k, \theta) = k \theta^k x^{k-1} e^{-\theta^k x^k}$

and the $(\alpha, \theta)$ parameterization of the Gamma distribution? (Also PLEASE don’t use this one: $f(x|b,k)=bkx^{k-1} \exp [-bx^k]$). It also doesn’t help things that out of the 100 or so survival models available on CRAN, they use all three of these different parameterizations when a Weibull hazard is desired.

So can we all consolidate and agree to use the same parameterizations? [i.e. let’s all use  $(\alpha, \theta)$ for the Gamma, $\lambda$ for the exponential and $(k,\theta)$ for the Weibull distributions]

Then we’ll still have a functional relationship between the Gamma, Weibull, Chi-Squared and Exponential distributions, but we won’t have to deal with those silly denominators or even God forbid, having to specify a rate parameter in CRAN instead of using the default.

Thanks.

-Andrew G Chapple

Check out some of my other posts if you enjoyed this one.

# Fun with Data: NFL playoff predictions with a simple who-beat-who model

With the playoffs coming up, people are tipping their hats into the prediction arena. Most broadcasters just pick teams based on recent performance, matchups and who’s home. As a side note, I should tell everyone that these picks are based on internal models. While they might not be statistical in terms of using concrete data (some probably do use raw numbers), commentators still look at the data of who-beat-who, bye weeks, recent streaks and the “eye test” and internally weight each factor in some way that produces a prediction.

Other outlets are actually statistical in the usual sense. FiveThirtyEight uses ELO which produces a statistic estimating how good a team is. They use these statistics in projecting win probabilities for two teams.

One model that I like that’s simple (and hence everyone can understand it) is the Bradley-Terry model. This model in it’s simplest form, takes data of who beat who and gets a prediction probability. It has had popular application in Tennis predictions.

Think of it this way:

My Saints beat the Seahawks this season, who beat the AFC regular season champion Patriots. While my Saints have largely been a disappointment this season, the fact that they beat a quality opponent should show you that they at least have the potential to play up to high level teams.

Likewise the model would give the Saints a slightly higher probability of beating the Patriots head to head than they would have if the Seahawks lost to the Pats.

The Bradley-Terry model (which can be extended to incorporate random effects like who is home etc) looks at the results of all 256 regular season games and uses this information to get predictions. Essentially, the model for the probability that team j beats team k is:

$P[j \text{beats} k] = \frac{\exp[\lambda_j - \lambda_k]}{1+\exp[\lambda_j - \lambda_k]}$

and the coefficients $\lambda_j, \lambda_k$ are estimated through maximum likelihood estimation, i.e. choosing $\lambda_1 , \lambda_2 , ... , \lambda_{31}$ (one team~Arizona~ is treated as the baseline team to compare to) to maximize the following binomial likelihood

$L(Data | \lambda_1, ..., \lambda_{31}) = \prod \limits_{k=0}^{31} \prod \limits_{j \ne k} \left[\frac{\exp[\lambda_j - \lambda_k]}{1+\exp[\lambda_j - \lambda_k]} \right]^{n_{jk}} \left[1-\frac{\exp[\lambda_j - \lambda_k]}{1+\exp[\lambda_j - \lambda_k]} \right]^{N_{jk}-n_{jk}}$

where $\lambda_0 = 0$, $N_{jk}$ is the number of times that teams $j$ and $k$ play and $n_{jk}$ is the number of times that team $j$ beats team $k$.

Some Drawbacks: One issue is that every NFL team ONLY plays 13/31 other teams (~41%). This limits some of the information, but also highlights the need to make third party comparisons-i.e. looking at game results for a common opponent between two teams. This means that we will be unlikely to get extreme prediction probabilities of victory (say >.8), but if the season was as long as say the NBA, we could get more stern predictions.

In college football, teams play 11/347 total division I schools (~3%). So the Bradley-Terry model is probably not useful [Research extension here!]

The last thing I’ll mention is since the NFL is split into two conferences for the playoffs, we have much more information. Teams play 11/15 other teams in their conference (~73%) and every two teams in a conference must share at least one common opponent.

Anyways enough with the model discussion. The code I used is found here. (I used the logit transformation). Anyways, I used these Bradley Terry models to get predicted probabilities of winning and used these to simulate the entire playoffs many times via Monte Carlo. For a look at some of the model probabilities, here are the wild card games with the probability of winning for the home team in ( ).

NFC: WildCard

NYG @ GB (48.81%)                            DET @ SEA (51.79 %)

AFC: WildCard

Oak @ Hou (46.2%)                                   Mia @ Pit (51.43%)

Given the parity in the NFL, it’s probably not surprising that these probabilities are all pretty close to 50%. There are extensions to the Bradley-Terry model that can incorperate extra variables, like who’s at home or winning as a function of certain opposing team statistics, which would move these probabilities further away. However, for simplicity I avoided these ((maybe next year depending on how these predictions go!)). So I used all these probabilities to simulate the entire playoffs 1,000,000 times keeping in mind that the Patriots and the Cowboys play the lowest advancing seed (3 potential opponents in round 2). I kept track of how many games each team won, who won the AFC/NFC and who won the Superbowl.

The Big Kicker

Nothing too surprising here. The Patriots, Cowboys,Chiefs and Falcons had the highest number of Super Bowl wins in the simulations but they also had the most regular season wins. They also have the luxury of getting a bye and do not risk playing an extra game. 65% of Superbowl champions have had byes and these simulations gave this a 55% chance of happening. I also should note that the Raiders are projected as the team to appear in the Superbowl the most out of the non-bye teams, which does not incorporate the fact that their quarterback Derek Carr has a broken fibula.

The Texans, sadly, had the worst chances at winning the big dance.

Anyways, I just wanted to apply this simple but cool model to see how it predicted the playoffs. I plan to do this for the NBA as long as the NBA has an excel spread sheet of wins (since the hardest part of this was putting the data together). I could also use additional variables like homefield advantage, yards/game etc. to move these probabilities away from .5. Those coefficients would be estimated as fixed effects in each probability function.

I’m hoping these sims are wrong and the Texans pull off a miracle.

If you enjoyed this post, take a look at some of my other Adventures in Statistics.

-Andrew G Chapple

# Fun with Data: Was it Osweiler’s Time to Go?

On Sunday the Houston Texans benched their new 72 million dollar Quarterback Brock Osweiler in favor of the unproven (and not nearly as highly paid) Tom Savage. Since the Texans are actually still legit in the hunt for the playoffs, unlike my favorite squad the New Orleans Saints, I decided to check myself to see : Did Osweiler really deserve being benched?

Before jumping into any additional insight I can provide, let’s just look at the raw season numbers on the ESPN leaderboard.

• Osweiler is 30th out of the 31 qualified quarterbacks for QBR with an average rating of 71.64
• Osweiler is 29/31 Quarterbacks for average passing yards per game with a paltry 193/game
• Osweiler is 27/31 Quarterbacks for completion percentage- completing only 59.6% of his passes. Only 4 Quarterbacks have completion %s that are less than 60%
• Osweiler is tied with Blake Bortles for second in interceptions with 16 on the season.
• Osweiler is tied for 23rd for TDs with 14 – 2 less than the number of interceptions he’s thrown.

So on first glance at the season statistics you can see that yes, the Texans made the right move. Statistically his season is among the 3-5 worst out of all 2016 quarterbacks.

However this doesn’t necessarily mean that the Texans can’t be successful in spite of Osweiler, look at Peyton Manning last year for example. He finished with a QBR of 67.9 and a 9-17 TD-INT ratio in 10 games, but he led the Broncos to an 8-2 record in those games. Compare that to Osweiler’s record of 6-4 through his first 10 games with a 68.53 QBR and a TD-INT ratio of 9-10.

One might wonder, did the Texans win 8 games in spite of Osweiller?

Let’s take a look at his stats in wins. In wins, including the win in which he was benched, Osweiler has a QBR of 73.1 and averaged 187.75 YDS/GM.

This is shocking because if you only consider Texans wins:

He would still rank 30th out of 31 qualified quarterbacks in QBR and Yards/GM !

This slideshow requires JavaScript.

Additionally, his TD-INT ratio in these games was 10-10. So basically even in wins, Osweiler played poorly. Osweiler’s best win came against the Colts in week 6 when he had a QBR of 90.7 (out of a maximum possible 158.3) but this went against a Colts defense that’s given up the 6th most passing yards and tied for the 3rd least interceptions in the NFL.

However the trend also has an impact in his benching, as evidenced in the graph below of his game by game QBR in wins and losses.

The blue line here represents the average passer rating (=90.9) for all qualified QBs in the NFL. Osweiler only had one game above the league average QBR  (and two above the league median on 90.4). This game came in a loss in week 12 – the Texans third loss in a row – and were followed by two sub 60 QBR and multi INT games that somehow ended in wins. In his last 4 games, he recorded a QBR below 60 in 3 of them.

Osweiler threw more touchdowns than interceptions in only 4 of the 14 games he started this season. Osweiler never threw for more than 300 yards, topping out at 269 against the Colts (He threw for 268 against the Chiefs in week 2).  But aside from the whole body of work that was sub par, he performed even more poorly in the past 5 weeks taking the Texans from 6-3  to 8-6 (and the two wins came in spite of him). The Texans were right to try something new.

Now I’m just hoping the Texans make the playoffs and somehow the Saints pull off a miracle and make it in.

If you liked this, check out some of my other Sports posts!

If you enjoyed this post, take a look at some of my other Adventures in Statistics.

-Andrew G Chapple

# The Excellent fits of Bayesian Piecewise Exponential Models

If you didn’t already know, one of my favorite models for modeling time to event data is the piecewise exponential model for hazard functions. These functions assume that there exists a partition of the time scale $s_0=0  and that on each interval $(s_{j-1},s_{j})$ the hazard is a constant value $\lambda_j$ for $j=1,…,J+1$. The most famous use of this in Frequentist literature is the Cox Proportional Hazards model, which determines the locations and heights $s_0=0 and $\lambda_j$ by maximizing the partial likelihood. If you fix $s_0=0, the maximum likelihood estimate of $\lambda_j$ is similar to the maximum likelihood estimator of the exponential distribution with censoring. Namely, it is $\hat{\lambda_j}=\frac{m}{\sum \limits_{Y_i>s_{j-1}} (Y_i-s_{j-1})}$ where $Y_i$ are the failure or censoring times of each observation. However, as previously mentioned, this requires fixing $\mathbf{s}$ to some reasonable partition that has enough data to approximate the hazard in each interval and adequately models the true shape of the hazard.

Another issue is how do we decide what value $J$ should be? We could set $J, \mathbf{s}, \mathbf{\lambda}$ to maximize the likelihood by trying out many different models. We could also allow $J$ to vary using reversible jump MCMC as I have before. However, one issue with this approach is when there is little data or the data is not separated too much in time, the reversible jump tends to pick out one or two pieces in the hazard. When you look at actual fits to the data, these provide terrible approximations to hazards and just pick out the means. So for this reason, if we are trying to get a good fit for a hazard, we should fix the number of split points to some reasonable number and let the locations and heights on each interval to vary. This number has to be a compromise between adequately modeling the shape of the hazard and ensuring that there is enough data in each interval to get a proper fit.  My adviser always says that 3-5 pieces can accomplish this and in fact they do quite a good job. If we have at least say 100 observations, using the .2 quantiles of the data is a good starting point for these split points, which places about 20 observations in each interval to start.

There are many different Bayesian models on how to model the hazard heights, but typically the following prior is assumed for the split point locations:

$\pi(\mathbf{s}|J) \propto \frac{(2J+1)! \prod \limits_{j=1}^{J+1} (s_j-s_{j-1})}{s_{J+1}^{2J+1}}$

which gives a low probability of having an interval with no events. In my practice, this doesn’t always help and we will sometimes get intervals with a few events (say 2-3) so I always ensure that each interval has at least 10 observations in it to ensure that the hazard heights are estimated well. Now let’s take a look at one example of prior heights for the $\mathbf{\lambda}$ that I find to work rather well.

$\log(\lambda_1) \sim N(0,5) \hspace{3mm} \log(\lambda_{j}) \sim N(\lambda_{j-1},\sigma)$

for $j=2,..,J+1$. Where $\sigma^2 \sim IG(a,b)$ or $\sigma \sim U[0,A]$ which both are conjugate for the sampler. The prior standard deviation $\lambda_1$ was chosen to be $5$ to encourage values of the hazard function to not be too extreme (near 0 or large) because this does not usually happen in practice. Anyways, I’ve gotten too verbose with this description, so let’s take a look at some hazard fits. Below we can see a few different fits of hazards and the corresponding survival function fit for 10,000 iterations with a 90% burnin and about 100 patients.

Let’s take a look at a few lognormal distributions:

Here are fits for two different weibull distributions:

And Survival Distribution fits:

As you can see, the piecewise exponential model does a pretty good job approximating both the hazard and the survival curves for some of the standard time to event distributions. Let’s see how well it does for a mixture distribution. Below are the fits for a 0.4*Weib(3,1)+0.6*Lognormal(-1,2).

The piecewise exponential model can even pick out multiple modes! Let’s see how this hazard approximation translates to survival approximation.

It’s able to pick out the shape pretty well, with the bend happening around 2 months.

So as you can see, the piecewise exponential model, if used correctly can give a good approximation to many different time to event hazards/survival curves. Even better is it’s easily interpretable!

If you enjoyed this post, take a look at some of my other Adventures in Statistics.

-Andrew G Chapple

# Fun with Data: A Monte Carlo Simulation of the 2016 Presidential Election

Unless you’ve been living under a rock for the past few months, you know that tomorrow marks the end (hopefully) to a highly contentious and divisive election. You’ve probably seen several different forecasts using different statistical models including 538, Jay DeSart,  and New York Times. You might have even seen the twitter fight between Nate Silver and the wonderfully unbiased people at Huffington Post. All of these models make different assumptions and 538 even has two different models, one that incorporates historical data and one that just uses polls.

So I’ve decided to throw my hat into the prediction game, using a very simple Monte Carlo simulation.  But just like the other methods, I have to outline my assumptions:

1. Recent Polls may not be entirely accurate, but they are at least a decent estimate of the true voting proportions in each state.
2. Given recent polls, the results in each state are INDEPENDENT. Many of these models do not have this assumption and they use historical trends to predict blocks of states: i.e. A president who won Georgia always won South Carolina etc. I think that these trends are reflected in the polling data.
3. Independents will not win any states. I’m actually planning to vote for Gary Johnson tomorrow, but I don’t realistically expect him, nor stein nor Mcmullin to take any states. Hope that doesn’t burst any bubbles.

So here’s the strategy. I take at most, the 4 most recent polls among likely voters- which I got from 538’s website in the state by state predictions. For the most part, this consists of polls taken after October 26th but in a few cases, like Maine’s district 1 and 3 polls I had to use the only poll available, which was from September.

For each simulation, in each state I randomly draw one of these polls, normalize the clinton and trump voting percentages  by dividing by the total proportion of voters that would vote for either major candidate and randomly draw a Bernoulli random variable with probability that Clinton wins = (Clinton %)/(Clinton % + Trump %). I do this based on my assumption that no independents will win. Then I assign the electoral votes to the winner.

I did this 100,000 times and got the following histogram of electoral vote totals for Clinton.

Here the blue line represents the number of Electoral votes required to win – 270. 59.624 % of these simulations had Hillary Clinton achieving the needed 270 votes to win while only 38.15 % of these simulations had Donald Trump achieving the needed 270 electoral votes. This leaves  2.226 % of the simulations that went to the House of Representatives for final decision, which has only happened once.

On average, Clinton received 281.8051 electoral votes and Trump received 256.1949 electoral votes. The medians electoral votes were similar to this with Clinton receiving a median of 283 and Trump receiving 255 electoral votes. The 95% confidence interval from this sampling procedure was (183-376) for Clinton and (162-355) for Trump.

It’s also worth noting, that based on these simulations, Hillary Clinton has a substantial 18.463 % chance of breaking the record for the largest share of the electoral college of 60.8%.

Anyways, If I had to place a bet based on this and other predictions, I’m taking Hillary Clinton as the winner with 283 electoral votes – aka I think it’s going to be a lot closer than most people (except 538) are predicting.

Remember: “All models are wrong, but some are useful.” Before I get berated by Huffpo like they did Nate Silver for this conservative prediction, I just want to remind everyone that this:

is also a model, based on one Monkey’s prophetic forsight.

I’m one to believe that predictions like this one and 538’s are more accurate than this monkey, but even the great Nate Silver has been way off before.

If you enjoyed this post, take a look at some of my other Adventures in Statistics.

If you enjoyed this post check out my five “Fun with Congressional Data” series, with the links and descriptions posted below.

-Andrew G Chapple