Who should win NBA MVP?

By now I’m sure you’ve seen a case for the four big candidates: Russel Westbrook, James Harden, Lebron James and Kawhi Leonard or maybe just a breakdown of who should win. With the voting process ending today at 3PM eastern, I’m going to throw my own hat into the who-should-win sweepstakes.

Even though I reside in Houston, I’m going to try my best to be unbiased here and just look at the numbers.

First I’m going to look at something most analysts haven’t, statistics versus playoff teams. How well do these candidates play against the top competition? I downloaded the data from BasketBallReference.com and looked at the average statistics in those games against playoff teams, each player played at least 37 of these games.


This table should immediately disqualify James Harden based on his negative Real +/- statistics in these games, which measures how the point margin will change over 100 possessions when the player is in the game versus when they aren’t. It incorporates the teams defensive efficiency when that player is on the floor. When Harden was on the floor against playoff teams the average margin was -.3 points less than when he was off the floor. All other MVP candidates have a positive +/- stat against Playoff teams. Obviously since these are means, they are heavily influenced by especially poor performances but the medians tell the same story for James: Harden = 0, Leonard = 1, Lebron =2, Westbrook = 3. Harden’s defensive liability hurts him here. Westbrook had 6 poor games where he had a +/- of > -19.

Now that Harden is out, Lebron also gets the boot after having a losing record 12-15 since the All-Star break. You cannot collapse down the stretch like that.

Now we’re down to Westbrook versus Leonard.

As far as playoff teams, Leonard has a higher win percentage, a higher +/- mean (but not median), a higher average FG % and a lower turnover margin. Westbrook leads Leonard in the three major statistical categories by a pretty large margin in these games.

Obviously we know that Westbrook had the better season in terms of raw statistics throughout the season, averaging a triple double, leading the NBA in scoring and breaking the record for the most triple doubles in a season. It has been mentioned that his teammates helped Westbrook get these records. Leonard is the league’s best two way player on the second best team in the NBA while Westbrook’s Thunder finished 10th overall. But who’s more valuable to their team?

In games against playoff teams, when Westbrook had a +/- that was <0, they won only 5.26 % of those games compared to Leonard’s Spurs winning 31.25 % of their games. When Westbrook played awful and had a +/- < -10, the Thunder did not beat any playoff team (which happened 9 times) where the Spurs won once out of the three times Leonard played this poorly. Looking at the entire season, the Spurs won 34.78% of their games when Leonard had a (-) +/- while the Thunder won 13%.

This indicates that Westbrook’s play is much more important to the Thunder’s success. When he plays poorly, the Thunder win less than when the Spurs when Leonard plays poorly. It’s also worth mentioning that the Spurs won 5/6 games they played without Leonard this year. I’d venture to say that the Spurs are still a playoff team if Leonard is not on that team (you know they’d be better than the Trailblazers) but the Thunder are bottom feeders if Westbrook had left with Durant.

So for me the MVP is Russell Westbrook.

If you enjoyed this post, check out some of my other blogs!

-Andrew G Chapple






Package SCRSELECT: Using ReturnModel()

Here you can find more details about how to use the SCRSELECT package, particularly the main function ReturnModel(). I will post a more advanced tutorial for users that want to manually do SCRSELECT() or DICTAUG() later here.

The link to the package documentation can be found on CRAN here. This code is related to a recent paper: “Bayesian variable selection for a semi-competing  risks model with three hazards”  which is joint work with Dr. Marina Vannucci, Dr. Peter Thall and Dr. Steven Lin.

This function performs Bayesian variable selection on a semi-competing risks model which gives the posterior probabilities of inclusion for each variable in each hazard model and determines the optimal model based on thresholds determined by the DIC statistic. More details on this can be found in the paper, which will appear in print in August.

The package has 4 main functions:

  • SCRSELECT – Performs Stochastic Search Variable Selection (SVSS) on the semi-competing risks model and saves posterior samples to a given location and summaries of the variable selection parameters and MCMC acceptance rates.
  •  DICTAUG – takes posterior means from SCRSELECT and performs a grid search to find the optimal thresholds for each hazard variable inclusion.
  •  ReturnModel- Uses SCRSELECT and DICTAUG to determine the best final model, then this performs the final MCMC to obtain inference.
  •  SCRSELECTRUN – Used in the ReturnModel function

The most useful function here is ReturnModel() as it performs SVSS, finds the optimal final model (i.e. what variables are included) and then returns important summaries of the posterior along with writing the posterior to a desired folder location. I’ll briefly go over the inputs prior to looking at the output and explaining it.

  • ReturnModel(Y1,I1,Y2,I2,X,hyper,inc,c,BSVSS,BDIC,Path)
    • Y1: Vector of effusion or death time.
    • I1: Vector of indicators for a non-terminal event.
    • Y2: Vector of death or censoring times.
    • I2: Vector of death indicators.
    • X: Covariates ~ the last inc columns will not be selected and will be a part of all three hazards for any possible model.
    • Hyper: Large vector of hyperparameters ~ see documentation.
    • Inc: How many covariates should be excluded from selection.
    • c: shrinkage hyperparameter, values between 10-50 are suggested.
    • BSVSS: how many iterations to run the SVSS (suggested = 100,000)
    • BDIC: how many iterations to run the DIC search (suggested = 10,000)
    • Path = where to save posterior samples of final model

I will show what the output from the application in the paper looks like in the program. I used the example code given on CRAN, but ran these chains out for a more reasonable amount of time. I let BSVSS = 10,000 and BDIC  = 1,000 indicating that the variable selection MCMC should be carried out for 10,000 iterations and the DIC-tau_g procedure should be run on 1,000 iterations.  I will set inc=2, meaning that the SVSS will always include the last two variables in each hazard.


First this program runs two different chains through the SVSS procedure and save the relevant results. Progress of each of the two chains is printed every 5000 iterations. After each chain is finished, the important posterior quantities of the combined sample are saved and the marginal posterior probabilities of inclusion are printed for each hazard. Given that this example data is just simulated noise, it’s not surprising here that we don’t see some variables being more important than others. After these values are printed, the DIC-\tau_g proceeds by first searching over (.05,\tau_2,\tau_3) for the optimal DIC. This proceeds by increasing \tau_1 by .05.


Once the grid search finishes, the program outputs the DIC value for the optimal model(s). It prints more than 1 value if two tau vectors produced DIC values that were within 1 of each other. Next the final model variable inclusion is printed where a TRUE indicates that that variable is important in that specific hazard. We can see, for example, that only variable 4 is included in the first hazard.

Next the program takes this final model and performs MCMC to obtain important posterior quantities for inference. The first thing that is printed is the posterior probability that a covariate increases a hazard, which for example, we see that variable 7 (which was not allowed to be selected out) had a posterior probability of a non-terminal event of .73. Likewise, Variable 2 in the hazard of death after a non-terminal event was shown to be prognostic for death (P=.83). The last bit of output from the function is shown below (excuse the output format, I’ll fix that up!):


We are left with the posterior median hazard ratios in the first line of each hazard, followed by the .025 % and .975 % quantiles, respectively. Once again, notice that variables not included in the final model have NA listed in their posterior summaries.

I hope this helps clear up how to use the package! As a note of practical usage, I would suggest that the user pick a value of c and see what those marginal posterior probabilities of inclusion are after the first step. You want to encourage separation so a small value of c that leads to very large marginal posterior probabilities of inclusion will be as unhelpful as large values of c which will lead to very small marginal posterior probabilities of inclusion. Ideally, you want to see some variables with high inclusion probabilities and some with low in each of the three hazards.

I also want to note that while I do not have guides here for the functions SCRSELECT() and DICTAUG(), these are pieces of the ReturnModel() function. An experienced user of ReturnModel() might want to further investigate the posterior distribution of the SVSS procedure, which you would use SCRSELECT() to do as this saves all posterior samples. Then they might save the important quantities for manually using DICTAUG(), however this is not suggested to those who are not already experienced with using ReturnModel().

If you have any questions don’t hesitate to ask! Also if you enjoyed this post, check out some of my other blog posts.

Happy Coding!

-Andrew G Chapple

Audio Slides: Bayesian Variable Selection for a semi-competing risks model with three hazard functions

This is the link for a 5-minute video summary of my latest paper: Bayesian Variable Selection for a semi-competing risks model with three hazard functions.


You can find the link to the paper below (once it’s uploaded, it will be sometime soon).

If you enjoyed this check out some of my other posts!

-Andrew G Chapple

Interesting Statistics Papers: Hidden Markov Models in Tennis

I’ve been taking a computational biology class where we use Hidden Markov Models to predict things like whether a part of a DNA sequence is a promoter, coding region, polya or non-coding (intergenic) region. You can read a little more about HMMs here. They  also have big applications in finance but the paper I’m discussing here is more related to the biological models.

For the DNA the basic idea is for each of the 4 states= promoter, coding region, polya, intergene they can each emit nucleotides = A,T,G,C. So the HMM takes a DNA sequence ( for example x=AACGTAGCC…) and for each letter tries to predict if it’s a promoter, coding region, polya or intergene region.

I’ve been thinking about how these types models can be applied to sports. For example, can we tell if a team is on a hot streak or slump based on game play or game to game play?

This paper: Tracking the evolution of a tennis match using Hidden Markov Models


By Kolonias et al. uses hidden markov models to predict who will receive a tennis point based on events that happen within each round. The big idea is what moves can a tennis player make in terms of returning balls or serving that will increase (decrease) their (their opponents) probability of scoring.

They do this by making 5 major states within their HMM: 1st serve, second serve, ace, rally and point scored where the first 4 each have sub markov models.

The main transitions are determined by the rules of tennis as follows. After a serve (1st or second), you can either fault and serve again, score on the serve (ace), have the serve returned to start a rally, or double fault and get a point scored against you. From the first serve state you can transition to any of the other states – 2nd serve, Ace, Rally, Point while from the second serve you can’t transition back to the first serve. Ace and Rally states are always ended by a point scored.

The real wow factor of this model comes in the sub markov models.

Let’s take a look at the graphic for the Rally sub model shown in the paper:

The ball can rally back and forth until an unforced error is made or a point is forced. Surely this could be complicated by adding direction of the shot or how hard they hit it back.

I’m interested in how this type of model works in predicting whether or not a point is scored in basketball. For example how many passes is too many and how many will lead to a high scoring probability. How effective are ball screens followed by a pass? How much does different amounts of spacing effect scoring in several common plays (give and go, hero ball etc)

These are questions that could be answered using this type of model.

Anyways, I thought this was a neat idea and it was nice to see Hidden Markov Models outside of The human genome.

If you liked this check out some of my other posts!

-Andrew G Chapple

Quick Tutorial for a Reversible Jump MCMC sampler

The following slides are a presentation that I made for a class involving constructing a reversible jump markov chain monte carlo sampler – Which everyone seems to be terrified of for some reason. All we have to do is to be careful about our storage matrix and that it is formatted to keep track of the matrix dimension. An additional variable storage of this dimension is recommended.


















If you would like this presentation for your class please email me!

If you liked this, check out some of my other posts!

-Andrew G. Chapple

Can’t we all use the same Parameterizations of Distributions?

This is something that’s been bothering me for a while in statistics.

I’ll start with an example, the exponential distribution is either:

f(x|\lambda)= \lambda e^{-\lambda x} \hspace{3mm} \text{or} \hspace{3mm} f(x|\beta)=\frac{1}{\beta} e^{-x/\beta}

depending on who you ask. Obviously they are equivalent with \lambda = \frac{1}{\beta} but the discrepancies in other distribution parameterizations have wide ranging effects in statistics.

I’ll be the first to admit, I’ve accidentally used the wrong parameterization several times for the gamma functions [rgamma(), dgamma(), etc…] in R.  They use the parameterization on the right.

f(x|\alpha,\beta)= \frac{1}{\Gamma(\alpha) \beta^\alpha } x^{\alpha-1} e^{-x/\beta}  \hspace{3mm} \text{or} \hspace{3mm}  f(x|\alpha,\theta)=\frac{\theta^\alpha}{\Gamma(\alpha)} x^{\alpha -1} e^{-x \theta}

but the left parameterization is the way that the Gamma distribution is introduced in several textbooks.

Now luckily I noticed that something was going majorly awry in my Bayesian MCMC when I used rgamma() with the intent of using the left parameterization and corrected to the right parameterization but can you imagine how many people accidentally used the wrong density functions in their analysis?

Wikipedia gives a pretty good summary of some of the reasons for the different parameterizations, I know I’ve tended to stick to the parameterization with \theta but I learned the \beta parameterization in the classic Casella Berger. We know that GAMMA(p/2,2) \sim \chi_p^2 under this parameterization, but it is a GAMMA(p/2,1/2) under the alternative parameterization. Is that much of a difference? I don’t think so.

As a Bayesian who uses CRAN a lot (I’m sure many of you also use dgamma) I say we should just consolidate to the parameterization with (\alpha, \theta) instead of (\alpha,\beta). The one thing that’s lacking is the relationship between the Gamma and Weibull distributions.

But then again, Weibull people (Reliability and Survival Gurus) instead of using this parameterization

f(x| k, \lambda) = \frac{k}{\lambda} \left(\frac{x}{\lambda} \right)^{k-1} e^{-(x/\lambda)^k}

can we please use

f(x| k, \theta) = k \theta^k x^{k-1} e^{-\theta^k x^k}

and the (\alpha, \theta) parameterization of the Gamma distribution? (Also PLEASE don’t use this one: f(x|b,k)=bkx^{k-1} \exp [-bx^k]). It also doesn’t help things that out of the 100 or so survival models available on CRAN, they use all three of these different parameterizations when a Weibull hazard is desired.

So can we all consolidate and agree to use the same parameterizations? [i.e. let’s all use  (\alpha, \theta) for the Gamma, \lambda for the exponential and (k,\theta) for the Weibull distributions]

Then we’ll still have a functional relationship between the Gamma, Weibull, Chi-Squared and Exponential distributions, but we won’t have to deal with those silly denominators or even God forbid, having to specify a rate parameter in CRAN instead of using the default.


-Andrew G Chapple

Check out some of my other posts if you enjoyed this one.



Fun with Data: NFL playoff predictions with a simple who-beat-who model

With the playoffs coming up, people are tipping their hats into the prediction arena. Most broadcasters just pick teams based on recent performance, matchups and who’s home. As a side note, I should tell everyone that these picks are based on internal models. While they might not be statistical in terms of using concrete data (some probably do use raw numbers), commentators still look at the data of who-beat-who, bye weeks, recent streaks and the “eye test” and internally weight each factor in some way that produces a prediction.

Other outlets are actually statistical in the usual sense. FiveThirtyEight uses ELO which produces a statistic estimating how good a team is. They use these statistics in projecting win probabilities for two teams.

One model that I like that’s simple (and hence everyone can understand it) is the Bradley-Terry model. This model in it’s simplest form, takes data of who beat who and gets a prediction probability. It has had popular application in Tennis predictions.

Think of it this way:

My Saints beat the Seahawks this season, who beat the AFC regular season champion Patriots. While my Saints have largely been a disappointment this season, the fact that they beat a quality opponent should show you that they at least have the potential to play up to high level teams.

Likewise the model would give the Saints a slightly higher probability of beating the Patriots head to head than they would have if the Seahawks lost to the Pats.

The Bradley-Terry model (which can be extended to incorporate random effects like who is home etc) looks at the results of all 256 regular season games and uses this information to get predictions. Essentially, the model for the probability that team j beats team k is:

P[j  \text{beats}  k]  = \frac{\exp[\lambda_j - \lambda_k]}{1+\exp[\lambda_j - \lambda_k]}

and the coefficients \lambda_j, \lambda_k are estimated through maximum likelihood estimation, i.e. choosing \lambda_1 , \lambda_2 , ... , \lambda_{31} (one team~Arizona~ is treated as the baseline team to compare to) to maximize the following binomial likelihood

L(Data | \lambda_1, ..., \lambda_{31}) = \prod \limits_{k=0}^{31} \prod \limits_{j \ne k} \left[\frac{\exp[\lambda_j - \lambda_k]}{1+\exp[\lambda_j - \lambda_k]} \right]^{n_{jk}} \left[1-\frac{\exp[\lambda_j - \lambda_k]}{1+\exp[\lambda_j - \lambda_k]} \right]^{N_{jk}-n_{jk}}

where \lambda_0 = 0, N_{jk} is the number of times that teams j and k play and n_{jk} is the number of times that team j beats team k.

Some Drawbacks: One issue is that every NFL team ONLY plays 13/31 other teams (~41%). This limits some of the information, but also highlights the need to make third party comparisons-i.e. looking at game results for a common opponent between two teams. This means that we will be unlikely to get extreme prediction probabilities of victory (say >.8), but if the season was as long as say the NBA, we could get more stern predictions.

In college football, teams play 11/347 total division I schools (~3%). So the Bradley-Terry model is probably not useful [Research extension here!]

The last thing I’ll mention is since the NFL is split into two conferences for the playoffs, we have much more information. Teams play 11/15 other teams in their conference (~73%) and every two teams in a conference must share at least one common opponent.

Anyways enough with the model discussion. The code I used is found here. (I used the logit transformation). Anyways, I used these Bradley Terry models to get predicted probabilities of winning and used these to simulate the entire playoffs many times via Monte Carlo. For a look at some of the model probabilities, here are the wild card games with the probability of winning for the home team in ( ).

NFC: WildCard

             NYG @ GB (48.81%)                            DET @ SEA (51.79 %)

AFC: WildCard

               Oak @ Hou (46.2%)                                   Mia @ Pit (51.43%)

Given the parity in the NFL, it’s probably not surprising that these probabilities are all pretty close to 50%. There are extensions to the Bradley-Terry model that can incorperate extra variables, like who’s at home or winning as a function of certain opposing team statistics, which would move these probabilities further away. However, for simplicity I avoided these ((maybe next year depending on how these predictions go!)). So I used all these probabilities to simulate the entire playoffs 1,000,000 times keeping in mind that the Patriots and the Cowboys play the lowest advancing seed (3 potential opponents in round 2). I kept track of how many games each team won, who won the AFC/NFC and who won the Superbowl.

The Big Kicker


Nothing too surprising here. The Patriots, Cowboys,Chiefs and Falcons had the highest number of Super Bowl wins in the simulations but they also had the most regular season wins. They also have the luxury of getting a bye and do not risk playing an extra game. 65% of Superbowl champions have had byes and these simulations gave this a 55% chance of happening. I also should note that the Raiders are projected as the team to appear in the Superbowl the most out of the non-bye teams, which does not incorporate the fact that their quarterback Derek Carr has a broken fibula.

The Texans, sadly, had the worst chances at winning the big dance.

Anyways, I just wanted to apply this simple but cool model to see how it predicted the playoffs. I plan to do this for the NBA as long as the NBA has an excel spread sheet of wins (since the hardest part of this was putting the data together). I could also use additional variables like homefield advantage, yards/game etc. to move these probabilities away from .5. Those coefficients would be estimated as fixed effects in each probability function.

I’m hoping these sims are wrong and the Texans pull off a miracle.

If you enjoyed this post, take a look at some of my other Adventures in Statistics.

-Andrew G Chapple