Buck Fama: A look at rank and point differentials between LSU and Alabama in the Saban era

Today marks the 12th time that LSU has played Alabama since Nick Saban took over as head coach, only winning 3 out of the past 11 matchups with 6 losses in a row. Over these 11 games, Alabama scored an average of 7.2 more points per game than LSU, with a victory margin of 11.5 points in the 8 wins against LSU.

I went back and looked at the rankings of each team in these games, found on these team/year pages from ESPN, (It really didn’t take but 10 min to tabulate), to see if there were any interesting trends. In only four of the 11 games, LSU was ranked better than Alabama, with LSU winning the first two and losing the second two outings. On average, Alabama has been ranked 4.2 spots higher than LSU.

Below is a plot of the Alabama – LSU point differentials as a function of the rank differentials, where each black point represents a game played against Saban. If Alabama won, the point differential is positive and if they had a higher ranking, the rank differential is negative. For example, this year Alabama is ranked #2 and LSU is ranked #19 so the rank differential is 2-19=-17, which is the biggest differential since Saban took over. The three large triangles represent the three overtime games and the purple dot is the predicted point differential for today’s game.

The two black lines here mark 0’s so that points to the right of the vertical line indicate that LSU was ranked higher, while points below the horizontal line indicate a victory for LSU. I fit a linear regression model to these data points to predict the point differential of today’s game based on the rank differential and this model estimates that Alabama will win by 12.9 points. The regression line is plotted above in purple and has intercept 5.30 and slope -.44.

For lagniappe, I fit a logistic regression model for the rank data to get an estimated probability of victory for LSU. ESPN says that LSU has a 6.9 % chance of winning, while the usual logistic regression model on the rank data only gives LSU a 1.2 % chance of winning. Neither suggests a positive outcome for the Tigers, but upsets do happen.

Who knows, an upset here might be karma repaid for the loss to Troy, but what a strange season that would be…

Geaux Tigers!!

If you enjoyed this post, check out some of my other posts!

-Andrew G. Chapple

Adding common legends to multiple R plots

Recently I submitted a paper to a journal that required compound figures with multiple plots to be uploaded as one combined single .eps file. This challenge turned out to be daunting, as I was trying to combine 4 figures, each requiring a legend. Doing this in R using par(mfrow=c(2,2)) produced plots that were unreadable due to the shrinked sizes of each plot and the legend any one of the plots. Likewise there were not too many options in terms of combining the plots .eps files into one .eps file online.

The solution: xpd=”NA” in the legend options.

The following code was used to add a common legend to the

> legend(-.35,1.5, legend = c("0", "1","2","3"),
+ lty = c(2,3,4,5),
+ title = "Subgroup",xpd="NA")

It seems like most of the time, that this code produces an error or warning, but does what we want, producing the following graphic.

So we have the desired output, one common legend in the middle of four plots. Now the coordinates here for the legend were tried via trial and error. For some reason, different sets of graphics have different coordinates that correspond to (0,0) so you need to move the legend around until you get what you want. The following two graphics shows some of the trial and errors that it took to get the legend placed correctly.

It usually doesn’t take too long to get it placed right.

I hope this is useful! And if not, it is at least here as a reference for me.

If you liked this post, check out some of my other blog posts.

-Andrew G. Chapple

Who should win NBA MVP?

By now I’m sure you’ve seen a case for the four big candidates: Russel Westbrook, James Harden, Lebron James and Kawhi Leonard or maybe just a breakdown of who should win. With the voting process ending today at 3PM eastern, I’m going to throw my own hat into the who-should-win sweepstakes.

Even though I reside in Houston, I’m going to try my best to be unbiased here and just look at the numbers.

First I’m going to look at something most analysts haven’t, statistics versus playoff teams. How well do these candidates play against the top competition? I downloaded the data from BasketBallReference.com and looked at the average statistics in those games against playoff teams, each player played at least 37 of these games.

This table should immediately disqualify James Harden based on his negative Real +/- statistics in these games, which measures how the point margin will change over 100 possessions when the player is in the game versus when they aren’t. It incorporates the teams defensive efficiency when that player is on the floor. When Harden was on the floor against playoff teams the average margin was -.3 points less than when he was off the floor. All other MVP candidates have a positive +/- stat against Playoff teams. Obviously since these are means, they are heavily influenced by especially poor performances but the medians tell the same story for James: Harden = 0, Leonard = 1, Lebron =2, Westbrook = 3. Harden’s defensive liability hurts him here. Westbrook had 6 poor games where he had a +/- of > -19.

Now that Harden is out, Lebron also gets the boot after having a losing record 12-15 since the All-Star break. You cannot collapse down the stretch like that.

Now we’re down to Westbrook versus Leonard.

As far as playoff teams, Leonard has a higher win percentage, a higher +/- mean (but not median), a higher average FG % and a lower turnover margin. Westbrook leads Leonard in the three major statistical categories by a pretty large margin in these games.

Obviously we know that Westbrook had the better season in terms of raw statistics throughout the season, averaging a triple double, leading the NBA in scoring and breaking the record for the most triple doubles in a season. It has been mentioned that his teammates helped Westbrook get these records. Leonard is the league’s best two way player on the second best team in the NBA while Westbrook’s Thunder finished 10th overall. But who’s more valuable to their team?

In games against playoff teams, when Westbrook had a +/- that was <0, they won only 5.26 % of those games compared to Leonard’s Spurs winning 31.25 % of their games. When Westbrook played awful and had a +/- < -10, the Thunder did not beat any playoff team (which happened 9 times) where the Spurs won once out of the three times Leonard played this poorly. Looking at the entire season, the Spurs won 34.78% of their games when Leonard had a (-) +/- while the Thunder won 13%.

This indicates that Westbrook’s play is much more important to the Thunder’s success. When he plays poorly, the Thunder win less than when the Spurs when Leonard plays poorly. It’s also worth mentioning that the Spurs won 5/6 games they played without Leonard this year. I’d venture to say that the Spurs are still a playoff team if Leonard is not on that team (you know they’d be better than the Trailblazers) but the Thunder are bottom feeders if Westbrook had left with Durant.

So for me the MVP is Russell Westbrook.

If you enjoyed this post, check out some of my other blogs!

-Andrew G Chapple

Package SCRSELECT: Using ReturnModel()

Here you can find more details about how to use the SCRSELECT package, particularly the main function ReturnModel(). I will post a more advanced tutorial for users that want to manually do SCRSELECT() or DICTAUG() later here.

The link to the package documentation can be found on CRAN here. This code is related to a recent paper: “Bayesian variable selection for a semi-competing  risks model with three hazards”  which is joint work with Dr. Marina Vannucci, Dr. Peter Thall and Dr. Steven Lin.

This function performs Bayesian variable selection on a semi-competing risks model which gives the posterior probabilities of inclusion for each variable in each hazard model and determines the optimal model based on thresholds determined by the DIC statistic. More details on this can be found in the paper, which will appear in print in August.

The package has 4 main functions:

• SCRSELECT – Performs Stochastic Search Variable Selection (SVSS) on the semi-competing risks model and saves posterior samples to a given location and summaries of the variable selection parameters and MCMC acceptance rates.
•  DICTAUG – takes posterior means from SCRSELECT and performs a grid search to find the optimal thresholds for each hazard variable inclusion.
•  ReturnModel- Uses SCRSELECT and DICTAUG to determine the best final model, then this performs the final MCMC to obtain inference.
•  SCRSELECTRUN – Used in the ReturnModel function

The most useful function here is ReturnModel() as it performs SVSS, finds the optimal final model (i.e. what variables are included) and then returns important summaries of the posterior along with writing the posterior to a desired folder location. I’ll briefly go over the inputs prior to looking at the output and explaining it.

• ReturnModel(Y1,I1,Y2,I2,X,hyper,inc,c,BSVSS,BDIC,Path)
• Y1: Vector of effusion or death time.
• I1: Vector of indicators for a non-terminal event.
• Y2: Vector of death or censoring times.
• I2: Vector of death indicators.
• X: Covariates ~ the last inc columns will not be selected and will be a part of all three hazards for any possible model.
• Hyper: Large vector of hyperparameters ~ see documentation.
• Inc: How many covariates should be excluded from selection.
• c: shrinkage hyperparameter, values between 10-50 are suggested.
• BSVSS: how many iterations to run the SVSS (suggested = 100,000)
• BDIC: how many iterations to run the DIC search (suggested = 10,000)
• Path = where to save posterior samples of final model

I will show what the output from the application in the paper looks like in the program. I used the example code given on CRAN, but ran these chains out for a more reasonable amount of time. I let BSVSS = 10,000 and BDIC  = 1,000 indicating that the variable selection MCMC should be carried out for 10,000 iterations and the DIC-tau_g procedure should be run on 1,000 iterations.  I will set inc=2, meaning that the SVSS will always include the last two variables in each hazard.

First this program runs two different chains through the SVSS procedure and save the relevant results. Progress of each of the two chains is printed every 5000 iterations. After each chain is finished, the important posterior quantities of the combined sample are saved and the marginal posterior probabilities of inclusion are printed for each hazard. Given that this example data is just simulated noise, it’s not surprising here that we don’t see some variables being more important than others. After these values are printed, the DIC-\tau_g proceeds by first searching over (.05,\tau_2,\tau_3) for the optimal DIC. This proceeds by increasing \tau_1 by .05.

Once the grid search finishes, the program outputs the DIC value for the optimal model(s). It prints more than 1 value if two tau vectors produced DIC values that were within 1 of each other. Next the final model variable inclusion is printed where a TRUE indicates that that variable is important in that specific hazard. We can see, for example, that only variable 4 is included in the first hazard.

Next the program takes this final model and performs MCMC to obtain important posterior quantities for inference. The first thing that is printed is the posterior probability that a covariate increases a hazard, which for example, we see that variable 7 (which was not allowed to be selected out) had a posterior probability of a non-terminal event of .73. Likewise, Variable 2 in the hazard of death after a non-terminal event was shown to be prognostic for death (P=.83). The last bit of output from the function is shown below (excuse the output format, I’ll fix that up!):

We are left with the posterior median hazard ratios in the first line of each hazard, followed by the .025 % and .975 % quantiles, respectively. Once again, notice that variables not included in the final model have NA listed in their posterior summaries.

I hope this helps clear up how to use the package! As a note of practical usage, I would suggest that the user pick a value of c and see what those marginal posterior probabilities of inclusion are after the first step. You want to encourage separation so a small value of c that leads to very large marginal posterior probabilities of inclusion will be as unhelpful as large values of c which will lead to very small marginal posterior probabilities of inclusion. Ideally, you want to see some variables with high inclusion probabilities and some with low in each of the three hazards.

I also want to note that while I do not have guides here for the functions SCRSELECT() and DICTAUG(), these are pieces of the ReturnModel() function. An experienced user of ReturnModel() might want to further investigate the posterior distribution of the SVSS procedure, which you would use SCRSELECT() to do as this saves all posterior samples. Then they might save the important quantities for manually using DICTAUG(), however this is not suggested to those who are not already experienced with using ReturnModel().

If you have any questions don’t hesitate to ask! Also if you enjoyed this post, check out some of my other blog posts.

Happy Coding!

-Andrew G Chapple

Audio Slides: Bayesian Variable Selection for a semi-competing risks model with three hazard functions

This is the link for a 5-minute video summary of my latest paper: Bayesian Variable Selection for a semi-competing risks model with three hazard functions.

http://audioslides.elsevier.com//ViewerLarge.aspx?source=1&doi=10.1016/j.csda.2017.03.002

You can find the link to the paper below (once it’s uploaded, it will be sometime soon).

If you enjoyed this check out some of my other posts!

-Andrew G Chapple

Interesting Statistics Papers: Hidden Markov Models in Tennis

I’ve been taking a computational biology class where we use Hidden Markov Models to predict things like whether a part of a DNA sequence is a promoter, coding region, polya or non-coding (intergenic) region. You can read a little more about HMMs here. They  also have big applications in finance but the paper I’m discussing here is more related to the biological models.

For the DNA the basic idea is for each of the 4 states= promoter, coding region, polya, intergene they can each emit nucleotides = A,T,G,C. So the HMM takes a DNA sequence ( for example x=AACGTAGCC…) and for each letter tries to predict if it’s a promoter, coding region, polya or intergene region.

I’ve been thinking about how these types models can be applied to sports. For example, can we tell if a team is on a hot streak or slump based on game play or game to game play?

http://xm2vtsdb.ee.surrey.ac.uk/CVSSP/Publications/papers/kolonias-sspr04.pdf

By Kolonias et al. uses hidden markov models to predict who will receive a tennis point based on events that happen within each round. The big idea is what moves can a tennis player make in terms of returning balls or serving that will increase (decrease) their (their opponents) probability of scoring.

They do this by making 5 major states within their HMM: 1st serve, second serve, ace, rally and point scored where the first 4 each have sub markov models.

The main transitions are determined by the rules of tennis as follows. After a serve (1st or second), you can either fault and serve again, score on the serve (ace), have the serve returned to start a rally, or double fault and get a point scored against you. From the first serve state you can transition to any of the other states – 2nd serve, Ace, Rally, Point while from the second serve you can’t transition back to the first serve. Ace and Rally states are always ended by a point scored.

The real wow factor of this model comes in the sub markov models.

Let’s take a look at the graphic for the Rally sub model shown in the paper:

The ball can rally back and forth until an unforced error is made or a point is forced. Surely this could be complicated by adding direction of the shot or how hard they hit it back.

I’m interested in how this type of model works in predicting whether or not a point is scored in basketball. For example how many passes is too many and how many will lead to a high scoring probability. How effective are ball screens followed by a pass? How much does different amounts of spacing effect scoring in several common plays (give and go, hero ball etc)

These are questions that could be answered using this type of model.

Anyways, I thought this was a neat idea and it was nice to see Hidden Markov Models outside of The human genome.

If you liked this check out some of my other posts!

-Andrew G Chapple

Quick Tutorial for a Reversible Jump MCMC sampler

The following slides are a presentation that I made for a class involving constructing a reversible jump markov chain monte carlo sampler – Which everyone seems to be terrified of for some reason. All we have to do is to be careful about our storage matrix and that it is formatted to keep track of the matrix dimension. An additional variable storage of this dimension is recommended.

If you would like this presentation for your class please email me!

If you liked this, check out some of my other posts!

-Andrew G. Chapple