# Buck Fama: A look at rank and point differentials between LSU and Alabama in the Saban era

Today marks the 12th time that LSU has played Alabama since Nick Saban took over as head coach, only winning 3 out of the past 11 matchups with 6 losses in a row. Over these 11 games, Alabama scored an average of 7.2 more points per game than LSU, with a victory margin of 11.5 points in the 8 wins against LSU.

I went back and looked at the rankings of each team in these games, found on these team/year pages from ESPN, (It really didn’t take but 10 min to tabulate), to see if there were any interesting trends. In only four of the 11 games, LSU was ranked better than Alabama, with LSU winning the first two and losing the second two outings. On average, Alabama has been ranked 4.2 spots higher than LSU.

Below is a plot of the Alabama – LSU point differentials as a function of the rank differentials, where each black point represents a game played against Saban. If Alabama won, the point differential is positive and if they had a higher ranking, the rank differential is negative. For example, this year Alabama is ranked #2 and LSU is ranked #19 so the rank differential is 2-19=-17, which is the biggest differential since Saban took over. The three large triangles represent the three overtime games and the purple dot is the predicted point differential for today’s game.

The two black lines here mark 0’s so that points to the right of the vertical line indicate that LSU was ranked higher, while points below the horizontal line indicate a victory for LSU. I fit a linear regression model to these data points to predict the point differential of today’s game based on the rank differential and this model estimates that Alabama will win by 12.9 points. The regression line is plotted above in purple and has intercept 5.30 and slope -.44.

For lagniappe, I fit a logistic regression model for the rank data to get an estimated probability of victory for LSU. ESPN says that LSU has a 6.9 % chance of winning, while the usual logistic regression model on the rank data only gives LSU a 1.2 % chance of winning. Neither suggests a positive outcome for the Tigers, but upsets do happen.

Who knows, an upset here might be karma repaid for the loss to Troy, but what a strange season that would be…

Geaux Tigers!!

If you enjoyed this post, check out some of my other posts!

-Andrew G. Chapple

# Modeling ISIL terror attacks and their intensities via marked point process Bayesian models

Since formally declaring itself a caliphate in 2014, the Islamic State of Iraq and the Levant (ISIL) has become the most deadly terror organization in the world, particularly affecting the middle east. Over 300 terrorist attacks have been attributed to ISIL in 2017 alone, while they continue to lose territory in Syria and Iraq, recently having Mosul freed from their control. However, it’s not clear whether this loss in territory has decreased their ability to carry out many devastating attacks or if ISIL has become more deadly as they gained more experience.

To model this trend, I use a model similar to the usual Bayesian Marked Point processes but endowed with slightly differing prior distributions to encourage dependency between adjacent intervals and discourage time intervals where few attacks happened.

Data from the National Consortium for the Study of Terrorism and Responses to Terrorism (START) was examined to see if the rates or intensities of ISIL attacks changed over time. Since this data only consists of terror attacks up until 2016, data coming from January 1, 2017 to October 22,2017 was obtained from another source obtained through crowdsourcing and Story Maps. The model is applied and picks out several interesting intervals of significant changes in ISIL’s effectiveness that match up to major news stories or events.

Note: If you are interested in the model, prior distributions and MCMC sampling scheme, skip to the end of this post. For everyone else, here’s the results of the application.

Application Results

The model used is described after the application in detail. In general, the model used tries to determine how many intervals of time since 2014 have different rates of attacks or fatality rates and how these rates vary over time. The number of intervals and the boundary of the intervals are considered random variables, along with the mean rate and number of fatalities on each interval.

This type of model will allow us to determine if there was some period of time from 2014-Now where ISIL was more or less effective and look at the events around these time periods. This could give some evidence that different strategies are more or less effective.

I used data from the National Consortium for the Study of Terrorism and Responses to Terrorism (START), only keeping data from after January 1, 2014 and terror attacks attributed to ISIL or its known partners. Since this data only consists of terror attacks up until 2016, data coming from January 1, 2017 to October 22,2017 was obtained from another source obtained through crowdsourcing and Story Maps. Only confirmed or suspected ISIL attacks were used, so for example the Las Vegas shooting was not included although ISIL claimed responsibility. We removed any data points where the number of fatalities were not available, leaving 4893 attacks since 2014. The average number of fatalities over the entire dataset was 7.8.

We applied the PieceExpIntensity package (developed for this application) to the data assuming that $\mu_J = 10$ with a maximum allowed 20 split points and 100,000 iterations, burning in the first half of the posterior samples. In the posterior, no samples had more than 10 different intervals of varying activity with 5 split points occurring the most, with probability .46. Four and six split points occurred with probabilities .30 and .22 which were the second and third most visited values of $J$ in the posterior distribution.

This model had mean split point locations at 1, 2014 of 54.3, 100.2, 159.7, 162.5 and 777.6 days after January 1, 2014. This corresponded to four split points occurring after the dates February 24, March 11, June 8 and June 12 2014 with a later mean split point location after February 19, 2016. There is such a short interval here between June 8th and June 12, 2014 because ISIL took Tikrit on June 12, 2014 during the Northern Iraq offensive, killing over $1,500$ people on this date. There were 23 separate attacks conducted by ISIL during this time period with 8 of these attacks killing 2,252 people and the other 15 not resulting in any casualties. The estimated posterior mean number of fatalities per attack on this interval was a staggering $93.0$, $85.3$ higher than the next highest mean fatality rate. Thus the method was able to accurately find a known period of high terrorist activity.

We examine the posterior samples of $\pmb{s}, \pmb{\lambda}, \pmb{\theta} | J = 5$ to draw inference on ISIL’s effectiveness. Plots of the posterior means and credible intervals for $\exp[\pmb{\theta}]$ and $\pmb{\lambda}$ is shown below.

We consider parameters that do not have overlapping credible intervals to be vastly different, displaying some significance. The credible interval for $\lambda_6 | J=5$ does not overlap with any of the other credible intervals, indicating that the rate of terror attacks carried out by ISIL decreased drastically from February 20, 2016 until today compared to previous time periods. This was a two months after Ramadi was retaken and four months before Fallujah was retaken from ISIL, which suggest that military intervention has decreased the risk of an attack. None of the other $j$ have credible intervals that don’t overlap with that for $\lambda_1$ for $J=5$. When looking at the credible intervals for $\exp[\pmb{\theta}]$, we see that intervals $5$ and $6$ (occuring from 6/13/2014 until today) have significantly higher fatality rates than the first three time intervals. Recall here that the posterior mean and credible interval for $\exp[\theta_4]$ are not displayed in Figure 1 but are 93 and (83.9, 100.2).

These results suggest that ISIL’s ability to carry out attacks has decreased dramatically since February 20,2016 but the deadliness of these attacks has increased compared to the early year’s after ISIL’s founding. Interestingly, the United States launched a successful airstrike on Libyan ISIL members including a senior leader who directed several attacks Noureddine Chouchane on February 19. It appears that the increase in the military push back, which resulted in a significant loss of ISIL teritory, has hindered ISIL’s ability to carry out attacks, but that their experience in conducting so many attacks since 2014 makes attacks much more deadly. It could also be that ISIL is feeling the pressure of losing their lands and attempting to dissuade further invasion with deadlier attacks when they have the ability to do so.

Model, Prior and Sampling information

We require a model that can identify intervals of time where the rate and/or the intensity of events change. To accomplish this end, we use the piecewise exponential distribution as it can flexibly estimate the number of intervals needed to accurately characterize the distribution and the hazard of an event within each disjoint interval. Let $X_i >0$ denote the $i$th event time with $Y_i \in \{0,1,... \}$ denoting the intensity of the event. In our case, $Y_i$ is the number of fatalities caused by attack $i$, which reflects the intensity of each attack. Define $\pmb{s}=(s_0=0, s_1, ...., s_J, s_{J+1}=X_\text{max})$ as a partition of the time scale from $0$ to the maximum observed value for $\pmb{X}$ ($X_\text{max}$) where rates and intensities of events differ. For all $X \in (s_{j-1},s_j]$, we assume that the hazard of an event is $h(X) = \exp(\lambda_j)$ and that the the conditional distribution of the intensity is $Y | X \in (s_{j-1},s_j] = \text{Poisson}(\exp(\theta_j))$. We assume that given a time scale partition $\pmb{s}$ all pairs $(X_i,Y_i)$ for which $X_i$ lies in this interval are independent and that observations in different intervals are also independent. Denote $\tilde{\theta}_j$ as $\exp[\theta_j]$. These assumptions lead to the following likelihood

$L(\mathbf{X},\mathbf{Y} | \mathbf{s}, \mathbf{\lambda}, \mathbf{\mu}) = \prod \limits_{i=1}^n \sum \limits_{j=1}^{J+1} I\left[ X_i \in (s_{j-1},s_j] \right] P \left( X_i \in (s_{j-1},s_j] | \mathbf{s}, \mathbf{\lambda} \right) f(Y_i | X_i \in (s_{j-1},s_j], \theta_j ) =$

$=\prod \limits_{j=1}^{J+1} \exp \left[ d_j (\lambda_j-\tilde{\theta_j}) - \exp(\lambda_j) \sum \limits_{i \in R_j} \Delta_{ij} \right] \frac{\tilde{\theta}_j^{\sum \limits_{k \in D_j} Y_k}}{\prod \limits_{k \in D_j} Y_k !} ,$

where $d_j$ is the number of events in the interval $(s_{j-1},s_j]$, $R_j$ is the set of events in the data set that have not yet occurred by time $s_{j-1}$ and $\Delta_{ij}=\min(X_i,s_{j})-s_{j-1}$. Additionally, $D_j$ is the set of attacks that happened between $s_{j-1}$ and $s_j$. We adopt a Bayesian approach by endowing the parameters $J, \pmb{s}, \pmb{\lambda}, \pmb{\theta}$ with priors. We assume that the split point locations positions $\mathbf{s}$ are a priori uniform but that they are spaced out in a way to discourage intervals with few events. This is done by drawing $2J+1$ uniform random variables from 0 to $X_\text{max}$ and taking the even order statistics as $\pmb{s}$. Formally, this prior is:

$\mathbf{s}|J \propto \frac{(2J+1)! \prod \limits_{j=1}^{J+1}(s_j-s_{j-1})}{s_{J+1}^{2J+1}}$

Since we do not know how many times the rate and severity of these attacks changed, we do not fix $J$ and assume that $J \sim POI(\mu_J)$ where $\mu_J$ is set to less than 20, with a default suggestion of $\mu_J=10$. We impose spatial dependency in the vectors $\pmb{\theta}$ and $\pmb{\lambda}$ with the following priors:

$\lambda_1 \sim N(0, 25), \hspace{5mm} \lambda_j | \lambda_{j-1}, \sigma_\lambda \sim N(\lambda_{j-1}, \sigma_\lambda) \hspace{10mm} \text{and} \hspace{10mm} \theta_1 \sim N(0, 25), \hspace{5mm} \theta_j | \theta_{j-1}, \sigma_\theta \sim N(\theta_{j-1}, \sigma_\theta)$

This allows us to establish some dependency between adjacent intervals of $\pmb{s}$ which will help us model how the effectiveness and rate of attacks have changed over time. Finally, we assume that $\sigma_\theta \propto 1/\sigma_\theta$ and $\sigma_\lambda \propto 1/\sigma_\lambda$ to encourage more dependency.

We perform a reversible jump Markov Chain Monte Carlo (MCMC) sampling scheme on the parameters $J, \pmb{s}, \pmb{\lambda}, \pmb{\theta}$, reccommending at least 10,000 iterations to explore the sample space and obtain some convergence on $J$.

For the sampling scheme, we do the following moves:

•  Metropolis Hastings move on $\pmb{s} | J$. We shuffle the entries of $\pmb{s}$ for $j=1,...,J$ if $J>0$ by proposing $s_j^{*} \sim \text{Unif}[s_{j-1}, s_{j+1}]$ and do not change $\pmb{\lambda}$ or $\pmb{\theta}$. This move is accepted based on the likelihood ratio and the prior ratio for $\pmb{s}|J$.
• Metropolis Hastings moves on $\pmb{\lambda}$ and $\pmb{\theta}$. Let $\pmb{\gamma}$ denote either of these vectors for the time being. For $j=1,...,J+1$, we propose $\gamma_j^{*} \sim \text{Unif}[\gamma_j-.25, \gamma_j+.25]$ as seen in Haneuse et al. and accept based on the likelihood ratio and the prior ratio for $\gamma_j|\gamma_{(-j)}, \pmb{s}, \sigma_\gamma$.
•  Gibbs step on $\sigma_\lambda^2$ and $\sigma_\theta^2$ by drawing $\sigma_\gamma^2$ from an inverse gamma distribution with parameters $J/2+1$ and $\sum \limits_{j=1}^{J+1} (\gamma_{j+1}-\gamma_j)^2$ for $\gamma \in \{\lambda, \theta \}$.
• Metropolis Hastings Green move on $J , \pmb{s}, \pmb{\lambda}, \pmb{\theta}$ which is done every iteration via add and delete moves (if applicable). For an add move, we propose $J^{*}=J+1$ and add an extra split point by drawing it from $\text{Unif}[0,X_\text{max}]$ to obtain $\pmb{s}^{*}$. Let $s_{*}$ denote the location of this additional split point and given that $s_{*} \in (s_{j-1},s_j)$, we define the new values $\gamma_{j}^{*}$ and $\gamma_{j+1}^{*}$ for $\pmb{\gamma} = \{\pmb{\lambda}, \pmb{\theta} \}$ via the multiplicative perturbation of $\exp[\gamma_{j+1}^{*}]/\exp[\gamma_j^{*}]=(1-U_\gamma)/U_\gamma$ suggested by Green and a weighted average $(s_{*}-s_{j-1})\gamma_j^{*}+(s_{j}-s_{*})\gamma_{j+1}^{*} = (s_{j}-s_{j-1})\gamma_j$ which yields the new values.

$\gamma_{j}^{*} = \gamma_j - \frac{s_j-s_{*}}{s_{j}-s_{j-1}} \log \left( \frac{1-U_\gamma}{U_\gamma} \right) \hspace{5mm} \text{and} \hspace{5mm} \gamma_{j+1}^{*} = \gamma_j + \frac{s_{*}-s_{j-1}}{s_{j}-s_{j-1}} \log \left( \frac{1-U_\gamma}{U_\gamma} \right)$

where $U_\gamma \sim \text{Unif}[0,1]$ is drawn for both $\pmb{\lambda}$, $\pmb{\theta}$ separately. $J^{*}, \pmb{s}^{*}, \pmb{\lambda}^{*}, \pmb{\theta}^{*}$ is accepted over the previous values with probability equal to the product of the likelihood ratio, the prior ratio and the following jacobians to maintain detailed parameter balance between the parameter spaces.

$\begin{vmatrix} \mathrm{d} \lambda_j^{*} / \mathrm{d} \lambda_j & \mathrm{d} \lambda_j^{*} / \mathrm{d} U_\lambda \\ \mathrm{d} \lambda_{j+1}^{*} / \mathrm{d} \lambda_{j+1} & \mathrm{d} \lambda_j^{*} / \mathrm{d} U_\lambda \\ \end{vmatrix} \begin{vmatrix} \mathrm{d} \theta_j^{*} / \mathrm{d} \theta_j & \mathrm{d} \theta_j^{*} / \mathrm{d} U_\theta \\ \mathrm{d} \theta_{j+1}^{*} / \mathrm{d} \theta_{j+1} & \mathrm{d} \theta_j^{*} / \mathrm{d} U_\theta \\ \end{vmatrix} = [U_\lambda (1-U_\lambda) U_\theta (1-U_\theta)]^{-1}$

• Delete: For a delete move, we randomly choose some $j=1,...,J$ and propose removing $s_j$ from $\pmb{s}^{*}$, which sets $\theta_j^{*}$ and $\lambda_j^{*}$ equal to the weighted average of $\theta_{j},\theta_{j+1}$ and $\lambda_j,\lambda_{j+1}$. For $\gamma_j \in \{\lambda_j, \theta_j \}$ we set $\gamma_j^{*} = \left[ \gamma_j (s_j-s_{j-1}) + \gamma_{j+1} (s_{j+1}-s_j) \right]/(s_{j+1}-s_{j-1})$. We assume a multiplicative perturbation of $\exp[\gamma_{j+1}]/\exp[\gamma_j]=(1-U_\gamma)/U_\gamma$. We accept $J^{*}, \pmb{s}^{*}, \pmb{\lambda}^{*}, \pmb{\theta}^{*}$ with probability equal to the likelihood ratio, prior ratio and the following jacobian.

$\begin{vmatrix} \mathrm{d} \lambda_j^{*} / \mathrm{d} \lambda_j & \mathrm{d} \lambda_j / \mathrm{d} \lambda_{j+1} \\ \mathrm{d} U_\lambda / \mathrm{d} \lambda_{j} & \mathrm{d} U_\lambda / \mathrm{d} \lambda_{j+1} \\ \end{vmatrix}\begin{vmatrix} \mathrm{d} \theta_j^{*} / \mathrm{d} \theta_j & \mathrm{d} \theta_j^{*} / \mathrm{d} \theta_{j+1} \\ \mathrm{d} U_\theta / \mathrm{d} \theta_{j} & \mathrm{d} U_\theta / \mathrm{d} \theta_{j+1} \\ \end{vmatrix}= U_\lambda (1-U_\lambda) U_\theta (1-U_\theta)$

Again here $U_\gamma \sim \text{Unif}[0,1]$ is drawn for both $\pmb{\lambda}$, $\pmb{\theta}$ separately.

We try both delete moves if $J>0$ and add moves if $J at every iteration. Potentially, this could result in shuffling $\pmb{s}$ with adjusted values for $\pmb{\lambda}, \pmb{\theta}$ if both an add and delete move are executed within an iteration.

After the MCMC has been completed, we burn in the first half of the samples and draw inference on how the rate and intensity of events have changed over time by looking at posterior samples that have $J$ values equal to the mode of $J$ over all posterior samples. Generally, one mode is selected at a significantly higher proportion than the second most visited value of $J$ but if two values of $J$ are visited with frequencies that differ by less than $.1$ then posterior inference should be conducted on both values of $J$. In general, the two values will be $J$ and $J+1$ ($J-1$) and the extra split point seen in one set of samples may not lead to very different inferential conclusions.

# Adding common legends to multiple R plots

Recently I submitted a paper to a journal that required compound figures with multiple plots to be uploaded as one combined single .eps file. This challenge turned out to be daunting, as I was trying to combine 4 figures, each requiring a legend. Doing this in R using par(mfrow=c(2,2)) produced plots that were unreadable due to the shrinked sizes of each plot and the legend any one of the plots. Likewise there were not too many options in terms of combining the plots .eps files into one .eps file online.

The solution: xpd=”NA” in the legend options.

The following code was used to add a common legend to the

> legend(-.35,1.5, legend = c("0", "1","2","3"),
+ lty = c(2,3,4,5),
+ title = "Subgroup",xpd="NA")

It seems like most of the time, that this code produces an error or warning, but does what we want, producing the following graphic.

So we have the desired output, one common legend in the middle of four plots. Now the coordinates here for the legend were tried via trial and error. For some reason, different sets of graphics have different coordinates that correspond to (0,0) so you need to move the legend around until you get what you want. The following two graphics shows some of the trial and errors that it took to get the legend placed correctly.

It usually doesn’t take too long to get it placed right.

I hope this is useful! And if not, it is at least here as a reference for me.

If you liked this post, check out some of my other blog posts.

-Andrew G. Chapple

# Who should win NBA MVP?

By now I’m sure you’ve seen a case for the four big candidates: Russel Westbrook, James Harden, Lebron James and Kawhi Leonard or maybe just a breakdown of who should win. With the voting process ending today at 3PM eastern, I’m going to throw my own hat into the who-should-win sweepstakes.

Even though I reside in Houston, I’m going to try my best to be unbiased here and just look at the numbers.

First I’m going to look at something most analysts haven’t, statistics versus playoff teams. How well do these candidates play against the top competition? I downloaded the data from BasketBallReference.com and looked at the average statistics in those games against playoff teams, each player played at least 37 of these games.

This table should immediately disqualify James Harden based on his negative Real +/- statistics in these games, which measures how the point margin will change over 100 possessions when the player is in the game versus when they aren’t. It incorporates the teams defensive efficiency when that player is on the floor. When Harden was on the floor against playoff teams the average margin was -.3 points less than when he was off the floor. All other MVP candidates have a positive +/- stat against Playoff teams. Obviously since these are means, they are heavily influenced by especially poor performances but the medians tell the same story for James: Harden = 0, Leonard = 1, Lebron =2, Westbrook = 3. Harden’s defensive liability hurts him here. Westbrook had 6 poor games where he had a +/- of > -19.

Now that Harden is out, Lebron also gets the boot after having a losing record 12-15 since the All-Star break. You cannot collapse down the stretch like that.

Now we’re down to Westbrook versus Leonard.

As far as playoff teams, Leonard has a higher win percentage, a higher +/- mean (but not median), a higher average FG % and a lower turnover margin. Westbrook leads Leonard in the three major statistical categories by a pretty large margin in these games.

Obviously we know that Westbrook had the better season in terms of raw statistics throughout the season, averaging a triple double, leading the NBA in scoring and breaking the record for the most triple doubles in a season. It has been mentioned that his teammates helped Westbrook get these records. Leonard is the league’s best two way player on the second best team in the NBA while Westbrook’s Thunder finished 10th overall. But who’s more valuable to their team?

In games against playoff teams, when Westbrook had a +/- that was <0, they won only 5.26 % of those games compared to Leonard’s Spurs winning 31.25 % of their games. When Westbrook played awful and had a +/- < -10, the Thunder did not beat any playoff team (which happened 9 times) where the Spurs won once out of the three times Leonard played this poorly. Looking at the entire season, the Spurs won 34.78% of their games when Leonard had a (-) +/- while the Thunder won 13%.

This indicates that Westbrook’s play is much more important to the Thunder’s success. When he plays poorly, the Thunder win less than when the Spurs when Leonard plays poorly. It’s also worth mentioning that the Spurs won 5/6 games they played without Leonard this year. I’d venture to say that the Spurs are still a playoff team if Leonard is not on that team (you know they’d be better than the Trailblazers) but the Thunder are bottom feeders if Westbrook had left with Durant.

So for me the MVP is Russell Westbrook.

If you enjoyed this post, check out some of my other blogs!

-Andrew G Chapple

# Package SCRSELECT: Using ReturnModel()

Here you can find more details about how to use the SCRSELECT package, particularly the main function ReturnModel(). I will post a more advanced tutorial for users that want to manually do SCRSELECT() or DICTAUG() later here.

The link to the package documentation can be found on CRAN here. This code is related to a recent paper: “Bayesian variable selection for a semi-competing  risks model with three hazards”  which is joint work with Dr. Marina Vannucci, Dr. Peter Thall and Dr. Steven Lin.

This function performs Bayesian variable selection on a semi-competing risks model which gives the posterior probabilities of inclusion for each variable in each hazard model and determines the optimal model based on thresholds determined by the DIC statistic. More details on this can be found in the paper, which will appear in print in August.

The package has 4 main functions:

• SCRSELECT – Performs Stochastic Search Variable Selection (SVSS) on the semi-competing risks model and saves posterior samples to a given location and summaries of the variable selection parameters and MCMC acceptance rates.
•  DICTAUG – takes posterior means from SCRSELECT and performs a grid search to find the optimal thresholds for each hazard variable inclusion.
•  ReturnModel- Uses SCRSELECT and DICTAUG to determine the best final model, then this performs the final MCMC to obtain inference.
•  SCRSELECTRUN – Used in the ReturnModel function

The most useful function here is ReturnModel() as it performs SVSS, finds the optimal final model (i.e. what variables are included) and then returns important summaries of the posterior along with writing the posterior to a desired folder location. I’ll briefly go over the inputs prior to looking at the output and explaining it.

• ReturnModel(Y1,I1,Y2,I2,X,hyper,inc,c,BSVSS,BDIC,Path)
• Y1: Vector of effusion or death time.
• I1: Vector of indicators for a non-terminal event.
• Y2: Vector of death or censoring times.
• I2: Vector of death indicators.
• X: Covariates ~ the last inc columns will not be selected and will be a part of all three hazards for any possible model.
• Hyper: Large vector of hyperparameters ~ see documentation.
• Inc: How many covariates should be excluded from selection.
• c: shrinkage hyperparameter, values between 10-50 are suggested.
• BSVSS: how many iterations to run the SVSS (suggested = 100,000)
• BDIC: how many iterations to run the DIC search (suggested = 10,000)
• Path = where to save posterior samples of final model

I will show what the output from the application in the paper looks like in the program. I used the example code given on CRAN, but ran these chains out for a more reasonable amount of time. I let BSVSS = 10,000 and BDIC  = 1,000 indicating that the variable selection MCMC should be carried out for 10,000 iterations and the DIC-tau_g procedure should be run on 1,000 iterations.  I will set inc=2, meaning that the SVSS will always include the last two variables in each hazard.

First this program runs two different chains through the SVSS procedure and save the relevant results. Progress of each of the two chains is printed every 5000 iterations. After each chain is finished, the important posterior quantities of the combined sample are saved and the marginal posterior probabilities of inclusion are printed for each hazard. Given that this example data is just simulated noise, it’s not surprising here that we don’t see some variables being more important than others. After these values are printed, the DIC-\tau_g proceeds by first searching over (.05,\tau_2,\tau_3) for the optimal DIC. This proceeds by increasing \tau_1 by .05.

Once the grid search finishes, the program outputs the DIC value for the optimal model(s). It prints more than 1 value if two tau vectors produced DIC values that were within 1 of each other. Next the final model variable inclusion is printed where a TRUE indicates that that variable is important in that specific hazard. We can see, for example, that only variable 4 is included in the first hazard.

Next the program takes this final model and performs MCMC to obtain important posterior quantities for inference. The first thing that is printed is the posterior probability that a covariate increases a hazard, which for example, we see that variable 7 (which was not allowed to be selected out) had a posterior probability of a non-terminal event of .73. Likewise, Variable 2 in the hazard of death after a non-terminal event was shown to be prognostic for death (P=.83). The last bit of output from the function is shown below (excuse the output format, I’ll fix that up!):

We are left with the posterior median hazard ratios in the first line of each hazard, followed by the .025 % and .975 % quantiles, respectively. Once again, notice that variables not included in the final model have NA listed in their posterior summaries.

I hope this helps clear up how to use the package! As a note of practical usage, I would suggest that the user pick a value of c and see what those marginal posterior probabilities of inclusion are after the first step. You want to encourage separation so a small value of c that leads to very large marginal posterior probabilities of inclusion will be as unhelpful as large values of c which will lead to very small marginal posterior probabilities of inclusion. Ideally, you want to see some variables with high inclusion probabilities and some with low in each of the three hazards.

I also want to note that while I do not have guides here for the functions SCRSELECT() and DICTAUG(), these are pieces of the ReturnModel() function. An experienced user of ReturnModel() might want to further investigate the posterior distribution of the SVSS procedure, which you would use SCRSELECT() to do as this saves all posterior samples. Then they might save the important quantities for manually using DICTAUG(), however this is not suggested to those who are not already experienced with using ReturnModel().

If you have any questions don’t hesitate to ask! Also if you enjoyed this post, check out some of my other blog posts.

Happy Coding!

-Andrew G Chapple

# Audio Slides: Bayesian Variable Selection for a semi-competing risks model with three hazard functions

This is the link for a 5-minute video summary of my latest paper: Bayesian Variable Selection for a semi-competing risks model with three hazard functions.

http://audioslides.elsevier.com//ViewerLarge.aspx?source=1&doi=10.1016/j.csda.2017.03.002

You can find the link to the paper below (once it’s uploaded, it will be sometime soon).

If you enjoyed this check out some of my other posts!

-Andrew G Chapple

# Interesting Statistics Papers: Hidden Markov Models in Tennis

I’ve been taking a computational biology class where we use Hidden Markov Models to predict things like whether a part of a DNA sequence is a promoter, coding region, polya or non-coding (intergenic) region. You can read a little more about HMMs here. They  also have big applications in finance but the paper I’m discussing here is more related to the biological models.

For the DNA the basic idea is for each of the 4 states= promoter, coding region, polya, intergene they can each emit nucleotides = A,T,G,C. So the HMM takes a DNA sequence ( for example x=AACGTAGCC…) and for each letter tries to predict if it’s a promoter, coding region, polya or intergene region.

I’ve been thinking about how these types models can be applied to sports. For example, can we tell if a team is on a hot streak or slump based on game play or game to game play?

http://xm2vtsdb.ee.surrey.ac.uk/CVSSP/Publications/papers/kolonias-sspr04.pdf

By Kolonias et al. uses hidden markov models to predict who will receive a tennis point based on events that happen within each round. The big idea is what moves can a tennis player make in terms of returning balls or serving that will increase (decrease) their (their opponents) probability of scoring.

They do this by making 5 major states within their HMM: 1st serve, second serve, ace, rally and point scored where the first 4 each have sub markov models.

The main transitions are determined by the rules of tennis as follows. After a serve (1st or second), you can either fault and serve again, score on the serve (ace), have the serve returned to start a rally, or double fault and get a point scored against you. From the first serve state you can transition to any of the other states – 2nd serve, Ace, Rally, Point while from the second serve you can’t transition back to the first serve. Ace and Rally states are always ended by a point scored.

The real wow factor of this model comes in the sub markov models.

Let’s take a look at the graphic for the Rally sub model shown in the paper:

The ball can rally back and forth until an unforced error is made or a point is forced. Surely this could be complicated by adding direction of the shot or how hard they hit it back.

I’m interested in how this type of model works in predicting whether or not a point is scored in basketball. For example how many passes is too many and how many will lead to a high scoring probability. How effective are ball screens followed by a pass? How much does different amounts of spacing effect scoring in several common plays (give and go, hero ball etc)

These are questions that could be answered using this type of model.

Anyways, I thought this was a neat idea and it was nice to see Hidden Markov Models outside of The human genome.

If you liked this check out some of my other posts!

-Andrew G Chapple