# Difficulties in Fitting Mixture Distributions for Survival Data

I’m currently working on improving fits of underlying patient hazard functions when there isn’t too much patient information (100-200) patients. This doesn’t provide enough information to fit the highly parameterized structures of Bayesian Non-Parametric analysis so we had to look at other approaches. One attempted approach was the piecewise exponential models, which I love, but don’t approximate the hazard too well. While this is a good tool for drawing inference from data, it’s not accurate enough for decision making, particularly patient decisions. Instead, the wild goose chase I’ve been on recently is fitting patient distributions with a finite mixture of several common survival distributions: Weibull, Lognormal, Gamma and Exponential distributions. The model is the following:

$f(T|\mathbf{w},\mathbf{\theta}) = \sum \limits_{k=1}^{J} w_k f_k(T|\theta_k)$

where $\sum \limits_{k=1}^{J} w_k =1$ and $f_k(T|\theta_k)$ takes the form of a Weibull, Lognormal, Gamma or Exponential distribution. I’ve taken a Bayesian approach here by assuming $\mathbf{w} \sim Dirichlet(\mathbf{2})$ and in general using $J=6$ (or 7 if you’re using an exponential also in the mixture) seems to do a decent job [key word: DECENT] but it doesn’t operate the way you would like. You would think that if say, you generate data directly from a lognormal distribution, that the weights $w_k$ that correspond to the lognormal distributions would go to 1 and the rest would go to 0. However, in actual simulations, this doesn’t always happen. An example of this are two fits to the same simulated data, shown below:

In both of these pictures, the red line represents the true hazard that the patients data is drawn from. This corresponds to the distribution of a 50/50 mixture of a gamma(2,.5) and a weibull(5,2) distribution. The black dotted lines are the MCMC fit of the data after running 10,000 iterations and burning in the first 9,000. The blue vertical lines represent the 80th and 90th percentiles of the random sample, so we only see about 10/100 patients after the second blue line, and shouldn’t be as concerned about the fit past this point.

Here you can see the issue, the left picture seems to do a pretty good job at fitting this mixture (at least up until the 90th percentile) and in fact it actually picks up the weights well with each of the weibull and gamma weights being around .45-.55 in the MCMC.

However the right picture indicates the failure and curse of these mixture models.

The fitted weights pick up ONLY a lognormal component, by chance, and the mcmc continues on these parameters. After a few iterations with the lognormal weight high, the mcmc moves the parameters $\mathbf{\theta_k}$ such that the MCMC will never assign less weight to the lognormal component.

This causes the MCMC to get “stuck” in a local mode that it will never jump out of, so the fit will always be poor. This is why people use Dirichelet Processes instead of these finite mixtures for modeling.

If you enjoyed this post, take a look at some of my other Adventures in Statistics.

-Andrew G Chapple