It is said that a distribution $f(x)$ is a mixture $k$ distributions component: $f_1(x), ..., f_k(x)$ if:
$$f(x) = \sum_{i=1}^k \pi_i f_i(x)$$
where $\pi_i$ are the so called mixing weights, $0 \le \pi_i \le 1$, and $\pi_1 + ... + \pi_k = 1$. More information about mixture distribution can be read in Wikipedia.
Herein, we will show how to fit numerical data to a mixture of probability distributions model using usefr R package.
To generate from a mixture distribution the function
mixtdistr will
be used.
library(usefr) set.seed(3) # set a seed for random generation # ========= A mixture of three distributions ========= phi = c(3/10, 7/10) # Mixture proportions # --------------------------------------------------------- # === Named vector of the corresponding distribution function parameters # must be provided args <- list(gamma = c(shape = 2, scale = 0.1), weibull = c(shape = 3, scale = 0.5)) # ------------------------------------------------------------ # ===== Sampling from the specified mixture distribution ==== X <- rmixtdistr(n = 1e5, phi = phi , arg = args) X <- na.omit(X) X[1:20]
The graphics for the simulated dataset and the corresponding theoretical mixture distribution. The density curve is drawn applying function dmixtdistr:
hist(X, 90, freq = FALSE, las = 1, family = "serif", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, family = "serif", col = "cyan1", border = "deepskyblue", xlim = c(0, 1.5)) x1 <- seq(-4, 1.5, by = 0.001) lines(x1, dmixtdistr(x1, phi = phi, arg = args), col = "red")
The nonlinear fit of this dataset is NOT straightforward! Herein, we used function fitMixDist
FIT <- fitMixDist(X, args = list(gamma = c(shape = NULL, scale = NULL), weibull = c(shape = NULL, scale = NULL)), npoints = 50, usepoints = 1000) summary(FIT$fit)
It seems to be that, depending on the OS, numerical algorithm variant, or the computer processor, the Cholesky factorization of a real symmetric positive-definite square matrix can fail (see for example: the same example at: genomaths.com).
Fortunately, to bypass the numerical issue, we can apply a numerical trick. In the current case, Gamma and Weibull distribution of this data are not quite different. So, firstly, we can try the fitting to a mixture of Weibull distributions and to get some guessing for starting parameter values for each distribution.
FIT <- fitMixDist(X, args = list(weibull = c(shape = NULL, scale = NULL), weibull = c(shape = NULL, scale = NULL)), npoints = 50, usepoints = 1000) summary(FIT$fit)
Next, the estimated parameter values are supplied as starting parameter values for the mixture of Gamma and Weibull distributions, and the numerical problem is solved:
FIT <- fitMixDist(X, args = list(gamma = c(shape = 1.55, scale = 0.31), weibull = c(shape = 3.4, scale = 0.52)), npoints = 50, usepoints = 1000) summary(FIT$fit)
Function dmixtdistr is applied to draw the density curve:
hist(X, 90, freq = FALSE, las = 1, family = "serif", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, family = "serif", col = "seagreen1", border = "deepskyblue", xlim = c(0, 1.5), cex.lab = 1.2, main = "") x1 <- seq(-4, 10, by = 0.001) lines(x1, dmixtdistr(x1, phi = FIT$phi, arg = FIT$args), col = "red") mtext("Histogram of Gamma & Weibull Mixture Distributions.", cex = 1.4, font = 3, family = "serif")
A bootstrap goodness-of-fit (GOF) test is performed with function mcgoftest.
The parameter values are taken from the previous fitted mixture distribution model. Notice the particular way to set up the list of parameters. The Null hypothesis is that the data set follows a mixture of Gamma and Weibull distributions with the estimated parameter values.
pars <- c(list(phi = FIT$phi), arg = list(FIT$args)) mcgoftest(varobj = X, distr = "mixtdistr", pars = pars, num.sampl = 999, sample.size = 99999, stat = "chisq", num.cores = 4, breaks = 200, seed = 123)
set.seed(3) # set a seed for random generation # ========= A mixture of three distributions ========= phi = c(3/10, 7/10) # Mixture proportions # ------------------------------------------------------------ # === Named vector of the corresponding distribution function parameters # must be provided args <- list(gamma = c(shape = 2, scale = 0.1), weibull = c(shape = 2.1, scale = 0.12)) # ------------------------------------------------------------ # ===== Sampling from the specified mixture distribution ==== X <- rmixtdistr(n = 1e5, phi = phi , arg = args) # ---------------------- Histogram --------------------------- hist(X, 90, freq = FALSE, las = 1, family = "serif", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, family = "serif", col = "seagreen1", border = "deepskyblue", xlim = c(0, 1.5), cex.lab = 1.2, main = "") x1 <- seq(-4, 10, by = 0.001) lines(x1, dmixtdistr(x1, phi = phi, arg = args), col = "red") mtext("Histogram of Gamma & Weibull Mixture Distributions.", cex = 1.4, font = 3, family = "serif")
Next, a search of the best single probability distribution model is accomplished with function fitCDF.
The generalized Gamma distribution model seems to be the best fitted model, followed by the Gamma model.
cdfp <- fitCDF(X, distNames = c("Gamma", "Weibull", "3P Weibull", "Generalized 3P Gamma"), xlabel = "My Nice Variable Label", plot = TRUE, plot.num = 2, font.lab = 3, font = 2, font.axis = 2, family = "serif", cex.lab = 1.3, cex.axis = 1.3 )
Clearly, without the previous knowledge about the true CDF, the graphics cast a suspicion about how good the best fitted model is. Althout the Q-Q plot is not good, the cross-validation correlation coefficient R (R.Cross.val) support a strong average of cross-validation predictive power of the generalized Gamma distribution model:
cdfp$gof
Also, the GoF testing based on the Monte Carlos Anderson–Darling statistic does not reject the model.
mcgoftest(varobj = X, distr = "ggamma", pars = cdfp$bestfit$par, num.sampl = 999, stat = "ks", sample.size = 99999, num.cores = 4)
As before the Weibull distribution model seems to be the best mixture model
FIT1 <- fitMixDist(X, args = list(weibull = c(shape = NULL, scale = NULL), weibull = c(shape = NULL, scale = NULL)), npoints = 100, usepoints = 2000) summary(FIT$fit)
The mixture distribution model using the Gamma as one of marginal distributions fails.
FIT2 <- fitMixDist(X, args = list(gamma = c(shape = NULL, scale = NULL), weibull = c(shape = NULL, scale = NULL)), npoints = 100, usepoints = 2000) summary(FIT2$fit)
Next, the estimated parameter values from the Weibull mixture distribution model are applied as starting parameter values to get the best fitting model for a Gamma-Weibull mixture distribution model.
FIT2 <- fitMixDist(X, args = list(gamma = FIT1$fit$par[1:2], weibull = FIT1$fit$par[1:2]), npoints = 100, usepoints = 2000) summary(FIT2$fit)
par(mfcol = c(2, 2)) xl <- c(min(X, na.rm = TRUE), max(X, na.rm = TRUE)) n = 1e5 ### --------------------------------------------------------------------- pars <- cdfp$bestfit$par set.seed(3) # set a seed for random generation xl <- c(min(X, na.rm = TRUE), max(X, na.rm = TRUE)) q = rggamma(n = n, alpha = pars[1], scale = pars[2], psi = pars[3]) qqplot(x = q, y = X, las = 1, family = "serif", bty="n", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, family = "serif", col = "red", cex.lab = 1.2, main = expression("Q-Q plot for the Generalized Gamma distribution model"), xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", xlim = xl, ylim = xl) qqline(y = X, distribution = function(p) qggamma(p, alpha = pars[1], scale = pars[2], psi = pars[3]), probs = c(0.1, 0.6), col = 2) ### --------------------------------------------------------------------- q = rmixtdistr(n = n, phi = FIT1$phi , arg = FIT1$args) qqplot(x = q, y = X, las = 1, family = "serif", bty="n", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, family = "serif", col = "red", cex.lab = 1.2, main = expression("Q-Q plot for the Weibull mixture distribution model"), xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", xlim = xl, ylim = xl) qqline(y = X, distribution = function(p) qmixtdistr(p, phi = FIT1$phi , arg = FIT1$args), probs = c(0.1, 0.6), col = 2) ### --------------------------------------------------------------------- set.seed(3) # set a seed for random generation pars = cdfp$fit$Gamma$par q = rgamma(n = n, shape = pars[1], scale = pars[2]) qqplot(x = q, y = X, las = 1, family = "serif", bty="n", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, family = "serif", col = "red", cex.lab = 1.2, main = expression("Q-Q plot for the Gamma distribution model"), xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", xlim = xl, ylim = xl) qqline(y = X, distribution = function(p) qgamma(p, shape = pars[1], scale = pars[2]), probs = c(0.1, 0.6), col = 2) ### --------------------------------------------------------------------- q = rmixtdistr(n = n, phi = FIT2$phi , arg = FIT2$args) qqplot(x = q, y = X, las = 1, family = "serif", bty="n", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, family = "serif", col = "red", cex.lab = 1.2, main = expression("Q-Q plot for the Gamma-Weibull mixture distribution model"), xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", xlim = xl, ylim = xl) qqline(y = X, distribution = function(p) qmixtdistr(p, phi = FIT2$phi , arg = FIT2$args), probs = c(0.1, 0.6), col = 2)
The Q-Q plots suggest that the Gamma-Weibull mixture distribution model is the best fitted model.
The Kolmogorov-Smirno test does not reject the Null hypothesis about that the X follows Gamma-Weibull mixture distribution model.
set.seed(1) pars <- c(list(phi = FIT2$phi), arg = list(FIT2$args)) mcgoftest(varobj = X, distr = "mixtdistr", pars = pars, num.sampl = 999, sample.size = 99999, stat = "ks", num.cores = 4, seed = 123)
Although the Monte Carlo $mc_p.value = 0.316$ is lower than the value estimated for the generalized gamma model ($mc_p.value = 1$), the $KS.stat.p.value = 0.289$ of the simple Kolmogorov-Smirno test support the mixture model, which for case of the generalized gamma model ($KS.stat.p.value = 0$) is reject.
To identify the best fitted probability distribution model for an experimental data set can be a laborious task. Usually, researchers move forward with the first single CDF model that is supported by a Goodness-of-Fit tests. However, after digging deep in the structure of the experimental data, it can turn out that the data would come from a stratified population, better modelled by a mixture of probability distributions.
Here is the output of sessionInfo()
on the system on which this document was
compiled running pandoc r rmarkdown::pandoc_version()
:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.