bootstrapEM_Class: Bootstrap analysis of EM clustering
In cookpa/socialdefeat: Methods for analyzing social defeat data

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/bootstrapLatencyClustering.R

Performs EM classification with mclust, then estimates uncertainty via bootstrap resampling.

The classification model is a Gaussian mixture model with two components and equal variance. Bootstrap resampling is stratified such that the proportion of short and long latency in each bootstrap remains constant.

1	bootstrapEM_Class(latency, its = 1000, stratification = "initial", modelNames = "E")

`latency`	A vector of average latency values.
`its`	Number of bootstrap iterations.
`stratification`	By default ("initial"), the proportion of SL / LL in each bootstrap is fixed, based on the original classification (the prior). This helps stabilize the bootstrap fit if the distribution is skewed such that there are relatively few SL or LL samples. We want to avoid producing bootstraps that contain few or no samples from one of the groups. If "probabilistic", we resample the original classification probabilistically. This allows us to vary the stratification according to uncertainty in the prior classification. For example, consider the subjects (A,B,C,D,E) with prior probabilities (0, 0.05, 0.45, 0.95, 1). This contains 3 SL and 2 LL. But if we resample the groups using these probabilities, we will classify A as LL with probability 0, B with probability 0.05, C with probability 0.45, and so on. If "none", the stratification is disabled and the bootstraps are produced by randomly sampling the original data with replacement, without regard to the initial classification.
`modelNames`	Passed to `Mclust` to allow or disallow different models. The default "E" means that the Gaussian distributions inside a particular model will have equal standard deviation. The SD will be different for each bootstrap, and the value for each bootstrap is recorded in the returned vector `em_std`. If `modelNames = "V"`, then the SD of each Gaussian is variable, and `em_std` contains two values for each bootstrap. If `modelNames = c("E", "V")`, then `Mclust` will choose the appropriate model for each bootstrap. Equal variance models imply a monotonically increasing probablity of LL classification with increasing latency. With unequal variance, the tails of the distributions might be quite different, lead to nonsensical results such as increasing probability of LL at very low latencies. Be careful about interpreting the results if you allow unequal variance.

This function is used to classify average latency scores into two groups, the "short latency" (SL) with low stress resilience and the "long latency" (LL) with high stress resilience.

The bootstrap resampling is done by sampling, with replacement, from the SL and LL groups defined by the initial call to Mclust on the original data. The proportion of SL and LL in each bootstrap remains fixed, unless restratify=TRUE.

The model for each bootstrap is fixed as a mixture of two Gaussian distributions with equal variance. Mclust is run on the resampled data, which uses hierarchical clustering for initialization followed by EM.

A list with the following components

`bootProbLL`	The probability that a subject is classified as LL, defined as the number of times this subject was classified LL over all bootstraps. This is distinct from the EM probability derived from the Gaussian mixture model.
`bootThreshLL`	The smallest integer value of the average latency that would be classified as LL using the model from each bootstrap.
`clusters`	The object returned from the intial call to Mclust on the entire data set. This contains the baseline classification that used to initialize the bootstraps.
`em_mean`	A matrix containing the cluster means from all bootstraps.
`em_mix`	A matrix containing the Gaussian mixing parameters from all bootstraps.
`em_std`	A matrix containing the estimated variance of the Gaussian distributions from all bootstraps.
`its`	The number of bootstraps.
`latency`	The latency vector passed to the function.
`priorProbLL`	The probability of LL classification from Mclust on the original data. This is used to initialize the bootstraps and optionally to restratify the data before each bootstrap.
`r_boot`	A matrix containing the EM probability for each subject in the input latency vector, from all bootstraps.
`r_curve_boot`	A matrix containing the EM probability for all integers 1:900, from all bootstraps. Used for deriving percentiles for the classification across the spectrum of possible latency values (0-900 seconds).

The ordering of the per-subject values (such as bootProbLL) is the same as in the latency vector passed to bootstrapClassification.

Philip A Cook <cookpa@pennmedicine.upenn.edu>

plotBootstrapEM_Class Mclust

##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

set.seed(20140123)

sl <- rnorm(60, 200, 100)
sl[which(sl < 0)] <- 0

ll <- rnorm(15, 600, 100)
ll[which(ll > 900)] <- 900

boot = bootstrapEM_Class(c(sl, ll), its = 100)

## Not run
## Not run: plotBootstrapEM_Class(boot)