bootstrapEM_Class: Bootstrap analysis of EM clustering

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/bootstrapLatencyClustering.R

Description

Performs EM classification with mclust, then estimates uncertainty via bootstrap resampling.

The classification model is a Gaussian mixture model with two components and equal variance. Bootstrap resampling is stratified such that the proportion of short and long latency in each bootstrap remains constant.

Usage

1
bootstrapEM_Class(latency, its = 1000, stratification = "initial", modelNames = "E")

Arguments

latency

A vector of average latency values.

its

Number of bootstrap iterations.

stratification

By default ("initial"), the proportion of SL / LL in each bootstrap is fixed, based on the original classification (the prior). This helps stabilize the bootstrap fit if the distribution is skewed such that there are relatively few SL or LL samples. We want to avoid producing bootstraps that contain few or no samples from one of the groups.

If "probabilistic", we resample the original classification probabilistically. This allows us to vary the stratification according to uncertainty in the prior classification.

For example, consider the subjects (A,B,C,D,E) with prior probabilities (0, 0.05, 0.45, 0.95, 1). This contains 3 SL and 2 LL. But if we resample the groups using these probabilities, we will classify A as LL with probability 0, B with probability 0.05, C with probability 0.45, and so on.

If "none", the stratification is disabled and the bootstraps are produced by randomly sampling the original data with replacement, without regard to the initial classification.

modelNames

Passed to Mclust to allow or disallow different models. The default "E" means that the Gaussian distributions inside a particular model will have equal standard deviation. The SD will be different for each bootstrap, and the value for each bootstrap is recorded in the returned vector em_std. If modelNames = "V", then the SD of each Gaussian is variable, and em_std contains two values for each bootstrap. If modelNames = c("E", "V"), then Mclust will choose the appropriate model for each bootstrap.

Equal variance models imply a monotonically increasing probablity of LL classification with increasing latency. With unequal variance, the tails of the distributions might be quite different, lead to nonsensical results such as increasing probability of LL at very low latencies. Be careful about interpreting the results if you allow unequal variance.

Details

This function is used to classify average latency scores into two groups, the "short latency" (SL) with low stress resilience and the "long latency" (LL) with high stress resilience.

The bootstrap resampling is done by sampling, with replacement, from the SL and LL groups defined by the initial call to Mclust on the original data. The proportion of SL and LL in each bootstrap remains fixed, unless restratify=TRUE.

The model for each bootstrap is fixed as a mixture of two Gaussian distributions with equal variance. Mclust is run on the resampled data, which uses hierarchical clustering for initialization followed by EM.

Value

A list with the following components

bootProbLL

The probability that a subject is classified as LL, defined as the number of times this subject was classified LL over all bootstraps. This is distinct from the EM probability derived from the Gaussian mixture model.

bootThreshLL

The smallest integer value of the average latency that would be classified as LL using the model from each bootstrap.

clusters

The object returned from the intial call to Mclust on the entire data set. This contains the baseline classification that used to initialize the bootstraps.

em_mean

A matrix containing the cluster means from all bootstraps.

em_mix

A matrix containing the Gaussian mixing parameters from all bootstraps.

em_std

A matrix containing the estimated variance of the Gaussian distributions from all bootstraps.

its

The number of bootstraps.

latency

The latency vector passed to the function.

priorProbLL

The probability of LL classification from Mclust on the original data. This is used to initialize the bootstraps and optionally to restratify the data before each bootstrap.

r_boot

A matrix containing the EM probability for each subject in the input latency vector, from all bootstraps.

r_curve_boot

A matrix containing the EM probability for all integers 1:900, from all bootstraps. Used for deriving percentiles for the classification across the spectrum of possible latency values (0-900 seconds).

The ordering of the per-subject values (such as bootProbLL) is the same as in the latency vector passed to bootstrapClassification.

Author(s)

Philip A Cook <cookpa@pennmedicine.upenn.edu>

See Also

plotBootstrapEM_Class Mclust

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

set.seed(20140123)

sl <- rnorm(60, 200, 100)
sl[which(sl < 0)] <- 0

ll <- rnorm(15, 600, 100)
ll[which(ll > 900)] <- 900

boot = bootstrapEM_Class(c(sl, ll), its = 100)

## Not run
## Not run: plotBootstrapEM_Class(boot)

cookpa/socialdefeat documentation built on May 17, 2019, 10:12 p.m.