bootstrapClusterClass: Bootstrap analysis of clustering algorithms
In cookpa/socialdefeat: Methods for analyzing social defeat data

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/bootstrapLatencyClustering.R

Performs classification with kmeans, hc or pam, then estimates uncertainty via bootstrap resampling.

Bootstrap resampling is stratified such that the proportion of short and long latency in each bootstrap remains constant.

1 2	bootstrapClusterClass(latency, its = 1000, stratification = "initial", algorithm=c("kmeans", "hc", "pam"), emModelNames = "E")

`latency`	A vector of average latency values.
`its`	Number of bootstrap iterations.
`stratification`	By default ("initial"), the proportion of SL / LL in each bootstrap is fixed, based on the original classification (the prior). This helps stabilize the bootstrap fit if the distribution is skewed such that there are relatively few SL or LL samples. We want to avoid producing bootstraps that contain few or no samples from one of the groups. If "probabilistic", classify the data with `Mclust` and resample the original classification probabilistically. This allows us to vary the stratification according to uncertainty in the prior classification. For example, consider the subjects (A,B,C,D,E) with prior probabilities (0, 0.05, 0.45, 0.95, 1). This contains 3 SL and 2 LL. But if we resample the groups using these probabilities, we will classify A as LL with probability 0, B with probability 0.05, C with probability 0.45, and so on. If "none", the stratification is disabled and the bootstraps are produced by randomly sampling the original data with replacement, without regard to the initial classification.
`algorithm`	The clustering algorithm. By default, calls `kmeans` from the `stats` package. Alternatively, `pam` (`cluster`) or `hc` (`mclust`) can be used.
`emModelNames`	Only used if `restratify = TRUE`. Passed to Mclust to allow or disallow different models. The default "E" means that the Gaussian distributions inside a particular model will have equal standard deviation. If `modelNames = "V"`, then the SD of each Gaussian is variable. If `modelNames = c("E", "V")`, then `Mclust` will choose the appropriate model.

This function is used to classify average latency scores into two groups, the "short latency" (SL) with low stress resilience and the "long latency" (LL) with high stress resilience.

The bootstrap resampling is done by sampling, with replacement, from the SL and LL groups defined by the initial call to the clustering algorithm on the original data. The proportion of SL and LL in each bootstrap remains fixed, unless restratify=TRUE, in which case an EM algorithm is used to define prior probabilities, and the stratification is recomputed by sampling these priors at each bootstrap.

The classification boundary for each bootstrap is the mid-point between the two centroids / means. This boundary is used to the classify the original latency data.

A list with the following components

`bootProbLL`	The probability that a subject is classified as LL, defined as the number of times this subject was classified LL over all bootstraps.
`its`	The number of bootstrap iterations.
`latency`	The original latency data used for the original classification and resampled for the bootstrap.
`boundary`	Approximate cluster boundary between SL and LL means (from kmeans or hc) or medoids (from pam) computed on the original data.
`boundary_boot`	Approximate boundary point from all bootstraps.
`centers`	Cluster centers or medoids fit to the original data.
`class_boot`	A matrix containing the classification of sample data at each bootstrap.
`clusters`	The classification of the original data: 1 (SL), 2 (LL).

The ordering of the per-subject values (such as bootProbLL) is the same as in the latency vector.

The cluster boundaries are defined as halfway between the two cluster means at each bootstrap.

Philip A Cook <cookpa@pennmedicine.upenn.edu>

plotBootstrapClusterClass kmeans pam hc

##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

set.seed(20140123)

sl <- rnorm(60, 200, 100)
sl[which(sl < 0)] <- 0

ll <- rnorm(15, 600, 100)
ll[which(ll > 900)] <- 900

boot = bootstrapClusterClass(c(sl, ll), its = 100, algorithm = "pam")


## Not run
## Not run: plotBootstrapClusterClass(boot)