bootstrapClusterClass: Bootstrap analysis of clustering algorithms

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/bootstrapLatencyClustering.R

Description

Performs classification with kmeans, hc or pam, then estimates uncertainty via bootstrap resampling.

Bootstrap resampling is stratified such that the proportion of short and long latency in each bootstrap remains constant.

Usage

1
2
bootstrapClusterClass(latency, its = 1000, stratification = "initial",
algorithm=c("kmeans", "hc", "pam"), emModelNames = "E") 

Arguments

latency

A vector of average latency values.

its

Number of bootstrap iterations.

stratification

By default ("initial"), the proportion of SL / LL in each bootstrap is fixed, based on the original classification (the prior). This helps stabilize the bootstrap fit if the distribution is skewed such that there are relatively few SL or LL samples. We want to avoid producing bootstraps that contain few or no samples from one of the groups.

If "probabilistic", classify the data with Mclust and resample the original classification probabilistically. This allows us to vary the stratification according to uncertainty in the prior classification.

For example, consider the subjects (A,B,C,D,E) with prior probabilities (0, 0.05, 0.45, 0.95, 1). This contains 3 SL and 2 LL. But if we resample the groups using these probabilities, we will classify A as LL with probability 0, B with probability 0.05, C with probability 0.45, and so on.

If "none", the stratification is disabled and the bootstraps are produced by randomly sampling the original data with replacement, without regard to the initial classification.

algorithm

The clustering algorithm. By default, calls kmeans from the stats package. Alternatively, pam (cluster) or hc (mclust) can be used.

emModelNames

Only used if restratify = TRUE. Passed to Mclust to allow or disallow different models. The default "E" means that the Gaussian distributions inside a particular model will have equal standard deviation. If modelNames = "V", then the SD of each Gaussian is variable. If modelNames = c("E", "V"), then Mclust will choose the appropriate model.

Details

This function is used to classify average latency scores into two groups, the "short latency" (SL) with low stress resilience and the "long latency" (LL) with high stress resilience.

The bootstrap resampling is done by sampling, with replacement, from the SL and LL groups defined by the initial call to the clustering algorithm on the original data. The proportion of SL and LL in each bootstrap remains fixed, unless restratify=TRUE, in which case an EM algorithm is used to define prior probabilities, and the stratification is recomputed by sampling these priors at each bootstrap.

The classification boundary for each bootstrap is the mid-point between the two centroids / means. This boundary is used to the classify the original latency data.

Value

A list with the following components

bootProbLL

The probability that a subject is classified as LL, defined as the number of times this subject was classified LL over all bootstraps.

its

The number of bootstrap iterations.

latency

The original latency data used for the original classification and resampled for the bootstrap.

boundary

Approximate cluster boundary between SL and LL means (from kmeans or hc) or medoids (from pam) computed on the original data.

boundary_boot

Approximate boundary point from all bootstraps.

centers

Cluster centers or medoids fit to the original data.

class_boot

A matrix containing the classification of sample data at each bootstrap.

clusters

The classification of the original data: 1 (SL), 2 (LL).

The ordering of the per-subject values (such as bootProbLL) is the same as in the latency vector.

The cluster boundaries are defined as halfway between the two cluster means at each bootstrap.

Author(s)

Philip A Cook <cookpa@pennmedicine.upenn.edu>

See Also

plotBootstrapClusterClass kmeans pam hc

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

set.seed(20140123)

sl <- rnorm(60, 200, 100)
sl[which(sl < 0)] <- 0

ll <- rnorm(15, 600, 100)
ll[which(ll > 900)] <- 900

boot = bootstrapClusterClass(c(sl, ll), its = 100, algorithm = "pam")


## Not run
## Not run: plotBootstrapClusterClass(boot)

cookpa/socialdefeat documentation built on May 17, 2019, 10:12 p.m.