BootKmeans: BootKmeans() function
In MHCtools: Analysis of MHC Data in Non-Model Species

View source: R/BootKmeans_func_20250108.R

BootKmeans

R Documentation

BootKmeans() function

Description

BootKmeans is a wrapper for the kmeans() function of the 'stats' package, which allows for bootstrapping. Bootstrapping k-estimates may be desirable in data sets, where the BIC- vs. k-values do not produce clear inflection points ("elbows").

Usage

BootKmeans(
  z1_matrix,
  z2_matrix,
  z3_matrix,
  z4_matrix,
  z5_matrix,
  threshold = 0.01,
  no_scans = 1000,
  max_k = 40,
  iter.max = 1e+06,
  nstart = 200,
  algorithm = "Hartigan-Wong",
  path_out = path_out
)

Arguments

`z1_matrix`	a matrix with numerical values of the first z-descriptor for each amino acid position in all sequences in the data set.
`z2_matrix`	a matrix with numerical values of the second z-descriptor for each amino acid position in all sequences in the data set.
`z3_matrix`	a matrix with numerical values of the third z-descriptor for each amino acid position in all sequences in the data set.
`z4_matrix`	a matrix with numerical values of the fourth z-descriptor for each amino acid position in all sequences in the data set.
`z5_matrix`	a matrix with numerical values of the fifth z-descriptor for each amino acid position in all sequences in the data set.
`threshold`	a numerical value between 0 and 1 specifying the threshold of reduction in BIC for selecting a k estimate for each kmeans clustering model. The value specifies a proportion of the max observed reduction in BIC when increasing k by 1 (default 0.01).
`no_scans`	an integer specifying the number of k estimation scans to run (default 1,000).
`max_k`	an integer specifying the hypothetical maximum number of clusters to detect (default 40). In each k estimation scan, the algorithm runs a kmeans() clustering model for each value of k between 1 and max_k.
`iter.max`	an integer specifying the maximum number of iterations allowed in each kmeans() clustering model (default 1,000,000).
`nstart`	an integer specifying the number of rows in the set of input matrices that will be chosen as initial centers in the kmeans() clustering models (default 200).
`algorithm`	character vector, specifying the method for the kmeans() clustering function, one of c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), default is "Hartigan-Wong".
`path_out`	a user defined path to the folder where the output files will be saved.

Details

BootKmeans() performs multiple runs of kmeans() scanning k-values from 1 to a maximum value defined by the user. In each scan, an optimal k-value is estimated using a user-defined threshold of BIC reduction. The method is an automated version of visually inspecting elbow plots of BIC- vs. k-values. The number of scans to be performed is defined by the user.

For each k-estimate scan, the algorithm produces a summary of the stats incl. total within SS, AIC, and BIC, an elbow plot (BIC vs. k), and a set of cluster files corresponding to the estimated optimal k-value. It also produces a table summarizing the stats of the final selected kmeans() models corresponding to the estimated optimal k-values of each scan.

After running BootKmeans() on a data set, it is recommended to subsequently evaluate the repeatability of the bootstrapped k-estimation scans with the ClusterMatch() function also included in MHCtools.

Input data format: A set of five z-matrices containing numerical values of the z-descriptors (z1-z5) for each amino acid position in a sequence alignment. Each column should represent an amino acid position and each row one sequence in the alignment.

If you publish data or results produced with MHCtools, please cite both of the following references: Roved, J. (2022). MHCtools: Analysis of MHC data in non-model species. Cran. Roved, J. (2024). MHCtools 1.5: Analysis of MHC sequencing data in R. In S. Boegel (Ed.), HLA Typing: Methods and Protocols (2nd ed., pp. 275–295). Humana Press. https://doi.org/10.1007/978-1-0716-3874-3_18

Value

The function produces three folders in path_out, which contain for each scan the estimated k-clusters saved as .RData files, an elbow plot saved as .pdf, and a stats summary table saved as a .csv file. In path_out a summary of all scans performed in the bootstrap run is also saved as .csv. This table is also shown in the console. Should alternative elbow plots be desired, they may be produced manually with the stats presented in the summary tables for each scan.

Note

If z-matrices were generated with the DistCalc() function, please make sure to load the z-matrices from the .csv files exported by DistCalc(). Calling e.g. 'z1_matrix' without loading the exported tables will engage the default test data set in MHCtools.

Setting max_k too high can cause kmeans to fail with the error "more cluster centers than distinct data points" - this problem can be solved by reducing max_k.

AIC and BIC are calculated from the kmeans model objects by the following formulae: - AIC = D + 2*m*k - BIC = D + log(n)*m*k in which: - m = ncol(fit$centers) - n = length(fit$cluster) - k = nrow(fit$centers) - D = fit$tot.withinss

Examples

z1_matrix <- z1_matrix
z2_matrix <- z2_matrix
z3_matrix <- z3_matrix
z4_matrix <- z4_matrix
z5_matrix <- z5_matrix
path_out <- tempdir()
BootKmeans(z1_matrix, z2_matrix, z3_matrix, z4_matrix, z5_matrix, threshold=0.01,
no_scans=10, max_k=20, iter.max=10, nstart=10, algorithm="Hartigan-Wong",
path_out=path_out)

MHCtools documentation built on April 3, 2025, 7:17 p.m.