ref_pacs: Generate Reference PAC Scores

Description Usage Arguments Details Value Examples

View source: R/ref_pacs.R

Description

This function computes reference PAC scores from simulated or permuted data based on an input matrix.

Usage

1
2
3
4
5
ref_pacs(dat, max_k = 3, ref_method = "pc_norm", B = 100, reps = 100,
  distance = "euclidean", cluster_alg = "hclust",
  hclust_method = "average", p_item = 0.8, p_feature = 1,
  wts_item = NULL, wts_feature = NULL, pac_window = c(0.1, 0.9),
  logit = TRUE, seed = NULL, parallel = TRUE)

Arguments

dat

Probe by sample omic data matrix. Data should be filtered and normalized prior to analysis.

max_k

Integer specifying the maximum cluster number to evaluate. Default is max_k = 3, but a more reasonable rule of thumb is the square root of the sample size.

ref_method

How should null data be generated? Options include "pc_norm", "pc_unif", "cholesky", "range", and "permute". See Details.

B

Number of reference datasets to generate.

reps

Number of subsamples to draw for consensus clustering.

distance

Distance metric for clustering. Supports all methods available in dist and vegdist, as well as those implemented in the bioDist package.

cluster_alg

Clustering algorithm to implement. Currently supports hierarchical ("hclust"), k-means ("kmeans"), and k-medoids ("pam").

hclust_method

Method to use if cluster_alg = "hclust". See hclust.

p_item

Proportion of items to include in each subsample.

p_feature

Proportion of features to include in each subsample.

wts_item

Optional vector of item weights.

wts_feature

Optional vector of feature weights.

pac_window

Lower and upper bounds for the consensus index sub-interval over which to calculate the PAC. Must be on (0, 1).

logit

Logit transform PAC output? Allows for faster convergence of the null distribution toward normality, which aids in downstream statistical testing.

seed

Optional seed for reproducibility.

parallel

If a parallel backend is loaded and available, should the function use it? Highly advisable if hardware permits.

Details

Suitable reference PAC scores are essential to test the magnitude and significance of cluster stability. This function generates B simulated or permuted datasets with similar properties to dat, but with random sample cluster structure. The expected value of k for these datasets is therefore 1, and PAC scores for each k form a null distribution that tends toward normality as B increases.

ref_pacs currently supports five methods for generating null datasets from a given input matrix:

The first two options use the data's true eigenvectors to preserve feature-wise covariance while scrambling sample-wise covariance. "pc_norm" tends to generate the most realistic null data, while Monte Carlo replicates generated via "pc_unif" converge more quickly to a true k of 1. Both methods are fast and stable when features outnumber samples. When samples outnumber features, ref_method defaults to "cholesky", which takes longer to compute, but is better suited for such cases. "range" and "permute" are included for convenience, but are not recommended since they do not preserve feature-wise covariance, which may bias results.

Just as reference PAC distributions are the theoretical core of the CCtestr approach to cluster validation, ref_pacs is the computational core of the CCtestr package. This function can take some time to execute, and should ideally be run in parallel, especially with large datasets.

Value

A matrix with B rows and max_k - 1 columns containing null PAC scores for each cluster number k.

Examples

1
2
mat <- matrix(rnorm(1000 * 12), nrow = 1000, ncol = 12)
rp <- ref_pacs(mat, ref_method = "pc_norm")

dswatson/M3C documentation built on Aug. 20, 2017, 2:46 p.m.