ref_pacs: Generate Reference PAC Scores
In dswatson/cc_testr: Rigorously test cluster stability

Description Usage Arguments Details Value Examples

This function computes reference PAC scores from simulated or permuted data based on an input matrix.

ref_pacs(dat, max_k = 3, ref_method = "pc_norm", B = 100, reps = 100,
  distance = "euclidean", cluster_alg = "hclust",
  hclust_method = "average", p_item = 0.8, p_feature = 1,
  wts_item = NULL, wts_feature = NULL, pac_window = c(0.1, 0.9),
  logit = TRUE, seed = NULL, parallel = TRUE)

`dat`	Probe by sample omic data matrix. Data should be filtered and normalized prior to analysis.
`max_k`	Integer specifying the maximum cluster number to evaluate. Default is `max_k = 3`, but a more reasonable rule of thumb is the square root of the sample size.
`ref_method`	How should null data be generated? Options include `"pc_norm"`, `"pc_unif"`, `"cholesky"`, `"range"`, and `"permute"`. See Details.
`B`	Number of reference datasets to generate.
`reps`	Number of subsamples to draw for consensus clustering.
`distance`	Distance metric for clustering. Supports all methods available in `dist` and `vegdist`, as well as those implemented in the `bioDist` package.
`cluster_alg`	Clustering algorithm to implement. Currently supports hierarchical (`"hclust"`), k-means (`"kmeans"`), and k-medoids (`"pam"`).
`hclust_method`	Method to use if `cluster_alg = "hclust"`. See `hclust`.
`p_item`	Proportion of items to include in each subsample.
`p_feature`	Proportion of features to include in each subsample.
`wts_item`	Optional vector of item weights.
`wts_feature`	Optional vector of feature weights.
`pac_window`	Lower and upper bounds for the consensus index sub-interval over which to calculate the PAC. Must be on (0, 1).
`logit`	Logit transform PAC output? Allows for faster convergence of the null distribution toward normality, which aids in downstream statistical testing.
`seed`	Optional seed for reproducibility.
`parallel`	If a parallel backend is loaded and available, should the function use it? Highly advisable if hardware permits.

Suitable reference PAC scores are essential to test the magnitude and significance of cluster stability. This function generates B simulated or permuted datasets with similar properties to dat, but with random sample cluster structure. The expected value of k for these datasets is therefore 1, and PAC scores for each k form a null distribution that tends toward normality as B increases.

ref_pacs currently supports five methods for generating null datasets from a given input matrix:

"pc_norm" simulates the principal components by taking random draws from a normal distribution with variance equal to the true eigenvalues. Data are subsequently back-transformed to their original dimensions by cross-multiplication with the true eigenvector matrix.
"pc_unif" simulates the principal components by taking random draws from a uniform distribution with ranges equal to those of the true principal components. Data are subsequently back-transformed to their original dimensions by cross-multiplication with the true eigenvector matrix.
"cholesky" simulates random Gaussian noise around the nearest positive-definite approximation to dat's feature-wise covariance matrix.
"range" selects random values uniformly from each feature's observed range.
"permute" shuffles each feature's observed values.

The first two options use the data's true eigenvectors to preserve feature-wise covariance while scrambling sample-wise covariance. "pc_norm" tends to generate the most realistic null data, while Monte Carlo replicates generated via "pc_unif" converge more quickly to a true k of 1. Both methods are fast and stable when features outnumber samples. When samples outnumber features, ref_method defaults to "cholesky", which takes longer to compute, but is better suited for such cases. "range" and "permute" are included for convenience, but are not recommended since they do not preserve feature-wise covariance, which may bias results.

Just as reference PAC distributions are the theoretical core of the CCtestr approach to cluster validation, ref_pacs is the computational core of the CCtestr package. This function can take some time to execute, and should ideally be run in parallel, especially with large datasets.

A matrix with B rows and max_k - 1 columns containing null PAC scores for each cluster number k.