refactor: Sparse principal component analysis using ReFACTor

Description Usage Arguments Details Value Note References Examples

View source: R/TCA.R

Description

Performs unsupervised feature selection followed by principal component analysis (PCA) under a row-sparse model using the ReFACTor algorithm. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), refactor allows to capture the variation in cell-type composition, which was shown to be a dominant sparse signal in methylation data.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
refactor(
  X,
  k,
  sparsity = 500,
  C = NULL,
  C.remove = FALSE,
  sd_threshold = 0.02,
  num_comp = NULL,
  rand_svd = FALSE,
  log_file = "TCA.log",
  debug = FALSE,
  verbose = TRUE
)

Arguments

X

An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k sources. Note that X must include row names and column names and that NA values are currently not supported. X should not include features that are constant across all observations.

k

A numeric value indicating the dimension of the signal in X (i.e. the number of sources).

sparsity

A numeric value indicating the sparsity of the signal in X (the number of signal rows).

C

An n by p design matrix of covariates that will be accounted for in the feature selection step. An intercept term will be included automatically. Note that C must include row names and column names and that NA values are currently not supported; set C to be NULL if there are no such covariates.

C.remove

A logical value indicating whether the covariates in X should be accounted for not only in the feature selection step, but also in the final calculation of the principal components (i.e. if C.remove == TRUE then the selected features will be adjusted for the covariates in C prior to calculating principal components). Note that setting C.remove to be TRUE is desired when ReFACTor is intended to be used for correction in downstream analysis, whereas setting C.remove to be FALSE is desired when ReFACTor is merely used for capturing the sparse signals in X (i.e. regardless of correction).

sd_threshold

A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in X (i.e. features with standard deviation lower than sd_threshold will be excluded). Set sd_threshold to be NULL for turning off this filter. Note that removing features with very low variability tends to improve speed and performance.

num_comp

A numeric value indicating the number of ReFACTor components to return.

rand_svd

A logical value indicating whether to use random svd for estimating the low-rank structure of the data in the first step of the algorithm; random svd can result in a substantial speedup for large data.

log_file

A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file; note that if verbose == FALSE then no output file will be generated regardless of the value of log_file.

debug

A logical value indicating whether to set the logger to a more detailed debug level; set debug to TRUE before reporting issues.

verbose

A logical value indicating whether to print logs.

Details

ReFACTor is a two-step algorithm for sparse principal component analysis (PCA) under a row-sparse model. The algorithm performs an unsupervised feature selection by ranking the features based on their correlation with their values under a low-rank representation of the data, followed by a calculation of principal components using the top ranking features (ReFACTor components).

Note that ReFACTor is tuned towards capturing sparse signals of the dominant sources of variation in the data. Therefore, in the presence of other potentially dominant factors in the data (i.e. beyond the variation of interest), these factors should be accounted for by including them as covariates (see argument C). In cases where the ReFACTor components are designated to be used as covariates in a downstream analysis alongside the covariates in C (e.g., in a standard regression analysis or in a TCA regression), it is advised to set the argument C.remove to be TRUE. This will adjust the selected features for the information in C prior to the calculation of the ReFACTor components, which will therefore capture only signals that is not present in C (and as a result may benefit the downstream analysis by potentially capturing more signals beyond the information in C).

Value

A list with the estimated components of the ReFACTor model.

scores

An n by num_comp matrix of the ReFACTor components (the projection scores).

coeffs

A sparsity by num_comp matrix of the coefficients of the ReFACTor components (the projection loadings).

ranked_list

A vector with the features in X, ranked by their scores in the feature selection step of the algorithm; the top scoring features (set according to the argument sparsity) are used for calculating the ReFACTor components. Note that features that were excluded according to sd_threshold will not appear in this ranked_list.

Note

For very large input matrices it is advised to use random svd for speeding up the feature selection step (see argument rand_svd).

References

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nature Methods 2017.

Examples

1
2
data <- test_data(100, 200, 3, 0, 0, 0.01)
ref <- refactor(data$X, k = 3, sparsity = 50)

cozygene/TCA documentation built on Feb. 18, 2021, 1:17 a.m.