tca: Fitting the TCA model

Description Usage Arguments Details Value Note References Examples

View source: R/TCA.R

Description

Fits the TCA model for an input matrix of observations coming from a mixture of k sources, under the assumption that each observation is a mixture of unique source-specific values (in each feature in the data). For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual.

Usage

1
2
3
4
5
tca(X, W, C1 = NULL, C1.map = NULL, C2 = NULL, refit_W = FALSE,
  refit_W.features = NULL, refit_W.sparsity = 500,
  refit_W.sd_threshold = 0.02, tau = NULL, parallel = FALSE,
  num_cores = NULL, max_iters = 10, log_file = "TCA.log",
  debug = FALSE, verbose = TRUE)

Arguments

X

An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k different sources. Note that X must include row names and column names and that NA values are currently not supported. X should not include features that are constant across all observations.

W

An n by k matrix of weights - the weights of k sources for each of the n mixtures (observations). All the weights must be positive and each row, corresponding to the weights of a single observation, must sum up to 1. Note that W must include row names and column names and that NA values are currently not supported. In case where only initial estimates of W are available, tca can be set to re-estimate W (see refit_W).

C1

An n by p1 design matrix of covariates that may affect the hidden source-specific values (possibly a different effect on each source). Note that C1 must include row names and column names and should not include an intercept term. NA values are currently not supported.

C1.map

An p1 by k matrix of 0/1 values, indicating for each of the p1 covariates in C1 whether to consider its potential effects on the values of each of the k sources (e.g., if position i,j in C1.map is 1 then the potential effect of the i-th covariate in C1 on the j-th source will be considered). If C1.map == NULL then effects for all covariates in C1 in each of the sources will be considered.

C2

An n by p2 design matrix of covariates that may affect the mixture (i.e. rather than directly the sources of the mixture; for example, variables that capture biases in the collection of the measurements). Note that C2 must include row names and column names and should not include an intercept term. NA values are currently not supported.

refit_W

A logical value indicating whether to re-estimate the input W under the TCA model.

refit_W.features

A vector with the names of the features in X to consider when re-estimating W (i.e. when refit_W == TRUE). This is useful since oftentimes just a subset of the features in X will be informative for estimating W. If refit_W.features == NULL then the ReFACTor algorithm will be used for performing feature selection (see also refit_W.sparsity, refit_W.sd_threshold).

refit_W.sparsity

A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating W (activated only if refit_W == TRUE and refit_W.features == NULL). Note that refit_W.sparsity must be lower or equal to the number of features in X. For more information, see the argument sparsity in refactor.

refit_W.sd_threshold

A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in X (activated only if refit_W == TRUE and refit_W.features == NULL). For more information, see the argument sd_threshold in refactor.

tau

A non-negative numeric value of the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model). If tau == NULL then tca will estimate tau.

parallel

A logical value indicating whether to use parallel computing (possible when using a multi-core machine).

num_cores

A numeric value indicating the number of cores to use (activated only if parallel == TRUE). If num_cores == NULL then all available cores except for one will be used.

max_iters

A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model (max_iters iterations will be used as long as the optimization does not converge in earlier iterations).

log_file

A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file; note that if verbose == TRUE then no output file will be generated regardless of the value of log_file.

debug

A logical value indicating whether to set the logger to a more detailed debug level; please set debug to TRUE before reporting issues.

verbose

A logical value indicating whether to print logs.

Details

The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by Z_{hj}^i the source-specific value of observation i in feature j source h, the TCA model assumes:

Z_{hj}^i \sim N(μ_{hj},σ_{hj}^2)

where μ_{hj},σ_{hj} represent the mean and standard deviation that are specific to feature j source h. The model further assumes that the observed value of observation i in feature j is a mixture of k different sources:

X_{ji} = ∑_{h=1}^k W_{ih}Z_{hj}^i + ε_{ji}

where W_{ih} is the non-negative proportion of source h in the mixture of observation i such that ∑_{h=1}^kW_{ih} = 1, and ε_{ji} \sim N(0,τ^2) is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in W are, in general, unique for each individual, therefore each entry in the data matrix X is coming from a unique distribution (i.e. a different mean and a different variance).

In cases where the true W is unknown, tca can be provided with initial estimates of W and then re-estimate W as part of the optimization procedure (see argument refit_W). These initial estimates should not be random but rather capture the information in W to some extent. When the argument refit_W is used, it is typically the case that only a subset of the features should be used for re-estimating W. Therefore, when re-estimating W, tca performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation (see argument refit_W.features).

Factors that systematically affect the source-specific values Z_{hj}^i can be further considered (see argument C1). In that case, we assume:

Z_{hj}^i \sim N(μ_{hj}+c^{(1)}_i γ_j^h,σ_{hj}^2)

where c^{(1)}_i is a row vector from C1, corresponding to the values of the p_1 factors for observation i, and γ_j^h is a vector of p_1 corresponding effect sizes.

Factors that systematically affect the mixture values X_{ji}, such as variables that capture biases in the collection of the measurements, can also be considered (see argument C2). In that case, we assume:

X_{ji} \sim ∑_{h=1}^k W_{ih}Z_{hj}^i + c^{(2)}_i δ_j + ε_{ij}

where c^{(2)}_i is a row vector from C2, corresponding to the values of the p_2 factors for observation i, and δ_j is a vector of p_2 corresponding effect sizes.

Value

A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.

W

An n by k matrix of weights. If refit_W == TRUE then this is the re-estimated W; otherwise this is the input W

mus_hat

An m by k matrix of estimates for the mean of each source in each feature.

sigmas_hat

An m by k matrix of estimates for the standard deviation of each source in each feature.

tau_hat

An estimate of the standard deviation of the i.i.d. component of variation in X. If an input value was provided for tau (i.e. tau != NULL) then tau_hat == tau.

gammas_hat

An m by k*p1 matrix of the estimated effects of the p1 factors in C1 on each of the m features in X, where the first p1 columns are the source-specific effects of the p1 factors on the first source, the following p1 columns are the source-specific effects on the second source and so on.

deltas_hat

An m by p2 matrix of the estimated effects of the p2 factors in C2 on the mixture values of each of the m features in X.

Note

The function tca may require a long running time when the input matrix X is very large; to alleviate this, it is strongly advised to use the parallel argument, given that a multi-core machine is available.

References

Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2018.

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.

Examples

1
2
data <- test_data(100, 20, 3, 1, 1, 0.01)
tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)

cozygene/TCA documentation built on April 3, 2020, 3:14 p.m.