Description Usage Arguments Details Value References Examples
Fits the TCA model for an input matrix of features by observations that are coming from a mixture of k
sources, under the assumption that each observation is a mixture of unique (unobserved) source-specific values (in each feature in the data). This function further allows to statistically test the effect of covariates on source-specific values. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca
allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual. In addition, it allows to statistically test the effects of covariates and phenotypes on methylation at the cell-type level.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | tca(
X,
W,
C1 = NULL,
C1.map = NULL,
C2 = NULL,
refit_W = FALSE,
refit_W.features = NULL,
refit_W.sparsity = 500,
refit_W.sd_threshold = 0.02,
tau = NULL,
vars.mle = FALSE,
constrain_mu = FALSE,
parallel = FALSE,
num_cores = NULL,
max_iters = 10,
log_file = "TCA.log",
debug = FALSE,
verbose = TRUE
)
|
X |
An |
W |
An |
C1 |
An |
C1.map |
An |
C2 |
An |
refit_W |
A logical value indicating whether to re-estimate the input |
refit_W.features |
A vector with the names of the features in |
refit_W.sparsity |
A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating |
refit_W.sd_threshold |
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in |
tau |
A non-negative numeric value of the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model). If |
vars.mle |
A logical value indicating whether to use maximum likelihood estimation when learning the variances in the model. If |
constrain_mu |
A logical value indicating whether to constrain the estimates of the mean parameters (i.e. \{μ_{hj}\}; see details below), in which case they will be constrained to the range of the values in |
parallel |
A logical value indicating whether to use parallel computing (possible when using a multi-core machine). |
num_cores |
A numeric value indicating the number of cores to use (activated only if |
max_iters |
A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model ( |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by Z_{hj}^i the source-specific value of observation i in feature j source h, the TCA model assumes:
Z_{hj}^i \sim N(μ_{hj},σ_{hj}^2)
where μ_{hj},σ_{hj} represent the mean and standard deviation that are specific to feature j, source h. The model further assumes that the observed value of observation i in feature j is a mixture of k different sources:
X_{ji} = ∑_{h=1}^k W_{ih}Z_{hj}^i + ε_{ji}
where W_{ih} is the non-negative proportion of source h in the mixture of observation i such that ∑_{h=1}^kW_{ih} = 1, and ε_{ji} \sim N(0,τ^2) is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in W are, in general, unique for each individual, therefore each entry in X is coming from a unique distribution (i.e. a different mean and a different variance).
In cases where the true W
is unknown, tca
can be provided with noisy estimates of W
and then re-estimate W
as part of the optimization procedure (see argument refit_W
). These initial estimates should not be random but rather capture the information in W
to some extent. When the argument refit_W
is used, it is typically the case that only a subset of the features should be used for re-estimating W
. Therefore, when re-estimating W
, tca
performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation, assuming that such list of features that are most informative for estimating W exist (see argument refit_W.features
).
Covariates that systematically affect the source-specific values Z_{hj}^i can be further considered (see argument C1
). In that case, we assume:
Z_{hj}^i \sim N(μ_{hj}+c^{(1)}_i γ_j^h,σ_{hj}^2)
where c^{(1)}_i and γ_j^h correspond to the p_1 covariate values of observation i (i.e. a row vector from C1
) and their effect sizes, respectively.
Covariates that systematically affect the mixture values X_{ji}, such as variables that capture technical biases in the collection of the measurements, can also be considered (see argument C2
). In that case, we assume:
X_{ji} = ∑_{h=1}^k W_{ih}Z_{hj}^i + c^{(2)}_i δ_j + ε_{ij}
where c^{(2)}_i and δ_j correspond to the p_2 covariate values of observation i (i.e. a row vector from C2
) and their effect sizes, respectively.
Since the standard deviation of X_{ji} is specific to observation i and feature j, we can obtain p-values for the estimates of γ_j^h and δ_j by dividing each observed data point x_{ji} by its estimated standard deviation and calculating T-statistics under a standard linear regression framework.
A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.
W |
An |
mus_hat |
An |
sigmas_hat |
An |
tau_hat |
An estimate of the standard deviation of the i.i.d. component of variation in |
gammas_hat |
An |
deltas_hat |
An |
gammas_hat_pvals |
An |
gammas_hat_pvals.joint |
An |
deltas_hat_pvals |
An |
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
1 2 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.