ctassoc: Cell-Type-Specific Association Testing

Description Usage Arguments Details Value References See Also Examples


Cell-Type-Specific Association Testing


  C = NULL,
  test = "full",
  regularize = FALSE,
  num.cores = 1,
  chunk.size = 1000,
  seed = 123



Matrix (or vector) of traits; samples x traits.


Matrix of cell type composition; samples x cell types.


Matrix (or vector) of bulk omics measurements; markers x samples.


Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric.


Statistical test to apply; either "full", "marginal", "nls.identity", "nls.log", "nls.logit", "propdiff.identity", "propdiff.log", "propdiff.logit" or "reducedrankridge".


Whether to apply Tikhonov (ie ridge) regularization to β_{h j k}. The regularization parameter is chosen automatically according to an unbiased version of (Lawless & Wang, 1976). Effective for nls.* and propdiff.* tests.


Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored.


The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size.


Seed for random number generation.


Let the indexes be h for cell type, i for sample, j for marker (CpG site or gene), k for each trait that has cell-type-specific effect, and l for each trait that has a uniform effect across cell types. The input data are X_{i k}, C_{i l}, W_{i h} and Y_{j i}, where C_{i l} can be omitted. X_{i k} and C_{i l} are the values for two types of traits, showing effects that are cell-type-specific or not, respectively. Thus, calling X_{i k} and C_{i l} as "traits" and "covariates" gives a rough idea, but is not strictly correct. W_{i h} represents the cell type composition and Y_{j i} represents the marker level, such as methylation or gene expression. For each tissue sample, the cell type proportion W_{i h} is the proportion of each cell type in the bulk tissue, which is measured or imputed beforehand. The marker level Y_{j i} in bulk tissue is measured and provided as input.

The parameters we estimate are the cell-type-specific trait effect β_{h j k}, the tissue-uniform trait effect γ_{j l}, and the basal marker level α_{h j} in each cell type.

We first describe the conventional linear regression models. For marker j in sample i, the maker level specific to cell type h is

α_{h j} + ∑_k β_{h j k} * X_{i k}.

This is a representative value rather than a mean, because we do not model a probability distribution for cell-type-specific expression. The bulk tissue marker level is the average weighted by W_{i h},

μ_{j i} = ∑_h W_{i h} [ α_{h j} + ∑_k β_{h j k} * X_{i k} ] + ∑_l γ_{j l} C_{i l}.

The statistical model is

Y_{j i} = μ_{j i} + ε_{j i},

ε_{j i} ~ N(0, σ^2_j).

The error of the marker level is is noramlly distributed with variance σ^2_j, independently among samples.

The full model is the linear regression

Y_{j i} = (∑_h α_{h j} * W_{i h}) + (∑_{h k} β_{h j k} * W_{i h} * X_{i k}) + (∑_l γ_{j l} * C_{i l}) + error.

The marginal model tests the trait association only in one cell type h, under the linear regression,

Y_{j i} = (∑_{h'} α_{h' j} * W_{i h'}) + (∑_k β_{h j k} * W_{i h} * X_{i k}) + (∑_l γ_{j l} * C_{i l}) + error.

The nonlinear model simultaneously analyze cell type composition in linear scale and differential expression/methylation in log/logit scale. The normalizing function is the natural logarithm f = log for gene expression, and f = logit for methylation. Conventional linear regression can be formulated by defining f as the identity function. The three models are named nls.log, nls.logit and nls.identity. We denote the inverse function of f by g; g = exp for gene expression, and g = logistic for methylation. The mean normalized marker level of marker j in sample i becomes

μ_{j i} = f(∑_h W_{i h} g( α_{h j} + ∑_k β_{h j k} * X_{i k} )) + ∑_l γ_{j l} C_{i l}.

The statistical model is

f(Y_{j i}) = μ_{j i} + ε_{j i},

ε_{j i} ~ N(0, σ^2_j).

The error of the marker level is is noramlly distributed with variance σ^2_j, independently among samples.

The ridge regression aims to cope with multicollinearity of the interacting terms W_{i h} * X_{i k}. Ridge regression is fit by minimizing the residual sum of squares (RSS) plus λ ∑_{h k} β_{h j k}^2, where λ > 0 is the regularization parameter.

The propdiff tests try to cope with multicollinearity by, roughly speaking, using mean-centered W_{i h}. We obtain, instead of β_{h j k}, the deviation of β_{h j k} from the average across cell types. Accordingly, the null hypothesis changes. The original null hypothesis was β_{h j k} = 0. The null hypothesis when centered is β_{h j k} - (∑_{i h'} W_{i h'} β_{h' j k}) / (∑_{i h'} W_{i h'}) = 0. It becomes difficult to detect a signal for a major cell type, because β_{h j k} would be close to the average across cell types. The tests propdiff.log and propdiff.logit include an additional preprocessing step that converts Y_{j i} to f(Y_{j i}). Apart from the preprocessing, the computations are performed in linear scale. As the preprocessing distorts the linearity between the dependent variable and (the centered) W_{i h}, I actually think propdiff.identity is better.


A list with one element, which is named "coefficients". The element gives the estimate, statistic, p.value in tibble format. In order to transform the estimate for α_{h j} to the original scale, apply plogis for test = nls.logit and exp for test = nls.log. The estimate for β_{h j k} by test = nls.log is the natural logarithm of fold-change, not the log2. If numerical convergence fails, NA is returned for that marker.


Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353

See Also




omicwas documentation built on Jan. 13, 2021, 8:48 a.m.