Description Usage Arguments Details Value References See Also Examples
Cell-Type-Specific Association Testing
1 2 3 4 5 6 7 8 9 10 11 |
X |
Matrix (or vector) of traits; samples x traits. |
W |
Matrix of cell type composition; samples x cell types. |
Y |
Matrix (or vector) of bulk omics measurements; markers x samples. |
C |
Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric. |
test |
Statistical test to apply; either |
regularize |
Whether to apply Tikhonov (ie ridge) regularization
to β_{h j k}.
The regularization parameter is chosen automatically according to
an unbiased version of (Lawless & Wang, 1976).
Effective for |
num.cores |
Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored. |
chunk.size |
The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size. |
seed |
Seed for random number generation. |
Let the indexes be h for cell type, i for sample, j for marker (CpG site or gene), k for each trait that has cell-type-specific effect, and l for each trait that has a uniform effect across cell types. The input data are X_{i k}, C_{i l}, W_{i h} and Y_{j i}, where C_{i l} can be omitted. X_{i k} and C_{i l} are the values for two types of traits, showing effects that are cell-type-specific or not, respectively. Thus, calling X_{i k} and C_{i l} as "traits" and "covariates" gives a rough idea, but is not strictly correct. W_{i h} represents the cell type composition and Y_{j i} represents the marker level, such as methylation or gene expression. For each tissue sample, the cell type proportion W_{i h} is the proportion of each cell type in the bulk tissue, which is measured or imputed beforehand. The marker level Y_{j i} in bulk tissue is measured and provided as input.
The parameters we estimate are the cell-type-specific trait effect β_{h j k}, the tissue-uniform trait effect γ_{j l}, and the basal marker level α_{h j} in each cell type.
We first describe the conventional linear regression models. For marker j in sample i, the maker level specific to cell type h is
α_{h j} + ∑_k β_{h j k} * X_{i k}.
This is a representative value rather than a mean, because we do not model a probability distribution for cell-type-specific expression. The bulk tissue marker level is the average weighted by W_{i h},
μ_{j i} = ∑_h W_{i h} [ α_{h j} + ∑_k β_{h j k} * X_{i k} ] + ∑_l γ_{j l} C_{i l}.
The statistical model is
Y_{j i} = μ_{j i} + ε_{j i},
ε_{j i} ~ N(0, σ^2_j).
The error of the marker level is is noramlly distributed with variance σ^2_j, independently among samples.
The full
model is the linear regression
Y_{j i} = (∑_h α_{h j} * W_{i h}) + (∑_{h k} β_{h j k} * W_{i h} * X_{i k}) + (∑_l γ_{j l} * C_{i l}) + error.
The marginal
model tests the trait association only in one
cell type h, under the linear regression,
Y_{j i} = (∑_{h'} α_{h' j} * W_{i h'}) + (∑_k β_{h j k} * W_{i h} * X_{i k}) + (∑_l γ_{j l} * C_{i l}) + error.
The nonlinear model simultaneously analyze cell type composition in
linear scale and differential expression/methylation in log/logit scale.
The normalizing function is the natural logarithm f = log for gene
expression, and f = logit for methylation. Conventional linear regression
can be formulated by defining f as the identity function. The three models
are named nls.log
, nls.logit
and nls.identity
.
We denote the inverse function of f by g; g = exp for
gene expression, and g = logistic for methylation.
The mean normalized marker level of marker j in sample i becomes
μ_{j i} = f(∑_h W_{i h} g( α_{h j} + ∑_k β_{h j k} * X_{i k} )) + ∑_l γ_{j l} C_{i l}.
The statistical model is
f(Y_{j i}) = μ_{j i} + ε_{j i},
ε_{j i} ~ N(0, σ^2_j).
The error of the marker level is is noramlly distributed with variance σ^2_j, independently among samples.
The ridge regression aims to cope with multicollinearity of the interacting terms W_{i h} * X_{i k}. Ridge regression is fit by minimizing the residual sum of squares (RSS) plus λ ∑_{h k} β_{h j k}^2, where λ > 0 is the regularization parameter.
The propdiff tests try to cope with multicollinearity by, roughly speaking,
using mean-centered W_{i h}.
We obtain, instead of β_{h j k}, the deviation of
β_{h j k} from the average across cell types.
Accordingly, the null hypothesis changes.
The original null hypothesis was β_{h j k} = 0.
The null hypothesis when centered is
β_{h j k} - (∑_{i h'} W_{i h'} β_{h' j k}) / (∑_{i h'} W_{i h'}) = 0.
It becomes difficult to detect a signal for a major cell type,
because β_{h j k} would be close to the average across cell types.
The tests propdiff.log
and propdiff.logit
include
an additional preprocessing step that converts Y_{j i} to f(Y_{j i}).
Apart from the preprocessing, the computations are performed in linear scale.
As the preprocessing distorts the linearity between the dependent variable
and (the centered) W_{i h},
I actually think propdiff.identity
is better.
A list with one element, which is named "coefficients".
The element gives the estimate, statistic, p.value in tibble format.
In order to transform the estimate for α_{h j} to the original scale,
apply plogis
for test = nls.logit
and
exp
for test = nls.log
.
The estimate for β_{h j k} by test = nls.log
is
the natural logarithm of fold-change, not the log2.
If numerical convergence fails, NA
is returned for that marker.
Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353
ctcisQTL
1 2 3 4 5 6 7 | data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.