Description Usage Arguments Details Value References See Also Examples

Cell-Type-Specific Association Testing

1 2 3 4 5 6 7 8 9 10 11 |

`X` |
Matrix (or vector) of traits; samples x traits. |

`W` |
Matrix of cell type composition; samples x cell types. |

`Y` |
Matrix (or vector) of bulk omics measurements; markers x samples. |

`C` |
Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric. |

`test` |
Statistical test to apply; either |

`regularize` |
Whether to apply Tikhonov (ie ridge) regularization
to |

`num.cores` |
Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored. |

`chunk.size` |
The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size. |

`seed` |
Seed for random number generation. |

Let the indexes be
*h* for cell type, *i* for sample,
*j* for marker (CpG site or gene),
*k* for each trait that has cell-type-specific effect,
and *l* for each trait that has a uniform effect across cell types.
The input data are *X_{i k}*, *C_{i l}*, *W_{i h}* and *Y_{j i}*,
where *C_{i l}* can be omitted.
*X_{i k}* and *C_{i l}* are the values for two types of traits,
showing effects that are cell-type-specific or not, respectively.
Thus, calling *X_{i k}* and *C_{i l}* as "traits" and "covariates"
gives a rough idea, but is not strictly correct.
*W_{i h}* represents the cell type composition and
*Y_{j i}* represents the marker level,
such as methylation or gene expression.
For each tissue sample, the cell type proportion *W_{i h}*
is the proportion of each cell type in the bulk tissue,
which is measured or imputed beforehand.
The marker level *Y_{j i}* in bulk tissue is measured and provided as input.

The parameters we estimate are
the cell-type-specific trait effect *β_{h j k}*,
the tissue-uniform trait effect *γ_{j l}*,
and the basal marker level *α_{h j}* in each cell type.

We first describe the conventional linear regression models.
For marker *j* in sample *i*,
the maker level specific to cell type *h* is

*α_{h j} + ∑_k β_{h j k} * X_{i k}.*

This is a representative value rather than a mean, because we do not model
a probability distribution for cell-type-specific expression.
The bulk tissue marker level is the average weighted by *W_{i h}*,

*μ_{j i} = ∑_h W_{i h} [ α_{h j} + ∑_k β_{h j k} * X_{i k} ] +
∑_l γ_{j l} C_{i l}.*

The statistical model is

*Y_{j i} = μ_{j i} + ε_{j i},*

*ε_{j i} ~ N(0, σ^2_j).*

The error of the marker level is is noramlly distributed with variance
*σ^2_j*, independently among samples.

The `full`

model is the linear regression

*Y_{j i} = (∑_h α_{h j} * W_{i h}) +
(∑_{h k} β_{h j k} * W_{i h} * X_{i k}) +
(∑_l γ_{j l} * C_{i l}) +
error.*

The `marginal`

model tests the trait association only in one
cell type *h*, under the linear regression,

*Y_{j i} = (∑_{h'} α_{h' j} * W_{i h'}) +
(∑_k β_{h j k} * W_{i h} * X_{i k}) +
(∑_l γ_{j l} * C_{i l}) +
error.*

The nonlinear model simultaneously analyze cell type composition in
linear scale and differential expression/methylation in log/logit scale.
The normalizing function is the natural logarithm *f* = log for gene
expression, and *f* = logit for methylation. Conventional linear regression
can be formulated by defining *f* as the identity function. The three models
are named `nls.log`

, `nls.logit`

and `nls.identity`

.
We denote the inverse function of *f* by *g*; *g* = exp for
gene expression, and *g* = logistic for methylation.
The mean normalized marker level of marker *j* in sample *i* becomes

*μ_{j i} = f(∑_h W_{i h} g( α_{h j} + ∑_k β_{h j k} * X_{i k} )) +
∑_l γ_{j l} C_{i l}.*

The statistical model is

*f(Y_{j i}) = μ_{j i} + ε_{j i},*

*ε_{j i} ~ N(0, σ^2_j).*

The error of the marker level is is noramlly distributed with variance
*σ^2_j*, independently among samples.

The ridge regression aims to cope with multicollinearity of
the interacting terms *W_{i h} * X_{i k}*.
Ridge regression is fit by minimizing the residual sum of squares (RSS) plus
*λ ∑_{h k} β_{h j k}^2*, where *λ > 0* is the
regularization parameter.

The propdiff tests try to cope with multicollinearity by, roughly speaking,
using mean-centered *W_{i h}*.
We obtain, instead of *β_{h j k}*, the deviation of
*β_{h j k}* from the average across cell types.
Accordingly, the null hypothesis changes.
The original null hypothesis was *β_{h j k} = 0*.
The null hypothesis when centered is
*β_{h j k} - (∑_{i h'} W_{i h'} β_{h' j k}) / (∑_{i h'} W_{i h'}) = 0*.
It becomes difficult to detect a signal for a major cell type,
because *β_{h j k}* would be close to the average across cell types.
The tests `propdiff.log`

and `propdiff.logit`

include
an additional preprocessing step that converts *Y_{j i}* to *f(Y_{j i})*.
Apart from the preprocessing, the computations are performed in linear scale.
As the preprocessing distorts the linearity between the dependent variable
and (the centered) *W_{i h}*,
I actually think `propdiff.identity`

is better.

A list with one element, which is named "coefficients".
The element gives the estimate, statistic, p.value in tibble format.
In order to transform the estimate for *α_{h j}* to the original scale,
apply `plogis`

for `test = nls.logit`

and
`exp`

for `test = nls.log`

.
The estimate for *β_{h j k}* by `test = nls.log`

is
the natural logarithm of fold-change, not the log2.
If numerical convergence fails, `NA`

is returned for that marker.

Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353

ctcisQTL

1 2 3 4 5 6 7 | ```
data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.