View source: R/simDat_scRNAseq_eQTL.R
simDat.eQTL.scRNAseq | R Documentation |
Generate gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on ZINB mixed effects regression model.
simDat.eQTL.scRNAseq(nSubj = 50,
nCellPerSubj = 100,
zero.p = 0.01,
m.int = 0,
sigma.int = 1,
slope = 1,
theta = 1,
MAF = 0.45)
nSubj |
integer. Total number of subjects. |
nCellPerSubj |
integer. Number of cells per subject. |
zero.p |
numeric. Probability that an excess zero occurs. |
m.int |
numeric. Mean of random intercept (see details). |
sigma.int |
numeric. Standard deviation of random intercept (see details). |
slope |
numeric. Slope (see details). |
theta |
numeric. dispersion parameter of negative binomial distribution.
The smaller |
MAF |
numeric. Minor allele frequency of the SNP. |
This function simulates gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on zero-inflated negative binomial (ZINB) regression model with only one covariate: genotype. That is, the read counts of a gene follows a mixture of 2-component distributions. One component takes only one value: zero. The other component is negative binomial distribution, which takes non-negative values 0, 1, 2, .... The log mean of the negative binomial distribution is linear function of the genotype.
Denote Y_{ij}
as the read counts for the j
-th cell of
the i
-th subject, i=1,\ldots, n
, j=1,\ldots, m
,
n
is the number of subjects, and m
is the number of cells per subject.
Denote p
as the probability that Y_{ij}=0
is an excess zero.
With probability 1-p
, Y_{ij}
follows a negative binomial distribution NB(\mu, \theta)
, where \mu
is the mean (i.e., \mu=E(Y_{ij})
) and \theta
is the dispersion parameter.
The variance of the NB distribution is \mu + \mu^2/\theta
.
The relationship between gene expression and genotype for the i
-th subject is characterized by the equation
\mu_i = \exp(\beta_{0i} + \beta_{1} x_i),
where \beta_{0i}
is the random intercept following a normal
distribution N(\beta_0, \sigma^2)
to account for within-subject correlation of gene expression, \beta_0
is the mean of the random intercept, \sigma
is the standard deviation of the random intercept, \beta_1
is the slope, and x_i
is the additive-coded genotype for the SNP with minor allele frequency MAF
.
We assume that the SNP satisfies the Hardy-Weinberg Equilibrium. That is, the
probabilities of the 3 genotypes (0, 1, 2)
are (1-MAF)^2,
2 MAF (1-MAF), MAF^2
, respectively.
For simplicity, we assume that excess zeros are caused by technical issues, hence are not related to genotypes.
A data frame with 3 columns:
id |
subject id |
geno |
additive-coded genotype of the SNP |
counts |
gene expression of the gene |
Xianjun Dong <XDONG@rics.bwh.harvard.edu>, Xiaoqi Li<xli85@bwh.harvard.edu>, Tzuu-Wang Chang <Chang.Tzuu-Wang@mgh.harvard.edu>, Scott T. Weiss <restw@channing.harvard.edu>, Weiliang Qiu <weiliang.qiu@gmail.com>
Dong X, Li X, Chang T-W, Scherzer CR, Weiss ST, and Qiu W. powerEQTL: An R package and shiny application for sample size and power calculation of bulk tissue and single-cell eQTL analysis. Bioinformatics, 2021;, btab385
frame = simDat.eQTL.scRNAseq(nSubj = 5,
nCellPerSubj = 3,
zero.p = 0.01,
m.int = 0,
sigma.int = 1,
slope = 1,
theta = 1,
MAF = 0.45)
print(dim(frame))
print(frame[1:10,])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.