View source: R/simDat_scRNAseq_eQTL.R
| simDat.eQTL.scRNAseq | R Documentation | 
Generate gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on ZINB mixed effects regression model.
simDat.eQTL.scRNAseq(nSubj = 50, 
		     nCellPerSubj = 100, 
		     zero.p = 0.01, 
		     m.int = 0, 
		     sigma.int = 1, 
		     slope = 1, 
		     theta = 1, 
		     MAF = 0.45)
| nSubj | integer. Total number of subjects. | 
| nCellPerSubj | integer. Number of cells per subject. | 
| zero.p | numeric. Probability that an excess zero occurs. | 
| m.int | numeric. Mean of random intercept (see details). | 
| sigma.int | numeric. Standard deviation of random intercept (see details). | 
| slope | numeric. Slope (see details). | 
| theta | numeric. dispersion parameter of negative binomial distribution.
The smaller  | 
| MAF | numeric. Minor allele frequency of the SNP. | 
This function simulates gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on zero-inflated negative binomial (ZINB) regression model with only one covariate: genotype. That is, the read counts of a gene follows a mixture of 2-component distributions. One component takes only one value: zero. The other component is negative binomial distribution, which takes non-negative values 0, 1, 2, .... The log mean of the negative binomial distribution is linear function of the genotype.
Denote Y_{ij} as the read counts for the j-th cell of
the i-th subject, i=1,\ldots, n, j=1,\ldots, m,
n is the number of subjects, and m is the number of cells per subject.
Denote p as the probability that Y_{ij}=0 is an excess zero.
With probability 1-p, Y_{ij} follows a negative binomial distribution NB(\mu, \theta), where \mu is the mean (i.e., \mu=E(Y_{ij})) and \theta is the dispersion parameter.
The variance of the NB distribution is \mu + \mu^2/\theta.
The relationship between gene expression and genotype for the i-th subject is characterized by the equation 
\mu_i = \exp(\beta_{0i} + \beta_{1} x_i),
where \beta_{0i} is the random intercept following a normal
distribution N(\beta_0, \sigma^2) to account for within-subject correlation of gene expression, \beta_0 is the mean of the random intercept, \sigma is the standard deviation of the random intercept, \beta_1 is the slope, and x_i is the additive-coded genotype for the SNP with minor allele frequency MAF.
We assume that the SNP satisfies the Hardy-Weinberg Equilibrium. That is, the
probabilities of the 3 genotypes (0, 1, 2) are (1-MAF)^2,
		2 MAF (1-MAF), MAF^2, respectively.
For simplicity, we assume that excess zeros are caused by technical issues, hence are not related to genotypes.
A data frame with 3 columns:
| id | subject id | 
| geno | additive-coded genotype of the SNP | 
| counts | gene expression of the gene | 
Xianjun Dong <XDONG@rics.bwh.harvard.edu>, Xiaoqi Li<xli85@bwh.harvard.edu>, Tzuu-Wang Chang <Chang.Tzuu-Wang@mgh.harvard.edu>, Scott T. Weiss <restw@channing.harvard.edu>, Weiliang Qiu <weiliang.qiu@gmail.com>
Dong X, Li X, Chang T-W, Scherzer CR, Weiss ST, and Qiu W. powerEQTL: An R package and shiny application for sample size and power calculation of bulk tissue and single-cell eQTL analysis. Bioinformatics, 2021;, btab385
frame = simDat.eQTL.scRNAseq(nSubj = 5, 
		     nCellPerSubj = 3, 
		     zero.p = 0.01, 
		     m.int = 0, 
		     sigma.int = 1, 
		     slope = 1, 
		     theta = 1, 
		     MAF = 0.45)
print(dim(frame))
print(frame[1:10,])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.