powerEQTL.scRNAseq.sim: Power Calculation for Association Between Genotype and Gene...

Description Usage Arguments Details Value Note Author(s) References Examples

View source: R/powerEQTL.scRNAseq.sim.R

Description

Power calculation for association between genotype and gene expression based on single cell RNAseq data via simulation from ZINB mixed effects regression model. This function can be used to calculate one of the 4 parameters (power, sample size, minimum detectable slope, and minimum allowable MAF) by setting the corresponding parameter as NULL and providing values for the other 3 parameters.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
powerEQTL.scRNAseq.sim(slope, 
		       n, 
		       m, 
                       power = NULL,
		       m.int = -1, 
		       sigma.int = 1, 
		       zero.p = 0.1, 
		       theta = 1, 
		       MAF = 0.2, 
		       FWER = 0.05, 
		       nTests = 1, 
		       nSim = 1000, 
		       estMethod = "GLMMadaptive", 
		       nCores = 1,
                       n.lower = 2.01,
                       n.upper = 1e+4,
                       slope.lower =  1e-6,
                       slope.upper = log(1.0e+6),
                       MAF.lower = 0.05,
                       MAF.upper = 0.49
)

Arguments

slope

numeric. Slope (see details).

n

integer. Total number of subjects.

m

integer. Number of cells per subject.

power

numeric. Power for testing if the slope is equal to zero.

m.int

numeric. Mean of random intercept (see details).

sigma.int

numeric. Standard deviation of random intercept (see details).

zero.p

numeric. Probability that an excess zero occurs.

theta

numeric. Dispersion parameter of negative binomial distribution. The smaller theta is, the larger variance of NB random variable is.

MAF

numeric. Minor allele frequency of the SNP.

FWER

numeric. Family-wise Type I error rate.

nTests

integer. Number of tests (i.e., number of all (SNP, gene) pairs) in eQTL analysis.

nSim

integer. Number of simulated datasets to be generated.

estMethod

character. Indicates which method would be used to fit zero inflated negative binomial mixed effects model. Currently, the possible choice is “GLMMadaptive”.

nCores

integer. Number of computer cores used by mclapply for parallel computing. For Windows, nCores=1.

n.lower

numeric. Lower bound of the total number of subjects. Only used when calculating total number of subjects.

n.upper

numeric. Upper bound of the total number of subjects. Only used when calculating total number of subjects.

slope.lower

numeric. Lower bound of the slope. Only used when calculating minimum slope.

slope.upper

numeric. Upper bound of the slope Only used when calculating minimum slope.

MAF.lower

numeric. Lower bound of the MAF. Only used when calculating minimum MAF.

MAF.upper

numeric. Upper bound of the MAF Only used when calculating minimum MAF.

Details

This function calculates the power for testing if genotypes of a SNP is associated with the expression of a gene via nSim datasets generated from zero-inflated negative binomial (ZINB) regression model.

Each dataset is generated from zero-inflated negative binomial mixed effects regression model with only one covariate: genotype. That is, the read counts of a gene follows a mixture of 2-component distributions. One component takes only one value: zero. The other component is negative binomial distribution, which takes non-negative values 0, 1, 2, .... The log mean of the negative binomial distribution is linear function of the genotype.

For each dataset, the p-value for testing if the slope for genotype is equal to zero will be calculated.

The proportion of p-values < α is the estimated power, where α = FWER/nTests.

Each simulated dataset contains gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells. The gene expression levels (read counts) follow zero-inflated negative binomial distribution. Denote Y_{ij} as the read counts for the j-th cell of the i-th subject, i=1,…, n, j=1,…, m, n is the number of subjects, and m is the number of cells per subject. Denote p as the probability that Y_{ij}=0 is an excess zero. With probability 1-p, Y_{ij} follows a negative binomial distribution NB(μ, θ), where μ is the mean (i.e., μ=E(Y_{ij})) and θ is the dispersion parameter. The variance of the NB distribution is μ + μ^2/θ. The relationship between gene expression and genotype for the i-th subject is characterized by the equation

μ_i = \exp(β_{0i} + β_{1} x_i),

where β_{0i} is the random intercept following a normal distribution N(β_0, σ^2) to account for within-subject correlation of gene expression, β_0 is the mean of the random intercept, σ is the standard deviation of the random intercept, β_1 is the slope, and x_i is the additive-coded genotype for the SNP with minor allele frequency MAF.

We assume that the SNP satisfies the Hardy-Weinberg Equilibrium. That is, the probabilities of the 3 genotypes (0, 1, 2) are (1-MAF)^2, 2 MAF (1-MAF), MAF^2, respectively.

For simplicity, we assume that excess zeros are caused by technical issues, hence are not related to genotypes.

Value

power if the input parameter power = NULL.

sample size (total number of subjects) if the input parameter n = NULL;

minimum detectable slope if the input parameter slope = NULL;

minimum allowable MAF if the input parameter MAF = NULL.

Note

The speed of simulation approach is slow. It is recommended to run the function powerEQTL.scRNAseq.sim in parallel computing environment (e.g., Unix/Linux) and set nCores > 1. Also, it is important to set appropriate ranges to search sample size, minimum detectable slope, or minimum allowable MAF. If the function (e.g., GLMMadaptive) to fit data returns an error message for a simulated dataset, then the function powerEQTL.scRNAseq.sim will try a few more runs with different simulated datasets. If all the runs failed, zero (for power estimation) or NA or NaN (for estimating slope, sample size, or MAF) would be output. It is recommended to set nSim >= 1000 to get stable results.

Author(s)

Xianjun Dong <XDONG@rics.bwh.harvard.edu>, Xiaoqi Li<xli85@bwh.harvard.edu>, Tzuu-Wang Chang <Chang.Tzuu-Wang@mgh.harvard.edu>, Scott T. Weiss <restw@channing.harvard.edu>, Weiliang Qiu <weiliang.qiu@gmail.com>

References

Dong X, Li X, Chang T-W, Scherzer CR, Weiss ST, and Qiu W. powerEQTL: An R package and shiny application for sample size and power calculation of bulk tissue and single-cell eQTL analysis. Bioinformatics, 2021;, btab385

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
nSubj = 10
nCellPerSubj = 10

# calculate power
power = powerEQTL.scRNAseq.sim(
  slope = 1.62, # slope
  n = nSubj, # total number of subjects
  m = nCellPerSubj, # number of cells per subject 
  power = NULL, # power to be estimated
  m.int = -1, # mean of random intercept
  sigma.int = 1, # SD of the random intercept
  zero.p = 0.01, # probability that an excess zero occurs
  theta = 1, # dispersion parameter of NB distribution NB(mu, theta)
  MAF = 0.45, 
  FWER = 0.05,
  nTests = 1,
  nSim  =  5, # number of simulations
  estMethod = "GLMMadaptive", # parameter estimation method for ZINB
  nCores = 1 # number of computer cores used by 'mclapply'
)

print(power)

powerEQTL documentation built on July 22, 2021, 9:08 a.m.