In suke18/POWSC: Simulation, power evaluation, and sample size recommendation for single cell RNA-seq

\tableofcontents

\vspace{.1in}

Installation and quick start

Installation

To install this package, start R (version > "4.1") and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("POWSC")

Quick start on using POWSC

We use one scRNA-seq template data about ES-MEF cell types.

suppressMessages(library(POWSC))
data("es_mef_sce")
sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"]
set.seed(12)
rix = sample(1:nrow(sce), 500)
sce = sce[rix, ]
est_Paras = Est2Phase(sce)
sim_size = c(100, 200) # A numeric vector
pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW")

Background

Determining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Even though a number of methods and tools are available for sample size calculation for microarray and RNA-seq under the context of differential expression, this topic in the field of single-cell RNA sequencing is understudied. Moreover, the unique data characteristics present in scRNA-seq including sparsity and heterogeneity gain the challenge.

Introduction

POWSC is an R package designed for power assessment and sample size estimation in scRNA-seq experiment. It contains three main functionalities: (1). Parameter Estimation: adopted and modified from the core of SC2P. (2). Data Simulation: consider two cases: paired-wise comparison and multiple comparisons. (3). Power Evaluation: provide both stratified (detailed) and marginal powers

Use POWSC

In the context of differential expression (DE) analysis, scientists are usually interested in two different scenarios: (1) within cell type: comparing the same cell types across biological conditions such as case vs. control, which reveals the expression change of a particular cell type under different contexts. (2) between cell types: comparing different cell types under the same condition, which identifies biomarkers to distinguish cell types. In either case, the experiment starts from a number of cells randomly picked from a tissue sample consisting of a mixture of different cell types. The only factor one can control is the total number of cells.

For two-group comparison

In the first scenario, the numbers of cells for a particular cell type in different biological conditions are often similar, barring significant changes in cell composition. It uses one cell type as the benchmark data and perturbs the genes with mixture proportion ($\pi$) and log fold changes ($lfc$) as the DE genes according two forms.

# Users can customize how many cells they need as to change the number of n. 
simData = Simulate2SCE(n=100, estParas1 = est_Paras, estParas2 = est_Paras)
de = runDE(simData$sce, DE_Method = "MAST")
estPower1 = Power_Disc(de, simData = simData)
estPower2 = Power_Cont(de, simData = simData)

For multiple-group comparisons

In the second scenario, the numbers of cells for distinct cell types can be very different, so the power for DE highly depends on the mixing proportions. Here, we use 1000*57 expression matrix sampled from the patient ID AB_S7 from the GSE67835 data. The original data GSE67835 can be download here: https://www.dropbox.com/s/55zdktqfqiwfs3l/GSE67835.RData?dl=0.

data("GSE67835_AB_S7_sce")
sim_size = 200 # use a large sample is preferrable
cell_per = c(0.2, 0.3, 0.5)
col = colData(sce)
exprs = assays(sce)$counts
(tb = table(colData(sce)$Patients, colData(sce)$cellTypes))
# use AB_S7 patient as example and take three cell types: astrocytes hybrid and neurons
estParas_set = NULL
celltypes = c("oligodendrocytes", "hybrid", "neurons")
for (cp in celltypes){
    print(cp)
    ix = intersect(grep(cp, col$cellTypes), grep("AB_S7", col$Patients))
    tmp_mat = exprs[, ix]
    tmp_paras = Est2Phase(tmp_mat)
    estParas_set[[cp]] = tmp_paras
}
######### 
#########  Simulation part
######### 
sim = SimulateMultiSCEs(n = sim_size, estParas_set = estParas_set, multiProb = cell_per)

######### 
#########  DE analysis part
######### 
DE_rslt = NULL
for (comp in names(sim)){
    tmp = runDE(sim[[comp]]$sce, DE_Method = "MAST")
    DE_rslt[[comp]] = tmp
}

######### 
######### Summarize the power result
######### 
pow_rslt = pow1 = pow2 = pow1_marg = pow2_marg = NULL
TD = CD = NULL
for (comp in names(sim)){
    tmp1 = Power_Disc(DE_rslt[[comp]], sim[[comp]])
    tmp2 = Power_Cont(DE_rslt[[comp]], sim[[comp]])
    TD = c(TD, tmp2$TD); CD = c(CD, tmp2$CD)
    pow1_marg = c(pow1_marg, tmp1$power.marginal)
    pow2_marg = c(pow2_marg, tmp2$power.marginal)
    pow_rslt[[comp]] = list(pow1 = tmp1, pow2 = tmp2)
    pow1 = rbind(pow1, tmp1$power)
    pow2 = rbind(pow2, tmp2$power)
}

######### 
######### Demonstrate the result by heatmap
######### 
library(RColorBrewer); library(pheatmap)
breaksList = seq(0, 1, by = 0.01)
colors = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList))
dimnames(pow1) = list(names(sim), names(tmp1$CD))
dimnames(pow2) = list(names(sim), names(tmp2$CD))
pheatmap(pow2, display_numbers = TRUE, color=colors, show_rownames = TRUE,
         cellwidth = 30, cellheight = 40, legend = TRUE,
         border_color = "grey96", na_col = "grey",
         cluster_row = FALSE, cluster_cols = FALSE,
         breaks = seq(0, 1, 0.01),
         main = "")