knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
You can install scpost from github with devtools
library(devtools) devtools::install_github(repo = "immunogenomics/scpost")
library(scpost)
To simulate data, scPOST uses an input prototype dataset (such as public or pilot data). scPOST estimates two types of variation often found in multi-sample single-cell data:
scPOST requires the following inputs for each cell:
Let's take a look at a prototype dataset that we will apply scPOST to.
ra_FibObj$meta %>% str ra_FibObj$embeddings %>% head(2)
In the metadata, we have:
In the embeddings, we have:
This tutorial will run through the scPOST framework, which comprises 3 steps.
Step 1: Parameter estimation
Step 2: Dataset simulation
Step 3: Association testing for cell state frequency shifts (e.g. expansion or deletion)
Here, we will simulate fibroblast data from the rheumatoid arthritis (RA) dataset described in Zhang F, Wei K, Slowikowski K, Fonseka C, Rao DA, et al., Nature Immunol (2020). The input prototype dataset we will use is contained in "ra_FibObj".
Through cytometry, the authors identified an HLA-DRA+ fibroblast cell state that was expanded in inflammatory RA samples compared to osteoarthritis samples. However, they we unable to detect this expansion in their single-cell RNA sequencing (scRNA-seq) data due to sample size (n = 21, 17 case, 4 control).
Let's use scPOST to find study design changes that might increase power.
With the "estimateFreqVar" function, we estimate each cell state's mean frequency and covariance across samples
raFib_freqEstimates <- estimateFreqVar(meta = ra_FibObj$meta, clusCol = 'clusOnlyFib', sampleCol = 'sample', logCov = TRUE) raFib_freqEstimates %>% str
With the "estimatePCVar" function, we estimate each cell state's gene expression variation. We use linear mixed effects models to estimate this variance, which we can deconvolute into different sources of variance. From these models, we estimate the following for each cell state:
Computation time will depend on the size of the real data. For larger datasets, we recommend saving the aforementioned estimates so that they can be used for future simulations
raFib_pcEstimates <- estimatePCVar(pca = ra_FibObj$embeddings, npcs = 20, meta = ra_FibObj$meta, clusCol = 'clusOnlyFib', sampleCol = 'sample', batchCol = 'batch') raFib_pcEstimates %>% str(1) # If desired, save these estimates for future use
Next, we'll generate in silico fibroblast datasets and use MASC (Mixed effects association of single-cells) to test whether we test an expansion or depletion of a cluster in our simulated data.
For users that wish to estimate power to detect differential abundance, we recommend users use the "simDataset.withMASC" function to perform both Step 2 and Step 3 at the same time so that they may make use of streamlined file naming and handling. Users that only wish to simulate a dataset can use the "simDataset.base" function.
Important: Before running simulations, create a save folder where you will save the results of the simulations
The cohort of the original RA scRNA-seq data consisted of 17 cases and 4 controls. Here, we'll simulate 5 datasets that each have 20 samples (17 cases and 3 controls).
Let's set the number of samples, the number of cells per sample, and the batch structure.
set.seed(23) # Set the number of samples, number of cells per sample, and create batch structure ncases <- 17 nctrls <- 3 nbatches <- 4 batchStructure <- distribSamples(ncases = ncases, nctrls = nctrls, nbatches = nbatches) ncells <- rep(250, times = ncases + nctrls) names(ncells) <- batchStructure$sample_names batchStructure %>% str
Next, we'll set up a parameter table with the "createParamTable" function that we'll use to run multiple simulations: - We'll simulate realistic levels of variation by setting "b_scale", "s_scale", and "cf_scale" equal to 1. - We'll induce a fold-change of 5 into "clus0", the HLA-DRA+ fibroblast cell state - We'll set up a folder where we will save our results
params <- createParamTable( nreps = 5, clus = "clus0", fc = 5, ncases = ncases, nctrls = nctrls, nbatches = nbatches, b_scale = 1, s_scale = 1, cf_scale = 1, res_use = 0.6, cond_induce = "cases", save_path = file.path(getwd(), "scpostSims/gettingStarted/") ) params %>% head(2)
The "simDataset.MASC" function will simulate datasets and save them as a list containing: - Metadata information for each simulated cell - p-values from applying MASC to test for differential abundance. We obtain a p-value for each simulated cell state. - For larger simulations, we recommend increasing "mc.cores" to make use of multiple cores if possible
suppressWarnings({ lapply(seq(nrow(params)), function(x){ simDataset.withMASC( save_path = params[x, 'save_path'], rep = params[x, 'rep'], seed = params[x, 'seed'], ncases = params[x, 'ncases'], nctrls = params[x, 'nctrls'], nbatches = params[x, 'nbatches'], batchStructure = batchStructure, ncells = ncells, centroids = raFib_pcEstimates$centroids, pc_cov_list = raFib_pcEstimates$pc_cov_list, batch_vars = raFib_pcEstimates$batch_vars, b_scale = params[x, 'b_scale'], sample_vars = raFib_pcEstimates$sample_vars, s_scale = params[x, 's_scale'], cfcov = raFib_freqEstimates$cfcov, cf_scale = params[x, 'cf_scale'], meanFreqs = raFib_freqEstimates$meanFreq, clus = params[x, 'clus'], fc = params[x, 'fc'], cond_induce = params[x, 'cond_induce'], res_use = params[x, 'res_use'], mc.cores = 1, clusterData = TRUE, returnPCs = FALSE ) }) })
Now, let's simulate 20-sample datasets with almost the same study design. However, now we'll simulate a more balanced number of cases and controls: 10 cases and 10 controls
set.seed(23) # Set the number of samples, number of cells per sample, and create batch structure ncases <- 10 nctrls <- 10 nbatches <- 4 batchStructure <- distribSamples(ncases = ncases, nctrls = nctrls, nbatches = nbatches) ncells <- rep(250, times = ncases + nctrls) names(ncells) <- batchStructure$sample_names params <- createParamTable( nreps = 5, clus = "clus0", fc = 5, ncases = ncases, nctrls = nctrls, nbatches = nbatches, b_scale = 1, s_scale = 1, cf_scale = 1, res_use = 0.6, cond_induce = "cases", save_path = file.path(getwd(), "scpostSims/gettingStarted/") ) params %>% head(2)
suppressWarnings({ lapply(seq(nrow(params)), function(x){ simDataset.withMASC( save_path = params[x, 'save_path'], rep = params[x, 'rep'], seed = params[x, 'seed'], ncases = params[x, 'ncases'], nctrls = params[x, 'nctrls'], nbatches = params[x, 'nbatches'], batchStructure = batchStructure, ncells = ncells, centroids = raFib_pcEstimates$centroids, pc_cov_list = raFib_pcEstimates$pc_cov_list, batch_vars = raFib_pcEstimates$batch_vars, b_scale = params[x, 'b_scale'], sample_vars = raFib_pcEstimates$sample_vars, s_scale = params[x, 's_scale'], cfcov = raFib_freqEstimates$cfcov, cf_scale = params[x, 'cf_scale'], meanFreqs = raFib_freqEstimates$meanFreq, clus = params[x, 'clus'], fc = params[x, 'fc'], cond_induce = params[x, 'cond_induce'], res_use = params[x, 'res_use'], mc.cores = 1, clusterData = TRUE, returnPCs = FALSE ) }) })
Now that we've simulated data and performed association testing, we can retrieve what our estimated power is.
We only simulated 5 replicates for each study design; to achieve a more accurate estimate, we recommend running many replicates. Here, we load the results tables from MASC analysis (named "res" in the saved list that was created by the "simDataset.MASC" function)
dir <- file.path(getwd(), "scpostSims/gettingStarted") filenames <- list.files(path = dir, full.names = T, pattern = '*res') %>% basename resTables <- lapply(filenames, function(x){ readRDS(file.path(dir, x))[["res"]] })
If you induced a fold change in multiple cell states, you can stratify the power results by cell state (instead of aggregate power over all cell states). Here, we only induced a fold change in the HLA-DRA+ fibroblast cell state, so that is not necessary.
getPowerFromRes( resFiles = filenames, resTables = resTables, threshold = 0.05, z = 1.96, stratByClus = FALSE )
From these small number of simulations, we estimated that the unbalanced study design would have 0% power. However, by changing the balance of cases and controls so that the study design is more balanced, we see an estimated power of 80%. This example showcases how scPOST can be used to evaluate how changing a study design might affect power.
For more accurate estimates, we recommend running higher numbers of simulations. In Millard et al., we ran 500 simulations for each of these study designs, and estimated 12% power for the unbalanced study design and 60% power for the balanced study design.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.