all_times <- list() # store the time for each chunk knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { now <<- Sys.time() } else { res <- difftime(Sys.time(), now, units = "secs") all_times[[options$label]] <<- res } } })) knitr::opts_chunk$set( tidy = TRUE, tidy.opts = list(width.cutoff = 95), message = FALSE, warning = FALSE, time_it = TRUE )
In this article we overview each scPipeline module, and provide details on how to specify parameters for analysis. For each module, we provide a description, example reports and documentation for implementation (i.e., parameter specification).
Prior to running any scPipeline module, please visit our getting started with scPipeline article to setup up scPipeline.
In general, our convention is to store all analysis parameters in a list named parameter.list
at the start of the script.
# example parameter.list <- list( input.file = "../data/demo.rds", ..., print.inline = F, save.flag = T )
In all scPipeline modules, parameter.list
is already specified, so users need only to configure the analysis parameters (e.g., specify input data, set cluster resolution, etc.) prior to running the module. In general, the default settings will yield reasonable results, however a little time spent optimizing parameters will often yield better outcomes.
Initial appraisal of scRNA-seq count matrix data and overview of quality control metrics, data filtering, and normalization. Generates initial UMAP clusters, and PCA analyses. A preprocessed seurat object is saved as output.
Quality control: Cells are filtered based on genes/cell recovered and fraction of counts belonging to mitochondrial genes. Injured/broken cells are characterized by low gene recovery and high mitochondrial content whereas doublets are characterized by extremely high gene recovery. Based on these behaviors, high quality non-injured single cells can be retained for downstream analysis using the gene.upperlimit
, gene.lowerlimit
, and mt.upperlimit
analysis parameters.
Count Normalization: Count matrices are normalized using SCTransform. In brief, the count data are modeled using regularized negative binomial regression to remove the influence of technical covariates, including sequencing depth and mitochondrial content. The resulting model residuals are treated as normalized count data for downstream analysis.
Dimensional Reduction: Following normalization, select features (i.e., genes) are used to reduce the dimensionality of the data set. This effectively eases downstream computational burden, increases the signal-to-noise and allows for data visualization. We perform principal component analysis and use the top N principal components that explain 90% (user-defined; pca.var.threshold
) variance for Louvain clustering and UMAP embedding. The UMAP algorithm embeds reduced data (e.g., principal components) into a lower dimensional representation while preserving the global structure of the transcriptomic data to facilitate visualization.
For all the example reports shown here, datasets were subsampled to 10,000 cells and minimal cell filtering was performed due to original count matrices having been cleaned by authors.
show contents
Inputs
One of the following data formats must be provided as input:
barcodes.tsv
, genes.tsv
, matrix.mtx
Outputs
show table
Parameter |Description |Argument
--| ---------- | ------
input.object | path to seurat object | character
input.cell_ranger | path to folder containing cell ranger output | character
input.matrix | path gene x cell count matrix (.tsv, .csv format) | character
save.flag | save Seurat object from current analysis | logical
output.object.directory | path to folder in which to save Seurat object | character
output.object.file | name of file to which Seurat object is saved. Must have ".Rdata" suffix (e.g., 'object.Rdata') | character
subsample.n | value specifying how many cells to sub sample. Set to NA if none. | numeric
cluster.resolution | value of the resolution parameter passed to Seurat::FindClusters(...)
. Use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities. Default is 1. | numeric
gene.upperlimit | Upper limit of genes per cell. Value between [0, Inf], but more than gene.lowerlimit
. Any cells exceeding this threshold are filtered out. Default is 9000. | numeric
gene.lowerlimit | Lower limit of genes per cell. Value between [0, Inf], but less than gene.upperlimit
. Any cells below this threshold are filtered out. Default is 200. | numeric
mt.upperlimit | Upper limit of mitochondrial content per cell. Value between [0, 100]. Cells exceeding this threshold are filtered out. Default is 10. | numeric
unmatched.rate.filter.flag | Logical specifying whether to filtered cells using unmatched rates. Default is TRUE.
vars2regress | Variables to regress out during scaling step. Default is 'percent.mt'. | vector of character(s)
correct.artifact | identify artifact cells and omit from dimensional reduction. Default is TRUE. | logical
feature.select.method | Feature selection method for dimensional reduction.| - hvg (default): highly-variable genes
- deviance
scale.method | Expression scaling method. | - sct (default): SCTransform residuals
- null_residuals
pca.method | principal component analysis method. | - pca (default)
- glmpca
pca.component.select.method | method for selecting number of principal components for clustering and UMAP embedding. | - cum_var (default)
- elbow
pca.var.threshold | value between [0, 1] specifying cumulative variance explained by pca. Ignored if pca.component.select.method
is not 'cum_var'. | numeric
conserve.memory | If TRUE, the residual matrix for all genes computed using SCTransform is never created in full; recommended for large data sets, but will take longer to run. Default is FALSE | logical
species.filter.flag | Logical specifying whether to filter species. Default is FALSE. | logical
species.include | Specify which species to retain in analysis. Ignored if species.filter.flag
is FALSE | - Hs: human
- Mm: murine
nworkers | Number of workers used for parallel implementation of SCTransform. Default is 1. | numeric
save.pdf | Logical specifying whether to save figures and tables in separate directory. | logical
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical
developer | Developer mode. Default is FALSE. | logical
Batch correction and integration of independent experiments. Generates a new Seurat integrated
assay using one of four approaches:
See Stuart et al. (2019). Cell for a description of rPCA and CCA, PolaĆski et al. (2020). Bioinformatics for BBKNN, and Hie et al. (2019). Nature Biotechnology for Scanorama.
Our selection of integration methods is based on the benchmarking efforts published by Luecken et al. (2021). Nature Methods
The primary aim of data integration is to identify common cell populations between datasets. However, because many integration methods involve non-linear warping, many statistical assumptions are broken in the process and the count matrix in the integrated
assay is not appropriate for downstream differential gene expression. If downstream differential expression (DEG) analyses are intended, use counts from the data
slot in the SCT
assay; Data are processed using PrepSCTFindMarkers()
thereby ensuring that counts have been corrected for varying sequencing-depth across data sets.
In addition to specifying the parameter.list
(analysis parameters), the Integration Module additionally requires a separate list, named input.data
, to specify the data sets used for integration. See below for details and examples.
show parameter.list
Parameter |Description |Argument
--| ---------- | ------
save.file | name of output file | character
integration.feature.select.method | method used to select integration features | - hvg (default): highly-variable genes
- deviance
integration.k.filter | number of neighbors (k) to use when filtering anchors. Default is 200. Passed to Seurat::FindIntegrationAnchors(...)
. Ignored if integration.method = "BBKNN" or "Scanorama". | numeric
integration.k.anchor | number of neighbors (k) to use when picking anchors. Dictates strength of alignment. 5 is recommended, 20 suggested for stronger alignments. Passed to Seurat::FindIntegrationAnchors(...)
. Ignored if integration.method = "BBKNN" or "Scanorama". | numeric
integration.k.weight | number of neighbors to consider when weighting anchors. Passed to Seurat::IntegrateData(...)
. Default is 100. Ignored if integration.method = "BBKNN" or "Scanorama". | numeric
integration.n.genes | number of genes used for integration. 1000-3000 recommended. | numeric
integration.method | integration method. | - rPCA
- CCA
- BBKNN
- Scanorama
integration.limit.memory | Specify whether to limit max memory allocated during integration. Default is T. Ignored if integration.method = "BBKNN" or "Scanorama". | logical
integration.max.memory | max memory allocated during integration, in terms of Gb. Ignored if integration.limit.memory = F. Default is 150. Ignored if integration.method = "BBKNN" or "Scanorama". | numeric
integration.cluster.resolution | value of the resolution parameter passed to Seurat::FindClusters(...)
, which is run after data integration. Use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities. Default is 1. | numeric
integration.compute.diversity | Compute change in population diversity before and after integration. Default is T | logical
pca.var.threshold | proportion of variance used for clustering and UMAP embedding. 0.9 is recommended. | numeric [0,1]
correct.artifact | omit artifact genes. Default is F | logical
vars2regress | Variables to regress out during scaling step. Default is 'percent.mt'. | vector of character(s)
correct.artifact | identify artifact genes (i.e., genes detected in only one sample) and omit from integration step. Recommended only if integrated datasets have similar cellular compositions. Default is F | logical
n.workers | number of workers to use for parallel implementation. Provided as list, with each named entry corresponding to different task (e.g., n.workers = list(import = 4, integration = 1)) | list
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical
save.pdf | Logical specifying whether to save figures and tables in separate directory. | logical
update.log | update analysis log. Ignored if developer = F. | logical
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical
developer |Developer mode. Default is FALSE. | logical
show input.data
The input.data
list is a named list, with each entry corresponding to a separate input dataset. For a basic use, given 2 or more seurat objects for integration, we can specify the path to the input data using the file
argument as follows:
# basic input specification input.data <- list( dataset1 = list(file = "dataset1.Rdata"), # path to .Rdata file containing seurat object named 'so' dataset2 = list(file = "dataset2.Rdata"), dataset3 = list(file = "dataset3.Rdata") )
Alternatively, if a given data set contains multiple batches of data, we can partition the seurat object into subsets which will then be handled as independent datasets in the downstream integration step. This can be accomplished using the integrate.by
argument, which specifies which feature to use to split the seurat object. The feature specified by integrate.by
must be present in the metadata of the corresponding seurat object. In the example below, we split dataset 1 using the Barcode
feature prior to integration:
# spliting seurat object into subset prior to integration input.data <- list( dataset1 = list( file = "dataset1.Rdata", # path to .Rdata file containing seurat object named 'so' integrate.by = "Barcode" # meta data features specifying sample labels ), dataset2 = list( file = "dataset2.Rdata" ), dataset3 = list( file = "dataset3.Rdata" ) )
Users can also filter their inputs prior to integration using the meta.feature.name
and meta.feature.filter
arguments. For example, if dataset 1 consists of multiple cell types, and we are interested in only integrating the immune compartment, we can set meta.feature.filter
to "lymphoid|myeloid" and meta.feature.name
to "cell.type", with the latter corresponding to the meta data column present in the seurat object:
# filtering seurat object prior to integration input.data <- list( dataset1 = list( file = "dataset1.Rdata", # path to .Rdata file containing seurat object named 'so' meta.feature.name = "cell.type", # meta data column used to filter data meta.feature.filter = "lymphoid|myeloid" # features to include for downstream integration ), dataset2 = list( file = "dataset2.Rdata" ), dataset3 = list( file = "dataset3.Rdata" ) )
Note that the filter uses non-exact matching, so please be aware that substrings may unintentionally filter out data of interest if incorrectly specified. For example, "cell_type_1" will filter "cell_type_1", "cell_type_10", "cell_type_11", etc. A few additional examples for filtering:
Clustering is performed at several candidate resolutions, and the optimal resolution is informed using:
Clustering is performed using the Seurat implementation of FindNeighbors()
and FindClusters()
. In brief, a cell X cell shared nearest neighbor (SNN) graph is first constructed by determining the neighborhood overlap (Jaccard Index) between every cell and its k^th^ nearest neighbor. Then, the SNN graph is clustered using the Louvain algorithm. The main parameter affecting the number of clusters generated is the resolution
. Higher resolutions yield more clusters, whereas lower resolution yield fewer clusters.
show contents
Inputs
so
. Outputs
show table
Parameter |Description |Argument
--| ---------- | ------
input.file | path to seurat object | character
cluster.resolution | range of resolution values used for evaluating candidate cluster configurations. Recommended = c(0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3) | vector of numeric(s)
lfc.threshold | log-fold change threshold used in differential-expression analysis. 0.5 is recommended. | numeric
fdr.threshold | fdr-value significance threshold. Value between [0,1]. Default is 0.05. | numeric
group.singletons | Group singletons into nearest cluster. If FALSE, assign all singletons to a "singleton" group. For certain difficult-to-cluster data sets, group.singletons
= FALSE may yield faster results. Default is FALSE | logical
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical
subsample_factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric
subsample_n | value specifying how many cells to sub sample. Set to NA if none. | numeric
n.workers | number of workers used for parallel implementation. parallel::detectCores()
is recommended. | numeric
only.pos | only included genes with logFC > 0. Default is TRUE. | logical
save.pdf | save figures and tables | logical
update.log | update analysis log. Ignored if developer = F. | logical
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical
developer |Developer mode. Default is FALSE. | logical
Annotation of cell populations identified by unsupervised clustering. Cell-type markers are scored using the Miko scoring pipeline.
Prior to running this module, we recommend running the cluster optimization module to identify the appropriate resolution for clustering.
show contents
Inputs
so
. Outputs
show table
Parameter |Description |Argument
--| ---------- | ------
input.file | path to seurat object | character
external.markers | cell-type markers used for annotation | named list; names are cell-types, entries are genes.
E.g., list(cell_A = c("gene1", "gene2", "gene3"), cell_B = c("gene4", "gene5", "gene6"))
internal.markers | name of scMiko internal gene set(s). Run names(geneSets)
for list of available gene sets. | character
marker.filter | filter marker sets using grepl
pattern matching. E.g., Manno2021 for all La Manno 2021 atlas-derived markers in Cell_Catalogy
stored in scMiko geneSets
. | character
coherence.threshold | value of minimum gene set coherence. Gene sets with coherence scores below this threshold are disqualified from consideration. Value must be between [0,1]. Default is 0.8. | numeric
p.threshold | p-value significance threshold. Value between [0,1]. Default is 0.05. | numeric
do.center | center scores by null model predictions. FALSE is recommended | logical
frequent.filter.threshold | path to seurat object | character
filter.frequent.fliers | path to seurat object | character
min.representation | minimum fraction of genes required for gene set consideration. If fraction of genes present in seurat object is below this threshold, gene set is disqualified. Value between [0,1]. Default is 0.5. | numeric
cluster.resolution | value of the resolution parameter passed to Seurat::FindClusters(...)
. Use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities. Default is 1. | numeric
fdr.correction | apply Benjamini & Hochberg correction | logical
clean.clusters | filter cells that exceed 3 median absolute deviations from cluster center in UMAP coordinate space. Default is FALSE. | logical
max.cells | max number of cells used in Miko scoring pipeline. Default is 20000. | numeric
subsample_factor | value specifying fraction of cells to sub sample. Value between [0,1] | numeric
min.geneset.size | maximal gene set size. Default is 200. | numeric
max.geneset.size | minimal gene set size. Default is 3. | numeric
pathway.db | database used for functional annotation of gene sets |- GO
- Bader
n.workers | number of workers used for parallel implementation. Default is parallel::detectCores()
. | numeric
save.pdf | save figures and tables. | logical
show.top.n.gene.panels | max number of gene sets per cluster to show in report. Default is 5 | numeric
max.gene.panels.allowed | max number of overall gene sets to show in report. Default is 75 | numeric
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical
update.log | update analysis log. Ignored if developer = F. | logical
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical
developer | Developer mode. Default is FALSE. | logical
Gene program/module identification. Three independent methods are implemented to identify gene programs based on gene co-expression profiles:
Gene set enrichment is performed on each gene program to facilitate functional annotation.
show contents
Inputs
so
. Outputs
show table
Parameter |Description |Argument
--| ---------- | ------
input.file | path to seurat object | character
cluster.resolution | Value of the resolution parameter, use a value above 1.0 if you want to obtain a larger number of communities. | numeric
subsample_factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric
subsample_n | value specifying how many cells to sub sample. Set to NA if none. | numeric
subset.data | Subset Seurat object according to specified meta data field. Only specified meta data entries are retained, while remaining of data is omitted. Set as "no.subset" for not subsetting. | data frame specifying which field (df$field
) to subset on, and which field entries to retain (df$subgroups
)
batch.feature | meta data field to use for batch correction (set to NA if unspecified) | character
feature.selection.method | method used to select features for module detection analysis. | - expr: expression-based criterion
- hvg: highly-variable genes
- deviance: deviance-based criterion
max.features | max number of features to include in module detection analysis (NA if unspecified)
min.pct | minimal feature expression (applicable only if select.method
= "expr"). Value between [0,1], use 0.1 for large network and 0.5 for small network. | numeric
data.type | Data representation used for module detection | - pearson: based on SCTransform (recommended)
- deviance: based on multinomial null model
purity.optimization.step.size | value of step size used for optimizing module purity. 0.5 recommended. | numeric
filter.parameters | specify which clusters to include or omit | named list with 2 sets of entries; 'include' entries specify which clusters to retain; 'omit' entries specify which clusters to omit.
E.g., list(include = c(0,1,2), omit = c(3,4,5))
general.weight.by.var | Weight feature embeddings by the variance of each PC. Default is FALSE. | logical
general.pca.cum.sum | cumulative variance explained by PCs used for SSN analysis. Value between (0,1], 0.9 is recommended. | numeric
general.umap.knn | number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50. | numeric
general.cluster.purity | target purity of network modules. Value between [0,1], 0.8 is recommended. | logical
general.scale.free.topology | apply scale-free transform to shared nearest neighbor graph. Default is TRUE. | logical
general.ica | perform independent component analysis (ICA). | logical
general.nmf | perform non-negative matrix factorization (NMF). | logical
general.nmf.k | number of latent factors to compute with NMF. Can be a range of values (e.g., c(5, 10, 15)) | vector of numeric(s)
general.robust.pca | use robust PCA for dimensional reduction (warning: computationally intensive) | logical
n.workers | number of workers used for parallel implementations. | numeric
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical
save.pdf | save figures and tables | logical
update.log | update analysis log. Ignored if developer = F. | logical
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical
developer |Developer mode. Default is FALSE. | logical
\
Given a query gene or gene set, scRNAseq expression is evaluated and gene similarity metrics are computed to identify associated genes.
show contents
Inputs
so
. Outputs
show table
Parameter |Description |Argument
--| ---------- | ------
input.file | path to seurat object | character
features | features used for analysis.
If feature.format
= 1, features
must be vector of features present in seurat object (e.g., c("PRRX1", "TCF4")).
If feature.format
= 2, features
must be name of column spreadsheet containing features (e.g., "HALLMARK_genes").
If feature.format
= 3, features
argument is ignored.
feature.format | format of features provided by user. | - 0: user-provided genes
- 1: features are stored in spreadsheet.
- 2: discovery mode.
feature.csv.path | Path to spreadsheet containing features. Required if using feature.format
= 1. For intended use, features in spreadsheet must be organized in columns that can be referenced by the column header using the features
argument. | character
discovery.method | Differential expression algorithm used in discovery mode (feature.format
= 2) | - Wilcoxon: Identifies DEGs using Wilcoxon Rank Sum test genes
- CDI: Identifies DEGs using co-dependency test
- Gini: Identifies group-specific features using gini inequality index
discovery.pct.min | Features that are expressed below this value are omitted from differential expression analysis. Range between [0,1], 0.1 is recommended. Required only if feature.format
= 2. | numeric
discovery.group.by | Name of object meta data field to group cells by when performing differential expression. E.g., "seurat_clusters". Required only if feature.format
= 2. | character
discovery.top.n | path to seurat object | character
discovery.max.n | minimum fraction of genes required for gene set consideration. If fraction of genes present in seurat object is below this threshold, gene set is disqualified. Value between [0,1]. Default is 0.5. | numeric
as.module | aggregate features into single meta-program. Default is FALSE. | logical
cor.method | feature association method | - pearson: Pearson correlation
- spearman: Spearman correlation
- rho_p: Proportionality constant
- cdi: co-dependency index
cor.top.n | Number of top associated features to show. 70 is recommended. | numeric
cor.which.data | aslot to use for feature association analysis. "data" is recommended. | - data
- scale
cor.min.expr | Features that are expressed below this value are omitted from feature association analysis. Range between [0,1], 0.025-0.1 is recommended. | numeric
general.cluster.resolution | Value of the resolution parameter, use a value above 1.0 if you want to obtain a larger number of communities. | numeric
general.subsample.factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric
general.subset | Subset Seurat object according to specified meta data field. Only specified meta data entries are retained, while remaining of data is omitted. Set as "no.subset" for not subsetting. | data frame specifying which field (df$field
) to subset on, and which field entries to retain (df$subgroups
)
general.clean.clusters | omit cells exceeding 3 median absolute deviations from cluster center in UMAP coordinate space. | logical
general.save.pdf | save figures and tables | logical
general.print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical
general.rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical
general.barcode.recode | Named list where names are new labels, and entries are old labels. If is provided for old labels, treated as wild card. e.g. list(in.vitro = c("invitro), in.vivo = NA). See scMiko::recordBarcode(...)
for details. | Named list where names are new labels, and entries are old labels.
expression.do.hex | show UMAP expression using schex. Default is FALSE | logical
n.workers | number of workers used for parallel implementation. Set to 1 for non-parallel run. | numeric
developer | Developer mode. Default is FALSE. | logical
Session Info
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.