all_times <- list()  # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      now <<- Sys.time()
    } else {
      res <- difftime(Sys.time(), now, units = "secs")
      all_times[[options$label]] <<- res
    }
  }
}))
knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(width.cutoff = 95),
  message = FALSE,
  warning = FALSE,
  time_it = TRUE
)

In this article we overview each scPipeline module, and provide details on how to specify parameters for analysis. For each module, we provide a description, example reports and documentation for implementation (i.e., parameter specification).

Prior to running any scPipeline module, please visit our getting started with scPipeline article to setup up scPipeline.

scPipeline Parameter specification

In general, our convention is to store all analysis parameters in a list named parameter.list at the start of the script.

# example
parameter.list <- list(
  input.file = "../data/demo.rds",
  ...,
  print.inline = F,
  save.flag = T
)

In all scPipeline modules, parameter.list is already specified, so users need only to configure the analysis parameters (e.g., specify input data, set cluster resolution, etc.) prior to running the module. In general, the default settings will yield reasonable results, however a little time spent optimizing parameters will often yield better outcomes.


Module 1: Quality Control (QC) & Preprocessing

Description

Initial appraisal of scRNA-seq count matrix data and overview of quality control metrics, data filtering, and normalization. Generates initial UMAP clusters, and PCA analyses. A preprocessed seurat object is saved as output.

Quality control: Cells are filtered based on genes/cell recovered and fraction of counts belonging to mitochondrial genes. Injured/broken cells are characterized by low gene recovery and high mitochondrial content whereas doublets are characterized by extremely high gene recovery. Based on these behaviors, high quality non-injured single cells can be retained for downstream analysis using the gene.upperlimit, gene.lowerlimit, and mt.upperlimit analysis parameters.

Count Normalization: Count matrices are normalized using SCTransform. In brief, the count data are modeled using regularized negative binomial regression to remove the influence of technical covariates, including sequencing depth and mitochondrial content. The resulting model residuals are treated as normalized count data for downstream analysis.

Dimensional Reduction: Following normalization, select features (i.e., genes) are used to reduce the dimensionality of the data set. This effectively eases downstream computational burden, increases the signal-to-noise and allows for data visualization. We perform principal component analysis and use the top N principal components that explain 90% (user-defined; pca.var.threshold) variance for Louvain clustering and UMAP embedding. The UMAP algorithm embeds reduced data (e.g., principal components) into a lower dimensional representation while preserving the global structure of the transcriptomic data to facilitate visualization.

Example Reports

For all the example reports shown here, datasets were subsampled to 10,000 cells and minimal cell filtering was performed due to original count matrices having been cleaned by authors.

Inputs and Outputs

show contents

Inputs

One of the following data formats must be provided as input:

  • Seurat object: must contain RNA assay with gene x cell count matrix
  • Cell Ranger output: folder must contain barcodes.tsv, genes.tsv, matrix.mtx
  • gene x cell count matrix

Outputs

  • Seurat object (.Rdata file)
  • Analysis report (.HTML file)

Analysis Parameters

show table

Parameter |Description |Argument --| ---------- | ------ input.object | path to seurat object | character input.cell_ranger | path to folder containing cell ranger output | character input.matrix | path gene x cell count matrix (.tsv, .csv format) | character save.flag | save Seurat object from current analysis | logical output.object.directory | path to folder in which to save Seurat object | character output.object.file | name of file to which Seurat object is saved. Must have ".Rdata" suffix (e.g., 'object.Rdata') | character subsample.n | value specifying how many cells to sub sample. Set to NA if none. | numeric cluster.resolution | value of the resolution parameter passed to Seurat::FindClusters(...). Use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities. Default is 1. | numeric gene.upperlimit | Upper limit of genes per cell. Value between [0, Inf], but more than gene.lowerlimit. Any cells exceeding this threshold are filtered out. Default is 9000. | numeric gene.lowerlimit | Lower limit of genes per cell. Value between [0, Inf], but less than gene.upperlimit. Any cells below this threshold are filtered out. Default is 200. | numeric mt.upperlimit | Upper limit of mitochondrial content per cell. Value between [0, 100]. Cells exceeding this threshold are filtered out. Default is 10. | numeric unmatched.rate.filter.flag | Logical specifying whether to filtered cells using unmatched rates. Default is TRUE. vars2regress | Variables to regress out during scaling step. Default is 'percent.mt'. | vector of character(s) correct.artifact | identify artifact cells and omit from dimensional reduction. Default is TRUE. | logical feature.select.method | Feature selection method for dimensional reduction.| - hvg (default): highly-variable genes
- deviance scale.method | Expression scaling method. | - sct (default): SCTransform residuals
- null_residuals pca.method | principal component analysis method. | - pca (default)
- glmpca pca.component.select.method | method for selecting number of principal components for clustering and UMAP embedding. | - cum_var (default)
- elbow pca.var.threshold | value between [0, 1] specifying cumulative variance explained by pca. Ignored if pca.component.select.method is not 'cum_var'. | numeric conserve.memory | If TRUE, the residual matrix for all genes computed using SCTransform is never created in full; recommended for large data sets, but will take longer to run. Default is FALSE | logical species.filter.flag | Logical specifying whether to filter species. Default is FALSE. | logical species.include | Specify which species to retain in analysis. Ignored if species.filter.flag is FALSE | - Hs: human
- Mm: murine nworkers | Number of workers used for parallel implementation of SCTransform. Default is 1. | numeric save.pdf | Logical specifying whether to save figures and tables in separate directory. | logical print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical developer | Developer mode. Default is FALSE. | logical


Module 2: Data Integration

Description

Batch correction and integration of independent experiments. Generates a new Seurat integrated assay using one of four approaches:

  1. rPCA: reciprocal principal component analysis (Seurat)
  2. CCA: canonical correlation analysis + mutual nearest neighbors-based integration (Seurat)
  3. BBKNN: batch-balanced K-nearest neighbors-based integration
  4. Scanorama: SVD + Mutual nearest neighbors-based integration

See Stuart et al. (2019). Cell for a description of rPCA and CCA, PolaƄski et al. (2020). Bioinformatics for BBKNN, and Hie et al. (2019). Nature Biotechnology for Scanorama.

Our selection of integration methods is based on the benchmarking efforts published by Luecken et al. (2021). Nature Methods

The primary aim of data integration is to identify common cell populations between datasets. However, because many integration methods involve non-linear warping, many statistical assumptions are broken in the process and the count matrix in the integrated assay is not appropriate for downstream differential gene expression. If downstream differential expression (DEG) analyses are intended, use counts from the data slot in the SCT assay; Data are processed using PrepSCTFindMarkers() thereby ensuring that counts have been corrected for varying sequencing-depth across data sets.

Analysis Parameters

In addition to specifying the parameter.list (analysis parameters), the Integration Module additionally requires a separate list, named input.data, to specify the data sets used for integration. See below for details and examples.

show parameter.list

Parameter |Description |Argument --| ---------- | ------ save.file | name of output file | character integration.feature.select.method | method used to select integration features | - hvg (default): highly-variable genes
- deviance integration.k.filter | number of neighbors (k) to use when filtering anchors. Default is 200. Passed to Seurat::FindIntegrationAnchors(...). Ignored if integration.method = "BBKNN" or "Scanorama". | numeric integration.k.anchor | number of neighbors (k) to use when picking anchors. Dictates strength of alignment. 5 is recommended, 20 suggested for stronger alignments. Passed to Seurat::FindIntegrationAnchors(...). Ignored if integration.method = "BBKNN" or "Scanorama". | numeric integration.k.weight | number of neighbors to consider when weighting anchors. Passed to Seurat::IntegrateData(...). Default is 100. Ignored if integration.method = "BBKNN" or "Scanorama". | numeric integration.n.genes | number of genes used for integration. 1000-3000 recommended. | numeric integration.method | integration method. | - rPCA
- CCA
- BBKNN
- Scanorama integration.limit.memory | Specify whether to limit max memory allocated during integration. Default is T. Ignored if integration.method = "BBKNN" or "Scanorama". | logical integration.max.memory | max memory allocated during integration, in terms of Gb. Ignored if integration.limit.memory = F. Default is 150. Ignored if integration.method = "BBKNN" or "Scanorama". | numeric integration.cluster.resolution | value of the resolution parameter passed to Seurat::FindClusters(...), which is run after data integration. Use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities. Default is 1. | numeric integration.compute.diversity | Compute change in population diversity before and after integration. Default is T | logical pca.var.threshold | proportion of variance used for clustering and UMAP embedding. 0.9 is recommended. | numeric [0,1] correct.artifact | omit artifact genes. Default is F | logical vars2regress | Variables to regress out during scaling step. Default is 'percent.mt'. | vector of character(s) correct.artifact | identify artifact genes (i.e., genes detected in only one sample) and omit from integration step. Recommended only if integrated datasets have similar cellular compositions. Default is F | logical n.workers | number of workers to use for parallel implementation. Provided as list, with each named entry corresponding to different task (e.g., n.workers = list(import = 4, integration = 1)) | list print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical save.pdf | Logical specifying whether to save figures and tables in separate directory. | logical update.log | update analysis log. Ignored if developer = F. | logical rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical developer |Developer mode. Default is FALSE. | logical

show input.data

The input.data list is a named list, with each entry corresponding to a separate input dataset. For a basic use, given 2 or more seurat objects for integration, we can specify the path to the input data using the file argument as follows:

# basic input specification
input.data <- list(
  dataset1 = list(file = "dataset1.Rdata"),    # path to .Rdata file containing seurat object named 'so'
  dataset2 = list(file = "dataset2.Rdata"),
  dataset3 = list(file = "dataset3.Rdata")
)

Alternatively, if a given data set contains multiple batches of data, we can partition the seurat object into subsets which will then be handled as independent datasets in the downstream integration step. This can be accomplished using the integrate.by argument, which specifies which feature to use to split the seurat object. The feature specified by integrate.by must be present in the metadata of the corresponding seurat object. In the example below, we split dataset 1 using the Barcode feature prior to integration:

# spliting seurat object into subset prior to integration
input.data <- list(
  dataset1 = list(
    file = "dataset1.Rdata",        # path to .Rdata file containing seurat object named 'so'
    integrate.by = "Barcode"        # meta data features specifying sample labels
  ),       
  dataset2 = list(
    file = "dataset2.Rdata"
  ),
  dataset3 = list(
    file = "dataset3.Rdata"
  )
)

Users can also filter their inputs prior to integration using the meta.feature.name and meta.feature.filter arguments. For example, if dataset 1 consists of multiple cell types, and we are interested in only integrating the immune compartment, we can set meta.feature.filter to "lymphoid|myeloid" and meta.feature.name to "cell.type", with the latter corresponding to the meta data column present in the seurat object:

# filtering seurat object prior to integration
input.data <- list(
  dataset1 = list(
    file = "dataset1.Rdata",                   # path to .Rdata file containing seurat object named 'so'
    meta.feature.name = "cell.type",           # meta data column used to filter data
    meta.feature.filter = "lymphoid|myeloid"   # features to include for downstream integration
  ),       
  dataset2 = list(
    file = "dataset2.Rdata"
  ),
  dataset3 = list(
    file = "dataset3.Rdata"
  )
)

Note that the filter uses non-exact matching, so please be aware that substrings may unintentionally filter out data of interest if incorrectly specified. For example, "cell_type_1" will filter "cell_type_1", "cell_type_10", "cell_type_11", etc. A few additional examples for filtering:

  • meta.feature.filter = c("Cell_A", "Cell_B") will match "Cell_A1", "Cell_A2" and "Cell_B1", but not "Cell_C1".
  • meta.feature.filter = c("Cell_A|Cell_B") will NOT match "Cell_A1", "Cell_A2" or "Cell_B1"
  • meta.feature.filter = c("Cell") will match "Cell_A1", "Cell_A2", "Cell_B1", and "Cell_C1".


Module 3: Cluster Optimization

Description

Clustering is performed at several candidate resolutions, and the optimal resolution is informed using:

Clustering is performed using the Seurat implementation of FindNeighbors() and FindClusters(). In brief, a cell X cell shared nearest neighbor (SNN) graph is first constructed by determining the neighborhood overlap (Jaccard Index) between every cell and its k^th^ nearest neighbor. Then, the SNN graph is clustered using the Louvain algorithm. The main parameter affecting the number of clusters generated is the resolution. Higher resolutions yield more clusters, whereas lower resolution yield fewer clusters.

Example Reports

Inputs and Outputs

show contents

Inputs

  • Seurat object (.Rds or .RData file)
    • If an .RData file is provided, the seurat object must be named so.

Outputs

  • Analysis report (.HTML file)

Analysis Parameters

show table

Parameter |Description |Argument --| ---------- | ------ input.file | path to seurat object | character cluster.resolution | range of resolution values used for evaluating candidate cluster configurations. Recommended = c(0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3) | vector of numeric(s) lfc.threshold | log-fold change threshold used in differential-expression analysis. 0.5 is recommended. | numeric fdr.threshold | fdr-value significance threshold. Value between [0,1]. Default is 0.05. | numeric group.singletons | Group singletons into nearest cluster. If FALSE, assign all singletons to a "singleton" group. For certain difficult-to-cluster data sets, group.singletons = FALSE may yield faster results. Default is FALSE | logical print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical subsample_factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric subsample_n | value specifying how many cells to sub sample. Set to NA if none. | numeric n.workers | number of workers used for parallel implementation. parallel::detectCores() is recommended. | numeric only.pos | only included genes with logFC > 0. Default is TRUE. | logical save.pdf | save figures and tables | logical update.log | update analysis log. Ignored if developer = F. | logical rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical developer |Developer mode. Default is FALSE. | logical


Module 4: Cell-Type Annotation

Description

Annotation of cell populations identified by unsupervised clustering. Cell-type markers are scored using the Miko scoring pipeline.

Prior to running this module, we recommend running the cluster optimization module to identify the appropriate resolution for clustering.

Example Reports

Inputs and Outputs

show contents

Inputs

  • Seurat object (.Rds or .RData file)
    • If an .RData file is provided, the seurat object must be named so.

Outputs

  • Analysis report (.HTML file)

Analysis Parameters

show table

Parameter |Description |Argument --| ---------- | ------ input.file | path to seurat object | character external.markers | cell-type markers used for annotation | named list; names are cell-types, entries are genes.
E.g., list(cell_A = c("gene1", "gene2", "gene3"), cell_B = c("gene4", "gene5", "gene6")) internal.markers | name of scMiko internal gene set(s). Run names(geneSets) for list of available gene sets. | character marker.filter | filter marker sets using grepl pattern matching. E.g., Manno2021 for all La Manno 2021 atlas-derived markers in Cell_Catalogy stored in scMiko geneSets. | character coherence.threshold | value of minimum gene set coherence. Gene sets with coherence scores below this threshold are disqualified from consideration. Value must be between [0,1]. Default is 0.8. | numeric p.threshold | p-value significance threshold. Value between [0,1]. Default is 0.05. | numeric do.center | center scores by null model predictions. FALSE is recommended | logical frequent.filter.threshold | path to seurat object | character filter.frequent.fliers | path to seurat object | character min.representation | minimum fraction of genes required for gene set consideration. If fraction of genes present in seurat object is below this threshold, gene set is disqualified. Value between [0,1]. Default is 0.5. | numeric cluster.resolution | value of the resolution parameter passed to Seurat::FindClusters(...). Use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities. Default is 1. | numeric fdr.correction | apply Benjamini & Hochberg correction | logical clean.clusters | filter cells that exceed 3 median absolute deviations from cluster center in UMAP coordinate space. Default is FALSE. | logical max.cells | max number of cells used in Miko scoring pipeline. Default is 20000. | numeric subsample_factor | value specifying fraction of cells to sub sample. Value between [0,1] | numeric min.geneset.size | maximal gene set size. Default is 200. | numeric max.geneset.size | minimal gene set size. Default is 3. | numeric pathway.db | database used for functional annotation of gene sets |- GO
- Bader n.workers | number of workers used for parallel implementation. Default is parallel::detectCores(). | numeric save.pdf | save figures and tables. | logical show.top.n.gene.panels | max number of gene sets per cluster to show in report. Default is 5 | numeric max.gene.panels.allowed | max number of overall gene sets to show in report. Default is 75 | numeric print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical update.log | update analysis log. Ignored if developer = F. | logical rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical developer | Developer mode. Default is FALSE. | logical


Module 5: Gene Program Discovery

Description

Gene program/module identification. Three independent methods are implemented to identify gene programs based on gene co-expression profiles:

Gene set enrichment is performed on each gene program to facilitate functional annotation.

Example Reports

Inputs and Outputs

show contents

Inputs

  • Seurat object (.Rds or .RData file)
    • If an .RData file is provided, the seurat object must be named so.

Outputs

  • Analysis report (.HTML file)

Analysis Parameters

show table

Parameter |Description |Argument --| ---------- | ------ input.file | path to seurat object | character cluster.resolution | Value of the resolution parameter, use a value above 1.0 if you want to obtain a larger number of communities. | numeric subsample_factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric subsample_n | value specifying how many cells to sub sample. Set to NA if none. | numeric subset.data | Subset Seurat object according to specified meta data field. Only specified meta data entries are retained, while remaining of data is omitted. Set as "no.subset" for not subsetting. | data frame specifying which field (df$field) to subset on, and which field entries to retain (df$subgroups) batch.feature | meta data field to use for batch correction (set to NA if unspecified) | character feature.selection.method | method used to select features for module detection analysis. | - expr: expression-based criterion
- hvg: highly-variable genes
- deviance: deviance-based criterion max.features | max number of features to include in module detection analysis (NA if unspecified) min.pct | minimal feature expression (applicable only if select.method = "expr"). Value between [0,1], use 0.1 for large network and 0.5 for small network. | numeric data.type | Data representation used for module detection | - pearson: based on SCTransform (recommended)
- deviance: based on multinomial null model purity.optimization.step.size | value of step size used for optimizing module purity. 0.5 recommended. | numeric filter.parameters | specify which clusters to include or omit | named list with 2 sets of entries; 'include' entries specify which clusters to retain; 'omit' entries specify which clusters to omit.
E.g., list(include = c(0,1,2), omit = c(3,4,5)) general.weight.by.var | Weight feature embeddings by the variance of each PC. Default is FALSE. | logical general.pca.cum.sum | cumulative variance explained by PCs used for SSN analysis. Value between (0,1], 0.9 is recommended. | numeric general.umap.knn | number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50. | numeric general.cluster.purity | target purity of network modules. Value between [0,1], 0.8 is recommended. | logical general.scale.free.topology | apply scale-free transform to shared nearest neighbor graph. Default is TRUE. | logical general.ica | perform independent component analysis (ICA). | logical general.nmf | perform non-negative matrix factorization (NMF). | logical general.nmf.k | number of latent factors to compute with NMF. Can be a range of values (e.g., c(5, 10, 15)) | vector of numeric(s) general.robust.pca | use robust PCA for dimensional reduction (warning: computationally intensive) | logical n.workers | number of workers used for parallel implementations. | numeric print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical save.pdf | save figures and tables | logical update.log | update analysis log. Ignored if developer = F. | logical rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical developer |Developer mode. Default is FALSE. | logical

\


Module 6: Gene Expression and Associations

Description

Given a query gene or gene set, scRNAseq expression is evaluated and gene similarity metrics are computed to identify associated genes.

Example Reports

Inputs and Outputs

show contents

Inputs

  • Seurat object (.Rds or .RData file)
    • If an .RData file is provided, the seurat object must be named so.

Outputs

  • Analysis report (.HTML file)

Analysis Parameters

show table

Parameter |Description |Argument --| ---------- | ------ input.file | path to seurat object | character features | features used for analysis.
If feature.format = 1, features must be vector of features present in seurat object (e.g., c("PRRX1", "TCF4")).
If feature.format = 2, features must be name of column spreadsheet containing features (e.g., "HALLMARK_genes").
If feature.format = 3, features argument is ignored. feature.format | format of features provided by user. | - 0: user-provided genes
- 1: features are stored in spreadsheet.
- 2: discovery mode. feature.csv.path | Path to spreadsheet containing features. Required if using feature.format = 1. For intended use, features in spreadsheet must be organized in columns that can be referenced by the column header using the features argument. | character discovery.method | Differential expression algorithm used in discovery mode (feature.format = 2) | - Wilcoxon: Identifies DEGs using Wilcoxon Rank Sum test genes
- CDI: Identifies DEGs using co-dependency test
- Gini: Identifies group-specific features using gini inequality index discovery.pct.min | Features that are expressed below this value are omitted from differential expression analysis. Range between [0,1], 0.1 is recommended. Required only if feature.format = 2. | numeric discovery.group.by | Name of object meta data field to group cells by when performing differential expression. E.g., "seurat_clusters". Required only if feature.format = 2. | character discovery.top.n | path to seurat object | character discovery.max.n | minimum fraction of genes required for gene set consideration. If fraction of genes present in seurat object is below this threshold, gene set is disqualified. Value between [0,1]. Default is 0.5. | numeric as.module | aggregate features into single meta-program. Default is FALSE. | logical cor.method | feature association method | - pearson: Pearson correlation
- spearman: Spearman correlation
- rho_p: Proportionality constant
- cdi: co-dependency index cor.top.n | Number of top associated features to show. 70 is recommended. | numeric cor.which.data | aslot to use for feature association analysis. "data" is recommended. | - data
- scale cor.min.expr | Features that are expressed below this value are omitted from feature association analysis. Range between [0,1], 0.025-0.1 is recommended. | numeric general.cluster.resolution | Value of the resolution parameter, use a value above 1.0 if you want to obtain a larger number of communities. | numeric general.subsample.factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric general.subset | Subset Seurat object according to specified meta data field. Only specified meta data entries are retained, while remaining of data is omitted. Set as "no.subset" for not subsetting. | data frame specifying which field (df$field) to subset on, and which field entries to retain (df$subgroups) general.clean.clusters | omit cells exceeding 3 median absolute deviations from cluster center in UMAP coordinate space. | logical general.save.pdf | save figures and tables | logical general.print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical general.rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical general.barcode.recode | Named list where names are new labels, and entries are old labels. If is provided for old labels, treated as wild card. e.g. list(in.vitro = c("invitro), in.vivo = NA). See scMiko::recordBarcode(...) for details. | Named list where names are new labels, and entries are old labels. expression.do.hex | show UMAP expression using schex. Default is FALSE | logical n.workers | number of workers used for parallel implementation. Set to 1 for non-parallel run. | numeric developer | Developer mode. Default is FALSE. | logical


Session Info

sessionInfo()



NMikolajewicz/scMiko documentation built on June 28, 2023, 1:41 p.m.