In apf2139/tamlabscpipeline: A basic package for scRNAseq data analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Contact: alexander.ferrena@einsteinmed.org

Single Dataset Analysis

This vignette will provide guidance for analysis of an individual single-cell RNAseq (scRNAseq) dataset.

Package installation:

install.packages("devtools") #if devtools not already installed
devtools::install_github('apf2139/tamlabscpipeline', build_vignettes = T)

Dependency installation:

install.packages("tidyverse")
install.packages('Seurat')
devtools::install_github(repo = 'ChristophH/sctransform')
install.packages("cowplot")
devtools::install_github('chris-mcginnis-ucsf/DoubletFinder')

#BioconductorPkgs
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MAST")
BiocManager::install("fgsea")

Startup

First, load up the package and dependencies. running library(tamlabscpipeline) will examine the depencies and compare your versions with the latest known compatible versions for tamlabscpipeline and issue warnings if you are behind / ahead of the latest test.

library(Seurat)
library(sctransform)
library(tidyverse)
library(cowplot)
library(DoubletFinder)
library(MAST)
library(fgsea)

library(tamlabscpipeline)

Quickstart

The following commands will perform QC filtering, clustering, differential gene expression testing and exploratory gene set enrichment analysis on the H5 output file from the Cellranger pipeline.

pdf('qcplots.pdf')

sobj <- seuratpipeline(data = filepath, #path to dataset
               format = h5 #format; typically cellranger dir or H5

               )

dev.off()

degdir = 'clusterDEG_2020-01-08_MAST'

deg.acrossclusters(sobj, degdir)

gseapipeline.clusters(inputfolder = degdir)

Altogether, these functions will return:

an analyzed Seurat object named "sobj";
a bunch of plots, including QC plots and a UMAP plot for clustering contained within the "qcplots.pdf" file;
a directory with differentially expressed genes for each cluster;
and a heatmap printed to the standard out showing GSEA results for the clusters against the Broad's Hallmark Pathways. Default Hallmark GSEA (and cell-type scoring) utilize mouse gene symbols, although human symbols can be used without any problems for GSEA with human data input (and human cell-type scoring will be implemented soon).

On a dataset of ~17,000 genes x ~5,500 cells using a 2018 MacBook Pro this took roughly 4 hours. Plenty of time to read the rest of this vignette for more details on the underlying methods and how to further assess the quality of your dataset; think about some hypotheses you'd like to explore in your data; and ponder the precarious global geopolitical situation we find ourselves living in.

seuratpipeline() takes roughly 20 mins
deg.acrossclusters() takes roughly 3 hours and 45 mins; this function has been written such that it can pick up where left off if interrupted
gseapipeline.clusters() takes roughly 10 seconds.

See below for further details and ideas for QC, data exploration and analysis.

QC and analysis

Let's assume you have a dataset at the ready. I will try to provide a small example dataset for reproducible tutorials in the future. Currently this dataset should be the output of Cellranger in the form of a directory or an H5 file.

The seuratpipeline() function will perform QC and initial clustering of your dataset. This will involve filtering out high mitochondrial content cells, low library size cells, and doublets; then normalizing using SingleCellTransform (SCT) and clustering using the Louvain method as implemented in Seurat.

Here is some explanation about the function and its options. The object filepath stores a string containing the path to an example file.

pdf('qcplots.pdf')

sobj <- seuratpipeline(
               #required arguments: data file path and format of data.
               data = filepath,
               format = h5,#h5 not default, but typical.

               #only used for kallisto | bustools pipeline
               transcript_gene_file = NULL, 

               project = NULL, #name of seurat object, sobj@project.name

               #"baseline", global filtration. 
               baselinefilter.mad = F,
                baseline.mito.filter = T,
                madmax.dist.percentmito.baseline = 'predict',
                baseline.libsize.filter = T,
                madmax.dist.nCount_RNA.baseline = 'predict,

              #auto id and remove hi-mito content clusters.
              removemitomaxclust = T,

              #iterative filtering options for first-pass clusters.
              iterativefilter = T,
                iterativefilter.libsize = 'lefttail',
                iterativefilter.mito = F,

              #perform cell cycle correction via regression. Default F. 
              #Still calculates cell cycle scores if set to False.  
              cellcycleregression = F,

              #manually choose which genes used in PCA.
              #default if NULL is variable genes called by SCT.
              PCAgenelist=NULL,

              #Jackstraw analysis to test for significant PCs. Default = F
              jackstraw = F,

              #which PCs to use in clustering and dim-reduction for viz.
              #default = 1:30
              dims = 1:30,

              #"Resolution parameter" used for Louvain Clustering.
              #default is three values, final value set as active identity:
              res = c(0.5, 1.5, 1.0)

               )

dev.off()

This function is designed to give a rough but pretty good minimum initial clustering across different datasets. On a dataset of ~17,000 genes x ~5,500 cells using a 2018 MacBook Pro this took roughly 20 minutes.

By default, this function will perform QC and clustering of the data. Default QC steps are as follows:

First-pass clustering involving normalization via SCT and Louvain clustering in Seurat with default parameters.
Identification and removal of mitochondrial cluster. Without any prefiltering, there will almost certainly be a cluster strongly driven by mitochondrial RNA content.
this step will generate two violin plots; one before removal and one after removal and reclustering.
Second-pass clustering after mito cluster removal. Iterative library size filtering.
By default, the second-pass clustering step now only involves clustering based on minimum library size. Proper thesholds for minimum library size are based on median absolute deviation thresholds which are learnt directly from the data.
Filtering will not take place if the cluster size is below a minimum arbitrary threshold (100 cells), as Med.Abs.Dev. learning performs quite poorly with smaller cell numbers.
this step will generate two visualizations for each cluster: one showing the Med.Abs.Dev. threshold learnt for that cluster, the second showing the actual cutoff point denoted by this threshold. Plots will be generated even for clusters of under 100 cells, but these clusters will not actually be subjected to lib size filtration.
Third-pass clustering. This will involve the identification and removal of doublets. Doublet identification relies on DoubletFinder
this will generate a line-graph that you can ignore. This step also performs its own multi-pass clustering steps using default parameters.
Cell cycle scores will be calculated. By default, there will be no automatic cell cycle correction, as this can vary from dataset to dataset and should be assessed manually.
A finalized normalization and clustering step using SCT and Seurat Louvain clustering will take place.
The function will return a Seurat object containing fully normalized and scaled matrices; clustering assignments; and dimensionality reduction embeddings, including PCA, tSNE and UMAP.

More details about options:

"Baseline" options (baselinefilter.mad). Filtering applied before any clustering. Cutoffs based on median absolute deviation (MAD); maximum deviation thresholds (madmax) can be predicted from data or set manually. If baselinefilter.mad is set to false, other "baseline" parameters have no effect. Susceptible to cell-type biases; defaults to FALSE.
Remove High Mito clusters (removemitomaxclust). Automatically identify and remove high mitochondrial content cluster(s) after first-pass clustering. Based on recursive Grubb's outlier test. Good to use if mito baseline filtering is not used. Default is True.
Iterative filtering (iterativefilter). Filters applied to each cluster after first-pass clustering. Thresholds for filtering set by learning median absolute deviation thresholds for each cluster. Filters will not be applied if cluster is below a certain cell number size (arbitrarily set at 100 cells). Default is only to set a lower (left-tail) MAD threshold for each cluster.
Left-tail lib size filtering captures cell debris while avoiding cell type biases inherent in global "baseline" filtering by setting a minimum library size based on automatically predicted median absolute deviation thesholds.
'twosided' lib size filtering attempts to capture both cell debris and doublets, specifically "homotypic" doublets, by setting both minimum and maximum thresholds. However, doublet capture may not really be effective using this strategy, as high library size may be more of an artifact of overamplification or oversequencing, rather than the starting cell material.
mitochondrial iterative filtering (iterativefilter.mito) is meant to remove high mitochondrial content cells from each cluster based on MAD cutoffs learnt from the data.
cellcycleregression - Whether to correct for cell cycle effects. Default False, but still calcualtes cell cycle scores for manual inspection.
PCAgenelist - Manually set gene list. Takes a character vector of gene names. Default = NULL --> will use variable genes called by SingleCellTransform (SCT).
jackstraw - T/F - whether to use Jackstraw to assess significance of each PC. Default = F. Takes a long time and results not readily interpretable with SCT.
dims: which PCs to use in clustering and calcualtion of tSNE / UMAP. Defaults to 1:30. When using SCT, noise reduction is adequate for using more dimensions even on smaller size datasets (few thousand cells). For exceptionally large datasets >(10s of thousands of cells) more PCs can be useful.
res: set resolution paramter to use in Louvain clustering. Resolution can help offset a known bias in default Louvain clustering by which larger datasets can coerce small (true) communities into larger clusters, thus missing true undrlying heterogeneity. Higher res = more clusters; too high res will cluster on noise rather than signal, too low res may miss true communities that deserve their own cluster. Default calculates clustering for thre values, c(0.5, 1.5, 1.0), setting the resulting Seurat object's active identity to the clusters identified by res=1.0; this corresponds to totally default Louvain clustering. Refer to Newman 2016 for more details.

Here are some suggestions for exploring the dataset and manually assessing quality.

``` {r, echo=FALSE}

library(Seurat)

library(sctransform)

library(tidyverse)

library(cowplot)

library(DoubletFinder)

knitr::opts_chunk$set(fig.width=7, fig.height=5)

sobj <- readRDS('../../tamlabtesting/sobj_default.rds')

Inspect the clustering via the `Seurat::DimPlot()` command.


`DimPlot(sobj, label = T)`


![](embed/0.dimplot.clusts.png){height=500 width=600}

`FeaturePlot(sobj, c('nCount_RNA', 'percent.mito', 'S.Score', 'G2M.Score'))`


![](embed/1.fplot.png){height=500 width=600}

`VlnPlot(sobj, c('nCount_RNA', 'nFeature_RNA', 'percent.mito'), ncol = 1)`


![](embed/2.vlnplot.png){height=500 width=600}

`DimPlot(sobj, group.by = 'Phase')`


![](embed/3.dimplot.phases.png){height=500 width=600}

A full assesment of quality will require cell type identification, for example via specific cell type markers. An example of this using Seurat's `FeaturePlot()` function is shown below:

`FeaturePlot(sobj, c('Ptprc', 'Cd3e', 'Cd19', 'Epcam'))`


![](embed/4.fplot.celltype.png){height=500 width=600}

Other cell type calling approaches make use of differentially expressed genes, the calling of which is discussed below. Automated (at at least semi-automated) cell-type calling is an active area of development and will be included in this suite soon.

An integrated assessment of quality and the decision to recluster using different QC parameters is thus a non-trivial and manual process, but these functions and the different parameters in the `seuratpipeline()` function will allow easy tweaking.

***

## Differentially Expressed Genes

After clustering, it is often useful to run differential expression testing to identify differentiall expressed genes (DEGs). This can be used as input for Gene Set Enrichment Analysis (GSEA).

For differential expression testing, Seurat provides wrappers for tests such as the [MAST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4676162/) test, specifically developed to identify DEGs while attempting to correct for the biases and noise inherent to scRNAseq data.

If you would like to use the GSEA pipeline included with tamlabscpipeline, it would be beneficial to use the function `deg.acrossclusters()`. This function is a wrapper around Seurat's wrapper, which can be used as follows:

deg.acrossclusters(sobj, idents, test, latent.vars, outdir )

This function will save output files to `outdir`. Output is a directory containing .rds files which store dataframes containing the output of Seurat::FindMarkers(). This function takes a while. To help deal with the risk of interruptions, this function is written in a way that tries to pick up where previously left off if stopped part-way through; simply recall the original invocation and it will try to pickup where left off.
On the same dataset mentioned previously (15982 genes x 3522 cells post filtration) and 24 clusters, using a MacBookPro 2018, this took ~3 hours and 45 mins. 

***

Options:

* sobj: the seurat object, for example outputted by `tamlabscpipeline::seuratpipeline()`
* idents: a categorical variable contained within the Seurat object metadata. Typically, the clustering assignment contained withing `sobj$seurat_clusters`. Meta.data acessed via `sobj@meta.data`, which returns a large dataframe containing information for every cell in the dataset; best to use `head(sobj@meta.data)` to view. Default is `seurat_clusters`, which if used directly from `seuratpipelne()` will correspond to clusters calculated at resolution = 1.0.
* test: which test to perform; see `?Seurat::FindMarkers` for details and options. Default is MAST.
* latent.vars: variables which will be regressed out in order to minimize their signal in DEG testing. Typically, noise. Called via a string or character vector referring to the name of columns in `sobj@meta.data`. Some tests, such as MAST, can only perform noise regression on quantitative rather than categorical variables.
* outdir; the directory to save output files to. Defaults to "ClusterDEG_(Seurat object project name)_(date)_(test)". 

***

## GSEA for each cluster

A gene set enrichment analysis (GSEA) pipeline is provided for individual datasets in the form of the `gseapipeline.clusters()` function.

gseapipeline.clusters(inputfolder, pathways = hallmark, nperm = 10000, makepdf = F, pdfname = NULL, filter_nonsig_pathways = F )

This function takes the result of the `deg.acrossclusters()` function as input. Simply set inputfolder equal to the a filepath string pointing to the directory created by that function.

As output, `gseapipeline.clusters()` will return a heatmap, colored by normalized enrichment and denoting significance cutoffs. The heatmap will look something like this: 
![](embed/5.heatmap.png){height=500 width=600}

Significance cutoffs of >0.05 and >0.25 are used due to the underlying FGSEA algorithm allowing for standard alpha cutoffs, and for historical reasons inspired by the original GSEA algorithm respectively. Please see [Korotkevich, Sukhov, and Sergushichev 2019](https://www.biorxiv.org/content/10.1101/060012v2) for details on underlying algorithm.

For the same example dataset, on 24 clusters using the default exploratory Hallmark mouse gene signatures with default nperm = 10000, this function takes ~10 seconds.

***

Options / details:

* inputfolder: string pointing to location of `deg.acrossclusters()` output directory. This contains .rds files storing dataframes outputted by Seurat::FindMarkers().
* pathways: A named "list of lists" of genes. The format of this object is important for functionality of this function and can be previewed in the tamlabscpipeline::hallmark object. The names of each list element provides the Y axis; the genes within each list element are the target genes used for GSEA. If NULL, defaults to the Msigdb's Hallmark pathways for mouse. 
  + The `msigdbr` package is extremely useful for finding mouse orthologs of the Broad's [Msigdb](http://software.broadinstitute.org/gsea/msigdb/index.jsp) repository of gene signatures. Please see the associated [vignette](https://cran.r-project.org/web/packages/msigdbr/vignettes/msigdbr-intro.html). Typical workflow for mouse datasets is to find an interesting gene set using the Broad's GUI and then use the mouse orthologs via msigdbr. The default Hallmark geneset, meant for exploratory analysis, was found in this way.
* nperm: how many permutations FGSEA will use. Facilitates significance testing. Default is 10000.
* makepdf: whether to print to pdf. Only for Mac users currently :/ this is because of limitations with special character encoding / printing to PDF. The marks for siginificance placed on the heatmap are specially encoded. Mac makes use of the quartz pdf device. Windows users are invited to explore wrapping the function in a CairoPDF call. This will hopefully be fixed soon.
  + pdfname: name of pdf. Only for Mac / Quartz users
* filter_nonsig_pathways: whether to show only rows that have at least one signifcant cell in the output heatmap. Default is False. Can be useful for polished plotting after exploratory data analysis.


***
#### Final notes

There have been some major improvements over the previous versions, including much better cell QC (more accrurately removing low-quality cells while being nicer to okay-quality cells). Also, there have been important changes made to clean up the code, remove unnecesarry options, optimize speed, and retain full scaled.data matrices at the final output. Tammela lab users are therefore encouraged to use this version of all code rather than previously shared versions.


### Parallelization

Parallelization should increase the speed of all functions. Parallelization is in the works.





``` {r, echo=FALSE}
#rm(sobj)

apf2139/tamlabscpipeline documentation built on July 23, 2021, 11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

apf2139/tamlabscpipeline
A basic package for scRNAseq data analysis

In apf2139/tamlabscpipeline: A basic package for scRNAseq data analysis

Single Dataset Analysis

Package installation:

Dependency installation:

Startup

Quickstart

QC and analysis

Here are some suggestions for exploring the dataset and manually assessing quality.

library(Seurat)

library(sctransform)

library(tidyverse)

library(cowplot)

library(DoubletFinder)

knitr::opts_chunk$set(fig.width=7, fig.height=5)

sobj <- readRDS('../../tamlabtesting/sobj_default.rds')

R Package Documentation

Browse R Packages

We want your feedback!

apf2139/tamlabscpipeline A basic package for scRNAseq data analysis

In apf2139/tamlabscpipeline: A basic package for scRNAseq data analysis

Single Dataset Analysis

Package installation:

Dependency installation:

Startup

Quickstart

QC and analysis

Here are some suggestions for exploring the dataset and manually assessing quality.

library(Seurat)

library(sctransform)

library(tidyverse)

library(cowplot)

library(DoubletFinder)

knitr::opts_chunk$set(fig.width=7, fig.height=5)

sobj <- readRDS('../../tamlabtesting/sobj_default.rds')

R Package Documentation

Browse R Packages

We want your feedback!

apf2139/tamlabscpipeline
A basic package for scRNAseq data analysis