knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Contact: alexander.ferrena@einsteinmed.org

Single Dataset Analysis

This vignette will provide guidance for analysis of an individual single-cell RNAseq (scRNAseq) dataset.

Package installation:

install.packages("devtools") #if devtools not already installed
devtools::install_github('apf2139/tamlabscpipeline', build_vignettes = T)

Dependency installation:

install.packages("tidyverse")
install.packages('Seurat')
devtools::install_github(repo = 'ChristophH/sctransform')
install.packages("cowplot")
devtools::install_github('chris-mcginnis-ucsf/DoubletFinder')

#BioconductorPkgs
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MAST")
BiocManager::install("fgsea")

Startup

First, load up the package and dependencies. running library(tamlabscpipeline) will examine the depencies and compare your versions with the latest known compatible versions for tamlabscpipeline and issue warnings if you are behind / ahead of the latest test.

library(Seurat)
library(sctransform)
library(tidyverse)
library(cowplot)
library(DoubletFinder)
library(MAST)
library(fgsea)
library(tamlabscpipeline)

Quickstart

The following commands will perform QC filtering, clustering, differential gene expression testing and exploratory gene set enrichment analysis on the H5 output file from the Cellranger pipeline.

pdf('qcplots.pdf')

sobj <- seuratpipeline(data = filepath, #path to dataset
               format = h5 #format; typically cellranger dir or H5

               )

dev.off()

degdir = 'clusterDEG_2020-01-08_MAST'

deg.acrossclusters(sobj, degdir)

gseapipeline.clusters(inputfolder = degdir)

Altogether, these functions will return:

On a dataset of ~17,000 genes x ~5,500 cells using a 2018 MacBook Pro this took roughly 4 hours. Plenty of time to read the rest of this vignette for more details on the underlying methods and how to further assess the quality of your dataset; think about some hypotheses you'd like to explore in your data; and ponder the precarious global geopolitical situation we find ourselves living in.

See below for further details and ideas for QC, data exploration and analysis.

QC and analysis

Let's assume you have a dataset at the ready. I will try to provide a small example dataset for reproducible tutorials in the future. Currently this dataset should be the output of Cellranger in the form of a directory or an H5 file.

The seuratpipeline() function will perform QC and initial clustering of your dataset. This will involve filtering out high mitochondrial content cells, low library size cells, and doublets; then normalizing using SingleCellTransform (SCT) and clustering using the Louvain method as implemented in Seurat.

Here is some explanation about the function and its options. The object filepath stores a string containing the path to an example file.

pdf('qcplots.pdf')

sobj <- seuratpipeline(
               #required arguments: data file path and format of data.
               data = filepath,
               format = h5,#h5 not default, but typical.

               #only used for kallisto | bustools pipeline
               transcript_gene_file = NULL, 

               project = NULL, #name of seurat object, sobj@project.name

               #"baseline", global filtration. 
               baselinefilter.mad = F,
                baseline.mito.filter = T,
                madmax.dist.percentmito.baseline = 'predict',
                baseline.libsize.filter = T,
                madmax.dist.nCount_RNA.baseline = 'predict,

              #auto id and remove hi-mito content clusters.
              removemitomaxclust = T,

              #iterative filtering options for first-pass clusters.
              iterativefilter = T,
                iterativefilter.libsize = 'lefttail',
                iterativefilter.mito = F,

              #perform cell cycle correction via regression. Default F. 
              #Still calculates cell cycle scores if set to False.  
              cellcycleregression = F,

              #manually choose which genes used in PCA.
              #default if NULL is variable genes called by SCT.
              PCAgenelist=NULL,

              #Jackstraw analysis to test for significant PCs. Default = F
              jackstraw = F,

              #which PCs to use in clustering and dim-reduction for viz.
              #default = 1:30
              dims = 1:30,

              #"Resolution parameter" used for Louvain Clustering.
              #default is three values, final value set as active identity:
              res = c(0.5, 1.5, 1.0)

               )

dev.off()

This function is designed to give a rough but pretty good minimum initial clustering across different datasets. On a dataset of ~17,000 genes x ~5,500 cells using a 2018 MacBook Pro this took roughly 20 minutes.


By default, this function will perform QC and clustering of the data. Default QC steps are as follows:

  1. First-pass clustering involving normalization via SCT and Louvain clustering in Seurat with default parameters.
  2. Identification and removal of mitochondrial cluster. Without any prefiltering, there will almost certainly be a cluster strongly driven by mitochondrial RNA content.
  3. this step will generate two violin plots; one before removal and one after removal and reclustering.
  4. Second-pass clustering after mito cluster removal. Iterative library size filtering.
  5. By default, the second-pass clustering step now only involves clustering based on minimum library size. Proper thesholds for minimum library size are based on median absolute deviation thresholds which are learnt directly from the data.
  6. Filtering will not take place if the cluster size is below a minimum arbitrary threshold (100 cells), as Med.Abs.Dev. learning performs quite poorly with smaller cell numbers.
  7. this step will generate two visualizations for each cluster: one showing the Med.Abs.Dev. threshold learnt for that cluster, the second showing the actual cutoff point denoted by this threshold. Plots will be generated even for clusters of under 100 cells, but these clusters will not actually be subjected to lib size filtration.
  8. Third-pass clustering. This will involve the identification and removal of doublets. Doublet identification relies on DoubletFinder
  9. this will generate a line-graph that you can ignore. This step also performs its own multi-pass clustering steps using default parameters.
  10. Cell cycle scores will be calculated. By default, there will be no automatic cell cycle correction, as this can vary from dataset to dataset and should be assessed manually.
  11. A finalized normalization and clustering step using SCT and Seurat Louvain clustering will take place.
  12. The function will return a Seurat object containing fully normalized and scaled matrices; clustering assignments; and dimensionality reduction embeddings, including PCA, tSNE and UMAP.

More details about options:

Here are some suggestions for exploring the dataset and manually assessing quality.

``` {r, echo=FALSE}

library(Seurat)

library(sctransform)

library(tidyverse)

library(cowplot)

library(DoubletFinder)

knitr::opts_chunk$set(fig.width=7, fig.height=5)

sobj <- readRDS('../../tamlabtesting/sobj_default.rds')

Inspect the clustering via the `Seurat::DimPlot()` command.


`DimPlot(sobj, label = T)`


![](embed/0.dimplot.clusts.png){height=500 width=600}

`FeaturePlot(sobj, c('nCount_RNA', 'percent.mito', 'S.Score', 'G2M.Score'))`


![](embed/1.fplot.png){height=500 width=600}

`VlnPlot(sobj, c('nCount_RNA', 'nFeature_RNA', 'percent.mito'), ncol = 1)`


![](embed/2.vlnplot.png){height=500 width=600}

`DimPlot(sobj, group.by = 'Phase')`


![](embed/3.dimplot.phases.png){height=500 width=600}

A full assesment of quality will require cell type identification, for example via specific cell type markers. An example of this using Seurat's `FeaturePlot()` function is shown below:

`FeaturePlot(sobj, c('Ptprc', 'Cd3e', 'Cd19', 'Epcam'))`


![](embed/4.fplot.celltype.png){height=500 width=600}

Other cell type calling approaches make use of differentially expressed genes, the calling of which is discussed below. Automated (at at least semi-automated) cell-type calling is an active area of development and will be included in this suite soon.

An integrated assessment of quality and the decision to recluster using different QC parameters is thus a non-trivial and manual process, but these functions and the different parameters in the `seuratpipeline()` function will allow easy tweaking.

***

## Differentially Expressed Genes

After clustering, it is often useful to run differential expression testing to identify differentiall expressed genes (DEGs). This can be used as input for Gene Set Enrichment Analysis (GSEA).

For differential expression testing, Seurat provides wrappers for tests such as the [MAST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4676162/) test, specifically developed to identify DEGs while attempting to correct for the biases and noise inherent to scRNAseq data.

If you would like to use the GSEA pipeline included with tamlabscpipeline, it would be beneficial to use the function `deg.acrossclusters()`. This function is a wrapper around Seurat's wrapper, which can be used as follows:

deg.acrossclusters(sobj, idents, test, latent.vars, outdir )

This function will save output files to `outdir`. Output is a directory containing .rds files which store dataframes containing the output of Seurat::FindMarkers(). This function takes a while. To help deal with the risk of interruptions, this function is written in a way that tries to pick up where previously left off if stopped part-way through; simply recall the original invocation and it will try to pickup where left off.
On the same dataset mentioned previously (15982 genes x 3522 cells post filtration) and 24 clusters, using a MacBookPro 2018, this took ~3 hours and 45 mins. 

***

Options:

* sobj: the seurat object, for example outputted by `tamlabscpipeline::seuratpipeline()`
* idents: a categorical variable contained within the Seurat object metadata. Typically, the clustering assignment contained withing `sobj$seurat_clusters`. Meta.data acessed via `sobj@meta.data`, which returns a large dataframe containing information for every cell in the dataset; best to use `head(sobj@meta.data)` to view. Default is `seurat_clusters`, which if used directly from `seuratpipelne()` will correspond to clusters calculated at resolution = 1.0.
* test: which test to perform; see `?Seurat::FindMarkers` for details and options. Default is MAST.
* latent.vars: variables which will be regressed out in order to minimize their signal in DEG testing. Typically, noise. Called via a string or character vector referring to the name of columns in `sobj@meta.data`. Some tests, such as MAST, can only perform noise regression on quantitative rather than categorical variables.
* outdir; the directory to save output files to. Defaults to "ClusterDEG_(Seurat object project name)_(date)_(test)". 

***

## GSEA for each cluster

A gene set enrichment analysis (GSEA) pipeline is provided for individual datasets in the form of the `gseapipeline.clusters()` function.

gseapipeline.clusters(inputfolder, pathways = hallmark, nperm = 10000, makepdf = F, pdfname = NULL, filter_nonsig_pathways = F )

This function takes the result of the `deg.acrossclusters()` function as input. Simply set inputfolder equal to the a filepath string pointing to the directory created by that function.

As output, `gseapipeline.clusters()` will return a heatmap, colored by normalized enrichment and denoting significance cutoffs. The heatmap will look something like this: 
![](embed/5.heatmap.png){height=500 width=600}

Significance cutoffs of >0.05 and >0.25 are used due to the underlying FGSEA algorithm allowing for standard alpha cutoffs, and for historical reasons inspired by the original GSEA algorithm respectively. Please see [Korotkevich, Sukhov, and Sergushichev 2019](https://www.biorxiv.org/content/10.1101/060012v2) for details on underlying algorithm.

For the same example dataset, on 24 clusters using the default exploratory Hallmark mouse gene signatures with default nperm = 10000, this function takes ~10 seconds.

***

Options / details:

* inputfolder: string pointing to location of `deg.acrossclusters()` output directory. This contains .rds files storing dataframes outputted by Seurat::FindMarkers().
* pathways: A named "list of lists" of genes. The format of this object is important for functionality of this function and can be previewed in the tamlabscpipeline::hallmark object. The names of each list element provides the Y axis; the genes within each list element are the target genes used for GSEA. If NULL, defaults to the Msigdb's Hallmark pathways for mouse. 
  + The `msigdbr` package is extremely useful for finding mouse orthologs of the Broad's [Msigdb](http://software.broadinstitute.org/gsea/msigdb/index.jsp) repository of gene signatures. Please see the associated [vignette](https://cran.r-project.org/web/packages/msigdbr/vignettes/msigdbr-intro.html). Typical workflow for mouse datasets is to find an interesting gene set using the Broad's GUI and then use the mouse orthologs via msigdbr. The default Hallmark geneset, meant for exploratory analysis, was found in this way.
* nperm: how many permutations FGSEA will use. Facilitates significance testing. Default is 10000.
* makepdf: whether to print to pdf. Only for Mac users currently :/ this is because of limitations with special character encoding / printing to PDF. The marks for siginificance placed on the heatmap are specially encoded. Mac makes use of the quartz pdf device. Windows users are invited to explore wrapping the function in a CairoPDF call. This will hopefully be fixed soon.
  + pdfname: name of pdf. Only for Mac / Quartz users
* filter_nonsig_pathways: whether to show only rows that have at least one signifcant cell in the output heatmap. Default is False. Can be useful for polished plotting after exploratory data analysis.


***
#### Final notes

There have been some major improvements over the previous versions, including much better cell QC (more accrurately removing low-quality cells while being nicer to okay-quality cells). Also, there have been important changes made to clean up the code, remove unnecesarry options, optimize speed, and retain full scaled.data matrices at the final output. Tammela lab users are therefore encouraged to use this version of all code rather than previously shared versions.


### Parallelization

Parallelization should increase the speed of all functions. Parallelization is in the works.





``` {r, echo=FALSE}
#rm(sobj)


apf2139/tamlabscpipeline documentation built on July 23, 2021, 11 a.m.