knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Contact: alexander.ferrena@einsteinmed.org
This vignette will provide guidance for analysis of an individual single-cell RNAseq (scRNAseq) dataset.
install.packages("devtools") #if devtools not already installed devtools::install_github('apf2139/tamlabscpipeline', build_vignettes = T)
install.packages("tidyverse") install.packages('Seurat') devtools::install_github(repo = 'ChristophH/sctransform') install.packages("cowplot") devtools::install_github('chris-mcginnis-ucsf/DoubletFinder') #BioconductorPkgs if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MAST") BiocManager::install("fgsea")
First, load up the package and dependencies. running library(tamlabscpipeline)
will examine the depencies and compare your versions with the latest known compatible versions for tamlabscpipeline
and issue warnings if you are behind / ahead of the latest test.
library(Seurat) library(sctransform) library(tidyverse) library(cowplot) library(DoubletFinder) library(MAST) library(fgsea)
library(tamlabscpipeline)
The following commands will perform QC filtering, clustering, differential gene expression testing and exploratory gene set enrichment analysis on the H5 output file from the Cellranger pipeline.
pdf('qcplots.pdf') sobj <- seuratpipeline(data = filepath, #path to dataset format = h5 #format; typically cellranger dir or H5 ) dev.off() degdir = 'clusterDEG_2020-01-08_MAST' deg.acrossclusters(sobj, degdir) gseapipeline.clusters(inputfolder = degdir)
Altogether, these functions will return:
On a dataset of ~17,000 genes x ~5,500 cells using a 2018 MacBook Pro this took roughly 4 hours. Plenty of time to read the rest of this vignette for more details on the underlying methods and how to further assess the quality of your dataset; think about some hypotheses you'd like to explore in your data; and ponder the precarious global geopolitical situation we find ourselves living in.
seuratpipeline()
takes roughly 20 minsdeg.acrossclusters()
takes roughly 3 hours and 45 mins; this function has been written such that it can pick up where left off if interruptedgseapipeline.clusters()
takes roughly 10 seconds.See below for further details and ideas for QC, data exploration and analysis.
Let's assume you have a dataset at the ready. I will try to provide a small example dataset for reproducible tutorials in the future. Currently this dataset should be the output of Cellranger in the form of a directory or an H5 file.
The seuratpipeline()
function will perform QC and initial clustering of your dataset. This will involve filtering out high mitochondrial content cells, low library size cells, and doublets; then normalizing using SingleCellTransform (SCT) and clustering using the Louvain method as implemented in Seurat.
Here is some explanation about the function and its options. The object filepath
stores a string containing the path to an example file.
pdf('qcplots.pdf') sobj <- seuratpipeline( #required arguments: data file path and format of data. data = filepath, format = h5,#h5 not default, but typical. #only used for kallisto | bustools pipeline transcript_gene_file = NULL, project = NULL, #name of seurat object, sobj@project.name #"baseline", global filtration. baselinefilter.mad = F, baseline.mito.filter = T, madmax.dist.percentmito.baseline = 'predict', baseline.libsize.filter = T, madmax.dist.nCount_RNA.baseline = 'predict, #auto id and remove hi-mito content clusters. removemitomaxclust = T, #iterative filtering options for first-pass clusters. iterativefilter = T, iterativefilter.libsize = 'lefttail', iterativefilter.mito = F, #perform cell cycle correction via regression. Default F. #Still calculates cell cycle scores if set to False. cellcycleregression = F, #manually choose which genes used in PCA. #default if NULL is variable genes called by SCT. PCAgenelist=NULL, #Jackstraw analysis to test for significant PCs. Default = F jackstraw = F, #which PCs to use in clustering and dim-reduction for viz. #default = 1:30 dims = 1:30, #"Resolution parameter" used for Louvain Clustering. #default is three values, final value set as active identity: res = c(0.5, 1.5, 1.0) ) dev.off()
This function is designed to give a rough but pretty good minimum initial clustering across different datasets. On a dataset of ~17,000 genes x ~5,500 cells using a 2018 MacBook Pro this took roughly 20 minutes.
By default, this function will perform QC and clustering of the data. Default QC steps are as follows:
More details about options:
``` {r, echo=FALSE}
Inspect the clustering via the `Seurat::DimPlot()` command. `DimPlot(sobj, label = T)` ![](embed/0.dimplot.clusts.png){height=500 width=600} `FeaturePlot(sobj, c('nCount_RNA', 'percent.mito', 'S.Score', 'G2M.Score'))` ![](embed/1.fplot.png){height=500 width=600} `VlnPlot(sobj, c('nCount_RNA', 'nFeature_RNA', 'percent.mito'), ncol = 1)` ![](embed/2.vlnplot.png){height=500 width=600} `DimPlot(sobj, group.by = 'Phase')` ![](embed/3.dimplot.phases.png){height=500 width=600} A full assesment of quality will require cell type identification, for example via specific cell type markers. An example of this using Seurat's `FeaturePlot()` function is shown below: `FeaturePlot(sobj, c('Ptprc', 'Cd3e', 'Cd19', 'Epcam'))` ![](embed/4.fplot.celltype.png){height=500 width=600} Other cell type calling approaches make use of differentially expressed genes, the calling of which is discussed below. Automated (at at least semi-automated) cell-type calling is an active area of development and will be included in this suite soon. An integrated assessment of quality and the decision to recluster using different QC parameters is thus a non-trivial and manual process, but these functions and the different parameters in the `seuratpipeline()` function will allow easy tweaking. *** ## Differentially Expressed Genes After clustering, it is often useful to run differential expression testing to identify differentiall expressed genes (DEGs). This can be used as input for Gene Set Enrichment Analysis (GSEA). For differential expression testing, Seurat provides wrappers for tests such as the [MAST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4676162/) test, specifically developed to identify DEGs while attempting to correct for the biases and noise inherent to scRNAseq data. If you would like to use the GSEA pipeline included with tamlabscpipeline, it would be beneficial to use the function `deg.acrossclusters()`. This function is a wrapper around Seurat's wrapper, which can be used as follows:
deg.acrossclusters(sobj, idents, test, latent.vars, outdir )
This function will save output files to `outdir`. Output is a directory containing .rds files which store dataframes containing the output of Seurat::FindMarkers(). This function takes a while. To help deal with the risk of interruptions, this function is written in a way that tries to pick up where previously left off if stopped part-way through; simply recall the original invocation and it will try to pickup where left off. On the same dataset mentioned previously (15982 genes x 3522 cells post filtration) and 24 clusters, using a MacBookPro 2018, this took ~3 hours and 45 mins. *** Options: * sobj: the seurat object, for example outputted by `tamlabscpipeline::seuratpipeline()` * idents: a categorical variable contained within the Seurat object metadata. Typically, the clustering assignment contained withing `sobj$seurat_clusters`. Meta.data acessed via `sobj@meta.data`, which returns a large dataframe containing information for every cell in the dataset; best to use `head(sobj@meta.data)` to view. Default is `seurat_clusters`, which if used directly from `seuratpipelne()` will correspond to clusters calculated at resolution = 1.0. * test: which test to perform; see `?Seurat::FindMarkers` for details and options. Default is MAST. * latent.vars: variables which will be regressed out in order to minimize their signal in DEG testing. Typically, noise. Called via a string or character vector referring to the name of columns in `sobj@meta.data`. Some tests, such as MAST, can only perform noise regression on quantitative rather than categorical variables. * outdir; the directory to save output files to. Defaults to "ClusterDEG_(Seurat object project name)_(date)_(test)". *** ## GSEA for each cluster A gene set enrichment analysis (GSEA) pipeline is provided for individual datasets in the form of the `gseapipeline.clusters()` function.
gseapipeline.clusters(inputfolder, pathways = hallmark, nperm = 10000, makepdf = F, pdfname = NULL, filter_nonsig_pathways = F )
This function takes the result of the `deg.acrossclusters()` function as input. Simply set inputfolder equal to the a filepath string pointing to the directory created by that function. As output, `gseapipeline.clusters()` will return a heatmap, colored by normalized enrichment and denoting significance cutoffs. The heatmap will look something like this: ![](embed/5.heatmap.png){height=500 width=600} Significance cutoffs of >0.05 and >0.25 are used due to the underlying FGSEA algorithm allowing for standard alpha cutoffs, and for historical reasons inspired by the original GSEA algorithm respectively. Please see [Korotkevich, Sukhov, and Sergushichev 2019](https://www.biorxiv.org/content/10.1101/060012v2) for details on underlying algorithm. For the same example dataset, on 24 clusters using the default exploratory Hallmark mouse gene signatures with default nperm = 10000, this function takes ~10 seconds. *** Options / details: * inputfolder: string pointing to location of `deg.acrossclusters()` output directory. This contains .rds files storing dataframes outputted by Seurat::FindMarkers(). * pathways: A named "list of lists" of genes. The format of this object is important for functionality of this function and can be previewed in the tamlabscpipeline::hallmark object. The names of each list element provides the Y axis; the genes within each list element are the target genes used for GSEA. If NULL, defaults to the Msigdb's Hallmark pathways for mouse. + The `msigdbr` package is extremely useful for finding mouse orthologs of the Broad's [Msigdb](http://software.broadinstitute.org/gsea/msigdb/index.jsp) repository of gene signatures. Please see the associated [vignette](https://cran.r-project.org/web/packages/msigdbr/vignettes/msigdbr-intro.html). Typical workflow for mouse datasets is to find an interesting gene set using the Broad's GUI and then use the mouse orthologs via msigdbr. The default Hallmark geneset, meant for exploratory analysis, was found in this way. * nperm: how many permutations FGSEA will use. Facilitates significance testing. Default is 10000. * makepdf: whether to print to pdf. Only for Mac users currently :/ this is because of limitations with special character encoding / printing to PDF. The marks for siginificance placed on the heatmap are specially encoded. Mac makes use of the quartz pdf device. Windows users are invited to explore wrapping the function in a CairoPDF call. This will hopefully be fixed soon. + pdfname: name of pdf. Only for Mac / Quartz users * filter_nonsig_pathways: whether to show only rows that have at least one signifcant cell in the output heatmap. Default is False. Can be useful for polished plotting after exploratory data analysis. *** #### Final notes There have been some major improvements over the previous versions, including much better cell QC (more accrurately removing low-quality cells while being nicer to okay-quality cells). Also, there have been important changes made to clean up the code, remove unnecesarry options, optimize speed, and retain full scaled.data matrices at the final output. Tammela lab users are therefore encouraged to use this version of all code rather than previously shared versions. ### Parallelization Parallelization should increase the speed of all functions. Parallelization is in the works. ``` {r, echo=FALSE} #rm(sobj)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.