knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The ascend
R package provides simple, robust tools for the processing and
exploration of data generated by single cell RNA sequencing (scRNA-seq).
Summarized in Figure 1, the following stages of scRNA-seq analysis workflows can
be performed with ascend
:
This vignette provides general overview of the ascend
R package.
Other vignettes cover the following topics:
ascend
and package dependanciesascend
ascend
tips and tricksThe main source of input is an expression matrix, or a gene-barcode matrix containing transcript or read counts. Columns represent individual cells, while rows represent genes. Expression matrices are usually the endpoint of scRNA-seq processing pipelines. Data generated by Chromium, DropSeq and inDrop have been tested with this package.
We have supplied an example expression matrix that has been sampled from a larger dataset that is dsecribed in the publication by [Daniszewski et al. 2017] ().
library(ascend) data(raw_set) EMSet <- raw_set counts <- counts(EMSet) counts[1:5, 1:5]
ascend
is able to use any row and column names in the expression matrix,
provided they abide by the following criteria:
It is beneficial to prepare cell and gene information to accompany the expression matrix.
If you are working with over 10,000 cells on a personal computer - it is highly
recommended you load your expression matrix as a
sparse matrix. This
decreases the amount of memory used by the object and enables ascend
to take
advantage of the matrix functions from the Matrix
package.
Metadata refers to information about other data; in the context of scRNA-seq
analysis, this would include information such as sample-specific experimental
conditions for each cell, and categories a gene is linked to. As ascend
makes
extensive use of this information for tasks such as quality control,
normalisation and differential expression, metadata should be included in an
EMSet.
Cell information should be loaded into a data frame containing cell identifiers in the first column (cell_barcode), the name of the batch the cell was sequenced in (batch) in the second column, and additional information such as experimental conditions in subsequent columns.
This data is loaded into an EMSet slot called "colInfo".
# Load dataframe stored in colInfo slot. col_info <- colInfo(EMSet) print(col_info)
ascend
will prepare this data frame for you if none is supplied, but will
assume all cells belong to a single batch.
Users can also update the dataframe with additional information. For our example, we will add a label that indicates whether or not the cell expresed a surface marker called THY1 during flow sorting. We know all THY1-expressing cells were sorted into sample 1 while THY1-negative cells were sorted into sample 2.
# Retrieve list of batches stored in colInfo batch_list <- col_info$batch # Replace 1 values in col_info with "Positive" by identifying indices of 1 # values in the list batch_list[which(batch_list == 1)] <- "Positive" # Replace 2 values in col_info with "Negative" by identifying indices of 2 # values in the list batch_list[which(batch_list == 2)] <- "Negative" # Add as a new column in col_info data frame. col_info$THY1 <- batch_list # Update EMSet with it colInfo(EMSet) <- col_info # Print colInfo print(colInfo(EMSet))
Cells can also be labelled based on the expression of specific genes. For
example, we can use the addGeneLabel
function to label cells that express
POU4F1, POU4F2 and POU4F3.
gene_markers <- c("POU4F1", "POU4F2", "POU4F3") EMSet <- addGeneLabel(EMSet, gene = gene_markers) colInfo(EMSet)
Gene information should be loaded into a dataframe containing the gene identifiers used in the expression matrix in the first column, followed by other information.
This data is loaded into the EMSet slot rowInfo
.
# Load dataframe stored in colInfo slot. row_info <- rowInfo(EMSet) print(row_info)
ascend
will also prepare this data frame for you if none is supplied, but
it will only contain the gene identifiers used in the expression matrix.
If you need to access other packages that use different
gene identifiers (eg. ENSEMBL gene identifiers for cyclone), you can store these
identifiers in the data frame and switch the dataset to that set of identifiers
using the convertGeneID
function.
ensembl_set <- convertGeneID(EMSet, new.annotation = "ensembl_gene_id") print("rowInfo dataframe") rowInfo(ensembl_set) print("Count matrix") counts(ensembl_set)[1:5, 1:5]
You may define genes as controls. It is required if you would like to use
ascend
's filtering functions. Controls are usually mitochondrial and ribosomal
genes, but can also include spike-ins if they were included in the study.
Controls should be organised into groups of control types and in a named list, and the genes must be present in the expression matrix.
# Example controls - generate them from row names controls <- list(Mt = grep("^Mt-", rownames(EMSet), ignore.case = TRUE, value = TRUE), Rb = grep("^Rps|^Rpl", rownames(EMSet), ignore.case = TRUE, value = TRUE)) # Set controls in the EMSet controls(EMSet) <- controls # Retrieve controls print(controls(EMSet))
Sometimes, you will need to remove the controls from the scRNA-seq data. This
can be done with the excludeControl
function.
EMSet <- excludeControl(EMSet, control = c("Mt", "Rb"))
The core of the ascend
R package is the Expression and Metadata Set (EMSet).
This object stores user-supplied data in the form of a gene-cell expression
matrix generated by any scRNA-seq platform, in addition to metadata such as
cell- or gene-related information. Information generated by ascend
's
analytical functions are stored in the EMSet and are accessed by downstream
functions. Operations are logged within the object and users may review quality
control metrics that are updated every time a change is made to the count
matrix.
The EMSet extends the SingleCellExperiment superclass from Bioconductor ensuring all layers of the object remain synchronized and users have access to specialized methods for data access. This also enables integration into other single cell analytical workflows; an EMSet can be created from a SingleCellExperiment object, and an EMSet can be converted into a SingleCellExperiment. Convenience functions are supplied to ensure all information stored in an EMSet are preserved upon conversion.
An EMSet consists of the following slots and information:
ascend
environment are stored in the normcounts assay, and the
log-transformed normalised values are stored in the logcounts assay.ascend
.ascend
, such as cell-related
quality control metrics.ascend
.ascend
, such as gene-related
quality control metrics.ascend
offers dimensionality reduction via Principal Component Analysis (PCA) and
t-Distributed Stochastic Neighbourhood Embedding (t-SNE).An EMSet can be created with the EMSet
function. As the EMSet is built off
the SingleCellExperiment class, users may load assays that are stored in a pre-
existing SingleCellExperiment object into this function.
# Get sample data to make a new set counts <- counts(EMSet) col_info <- colInfo(EMSet) row_info <- rowInfo(EMSet) controls <- controls(EMSet) # Creating an EMSet from scratch em_set <- EMSet(list(counts = counts), colInfo = col_info, rowInfo = row_info, controls = controls) # Loading an EMSet from a SingleCellExperiment single_cell_experiment <- SingleCellExperiment(list(counts = counts)) em_set <- EMSet(single_cell_experiment, colInfo = col_info, rowInfo = row_info, controls = controls)
Refer to the Analysis of retina ganglion cells with ascend vignette for more information on how to prepare and load data into an EMSet.
An EMSet is organised as layers of information accessible via functions. These functions can be used to get and set elements of an EMSet.
# Load a 'complete' EMSet data(analyzed_set) EMSet <- analyzed_set # Get assays count_matrix <- counts(EMSet) norm_matrix <- normcounts(EMSet) logcounts_matrix <- logcounts(EMSet) # Set assays counts(EMSet) <- count_matrix normcounts(EMSet) <- norm_matrix logcounts(EMSet) <- logcounts_matrix # Get gene and cell information col_info <- colInfo(EMSet) col_data <- colData(EMSet) row_info <- rowInfo(EMSet) row_data <- rowData(EMSet) # Set gene and cell information colInfo(EMSet) <- col_info colData(EMSet) <- col_data rowInfo(EMSet) <- row_info rowData(EMSet) <- row_data # Get reduced dimensionality data tsne_matrix <- reducedDim(EMSet, "TSNE") pca_matrix <- reducedDim(EMSet, "PCA") umap_matrix <- reducedDim(EMSet, "UMAP") # Get cluster analysis clusterAnalysis <- clusterAnalysis(EMSet) # Get progress log progressLog <- progressLog(EMSet)
The EMSet can also be treated as as a data frame through the use of row/column accessors.
# Reduce EMSet to first ten cells and genes tiny_EMSet <- EMSet[1:10,1:10] # Review content in smaller dataset print(counts(tiny_EMSet)) print(colInfo(tiny_EMSet)) print(rowInfo(tiny_EMSet))
The subgroups within the EMSet can also be subsetted from the EMSet with the
subsetCondition
function. The condition needs to be set in the colInfo slot.
# Subset batch 1 from the combined EMSet Batch1_EMSet <- subsetCondition(EMSet, by = "batch", conditions = list(batch = c(1)))
If you have an EMSet created with older versions of ascend
< 0.6.0, you may
import your old objects into the new environment as follows:
# Read in old EMSet stored as an RDS file legacy_EMSet <- readRDS("LegacyEMSet.rds") # Update to new object # Please ensure your new object has the same name as the old object legacy_EMSet <- updateObject(legacy_EMSet)
Please note that if your data has been normalised, it will be stored in the
counts
and normcounts
slots.
Most plots in ascend
are generated via the ggplot2 package. These plots can be
modified further by the user with ggplot2 functions, as described in the package
documentation.
| Function | Description | |-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | plotGeneralQC | Generates a list of quality control plots | | plotLibsizeBarplot | Called by plotGeneralQC. Generates a barplot representing number of reads per cell. | | plotLibsizeHist | Called by plotGeneralQC. Generates a histogram representing the number of cells with a specific library size. | | plotAverageGeneCount | Called by plotGeneralQC. Generates a histogram representing the number of genes with a specific average expression level. | | plotTopGenesBoxplot | Called by plotGeneralQC. Generates a series of boxplots representing the expression of the top most abundant genes. | | plotTopGenesPerSample | Called by plotGeneralQC. Generates a beeswarm-violin plot for each sample that represents the relationship between the counts of the top 500 most expressed genes and the top 100 most expressed genes. | | plotFeatureHist | Called by plotGeneralQC if controls are defined. Histogram of non-control counts associated with each cell. | | plotGeneNumber | Called by plotGeneralQC if controls are defined. Histogram of non-control genes associated with each cell. | | plotControlHist | Called by plotGeneralQC. Generates a histogram of the percentage of control gene expression for each cell. | | plotControlPctPerSample | Called by plotGeneralQC. Generates a beeswarm-violin plot representing the proportion of control-associated genes to whole expression of each gene. | | plotNormQC | Generates plots for the quality control of normalised data. Plots include a histogram for pre-normalised and normalised library sizes, scatter boxplots for selected genes and scatter boxplots for counts sampled from 100 genes. | | plotPCAVariance | Generates a scree plot of percentage variance of genes, sorted from largest to smallest. | | plotPCA | Generates a scatter plot based off the PCA matrix. | | plotTSNE | Generates a scatter plot based off the TSNE matrix. | | plotMDS | Generates a scatter plot based of MDS values derived from the PCA matrix and distance matrix. | | plotUMAP | Generates a scatter plot based of UMAP values derived from the expression matrix. | | plotStabilityDendro | Generates a dendrogram with coloured bars representing stability. DOES NOT USE GGPLOT2. | | plotStability | Generates a line graph representing stability, consecutive index and Rand index. | | plotDendrogram | Generates a colour-labelled dendrogram with cluster sizes added. DOES NOT USE GGPLOT2. | | plotVolcano | Generates a volcano plot (scatter plot) using output generated by the differential expression functions of the ascend package. | | plotVariableGenes | Generates a scatter plot to aid in the detection of variable genes. |
t-SNE plots are a popular way to visualise scRNA-seq data. As t-SNE plots are
best used in conjunction with information generated from other analysis, the
plotTSNE function in ascend
can use information stored in the colInfo slot
label cells.
Before we can generate a t-SNE plot, we need to reduce the data with the method.
We will supply the runTSNE
function with a seed value as we may want to
replicate this analysis. To speed up the generation of a t-SNE matrix, we will
use the values generated by Principal Component Analysis (PCA) function runPCA
.
EMSet <- runTSNE(EMSet, dims = 2, PCA = TRUE, seed = 1)
You can retrieve the t-SNE, UMAP and PCA matrix using the reducedDim
function.
# Retrieve dimensionality-reduced matrices tsne_matrix <- reducedDim(EMSet, "TSNE") pca_matrix <- reducedDim(EMSet, "PCA") umap_matrix <- reducedDim(EMSet, "UMAP") # Display retrieved matrices print("t-SNE matrix") print(tsne_matrix[1:10, ]) print("PCA matrix") print(pca_matrix[1:10, 1:5]) print("UMAP matrix") print(umap_matrix[1:10, ])
We can then use plotTSNE
to visualise the clusters we identified via the
runCORE function.
tsne_plot <- plotTSNE(EMSet, group = "cluster") tsne_plot
As the t-SNE plot was generated via ggplot2, we can use ggplot2 functions to modify the graph. In this case, we want to change the colours of the clusters.
library(ggplot2) # Set colour palette tsne_plot <- tsne_plot + scale_color_manual(values=c("#bb5f4c", "#8e5db0", "#729b57")) # Add title tsne_plot <- tsne_plot + ggtitle("Clusters", subtitle = "tSNE plot") # Put legend on the bottom tsne_plot <- tsne_plot + theme(legend.position = "bottom") tsne_plot
A UMAP plot can be generated and labelled in a similar fashion.
library(ggplot2) # Generate plot using plotUMAP umap_plot <- plotUMAP(EMSet, group = "cluster", Dim1 = 1, Dim2 = 2) # Set colour palette umap_plot <- umap_plot + scale_color_manual(values=c("#bb5f4c", "#8e5db0", "#729b57")) # Add title umap_plot <- umap_plot + ggtitle("Clusters", subtitle = "UMAP plot") # Put legend on the bottom umap_plot <- umap_plot + theme(legend.position = "bottom") umap_plot
Many Bioconductor packages that are dedicated to scRNA-seq analysis use classes
that are related to the SingleCellExperiment class. As the EMSet contains
additional information, we have provided a conversion function convert
that
preserves this information within a SingleCellExperiment object. The resulting
SingleCellExperiment object can then be reverted to the EMSet using the same
function.
# Convert to SingleCellExperiment SingleCellExperiment <- convert(EMSet, to = "sce") head(SingleCellExperiment) # Revert to SingleCellExperiment EMSet <- convert(SingleCellExperiment, to = "EMSet") head(EMSet)
The Seurat package is the most popular tool for the analysis of scRNA-seq data. The convert function can also be used to transform an EMSet to a Seurat object. As with SingleCellExperiment objects, we can turn a Seurat object into an EMSet with the same function.
SeuratObject <- convert(EMSet, to = "seurat") SeuratObject EMSet <- convert(SeuratObject, to = "EMSet")
The scone
package provides more advanced methods for filtering and
the comparison of normalisation methods. You can use the convert
function to
import and export data for use with the scone
workflow.
sconeObject <- convert(EMSet, to = "scone") sconeObject EMSet <- convert(sconeObject, to = "EMSet")
We have created wrappers for specific tasks in scran
, DESeq
and DESeq2
.
They provide alternatives for specific stages in the ascend
workflow and
provide more in-depth analysis.
This method is regarded as an advanced normalisation method for scRNA-seq data. This method groups the cells into pools for size factor calculation before they are applied to the dataset.
scran_normalised <- scranNormalise(EMSet, quickCluster = FALSE, min.mean = 1e-05)
As the Homo sapiens training dataset that comes with scran uses ENSEMBL gene identifiers, we need to convert the EMSet's gene identifiers to them. Fortunately, we have retained these in rowInfo under the "ensembl_gene_id" column.
# Convert identifiers to ENSEMBL gene identifiers ensembl_set <- convertGeneID(RawSet, new.annotation = "ensembl_gene_id") # Load training data from scran training_data <- readRDS(system.file("exdata", "human_cycle_markers.rds", package = "scran")) # Run cell cycle scran_cellcycle <- scranCellCycle(ensembl_set, training_set = training_data) # Show cell cycle results colInfo(scran_cellcycle)
We have included wrappers for DESeq as the differential expression analysis for our earlier scRNA-seq datasets were preformed with this package. We found the results were comparable with other methods. Our optimisations have decreased the runtime of this package as well. If you have the memory resources, you can use the parallel option to speed up the runtime of the package. This may not be appropriate for some datasets though; if this function fails, please run the function without the parallel option.
cluster2_vs_others <- runDESeq(EMSet, group = "cluster", condition.a = 2, condition.b = c(1, 3), ngenes = 1500, parallel = TRUE)
We have also included wrappers for DESeq2 on request. This package requires substantially more time to run.
cluster1_vs_others <- runDESeq2(EMSet, group = "cluster", condition.a = 1, condition.b = c(2, 3), ngenes = 1500)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.