Single-cell RNA-seq analysis utilities

knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)


r Biocpkg("scuttle") provides various low-level utilities for single-cell RNA-seq data analysis, focusing on basic normalization, quality control and transformation of expression values. These functions are intended for use by analysts, usually at the start of the analysis workflow; or by developers of other packages, to assemble more complex functions in downstream analysis steps.

To demonstrate, we will obtain the classic Zeisel dataset from the r Biocpkg("scRNAseq") package. In this case, the dataset is provided as a SingleCellExperiment object. However, most r Biocpkg("scuttle") functions can also be used with raw expression matrices or instances of the more general SummarizedExperiment class.

sce <- ZeiselBrainData()

Note: A more comprehensive description of the use of r Biocpkg("scuttle") functions (along with other packages) in a scRNA-seq analysis workflow is available at

Cell-level quality control

Computing metrics

The perCellQCMetrics() function computes a variety of basic cell-level metrics, including sum, total number of counts for the cell (i.e., the library size); detected, the number of features for the cell that have counts above the detection limit (default of zero); and subsets_X_percent, percentage of all counts that come from the feature control set named X.

is.mito <- grep("mt-", rownames(sce))
per.cell <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))

It is often convenient to store this in the colData() of our SingleCellExperiment object for future reference. One can either do so manually:

colData(sce) <- cbind(colData(sce), per.cell)

... or just use the addPerCellQC() function:

sce2 <- addPerCellQC(sce, subsets=list(Mito=is.mito))

Filtering low-quality cells

We identify low-quality cells by setting a threshold on each of these metrics using the isOutlier() function. This defines the threshold at a certain number of median absolute deviations (MADs) away from the median; values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells. Here, we define small outliers (using type="lower") for the log-total counts at 3 MADs from the median. <- !isOutlier(per.cell$sum, type="lower", log=TRUE)
filtered <- sce[,]

For typical scRNA-seq applications, quickPerCellQC() will conveniently detect outliers across several common metrics. This uses the total count, number of detected features and the percentage of counts in gene sets of diagnostic value (e.g., mitochondrial genes, spike-in transcripts) to identify which cells to discard and for what reason.

qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent")
filtered <- sce[,!qc.stats$discard]

The isOutlier approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, differences in total RNA content between cell types. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the book for more details.

Computing feature-level statistics

Some basic feature-level statistics are computed by the perFeatureQCMetrics() function. This includes mean, the mean count of the gene/feature across all cells; detected, the percentage of cells with non-zero counts for each gene; subsets_Y_ratio, ratio of mean counts between the cell control set named Y and all cells.

# Pretending that the first 10 cells are empty wells, for demonstration.
per.feat <- perFeatureQCMetrics(sce, subsets=list(Empty=1:10))

A more refined calculation of the average is provided by the calculateAverage() function, which adjusts the counts by the relative library size (or size factor) prior to taking the mean.

ave <- calculateAverage(sce)

These metrics tend to be more useful for informing the analyst about the overall behavior of the experiment, rather than being explicitly used to filter out genes. For example, one would hope that the most abundant genes are the "usual suspects", e.g., ribosomal proteins, actin, histones.

Utilities for normalizaton

Computing size factors

r Biocpkg("scuttle") provides a number of utilities for global scaling normalization, where the counts for each cell are divided by a cell-specific "size factor" to adjust for uninteresting differences in sequencing depth and capture efficiency. By default, the size factor is automatically computed from the library size of each cell using the librarySizeFactors() function. This calculation simply involves scaling the library sizes so that they have a mean of 1 across all cells.


r Biocpkg("scuttle") also implements two other basic methods of computing size factors, either from the geometric mean or using a r Biocpkg("DESeq")-esque approach based on the median. These have their own strengths and weaknesses that are discussed in the relevant documentation pages.


Note that if size factors are explicitly provided in the SingleCellExperiment, they will be used by the downstream normalization functions. Size factors can be manually added by sizeFactors()<- or with the functions like:

sce <- computeLibraryFactors(sce)

Normalizing expression values

The most commonly used function is logNormCounts(), which calculates log~2~-transformed normalized expression values by dividing each count by the size factor for the cell, adding a constant pseudo-count and log-transforming. The resulting values can be roughly interpreted on the same scale as log-transformed counts and are stored in "logcounts".

sce <- logNormCounts(sce)

Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay. Here, we use the normalizeCounts() function to perform some custom normalization with random size factors.

assay(sce, "normed") <- normalizeCounts(sce, 
    size.factors=runif(ncol(sce)), pseudo.count=1.5)

r Biocpkg("scuttle") can also calculate counts-per-million using the aptly-named calculateCPM() function. The output is most appropriately stored as an assay named "cpm" in the assays of the SingleCellExperiment object. Related functions include calculateTPM() and calculateFPKM(), which do pretty much as advertised.

assay(sce, "cpm") <- calculateCPM(sce)

Aggregation across groups or clusters

The aggregateAcrossCells() function is helpful for aggregating expression values across groups of cells. For example, we might wish to sum together counts for all cells in the same cluster, possibly to use as a summary statistic for downstream analyses (e.g., for differential expression with r Biocpkg("edgeR")). This will also perform the courtesy of sensibly aggregating the column metadata for downstream use.

agg.sce <- aggregateAcrossCells(sce, ids=sce$level1class)
colData(agg.sce)[,c("ids", "ncells")]

It is similarly possible to sum across multiple factors, as shown below for the cell type and the tissue of origin. This yields one column per combination of cell type and tissue, which allows us to conveniently perform downstream analyses with both factors.

agg.sce <- aggregateAcrossCells(sce, 
    ids=colData(sce)[,c("level1class", "tissue")])
colData(agg.sce)[,c("level1class", "tissue", "ncells")]

Summation across rows may occasionally be useful for obtaining a measure of the activity of a gene set, e.g., in a pathway. Given a list of gene sets, we can use the sumCountsAcrossFeatures() function to aggregate expression values across features. This is usually best done by averaging the log-expression values as shown below.

agg.feat <- sumCountsAcrossFeatures(sce,
    ids=list(GeneSet1=1:10, GeneSet2=11:50, GeneSet3=1:100),
    average=TRUE, exprs_values="logcounts")

Similar functions are available to compute the number or proportion of cells with detectable expression in each group. To wit:

agg.n <- numDetectedAcrossCells(sce,
    ids=colData(sce)[,c("level1class", "tissue")])

Miscellaneous functions

Reading in sparse matrices

Normally, sparse matrices are provided in the MatrixMarket (.mtx) format, where they can be read efficiently into memory using the readMM() function from the r CRANpkg("Matrix") package. However, for some reason, it has been popular to save these files in dense form as tab- or comma-separate files. This is an inefficient and inconvenient approach, requiring users to read in the entire dataset in dense form with functions like read.delim() or read.csv() (and hope that they have enough memory on their machines to do so).

In such cases, r Biocpkg("scuttle") provides the readSparseCounts() function to overcome excessive memory requirements. This reads in the dataset in a chunkwise manner, progressively coercing each chunk into a sparse format and combining them into a single sparse matrix to be returned to the user. In this manner, we never attempt to load the entire dataset in dense format to memory.

# Mocking up a dataset to demonstrate:
outfile <- tempfile()
write.table(counts(sce)[1:100,], file=outfile, sep="\t", quote=FALSE)

# Reading it in as a sparse matrix:
output <- readSparseCounts(outfile)

Making gene symbols unique

When publishing a dataset, the best practice is to provide gene annotations in the form of a stable identifier like those from Ensembl or Entrez. This provides an unambiguous reference to the identity of the gene, avoiding difficulties with synonynms and making it easier to cross-reference. However, when working with a dataset, it is more convenient to use the gene symbols as these are easier to remember.

Thus, a common procedure is to replace the stable identifiers in the row names with gene symbols. However, this is not straightforward as the gene symbols may not exist (NAs) or may be duplicated. To assist this process, r Biocpkg('scuttle') provides the uniquifyFeatureNames() function that emit gene symbols if they are unique; append the identifier, if they are duplicated; and replace the symbol with the identifier if the former is missing.

# Original row names are Ensembl IDs.
sce.ens <- ZeiselBrainData(ensembl=TRUE)

# Replacing with guaranteed unique and non-missing symbols:
rownames(sce.ens) <- uniquifyFeatureNames(
    rownames(sce.ens), rowData(sce.ens)$originalName

Creating a data.frame

The makePerCellDF() and makePerFeatureDF() functions create data.frames from the SingleCellExperiment object. In the makePerCellDF() case, each row of the output data.frame corresponds to a cell and each column represents the expression of a specified feature across cells, a field of the column metadata, or reduced dimensions (if any are available).

out <- makePerCellDF(sce, features="Tspan12")

In the makePerFeatureDF() case, each row of the output data.frame corresponds to a gene and each column represents the expression profile of a specified cell or the values of a row metadata field.

out2 <- makePerFeatureDF(sce, cells=c("1772063062_D05",
    "1772063061_D01", "1772060240_F02", "1772062114_F05"))

The aim is to enable the data in a SingleCellExperiment to be easilybe used in functions like model.matrix() or in ggplot(), without requiring users to manually extract the desired fields from the SingleCellExperiment to construct their own data.frame.

Session information {.unnumbered}


Try the scuttle package in your browser

Any scripts or data that you put into this service are public.

scuttle documentation built on Dec. 19, 2020, 2 a.m.