In LTLA/scuttle: Single-Cell RNA-Seq Analysis Utilities

knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
library(BiocStyle)
set.seed(10918)

Introduction

r Biocpkg("scuttle") provides various low-level utilities for single-cell RNA-seq data analysis, typically used at the start of the analysis workflows or within high-level functions in other packages. This vignette will discuss the use of miscellaneous functions for scRNA-seq data processing. To demonstrate, we will obtain the classic Zeisel dataset from the r Biocpkg("scRNAseq") package, and apply some quick quality control to remove damaged cells.

library(scRNAseq)
sce <- ZeiselBrainData()

library(scuttle)
sce <- quickPerCellQC(sce, subsets=list(Mito=grep("mt-", rownames(sce))),
    sub.fields=c("subsets_Mito_percent", "altexps_ERCC_percent")) 

sce

Note: A more comprehensive description of the use of r Biocpkg("scuttle") functions (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.

Aggregation across groups or clusters

The aggregateAcrossCells() function is helpful for aggregating expression values across groups of cells. For example, we might wish to sum together counts for all cells in the same cluster, possibly to use as a summary statistic for downstream analyses (e.g., for differential expression with r Biocpkg("edgeR")). This will also perform the courtesy of sensibly aggregating the column metadata for downstream use.

library(scuttle)
agg.sce <- aggregateAcrossCells(sce, ids=sce$level1class)
head(assay(agg.sce))
colData(agg.sce)[,c("ids", "ncells")]

It is similarly possible to sum across multiple factors, as shown below for the cell type and the tissue of origin. This yields one column per combination of cell type and tissue, which allows us to conveniently perform downstream analyses with both factors.

agg.sce <- aggregateAcrossCells(sce, 
    ids=colData(sce)[,c("level1class", "tissue")])
head(assay(agg.sce))
colData(agg.sce)[,c("level1class", "tissue", "ncells")]

Summation across rows may occasionally be useful for obtaining a measure of the activity of a gene set, e.g., in a pathway. Given a list of gene sets, we can use the sumCountsAcrossFeatures() function to aggregate expression values across features. This is usually best done by averaging the log-expression values as shown below.

sce <- logNormCounts(sce)
agg.feat <- sumCountsAcrossFeatures(sce,
    ids=list(GeneSet1=1:10, GeneSet2=11:50, GeneSet3=1:100),
    average=TRUE, exprs_values="logcounts")
agg.feat[,1:10]

Similar functions are available to compute the number or proportion of cells with detectable expression in each group. To wit:

agg.n <- summarizeAssayByGroup(sce, statistics="prop.detected",
    ids=colData(sce)[,c("level1class", "tissue")])
head(assay(agg.n))

Reading in sparse matrices

Normally, sparse matrices are provided in the MatrixMarket (.mtx) format, where they can be read efficiently into memory using the readMM() function from the r CRANpkg("Matrix") package. However, for some reason, it has been popular to save these files in dense form as tab- or comma-separate files. This is an inefficient and inconvenient approach, requiring users to read in the entire dataset in dense form with functions like read.delim() or read.csv() (and hope that they have enough memory on their machines to do so).

In such cases, r Biocpkg("scuttle") provides the readSparseCounts() function to overcome excessive memory requirements. This reads in the dataset in a chunkwise manner, progressively coercing each chunk into a sparse format and combining them into a single sparse matrix to be returned to the user. In this manner, we never attempt to load the entire dataset in dense format to memory.

# Mocking up a dataset to demonstrate:
outfile <- tempfile()
write.table(as.matrix(counts(sce)[1:100,]), 
    file=outfile, sep="\t", quote=FALSE)

# Reading it in as a sparse matrix:
output <- readSparseCounts(outfile)
class(output)

Making gene symbols unique

When publishing a dataset, the best practice is to provide gene annotations in the form of a stable identifier like those from Ensembl or Entrez. This provides an unambiguous reference to the identity of the gene, avoiding difficulties with synonynms and making it easier to cross-reference. However, when working with a dataset, it is more convenient to use the gene symbols as these are easier to remember.

Thus, a common procedure is to replace the stable identifiers in the row names with gene symbols. However, this is not straightforward as the gene symbols may not exist (NAs) or may be duplicated. To assist this process, r Biocpkg('scuttle') provides the uniquifyFeatureNames() function that emit gene symbols if they are unique; append the identifier, if they are duplicated; and replace the symbol with the identifier if the former is missing.

# Original row names are Ensembl IDs.
sce.ens <- ZeiselBrainData(ensembl=TRUE)
head(rownames(sce.ens)) 

# Replacing with guaranteed unique and non-missing symbols:
rownames(sce.ens) <- uniquifyFeatureNames(
    rownames(sce.ens), rowData(sce.ens)$originalName
)
head(rownames(sce.ens))

Creating a `data.frame`

The makePerCellDF() and makePerFeatureDF() functions create data.frames from the SingleCellExperiment object. In the makePerCellDF() case, each row of the output data.frame corresponds to a cell and each column represents the expression of a specified feature across cells, a field of the column metadata, or reduced dimensions (if any are available).

out <- makePerCellDF(sce, features="Tspan12")
colnames(out)

In the makePerFeatureDF() case, each row of the output data.frame corresponds to a gene and each column represents the expression profile of a specified cell or the values of a row metadata field.

out2 <- makePerFeatureDF(sce, cells=c("1772063062_D05",
    "1772063061_D01", "1772060240_F02", "1772062114_F05"))
colnames(out2)

The aim is to enable the data in a SingleCellExperiment to be easily used in functions like model.matrix() or in ggplot(), without requiring users to manually extract the desired fields from the SingleCellExperiment to construct their own data.frame.