knitr::opts_chunk$set(
  echo = TRUE,
  cache = FALSE,
  collapse = TRUE,
  comment = "#>",
  crop = NA
)
suppressPackageStartupMessages({
    require(BiocStyle)
})
htmltools::tagList(rmarkdown::html_dependency_font_awesome())

Preparation of example scRNAseq data {#prepare-scrnaseq-data}

In this workshop, we use example data from the r Biocpkg("TENxPBMCData") package. This package provides an R / Bioconductor resource for representing and manipulating different single-cell RNA-seq data sets profiling peripheral blood mononuclear cells (PBMC) generated by 10x Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets).

library(TENxPBMCData)

The man page for the TENxPBMCData() function gives an idea of the datasets that are available from this package. It can be opened with the following command.

help(TENxPBMCData)

Here, we use the pbmc3k dataset, which contains gene expression profiles for 2,700 single peripheral blood mononuclear cells. The first time this dataset is loaded, this command downloads the dataset to a local cache, which takes some time, depending on the speed of your internet connection. Subsequent times, the same command loads the dataset directly from the local cache.

sce <- TENxPBMCData(dataset = "pbmc3k")

At this point we can inspect the dataset in the console.

sce

The dataset is provided as an object of the SingleCellExperiment class. In particular, this summary view indicates that the following pieces of information are available:

Note that a SingleCellExperiment object (or, more generally, any SummarizedExperiment object) like this one already contains sufficient information to launch an interactive application instance to visualize the available data and metadata, using the iSEE() function.

For the purpose of this workshop, we first apply some preprocessing to the SingleCellExperiment object, in order to populate it with more information that can be visualized with iSEE.

We start by adding column names to the object, and use gene symbols instead of Ensembl IDs as row names. In the case where multiple Ensembl identifiers correspond to the same gene symbol, the scater::uniquifyFeatureNames function concatenates the Ensembl ID and the gene symbol in order to generate unique feature names.

library(scater)
colnames(sce) <- paste0("Cell", seq_len(ncol(sce)))
rownames(sce) <- scater::uniquifyFeatureNames(
    ID = rowData(sce)$ENSEMBL_ID,
    names = rowData(sce)$Symbol_TENx
)
head(rownames(sce))

Next, we use the r Biocpkg("scater") package to calculate gene- and cell-level quality metrics. These metrics are added as columns to the rowData and colData slots of the SingleCellExperiment object, respectively. We also add some additional metrics that are not automatically computed by the r Biocpkg("scater") package.

MT <- rownames(sce)[grep("^MT-", rownames(sce))]
sce <- scater::addPerCellQC(sce, subsets = list(MT = MT))
sce <- scater::addPerFeatureQC(sce)
sce$log10_total <- log10(sce$total)
rowData(sce)$n_cells <- as.integer(rowData(sce)$detected * ncol(sce))
rowData(sce)$log10_total <- log10(rowSums(assay(sce, "counts")) + 1)
sce

We filter out a few cells with a large fraction of the counts coming from mitochondrial genes, since these may be damaged cells. Notice the reduced number of columns in the dataset below.

(sce <- sce[, sce$subsets_MT_percent < 5])

Next, we calculate size factors and normalized and log-transformed expression values, using the r Biocpkg("scran") and r Biocpkg("scater") packages. Note that it is typically recommended to pre-cluster the cells before computing the size factors, as follows:

# set.seed(1000)
# clusters <- scran::quickCluster(sce, BSPARAM = IrlbaParam())
# sce <- scran::computeSumFactors(sce, cluster = clusters, min.mean = 0.1)

However, for time reasons, we will skip the pre-clustering step in this workshop.

library(scran)
assay(sce, "counts") <- as(assay(sce, "counts"), "sparseMatrix") # dirty fix for DockerHub
sce <- scran::computeSumFactors(sce, min.mean = 0.1)
summary(sizeFactors(sce))
sce <- scater::logNormCounts(sce)

In order to extract the most informative genes, we first model the mean-variance trend and decompose the variance into biological and technical components.

dec <- scran::modelGeneVar(sce)
top.dec <- dec[order(dec$bio, decreasing = TRUE), ] 
head(top.dec)

Next, we apply Principal Components Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) to generate low-dimensional representations of the cells in our data set. These low-dimensional representations are added to the reducedDim slot of the SingleCellExperiment object.

library(BiocSingular)
set.seed(1000)
sce <- scran::denoisePCA(sce, technical = dec)
ncol(reducedDim(sce, "PCA"))
set.seed(1000)
sce <- scater::runTSNE(sce, dimred = "PCA", perplexity = 30)
sce <- scater::runUMAP(sce, dimred = "PCA")
sce

After this, we cluster the cells using a graph-based algorithm, and find 'marker genes' for each cluster as the genes that are significantly upregulated in the cluster compared to each of the other inferred clusters. The adjusted p-values from this test, for each cluster, are added to the rowData slot of the object.

snn.gr <- scran::buildSNNGraph(sce, use.dimred = "PCA")
clusters <- igraph::cluster_walktrap(snn.gr)
sce$Cluster <- factor(clusters$membership)
table(sce$Cluster)
markers <- scran::findMarkers(sce, groups = sce$Cluster,
                              test.type = "t",
                              direction = "up", pval.type = "all")
for (i in names(markers)) {
    rowData(sce)[, paste0("FDR_cluster", i)] <- 
        markers[[i]]$FDR[match(rownames(sce), 
                               rownames(markers[[i]]))]
}
sce

Finally, we assign a label to each cell, based on their individual transcriptome profiles, using the SingleR method and the Monaco immune data (https://doi.org/10.1016/j.celrep.2019.01.041) as a reference.
For each prediction, we assign the labels values to a specific colData element

library(SingleR)
library(celldex)
ref_monaco <- MonacoImmuneData()

Here we assign the cell type according to the main classification scheme (this includes r knitr::combine_words(names(table(ref_monaco$label.main))))

pred_monaco_main <- SingleR(test = sce, ref = ref_monaco, labels = ref_monaco$label.main)
table(pred_monaco_main$labels)
sce$labels_main <- pred_monaco_main$labels

We do something similar with a more fine-grained classification, this time including the cell subtypes (e.g., for r sort(unique(ref_monaco$label.main))[1], the subtypes would include r knitr::combine_words(names(table((ref_monaco$label.main), (ref_monaco$label.fine))[1, ])[which(table((ref_monaco$label.main), (ref_monaco$label.fine))[1,] > 0)]))

pred_monaco_fine <- SingleR(test = sce, ref = ref_monaco, labels = ref_monaco$label.fine)
table(pred_monaco_fine$labels)
sce$labels_fine <- pred_monaco_fine$labels

Similarly, we use the information contained in the cell ontology labels.

pred_monaco_ont <- SingleR(test = sce, ref = ref_monaco, labels = ref_monaco$label.ont)
table(pred_monaco_ont$labels)
sce$labels_ont <- pred_monaco_ont$labels

The next table shows the relationship between the coarse and fine grained assignments in the data at hand.

table(sce$labels_main,
      sce$labels_fine)

This concludes the preparation of the data. We have now a SingleCellExperiment object that contains different types of abundance values, representations in reduced dimensions, as well as a range of row (feature) and column (cell) metadata.

Cache data set

We use a function implemented in this workshop package, to cache the precomputed object in an file that we will use as a starting point in the recipes vignette.

library(iSEEWorkshopEuroBioc2020)
cache_demo_sce(sce, "pbmc3k")


iSEE/iSEEWorkshopEuroBioc2020 documentation built on Jan. 1, 2021, 3:24 a.m.