bcbio_filtered.rda object needs to be processed before run this template:

# Set seed for reproducibility
set.seed(1454944673L)
library(knitr)

opts_chunk[["set"]](
    audodep = TRUE,
    cache = TRUE,
    cache.lazy = TRUE,
    error = TRUE,
    fig.height = 8L,
    fig.retina = 2L,
    fig.width = 8L,
    message = FALSE,
    tidy = TRUE,
    warning = FALSE
)
library(ggplot2)
library(cowplot)
library(tidyverse)
library(Seurat)
rows = readRDS(file.path(params$data_dir, params$rowdata))
theme_set(theme_light(base_size = 11))
theme_update(legend.position = "bottom")

This workflow is adapted from the following sources:

To identify clusters, the following steps will be performed:

  1. Normalization and transformation of the raw gene counts per cell to account for differences in sequencing depth.
  2. Identification of high variance genes.
  3. Regression of sources of unwanted variation (e.g. number of UMIs per cell, cell cycle phase).
  4. Identification of the primary sources of heterogeneity using principal component (PC) analysis and heatmaps.
  5. Clustering cells based on significant PCs (metagenes).

Initialize Seurat

First, let's create a seurat object using the raw counts from the cells that have passed our quality control filtering parameters. Next, the raw counts are normalized using global-scaling normalization with the NormalizeData() function. This (1) normalizes the gene expression measurements for each cell by the total expression; (2) multiplies this by a scale factor (10,000 by default); and (3) log-transforms the result. Following normalization, the FindVariableGenes() function is then called, which calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression. Finally, the genes are scaled and centered using the ScaleData() function.

seurat = readRDS(file.path(params$data_dir, params$seurat))

Plot variable genes

To better cluster our cells, we need to detect the genes that are most variable within our dataset. We can plot dispersion (a normalized measure of to cell-to-cell variation) as a function of average expression for each gene to identify a set of high-variance genes.

VariableGenePlot(seurat, do.text = F)

Regress out unwanted sources of variation

Your single-cell dataset likely contains "uninteresting" sources of variation. This can include technical noise, batch effects, and/or uncontrolled biological variation (e.g. cell cycle). Regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering [@Buettner2015-ur]. To mitigate the effect of these signals, Seurat constructs linear models to predict gene expression based on user-defined variables. The scaled z-scored residuals of these models are stored in the seurat@scale.data slot, and are used for dimensionality reduction and clustering.

Cell-cycle scoring

First, we assign each cell a score, based on its expression of G2/M and S phase markers. These marker sets should be anticorrelated in their expression levels, and cells expressing neither are likely not cycling and in G1 phase. We assign scores in the CellCycleScoring() function, which stores S and G2/M scores in seurat@meta.data, along with the predicted classification of each cell in either G2M, S or G1 phase.

PCAPlot(seurat, group.by = "Phase")

Here we are checking to see if the cells are grouping by cell cycle. If we don't see clear grouping of the cells into G1, G2M, and S clusters on the PCA plot, then it is recommended that we don't regress out cell-cycle variation. When this is the case, remove S.Score and G2M.Score from the variables to regress (vars_to_regress) in the R Markdown YAML parameters.

seurat = readRDS(file.path(params$data_dir, params$seurat_tsne))

Apply regression variables

Here we are regressing out variables of uninteresting variation, using the vars.to.regress argument in the ScaleData() function. When variables are defined in the vars.to.regress argument, Seurat regresses them individually against each gene, then rescales and centers the resulting residuals.

We generally recommend minimizing the effects of variable read count depth (nUMI).

When regressing out the effects of cell-cycle variation, include S.Score and G2M.Score in the vars.to.regress argument. Cell-cycle regression is generally recommended but should be avoided for samples containing cells undergoing differentiation.

PCAPlot(seurat, group.by = "Phase")

Linear dimensionality reduction {.tabset}

Next, we perform principal component analysis (PCA) on the scaled data with RunPCA(). By default, the genes in seurat@var.genes are used as input, but can be defined using the pc.genes argument. ProjectPCA() scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. Though we don't use this further here, it can be used to identify markers that are strongly correlated with cellular heterogeneity, but may not have passed through variable gene selection. The results of the projected PCA can be explored by setting use.full = TRUE for PrintPCA().

PCHeatmap()

In particular, PCHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the "extreme" cells on both ends of the spectrum, which dramatically speeds plotting for large datasets (here we used 3000).

PCHeatmap(
    seurat,
    cells.use = 3000,
    do.balanced = TRUE,
    label.columns = FALSE,
    labRow = FALSE,
    pc.use = 1:params$pc_compute,
    remove.key = TRUE)

PrintPCA()

Some times is useful to look at the top 30 genes for each PC from the PCA to know which PCA are more intresting to use downstream.

projection = seurat@dr$pca@gene.loadings
lapply(1:params$pc_compute, function(pc){
    bind_rows(
        sort(projection[,pc], decreasing = T)[1:15] %>% 
            data.frame(gene = names(.), score = ., stringsAsFactors = F),
        sort(projection[,pc])[1:15] %>% 
            data.frame(gene = names(.), score = ., stringsAsFactors = F)
    ) %>% left_join(rows, by = c("gene" = "gene_id")) %>% 
        dplyr::select(gene_name, score) %>% 
        set_names(paste(paste0("PC", pc), names(.)))
}) %>% bind_cols() %>% kable()

Determine significant principal components

To overcome the extensive technical noise in any single gene for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a "metagene" that combines information across a correlated gene set. Determining how many PCs to include downstream is therefore an important step.

PC selection — identifying the true dimensionality of a dataset — is an important step for Seurat, but can be challenging/uncertain:

We're using a heuristic approach here, by calculating where the principal components start to elbow. The plots below show where we have defined the principal compoment cutoff used downstream for dimensionality reduction. This is calculated automatically as the larger value of:

  1. The point where the principal components only contribute 5% of standard deviation (horizontal line).
  2. The point where the principal components cumulatively contribute 90% of the standard deviation (veritcal line).
  3. The last point where the decrease between pcX and pcX-1 is bigger than 0.1%. (change of color)

From this 3 metrics we used the larger value of point 1 and 2, and the smaller of this and point 3.

# Allow for user-defined PCs to use, otherwise calculate using a PC elbow plot
pct = seurat@dr$pca@sdev / sum(seurat@dr$pca@sdev) * 100
cum = cumsum(pct)
co1 = which(cum > 90 & pct < 5)[1]
co2 = sort(which((pct[1:length(pct)-1] - pct[2:length(pct)]) >0.1),
           decreasing = T)[1] + 1
co = min(co1, co2)
data.frame(pct = pct, cumsum = cum, rank = 1:length(pct)) %>% 
    ggplot(aes(cumsum, pct, label = rank, color = rank>=co)) +
    geom_text() +
    geom_vline(xintercept = 90, color = "grey") +
    geom_hline(yintercept = min(pct[pct>5]), color = "grey")
pc_use <- co

We are using r length(pc_use) principal components for dimensionality reduction calculations.

Cluster the cells

Seurat uses a graph-based clustering approach, inspired by SNN-Cliq [@Xu2015-je] and PhenoGraph [@Levine2015-hr]. This approach embeds cells in a graph structure, by default using a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. As in PhenoGraph, Seurat first constructs a KNN graph based on the euclidean distance in PCA space, and refines the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard distance). To cluster the cells, it then applies modularity optimization techniques [@Blondel2008-rf], to iteratively group cells together, with the goal of optimizing the standard modularity function.

The FindClusters() function implements the procedure, and contains a resolution argument that sets the "granularity" of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.6-1.2 typically returns good results for single cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters are saved in the seurat@ident slot.

Regarding the value of the resolution argument, use a value < 1 if you want to obtain fewer clusters.

A useful feature in Seurat v2.0 is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, we provide the function PrintFindClustersParams to print a nicely formatted formatted summary of the parameters that were chosen.

PrintFindClustersParams(seurat)

Run non-linear dimensional reduction (tSNE)

Seurat continues to use tSNE as a powerful tool to visualize and explore these datasets. While we no longer advise clustering directly on tSNE components, cells within the graph-based clusters determined above should co-localize on the tSNE plot. This is because the tSNE aims to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space. As input to the tSNE, we suggest using the same PCs as input to the clustering analysis, although computing the tSNE based on scaled gene expression is also supported using the genes.use argument.

PrintTSNEParams(seurat)

Note that tSNE is not PCA! The measurement of distance in a tSNE plot is difficult to interpret, and is most helpful for the relationships of close neighbors. To better infer separation distance between the putative clusters, PCA is aswell shown.

Clusters with different resolutions {.tabset}

res_values = names(slot(seurat, "meta.data"))[grepl("res.", names(slot(seurat, "meta.data")))]
all_res = lapply(res_values,
                 function(res){
                     TSNEPlot(SetAllIdent(seurat, res),
                              do.return = TRUE,do.label = TRUE) +
                         theme(legend.position="none") +
                         ggtitle(res)
                 })
plot_grid(plotlist = all_res)
# in case you want to change it
seurat = SetAllIdent(seurat, id = params$res)

Clusters {.tabset}

tsne_label = FetchData(seurat, vars.all = c("ident", "tSNE_1", "tSNE_2"))  %>% 
    as.data.frame() %>% 
    group_by(ident) %>%
    summarise(x=mean(tSNE_1), y=mean(tSNE_2))
pca_label = FetchData(seurat, vars.all = c("ident", "PC1", "PC2"))  %>% 
    as.data.frame() %>% 
    mutate(ident = seurat@ident) %>% 
    group_by(ident) %>%
    summarise(x=mean(PC1), y=mean(PC2))

TSNE

plot_grid(
    TSNEPlot(
        seurat,
        do.label = TRUE,
        do.return = TRUE) +
        ggtitle("tSNE"),
    PCAPlot(
        seurat,
        do.label = TRUE,
        do.return = TRUE) +
        ggtitle("PCA")
)

Other variables {.tabset}

group_by = c("Phase", "sample") # Add whatever column in colData to color by other group

class_tsne_data = FetchData(seurat, vars.all = c("ident", "tSNE_1", "tSNE_2", group_by))
class_pca_data = FetchData(seurat, vars.all = c("ident", "PC1", "PC2", group_by))

lapply(group_by, function(variable) {
cat("\n\n###", variable, "\n\n")
    p = plot_grid(
        ggplot(class_tsne_data, aes(tSNE_1, tSNE_2)) +
            geom_point(aes_string(color = variable), alpha = 0.7) +
            scale_color_brewer(palette = "Set2")  +
            geom_text(data=tsne_label, aes(label=ident, x, y)),
        ggplot(class_pca_data, aes(PC1, PC2)) +
            geom_point(aes_string(color = variable), alpha = 0.7) +
            scale_color_brewer(palette = "Set2")  +
            geom_text(data=pca_label, aes(label=ident, x, y)),
        nrow = 1, align = "v"
    ) 
    print(p)
}) %>% invisible()

Cluster quality control

Let's look at the variance in the number of UMI counts (nUMI), gene detection (nGene), and the percentage of mitochondrial gene expression (mitoRatio), to see if there are any obvious cluster artefacts. We can also assess cell cycle batch effects (S.Score, G2M.Score) and any principal component bias toward individual clusters.

variables =  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
qc_data = FetchData(seurat, vars.all = c(variables, "ident", "tSNE_1", "tSNE_2"))
lapply(variables, function(qc){
    ggplot(qc_data, aes(tSNE_1, tSNE_2)) +
        geom_point(aes_string(color=qc), alpha = 0.7) +
        scale_color_gradient(guide = FALSE, low = "grey90", high = "blue")  +
        geom_text(data=tsne_label, aes(label=ident, x, y)) +
        ggtitle(qc)
}) %>% plot_grid(plotlist = .)

PCs importance

columns = c(paste0("PC", 1:params$pc_compute),
            "ident",
            "tSNE_1", "tSNE_2")
pc_data = FetchData(seurat, vars.all = columns)

lapply(paste0("PC", 1:params$pc_compute), function(pc){
    ggplot(pc_data, aes(tSNE_1, tSNE_2)) +
        geom_point(aes_string(color=pc), alpha = 0.7) +
        scale_color_gradient(guide = FALSE, low = "grey90", high = "blue")  +
        geom_text(data=tsne_label, aes(label=ident, x, y)) +
        ggtitle(pc)
}) %>% plot_grid(plotlist = .)
# Custom genes {.tabset}

library(rio)
# symbols vector with list of genes 
custom_genes = rows %>% filter(gene_name  %in% symbols) %>%
    select(gene_id, gene_name)
plot_selected = custom_genes$gene_id
names(plot_selected) = custom_genes$gene_name

tsne = FetchData(seurat, vars.all = c("tSNE_1", "tSNE_2"))
gene_data = FetchData(seurat, vars.all = custom_genes$geneID)
colnames(gene_data) = names(plot_selected)[match(plot_selected, colnames(gene_data))]
gene_data = cbind(tsne, gene_data)

lapply(custom_genes$gene_name, function(g){
    ggplot(gene_data, aes(tSNE_1, tSNE_2)) +
    geom_point(aes_string(color=g), alpha = 0.7) +
    scale_color_gradient(guide = FALSE, low = "grey90", high = "blue")  +
    geom_text(data=tsne_label, aes(label=ident, x, y)) +
    ggtitle(g)
}) %>% plot_grid(plotlist = .)
lapply(custom_genes$gene_name, function(g){
    cat("\n\n## ", g, "\n\n")
    ggplot(gene_data, aes_string("condition", g)) +
        geom_jitter(size = 0.5, alpha = 0.4) +
        geom_violin() +
        ggtitle(g) + facet_wrap(~ident)
})

Session

sessionInfo()


hbc/hbcABC documentation built on Sept. 5, 2019, 9:51 p.m.