knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(geneBasisR)
library(SingleCellExperiment)
library(tibble)
library(ggplot2)
library(ggpubr)

Introduction

geneBasis is an iterative greedy approach for gene selection. It strives a gene selection that will preserve transcriptional relationships between cells, represented as kNN-graphs.

Here we exemplify how geneBasis can be utilized for the design of targeted gene panels (just main steps; for the extended workflow, covering various aspects of gene selection process, please see tutorials here: https://github.com/MarioniLab/geneBasisR).

Load data

We will be working with mouse embryo, E8.5. Initial filtering for uninteresting genes was already performed.

data("sce_mouseEmbryo", package = "geneBasisR")

# Fetch meta-data and umap coordinates, merge together
meta = as.data.frame(colData(sce_mouseEmbryo))
umaps = as.data.frame(reducedDim(sce_mouseEmbryo, "UMAP"))
umaps = rownames_to_column(umaps, var = "cell")
meta = merge(meta, umaps)

# Fetch colour scheme for celltypes from meta-data
celltype_colors_df = unique(meta[, c("celltype" , "colour")])
celltype_colors = c(paste0("#",as.character(celltype_colors_df$colour)))
names(celltype_colors) = as.character(celltype_colors_df$celltype)

Cell type composition for mouse embryo

Let's see which cell types (and in which quantities) we observe in mouse embryo.

# celltype composition
tab = as.data.frame(table(sce_mouseEmbryo$celltype))
colnames(tab) = c("celltype" , "n")
tab = tab[order(tab$n) , ]
tab$celltype = factor(tab$celltype , levels = tab$celltype)
p1 = ggplot(tab , aes(x = celltype , y = log2(n) , fill = celltype)) +
  geom_bar(stat = "identity" , position = "dodge") +
  scale_fill_manual(values = celltype_colors) +
  theme_classic() +
  theme(axis.text.x = element_blank()) +
  labs(y = "log2(# cells)", x = "Cell type")

# UMAP
p2 = ggplot(meta , aes(x = x , y = y , col = celltype)) +
  geom_point(size=1,alpha = .9) +
  scale_color_manual(values = celltype_colors) +
  theme_classic() +
  labs(x = "UMAP-1" , y = "UMAP-2") +
  ggtitle("UMAP")

# combine
p = ggarrange(p1,p2, common.legend = T)
p

Select gene panel of size 10

To select gene panel, use gene_search.

The main arguments of gene_search include:

n_genes_total = 10
genes_stat = gene_search(sce_mouseEmbryo , genes_base = c("Hba-x", "Acta2", "Ttr", "Crabp1" , "Hoxaas3"), n_genes_total = n_genes_total, batch = "sample", verbose = T)
genes = genes_stat$gene

Evaluation of the gene panel

Cell type confusion matrix

Cell type mapping for a single gene panel can be computed separately using get_celltype_mapping.

We also included visualization function that returns heatmap for the confusion matrix: plot_mapping_heatmap.

celltype_mapping = get_celltype_mapping(sce_mouseEmbryo , genes.selection = genes, batch = "sample", return.stat = F)
p = plot_mapping_heatmap(celltype_mapping$mapping, title = "Cell type confusion matrix")
p

Cell score

Henceforth, to be less wordy, we refer to cell neighborhood preservation score as cell score; gene prediction score as gene score.

We estimate the quality of the panel using evaluate_library. In this example, we will only compute cell score (see extended tutorials for more in-depth analysis).

stat = evaluate_library(sce_mouseEmbryo, genes, genes.all = rownames(sce_mouseEmbryo), batch = "sample", 
                        library.size_type = "single", celltype.id = "celltype",
                        return.cell_score_stat = T, return.gene_score_stat = F, return.celltype_stat = F, verbose = FALSE)

Cell score - distribution per cell type

Let's assess whether certain cell types are consistently 'under-preserved' by looking at the distribution of cell scores across cell types.

cell_score_stat = stat$cell_score_stat[stat$cell_score_stat$n_genes == n_genes_total , ] 
cell_score_stat = merge(cell_score_stat, meta) 

p = ggplot(cell_score_stat , aes(x = celltype , y = cell_score, fill = celltype)) + 
  geom_boxplot() + 
  scale_fill_manual(values = celltype_colors) + 
  theme_classic() + 
  labs(y = "Cell neighborhood preservation score" , x = "# genes") + 
  theme(legend.position = "none") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 
p 

We see that PGCs and blood progenitors in average have lower scores. This is consistent with observation from Fig. 2.

To note, it is not particularly surprising since PGC and blood progenitors are very rare cell types, and our panel size is only 10. Therefore we suggest that more genes should be included into the panel to successfully resolve these cell types.

Characterisation of the selected genes

Let's actually look into what genes we select.

What are the genes we are selecting - visual inspection

Let's plot UMAPs, colored by expression of the selected genes, to have an intuitive grasp of what kind of genes we select. UMAPs can be easily plotted with provided function plot_umaps_w_counts. Note that we require UMAP coordinates to be stored in reducedDim(sce) under the name 'UMAP', and colnames for the coordinates should be 'x' and 'y'.

If UMAPs are not provided, you can compute them using get_umap_coordinates and then store it in sce.

p = plot_umaps_w_counts(sce_mouseEmbryo, genes)
p

Overall we see that most of the genes are expressed in some transcriptional regions but not others (i.e. like cell type markers behavior).

We also select cell state marker - cell cycle gene Ube2c.

UMAP-representation on how well data is preserved for the selected panel

Let's compare UMAP plots when using whole transcriptome and using the selected panel (UMAP-coordinates can be calculated using get_umap_coordinates).

For consistency, e will re-calculate UMAP coordinates for the whole transcriptome as well.

umaps_all = get_umap_coordinates(sce_mouseEmbryo, genes = rownames(sce_mouseEmbryo), batch = "sample", nPC = 50)
colnames(umaps_all) = c("cell", "x_all", "y_all")

umaps_selection = get_umap_coordinates(sce_mouseEmbryo, genes = genes, batch = "sample")
colnames(umaps_selection) = c("cell", "x_selection", "y_selection")

umaps_combined = merge(umaps_all, umaps_selection)

meta = as.data.frame(colData(sce_mouseEmbryo))
meta = merge(meta, umaps_combined)

# plot
p1 = ggplot(meta , aes(x = x_all , y = y_all , col = celltype)) +
  geom_point(size=1,alpha = .9) +
  scale_color_manual(values = celltype_colors) +
  theme_classic() +
  labs(x = "UMAP-1" , y = "UMAP-2") +
  ggtitle("All genes")
p2 = ggplot(meta , aes(x = x_selection , y = y_selection , col = celltype)) +
  geom_point(size=1,alpha = .9) +
  scale_color_manual(values = celltype_colors) +
  theme_classic() +
  labs(x = "UMAP-1" , y = "UMAP-2") +
  ggtitle("All genes")
p = ggarrange(p1,p2,common.legend = T)
p

Average expression across cell types and co-expression

Let's look a bit more systematically into how cell type specific are expressions of the selected genes.

p = plot_expression_heatmap(sce_mouseEmbryo, genes = genes, value.type = "mean") 
p 

Also, let's look in overall co-expression between selected genes.

p = plot_coexpression(sce_mouseEmbryo, genes = genes) 
p 

Estimation of redundancy

Finally, let's estimate the redundancy of the gene panel on gene/celltype level: for each gene, we temporarily remove it from the panel and compare the accuracy of cell type mappings with and without removed gene (per cell type).

The function to get this stat is get_redundancy_stat. We also provide the function that will plot the results as a heatmap: plot_redundancy_stat. Red corresponds to cases where cell type mapping accuracy drops with the removal of the gene, green - where cell type mapping accuracy increases with the removal of the gene (that can occur for cell type that were not well mapped to begin with).

redundancy_stat = get_redundancy_stat(sce_mouseEmbryo, genes, genes_to_assess = genes, batch = "sample") 
p = plot_redundancy_stat(redundancy_stat) 
p 

From this analysis, we can conclude the particular relevance of some genes for some cell types e.g. Plvap for Endothelium, Acta2 for Cardiomyocytes, etc.

This line of analysis also allows to estimate overall redundancy of the panel which is useful when planning -FISH experiments.

Session Info

sessionInfo()


MarioniLab/geneBasisR documentation built on June 30, 2023, 2:04 p.m.