Guided tutorial

The easiest way to use scigenex is to perform the following steps using the Seurat R package:

The resulting object can be used as input to SciGeneX. You can also provide a normalized matrix as input.

The dataset

For this tutorial, we'll be using a peripheral blood mononuclear cell (PBMC) scRNA-seq dataset available from the 10x Genomics website. This dataset contains 2700 individual cells sequenced on the Illumina NextSeq 500 and can be downloaded from 10X Genomics web site or via the SeuratData library.

Preparing the pbmc3k dataset

In this step, we'll carry out the classic pre-processing steps of the tutorial. Please refer to this tutorial for more information. If you have already pre-processed your data with Seurat, or if you have a normalized count matrix as input, you can skip this step.

suppressMessages(suppressWarnings(library(ggplot2, quietly = TRUE)))
suppressMessages(suppressWarnings(library(patchwork, quietly = TRUE)))
suppressMessages(suppressWarnings(library(Seurat, quietly = TRUE)))
suppressMessages(suppressWarnings(library(SeuratData, quietly = TRUE)))
suppressMessages(suppressWarnings(InstallData("pbmc3k")))
suppressMessages(suppressWarnings(data("pbmc3k")))
library(ggplot2)
library(patchwork)
library(Seurat, quietly = TRUE)
library(SeuratData)

InstallData("pbmc3k")
data("pbmc3k")

We next run the classical steps of the seurat pipeline. For more information, you can check Seurat website here.

# Quality control
pbmc3k[["percent.mt"]] <- PercentageFeatureSet(pbmc3k, pattern = "^MT-")
pbmc3k <- subset(pbmc3k, subset = percent.mt < 5 & nFeature_RNA > 200)

# Normalizing
pbmc3k <- NormalizeData(pbmc3k)

# Identification of highly variable genes
pbmc3k <- FindVariableFeatures(pbmc3k, selection.method = "vst", nfeatures = 2000)

# Scaling data
pbmc3k <- ScaleData(pbmc3k, features = rownames(pbmc3k), verbose = FALSE)

# Perform principal component analysis
pbmc3k <- RunPCA(pbmc3k, features = VariableFeatures(object = pbmc3k), verbose = FALSE)

# Cell clustering
pbmc3k <- FindNeighbors(pbmc3k, dims = 1:10, verbose = FALSE)
pbmc3k <- FindClusters(pbmc3k, resolution = 0.5, verbose = FALSE)

# Dimension reduction
pbmc3k <-suppressWarnings(RunUMAP(pbmc3k, dims = 1:10, verbose = FALSE))
DimPlot(pbmc3k, reduction = "umap")

Extracting gene modules using SciGeneX

In this section, we use the previously generated Seurat object as input to run the main SciGeneX commands. This command executes the main algorithm which will:

First, we'll load the SciGeneX library. To limit the verbosity of the SciGeneX functions, we'll set the verbosity level to zero (which will disable information and debug messages).

suppressMessages(suppressWarnings(library(scigenex)))
scigenex::set_verbosity(0)
library(scigenex)
scigenex::set_verbosity(0)

Then we call successively the select_genes() function which will select informative genes (i.e co-regulated), then the gene_clustering() function which will call MCL and partition the dataset into gene modules.

# Select informative genes
pbmc_scigenex <- select_genes(pbmc3k,
                              k = 50,
                              distance_method = "pearson",
                              which_slot = "data",
                              row_sum=2)

# Run MCL
pbmc_scigenex <- gene_clustering(pbmc_scigenex,
                                 s = 5,
                                 threads = 2,
                                 inflation = 2)

The object produced is a ClusterSet objet that is a S4 object that is intented to store gene modules.

isS4(pbmc_scigenex)
pbmc_scigenex

There are various methods associated with the ClusterSet objects.

show_methods()

The current object contains r nrow(pbmc_scigenex) informative genes, r ncol(pbmc_scigenex) samples and r nclust(pbmc_scigenex) gene modules.

nrow(pbmc_scigenex)
ncol(pbmc_scigenex)
nclust(pbmc_scigenex)

At this stage, several modules need to be filtered, as many of them may be singletons. Interestingly, the ClusterSet class implements the indexing operator/function ("["). The first argument/dimension of the indexing function corresponds to the cluster stored in the object. The second dimension corresponds to the cell/spot names. As an example, we can simply store gene modules whose size (*i.e. number of genes) is greater than 7 using the following code. The result is an object containing gene modules r nclust(pbmc_scigenex[clust_size(pbmc_scigenex) > 7, ]).

pbmc_scigenex <- pbmc_scigenex[clust_size(pbmc_scigenex) > 7, ]
nclust(pbmc_scigenex)

It may also be important to filter out gene based on dispersion. Several parameters can be computed for each cluster using the cluster_stats()function.

plot_cluster_stats(cluster_stats(pbmc_scigenex)) + 
 ggplot2::theme(axis.text.y = ggplot2::element_blank(),
                axis.text.x = element_text(angle=45, vjust = 0.5),
                panel.grid = element_blank()) 

Here we will select gene modules based on standard deviation (> 0.1) and rename the cluster (from 1 to the number of clusters):

pbmc_scigenex <- pbmc_scigenex[cluster_stats(pbmc_scigenex)$sd_total > 0.1, ] 
pbmc_scigenex <- rename_clust(pbmc_scigenex)
nclust(pbmc_scigenex)

Then we check the statistics again.

plot_cluster_stats(cluster_stats(pbmc_scigenex)) + 
 ggplot2::theme(axis.text.y = ggplot2::element_blank(),
                axis.text.x = element_text(angle=45, vjust = 0.5),
                panel.grid = element_blank()) 

Clusters of genes are stored in the gene_clusters slot. One can access the gene names from a cluster using the get_genes() command. By default, all genes are returned.

# Extract gene names from the 5th gene cluster
genes_module_5 <- get_genes(pbmc_scigenex, cluster = 5)
head(genes_module_5)

One can also access the gene to cluster mapping using get_genes().

head(gene_cluster(pbmc_scigenex))
tail(gene_cluster(pbmc_scigenex))

Heatmap visualization

Visualization of specific genetic modules

Visualizing a heatmap containing all cells and modules can be time-consuming and require significant memory resources. A first alternative is to examine gene modules individually or a subset of modules. Gene modules can be visualized using the plot_heatmap() the plot_ggheatmap() functions. The former is primarily intended to provide an interactive visualization based on the iheatmapr library, and allows easy browsing of results and zooming in on particular regions of the heatmap. This second solution leads to a diagram that is more easily customized, as it is based on the ggplot framework. Here, we use plot_ggheatmap() to view the first 4 clusters. Note that here, we choose to order the columns/cells on the results of Seurat::FindClusters.

plot_ggheatmap(pbmc_scigenex[1:4, ], 
               use_top_genes = FALSE,
               ident=Idents(pbmc3k)) + ggtitle("Cluster 1 to 4") 

Heatmap of representative genes

However, an alternative is to extract the most representative genes from each group. This can be achieved using the top_genes() function. This function stores the identifiers of these representative genes in the top_genes slot of the ClusterSet object. The get_genes() function is used to access the top_genes slot.

pbmc_scigenex <- top_genes(pbmc_scigenex)
genes_cl5_top <- get_genes(pbmc_scigenex, cluster = 5, top = TRUE)
genes_cl5_top

Both plot_heatmap() and plot_ggheatmap() support the use_top_genesargument:

plot_ggheatmap(pbmc_scigenex, 
               use_top_genes = TRUE,
               ident=Seurat::Idents(pbmc3k)) + ggtitle("All clusters (top genes)") +
               theme(strip.text.y = element_text(size=4))

Interactive heatmap

A very interesting feature of SciGeneX is its ability to display gene expression levels in cells/spots using interactive heatmaps. With this function, the user can interactively evaluate expression levels in selected cells or groups, or over the whole dataset. However, it is generally advisable, when using all clusters, to restrict the analysis using top_genes=TRUE. Here, we'll display expression levels in cells from clusters 3 to 5.

plot_heatmap(pbmc_scigenex[3:5, ], 
               use_top_genes = TRUE,
               cell_clusters =Seurat::Idents(pbmc3k))

Exporting modules

Gene modules can be exported using the cluster_set_to_xls(). This function will create a Excel workbook that will contain the mapping from genes to modules.

tmp_file <- file.path(tempdir(), "pbmc_scigenex_out.xls")
out <- suppressWarnings(file.remove(tmp_file))
tmp_file <- file.path(tempdir(), "pbmc_scigenex_out.xls")
cluster_set_to_xls(object=pbmc_scigenex, file_path=tmp_file)

Functional enrichment analysis

Functional enrichment analysis can be performed for each gene module using the enrich_go() function. Enrichments can be displayed using the plot_clust_enrichments() function.

# Functional enrichment analysis
pbmc_scigenex <- enrich_go(pbmc_scigenex, specie = "Hsapiens", ontology = "BP")
plot_clust_enrichments(pbmc_scigenex, gradient_palette=colors_for_gradient("Je1"), 
                       floor=50,
                       nb_go_term = 2) + 
  theme(axis.text.y = element_text(size=4))

Mapping cell populations markers onto the gene modules

Given a set of markers, the plot_markers_to_clusters() function can be used to map cell type markers to gene modules. This function will display jaccard and hypergeometric statistics. Here we will use the markers from the sctype.app database.

library(dplyr)
library(tidyr)
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(tidyr)))
sctype <- "https://zenodo.org/record/8269433/files/sctype.app.tsv"
marker_table <- read.table(sctype, head=TRUE, sep="\t")
marker_table <- marker_table %>% 
                filter(Tissue == "Immune system") %>% 
                filter(!grepl('Pro-|Pre-|HSC|precursor|Progenitor', Cell.type)) %>%
                separate_rows(Marker_genes, convert = TRUE)

markers <- split(marker_table$Marker_genes, 
                 marker_table$Cell.type)
plot_markers_to_clusters(pbmc_scigenex, 
                         markers=markers, background = rownames(pbmc3k))

Working with spatial transcriptomics datasets

The scigenex package offers a certain number of functions dedicated to spatial transcriptomic data analysis. At the moment these functions have been mainly developed to analyse VISIUM technology (10X Genomics).

Searching for gene modules

Pre-process spatial transcriptomics data

Here we will use the stxBrain dataset as example. This dataset is available from SeuratData library and contains mouse brain spatial expression over in several datasets. Two datasets are for the posterior region, two for the anterior. We will use the anterior1 dataset that we will first pre-process using Seurat.

suppressWarnings(SeuratData::InstallData("stxBrain"))
brain1 <- LoadData("stxBrain", type = "anterior1")
brain1 <- NormalizeData(brain1, 
                        normalization.method = "LogNormalize",
                        verbose = FALSE)
brain1 <- ScaleData(brain1, verbose = FALSE)
brain1 <- FindVariableFeatures(brain1, verbose = FALSE)
brain1 <- RunPCA(brain1, assay = "Spatial", verbose = FALSE)
brain1 <- FindNeighbors(brain1, reduction = "pca", dims = 1:20, verbose = FALSE)
brain1 <- FindClusters(brain1, verbose = FALSE)
brain1 <- RunUMAP(brain1, reduction = "pca", dims = 1:20, verbose = FALSE)

DimPlot(brain1, reduction = "umap", label = TRUE)
SpatialDimPlot(brain1, label = TRUE, label.size = 3, pt.size.factor = 1.4)

Calling scigenex

Here we will used the alternative clustering methods proposed by scigenex ("reciprocal_neighborhood"). It is advise to increase slightly k (here k will be set to 80). After call to gene_clustering() we apply filtering on gene modules based on cluster size (min number of genes 7) and standard deviation (gene module sd > 0.1).

res_brain <- select_genes(data=brain1,
                    k=80,
                    distance_method="pearson",
                    row_sum = 5)

gc_brain <- gene_clustering(res_brain, 
                            method = "reciprocal_neighborhood", 
                            inflation = 2, 
                            threads = 4)

gcs_brain <- filter_cluster_size(gc_brain, min_cluster_size = 7)
df <- cluster_stats(gcs_brain) 
gcss_brain <- gcs_brain[df$sd_total > 0.1, ]
gcss_brain <- rename_clust(gcss_brain)
length(row_names(gcss_brain))
nclust(gcss_brain)

Interestingly, Scigenex algorithm is able to retrieve nclust(gcss_brain) gene modules. This is most probably reminiscent of cell complexity but also of numerous molecular pathways that are differentially activated across the organ and unanticipated complexity.

Visualizing corresponding heatmap

We then may display the corresponding heatmap using plot_ggheatmap().

gcss_brain <- top_genes(gcss_brain)
plot_ggheatmap(gcss_brain, 
               use_top_genes = TRUE,
               ident=Idents(brain1)) + ggtitle("All clusters (top genes)") +
               theme(strip.text.y = element_text(size=3))

Interactive heatmap

Again, as in the context of scRNA-seq, we may also use the powerful plot_heatmap() fonction which allows interactive exploration of all or specific clusters. Here we look at cluster 6 to 9.

plot_heatmap(gcss_brain[6:9,], 
             use_top_genes = TRUE, 
             cell_clusters = Seurat::Idents(brain1))
# Try selecting a subset of columns/rows
# The 'home' button can be used to reset
# the heatmap

Visualizing topological clusters

The scigenex library implements the plot_spatial() function to display topological information. In addition to the signal, specific regions can be highlighted using a hull that can be created using the display_hull() function. Here will also add a hull around seurat cluster 0 and 2. We will then display signal for "seurat_clusters" metadata.

hull_white <- display_hull(getFlippedTissueCoordinates(brain1),
                           ident = ifelse(Idents(brain1) %in% c(0, 2), 
                                          1, 0),
                           color = "white", 
                           size_x=4, size_y=3, 
                           hull_type = "wall", 
                           size = 0.5, 
                           step_x = 2.6, 
                           step_y=2.4, 
                           delta = 1.5)

plot_spatial(seurat_obj = brain1, 
             metadata = "seurat_clusters", 
             pt_size=2.5, coord_flip = T) + hull_white

We may also want to visualize a specific gene. Here we will look at the pattern of "Hpca" which is part of the cluster r setNames(which_clust( gcss_brain, "Hpca"), NULL) detected by Scigenex.

"Hpca" %in% gcss_brain
Hpca_clust <- which_clust( gcss_brain, "Hpca")
Hpca_clust

To this aim we will use the plot_spatial() function.

plot_spatial(seurat_obj = brain1, 
             gene_name = "Hpca",  
             pt_size=2.5, coord_flip = T) + hull_white  +
  ggtitle("Hpca expression pattern.")

Note that cluster r setNames(which_clust( gcss_brain, "Hpca"), NULL) also contains several genes related to Regulator Of G Protein family. This can be checked with a regular expression using the grep_clust() function:

grep_clust(gcss_brain[Hpca_clust, ], "^Rgs")

To visualize the topological distribution of signals in each cluster, we'll (i) extract the gene_clusters slot from the gcss object, (ii) calculate the module score using Seurat's AddModuleScore() function and (iii) store the results in the seurat object. Note that, for each gene module, the signal will be scaled from 0 to 1, allowing us to have a common legend for all plots.

brain1 <- AddModuleScore(brain1, features = gcss_brain@gene_clusters)

for(i in 1:nclust(gcss_brain)){ # Normalizing module scores
  tmp <- brain1[[paste0("Cluster", i, sep="")]] 
  max_tmp <- max(tmp)
  min_tmp <- min(tmp)
  brain1[[paste0("Cluster", i, sep="")]]  <- (tmp[,1] - min(tmp))/(max_tmp - min_tmp)
}

The topological profile of cluster r Hpca_clust (that contains Hpca, a strong hippocampus marker) is the following:

plot_spatial(seurat_obj = brain1, 
             metadata = paste("Cluster", setNames(Hpca_clust, NULL), sep=""), 
             pt_size=2.5, coord_flip = T) + hull_white

We can easily see that the Hpca-containing gene module pattern is very similar to the Hpca pattern. However, the analysis suggests a more complex tissue architecture than expected, as the Hpca signal extends beyond Seurat's cluster 0.

We will then display the topological signal of all clusters using the plot_spatial_panel() function.

p <- plot_spatial_panel(brain1, 
                        metadata = paste("Cluster", 1:nclust(gcss_brain), 
                                         sep=""), 
                   ncol_layout = 3, 
                   pt_size=0.7, 
                   guide='collect',
                   stroke = 0, size_title = 5, 
                   face_title = 'plain', 
                   barwidth = 0.25, barheight = 1.5, 
                   coord_flip=T) 
print(p + guide_area())

Comparing FindAllMarkers() and Scigenex gene modules

It may be interesting to compare the clusters obtained using an unsupervised approach (SciGeneX) with those obtained from Seurat::FindAllMarkers(), which searches for genes differentially expressed between cell populations deduced by Seurat::FindClusters().

As shown using the plot_cmp_genesets() function, Seurat::FindAllMarkers() tends to find markers that are not so specific to cell populations. Thus this markers are shared between gene modules.

seurat_brain_mk <- Seurat::FindAllMarkers(brain1)
seurat_brain_mk<- split(seurat_brain_mk$gene, seurat_brain_mk$cluster)
plot_cmp_genesets(seurat_brain_mk, seurat_brain_mk, layout = "square", transform="-log10" ) + 
  xlab("Seurat::FindAllMarkers()") + 
  ylab("Seurat::FindAllMarkers")

In contrast, gene partitioning in Scigenex is a hard clustering method (elements are not shared between clusters).

plot_cmp_genesets(gcss_brain@gene_clusters, gcss_brain@gene_clusters, layout = "square", transform="-log10") + 
  xlab("Scigenex") + 
  ylab("Scigenex") 

When comparing both one can see that, although some overlaps exist between gene clusters obtained from both appraoches, they are capturing different information that are probably complementary.

plot_cmp_genesets(gcss_brain@gene_clusters, seurat_brain_mk, layout = "square", transform="-log10" )  +
  ylab("Seurat::FindAllMarkers()") + 
  xlab("Scigenex")

Creating a report

A report can be created using the cluster_set_report() function. The arguments to this function are the processed clusterSet and the corresponding processed Seurat object. Ideally, the clusterSet object should contain functional annotations (see enrich_go()). Currently, the process of creating a report can be quite time consuming It can also produce heavy html files which may take some time to load in the web browser.

# Uncomment to prepare the report.
# gcss_brain <- enrich_go(gcss_brain, species = "Mmusculus", ontology = "BP")
# cluster_set_report(clusterset_object = gcss_brain, seurat_object = brain1)

Session info

sessionInfo()


dputhier/scigenex documentation built on May 31, 2024, 8:59 a.m.