In FloWuenne/scFunctions: Functions for single cell data analysis

knitr::opts_chunk$set(echo = TRUE)

library(scFunctions)
library(tidyverse)

Introduction

This tutorial will describe how to use the functions implemented in this package to further process the output from a typical run of the SCENIC pipeline. This tutorial assumes that you have processed your data up until the third step of the pipeline. The following data will be required for completely running this tutorial:

regulonAUC - The regulon activity scores in matrix format (output from file 3.4_regulonAUC.Rds)
cell classifications for your cell - For example using a metadata data frame from a seurat object

We provide a small test data set as part of this package, which can be found in ./example_data to help test the scripts and analysis and get familiar with the different data formats and plots. The example_data is from Wuennemann et al. and represents the AUC values for heart cells from E14.5 embryonic hearts.

If any of the defintions or terms in this tutorial are unclear, please visit the SCENIC FAQ page and see whether your question is answered there already:

SCENIC FAQ

Installation

You can easily install the package with the following command:

library(devtools)
install_github("FloWuenne/scFunctions")

Regulon analysis

Many of the analysis concepts and statistics that are calculated in this tutorial have been derived from a publication by Suo et al. (2018) in Cell Reports. In their publication, they used a modified version of SCENIC to call regulons for the Mouse Cell Atlas dataset.

Determine AUC thresholds

In the original implemntation of SCENIC, the authors included a function to determine thresholds for the AUC activity via the function: AUCell_exploreThresholds(). In this package, we wrote a very simple function that determines thresholds based on k-means clustering on the AUC distribution. This function performs comparable to the original implementation but is much faster. Please keep in mind however, that this function might not perform very well for setting thresholds for non-bimodal distribution, which are quite often observed for regulons. We still advise to manually check and adjust the thresholds prior to binarization of regulons!

Let us use the regulons AUC values to determine thresholds using our k-means function.

regulonAUC <- readRDS("../example_data/regulonAUC_subset.Rds")
kmeans_thresholds <- auc_thresh_kmeans(regulonAUC)

The thresholds are saved in list format where each regulon is the name of the list and the AUC threshold is the value of the list.

head(kmeans_thresholds)

While we use a sample k-means clustering approache with 2 clusters here to determine thresholds, there are obviously more sophisticated approaches than this to determine thresholds in the AUC distribution. Feel free to develop your own approaches to determine optimal thresholds to binarize regulons. I would be excited to hear back if you developed your own function to perform this task!

Binarize regulons using thresholds

Now that we have our thresholds, it is time to binarize the regulons using these thresholds.

binary_regulons <- binarize_regulons(regulonAUC,kmeans_thresholds)

Let's take a look at the first regulon in the binary regulon list.

head(binary_regulons$`Ybx1_extended (738g)`)

Next, we have to reformat the binary regulons into a big data frame that contains all of the binary regulons so that we can use them to calculate RRS scores.

library(tidyverse)

joined_bin_reg <- binary_regulons %>%
    reduce(left_join,by="cells")

rownames(joined_bin_reg) <- joined_bin_reg$cells
joined_bin_reg <- joined_bin_reg[2:ncol(joined_bin_reg)]

binary_regulons_trans <- as.matrix(t(joined_bin_reg))

Let's check that the data table is formatted correctly before proceeding:

binary_regulons_trans[1:4,1:3]

Calculate Regulon Specificity Score (RSS)

We now want to use the binary regulon activity together with the cell assignments to see how specific each predicted regulon is for each cell type. We can do this by calculating a regulon specificity score (RSS) which is based on the Jensen-Shannon divergence, a measure of the similarity between two probability distributions. Basically for the calculation of the RSS, we will calculate the Jensen-Shannon divergence between each vector of binary regulon activity overlaps with the assignment of cells to a specific cell type.

First, we need to load the cell assignments as a data frame. This data frame needs to have cell names that correspond with the binary regulon data frame as rownames and contain a column labeled "cell_type", which contains the assignments for all cells. For convenience, you can use the metadata table from a correspondin Seurat object, just make sure that you add a column labeled "cell_type".

metadata_sub <- readRDS("../example_data/metadata_sub.Rds")

head(metadata_sub)

Now that we are ready to calculate the RSS for all regulons over all cell types.

rrs_df <- calculate_rrs(metadata_sub,
              binary_regulons = binary_regulons_trans)

The output is a data frame with a RSS score for each regulon - cell type combination.

head(rrs_df)

We can visualize the RSS by performing ranking on the RSS scores with the most specific regulons ranking the highest per cell type. I have included a function (plot_rrs_ranking) to easily plot an RSS ranking plot from this data frame. The function has a couple of options, most of which are cosmetic. Importantly, you can either plot the RSS ranking for a cell type of interest or you can set cell_type = "all" to plot the RSS over all cell types. plot_extended determines whether you would like to plot the high confidence regulons only or if you want to plot the regulons named _extended, which also contain genes only based on motif prediction based on similarity.

plot_rrs_ranking(rrs_df,
                 "RMPH",
                 ggrepel_force = 1,
                 ggrepel_point_padding = 0.2,
                 top_genes = 4,
                 plot_extended = FALSE)

We can also easily visualize all regulons over all cell types using heatmaps. Let's first investigate the distribution of RSS over all cell types.

library(ggridges)
rrs_df_nona <- subset(rrs_df,RSS > 0)
ggplot(rrs_df_nona,aes(RSS,cell_type, fill = cell_type)) +
  geom_density_ridges(scale = 5, alpha = 0.75) +
  geom_vline(xintercept = 0.1) +
  theme(legend.position = "none")

The RSS distribution clearly shows that the RSS is highly dependent upon the cell type we are investigating. As we can see, resident macrophages show very high and specific RSS, while other cell types for which more similar cell types exist in the dataset, like cardiomyocytes show less specificty for the regulons. In this small toy example, it seems that ~ 0.05 - 0.1 will capture specific regulons for most cell types. For highlighting purposes, we are gonna filter with 0.4 to be able to easily visualize the result in a heatmap.

rrs_df_wide <- rrs_df %>%
  spread(cell_type,RSS)

rownames(rrs_df_wide) <- rrs_df_wide$regulon 
rrs_df_wide <- rrs_df_wide[,2:ncol(rrs_df_wide)]

## Subset all regulons that don't have at least an RSS of 0.7 for one cell type
rrs_df_wide_specific <- rrs_df_wide[apply(rrs_df_wide,MARGIN = 1 ,FUN =  function(x) any(x > 0.4)),]

We can then visualize the regulons that show an RSS over the defined threshold in this example using heatmapply, a heatmap library using plotly.

library(heatmaply)
heatmaply(rrs_df_wide_specific)

This concludes the section about RSS calculations.

Calculate conecction specificity index (CSI) for all regulons

The final statistics that we want to calculate is the connection specificty index. The CSI is a major of connectedness between the different regulons. Regulons that share high CSI likely are co-regulating downstream genes and are together responsible for cell function. You can read more about the theoretical details of CSI here:

GAIN

We can calculate the CSI scores for all regulon pairs based on the AUCs matrix for all regulons. Again, this function has a switch to either select high confidence regulons or run the CSI calculation only on _extended regulons (calc_extended = TRUE). Here we choose to calculate CSI only for high confidence regulons.

regulons_csi <- calculate_csi(regulonAUC,
                              calc_extended = FALSE)

Once we have calculated the CSI values for all regulon pairs, we can visualize the regulon modules using the function plot_csi_modules, which will create a heatmap of the CSI values. The function has an argument that lets you change the number of clusters the heatmap is divided into via nclust.

plot_csi_modules(regulons_csi,
                 nclust = 10,
                 font_size_regulons = 8)

Finally, we can also export the CSI cluster modules by running the following code with the same cluster number we used on to create the heatmap.

csi_csi_wide <- regulons_csi %>%
    spread(regulon_2,CSI)

  future_rownames <- csi_csi_wide$regulon_1
  csi_csi_wide <- as.matrix(csi_csi_wide[,2:ncol(csi_csi_wide)])
  rownames(csi_csi_wide) <- future_rownames

regulons_hclust <- hclust(dist(csi_csi_wide,method = "euclidean"))

clusters <- cutree(regulons_hclust,k= 10)
clusters_df <- data.frame("regulon" = names(clusters),
                          "csi_cluster" = clusters)

Let's get some statistics about the different clusters, for example, how many regulons are in each cluster and how many genes are in each of the regulons to determine whether there are some clusters that feature larger regulons.

# Check how many regulons are in each cluster
clusters_df_stats <- clusters_df %>%
  group_by(csi_cluster) %>%
  mutate("regulon" = as.character(regulon)) %>%
  tally()

ggplot(clusters_df_stats,aes(as.factor(csi_cluster),n,fill=as.factor(csi_cluster))) +
  geom_bar(color= "black",stat="identity") +
  theme(legend.position="none") +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "HC clusters",
       y = "# Regulons")

## Check average regulon size per cluster
clusters_df_regsizes <- clusters_df %>%
  separate(regulon, into = c("regulon_name","regulon_size"), sep=" ") %>%
  mutate("regulon_size" = gsub("\\(","",regulon_size)) %>%
  mutate("regulon_size" = gsub("\\g)","",regulon_size)) %>%
  mutate("regulon_size" = as.numeric(regulon_size))

ggplot(clusters_df_regsizes,aes(log10(regulon_size),as.factor(csi_cluster),fill=as.factor(csi_cluster))) + 
  geom_density_ridges() +
  scale_fill_brewer(palette = "Set3") +
  theme(legend.position = "none")

clusters_df_regsizes_summary <- clusters_df_regsizes %>%
  group_by(csi_cluster) %>%
  summarise("mean_regulon_size" = mean(regulon_size),
            "median_regulon_size" = median(regulon_size),
            "sd_regulon_size" = sd(regulon_size))

library(ggrepel)
## Plot correlation between number of regulons and regulon size
clusters_meta <-  full_join(clusters_df_stats,clusters_df_regsizes_summary,by="csi_cluster")
ggplot(clusters_meta,aes(n,log10(median_regulon_size),label=as.factor(csi_cluster),fill=as.factor(csi_cluster))) +
  geom_point(color= "black",pch= 21) +
  geom_label_repel() +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Set3")

In our toy example it seems that there is some robust correlation between the number of regulons in a module and the size of the regulons in that module. Modules with more regulons generally also seem to contain larger modules.

Finally, let's calculate activity scores for each csi module based on the specificity scores of all the regulons in that module. For this, we will load the AUC matrix again and calculate the mean of AUC fr each regulon per cell type and add up these scores per module.

csi_cluster_activity_wide <- calc_csi_module_activity(clusters_df,
                                     regulonAUC,
                                     metadata_sub)

  pheatmap(csi_cluster_activity_wide,
           show_colnames = TRUE,
           color = viridis(n = 10),
           cluster_cols = TRUE,
           cluster_rows = TRUE,
           clustering_distance_rows = "euclidean",
           clustering_distance_cols = "euclidean")

This concludes the tutorial for the downstream analysis of SCENIC results. If you have any questions, please feel free to write me a mail or leave an issue on the github page. The functions written in this package were mainly written for own use and thus some dependencies and other details might not be fully defined. If you find any large bugs or issues, please let me know!

Enjoy the analysis of your SCENIC data!

Complete pipeline

For convenience, if you want to run the entire pipeline on one of your datasets, here is all the code needed for this in one chunk. Please be aware that the execution of all code can take some time, based on your dataset size. For a dataset with ~ 14 cell types and 14k cells, the entire pipelines takes 5 to 10 minutes!

## Will run complete pipeline, might take a while depending on your dataset size!
library(scFunctions)
library(tidyverse)

regulonAUC <- readRDS("../example_data/regulonAUC_subset.Rds")
metadata_sub <- readRDS("../example_data/metadata_sub.Rds")
number_of_regulon_clusters <- 10

## Binary regulons
kmeans_thresholds <- auc_thresh_kmeans(regulonAUC)
binary_regulons <- binarize_regulons(regulonAUC,kmeans_thresholds)

joined_bin_reg <- binary_regulons %>%
    reduce(left_join,by="cells")

rownames(joined_bin_reg) <- joined_bin_reg$cells
joined_bin_reg <- joined_bin_reg[2:ncol(joined_bin_reg)]

binary_regulons_trans <- as.matrix(t(joined_bin_reg))

## RRS
rrs_df <- calculate_rrs(metadata_sub,
              binary_regulons = binary_regulons_trans,
              cell_type_column = "cell_type")

rrs_df_wide <- rrs_df %>%
  spread(cell_type,RSS)

rownames(rrs_df_wide) <- rrs_df_wide$regulon
rrs_df_wide <- rrs_df_wide[,2:ncol(rrs_df_wide)]

## Subset all regulons that don't have at least an RSS of X for one cell type
rrs_df_wide_specific <- rrs_df_wide[apply(rrs_df_wide,MARGIN = 1 ,FUN =  function(x) any(x > 0.4)),]

## CSI
regulons_csi <- calculate_csi(regulonAUC,
                              calc_extended = FALSE,
                              verbose = FALSE)

plot_csi_modules(regulons_csi,
                 nclust = number_of_regulon_clusters,
                 font_size_regulons = 8)

csi_csi_wide <- regulons_csi %>%
    spread(regulon_2,CSI)

  future_rownames <- csi_csi_wide$regulon_1
  csi_csi_wide <- as.matrix(csi_csi_wide[,2:ncol(csi_csi_wide)])
  rownames(csi_csi_wide) <- future_rownames

regulons_hclust <- hclust(dist(csi_csi_wide,method = "euclidean"))

clusters <- cutree(regulons_hclust,k= number_of_regulon_clusters)
clusters_df <- data.frame("regulon" = names(clusters),
                          "csi_cluster" = clusters)

## CSI Cluster activity
csi_cluster_activity_wide <- calc_csi_module_activity(clusters_df,
                                     regulonAUC,
                                     metadata_sub)

  pheatmap(csi_cluster_activity_wide,
           show_colnames = TRUE,
           color = viridis(n = 10),
           cluster_cols = TRUE,
           cluster_rows = TRUE,
           clustering_distance_rows = "euclidean",
           clustering_distance_cols = "euclidean")