Tutorial under construction.

Compiled: 11-03-2024

Introduction

In this tutorial we show how to perform transcription factor (TF) regulatory network analysis with hdWGCNA. This is an additional type of network analysis that is complementary to the standard co-expression network analysis included in hdWGCNA. While hdWGCNA co-expression networks are undirected networks, where we do not have explicit information about which genes regulate each other, TF regulatory networks leverage TF binding motif information to build directed networks of TFs and their downstream target genes. Some of this analysis and the concepts used here are similar to another network analysis package called SCENIC, but this TF regulatory network analysis included in hdWGCNA is a distinct approach with key differences. Here, we demonstrate this analysis on a dataset of the human prefrontal cortex, but keep in mind that this analysis must be modified appropriately for different species.

Importantly, this approach was not included in the original hdWGCNA paper Morabito et al., Cell Reports Methods (2023), and we first described this method in our paper Childs & Morabito et al., Cell Reports (2024). If you use the TF regulatory network analysis in your research, please cite both of these papers.

# single-cell analysis package
library(Seurat)

# plotting and data science packages
library(tidyverse)
library(cowplot)
library(patchwork)
library(magrittr)

# co-expression network analysis packages:
library(WGCNA)
library(hdWGCNA)

# network analysis & visualization package:
library(igraph)

# using the cowplot theme for ggplot
theme_set(theme_cowplot())

# set random seed for reproducibility
set.seed(12345)

# re-load the Zhou et al snRNA-seq dataset
seurat_obj <- readRDS('data/Zhou_control.rds')

Install additional packages

For this analysis, we need to install some additional R packages to work with TF binding motifs. These broadly fall in two categories, tools for working TF motifs and genomic coordinates, and database tools to provide us information on TF motifs and genomic features. Two of the databases that we are using are specific to human data (EnsDb.Hsapiens.v86, BSgenome.Hsapiens.UCSC.hg38), and the JASPAR database includes motif information for multiple species. We also need to install xgboost, which includes an algorithm that we use to model TF regulation for each gene.

# install packages for dealing with TF motifs and genomic coordinates
BiocManager::install(c(
  'motifmatchr',
  'TFBSTools',
  'GenomicRanges'
))

# install database packages for human motifs & genomic features
BiocManager::install(c(
  'JASPAR2020',
  'EnsDb.Hsapiens.v86',
  'BSgenome.Hsapiens.UCSC.hg38'
)) 

# install xgboost 
install('xgboost')

# load these packages into R:
library(JASPAR2020)
library(motifmatchr)
library(TFBSTools)
library(EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg38)
library(GenomicRanges)
library(xgboost)

Identify TFs in promoter regions

The first main step in our TF regulatory network analysis is to determine which genes are potentially regulated by each TF. We provide a function MotifScan which uses the algorithm motifmatchr to search for occurrences of different TF motifs within gene promoter regions. This function will store information in the seurat_obj about which TF motifs are present in each gene's promoter.

seurat_obj <- readRDS(file='/dfs7/swaruplab/smorabit/analysis/scWGCNA/data/zhou_tutorial.rds')

# use TFBSTools to get the motif position weight matrices 
# for the JASPAR 2020 database
pfm_core <- TFBSTools::getMatrixSet(
  x = JASPAR2020,
  opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE)
)

# run the motif scan
seurat_obj <- MotifScan(
  seurat_obj,
  species_genome = 'hg38',
  pfm = pfm_core,
  EnsDb = EnsDb.Hsapiens.v86
)

Construct TF Regulatory Network

Now that we have information about which TFs potentially regulate each gene based on TF motif presence, we have a basis for constructing a TF regulatory network. In this section, we use the function ConstructTFNetwork to construct a network of TFs and their putative target genes. This function leverages the extreme gradient boosting (XGBoost) algorithm, a powerful ensemble learning approach that we use to predict the expression of a given gene based on the expression of all TFs with a matching motif in that gene's promoter. This analysis will reveal a ranking of which TFs are best at predicting the expression of the target gene, which we consider the most likely regulators of that particular gene. Please read the methods section of our paper for more information.

Similar to the standard hdWGCNA co-expression network analysis, we need to define the set of genes that will be used for this analysis, and we need to explicitly define the expression matrix that will be used with the function SetDatExpr.

The user may decide which genes that they want to use for this analysis, but here we will use all of the genes that were assigned to a co-expression module based on the co-expression network analysis, and all genes corresponding to a TF.

# get the motif df:
motif_df <- GetMotifs(seurat_obj)

# keep all TFs, and then remove all genes from the grey module
tf_genes <- unique(motif_df$gene_name)
modules <- GetModules(seurat_obj)
nongrey_genes <- subset(modules, module != 'grey') %>% .$gene_name
genes_use <- c(tf_genes, nongrey_genes)

# update the gene list and re-run SetDatExpr
seurat_obj <- SetWGCNAGenes(seurat_obj, genes_use)
seurat_obj <- SetDatExpr(seurat_obj, group.by = 'cell_type', group_name='INH')

Now we are ready to run ConstructTFNetwork. Since this function models each gene, the runtime will scale wth the number of genes selected from the previous step, and it will scale with the number of metacells / metaspots that are used for this analysis.

library(xgboost)

seurat_obj <- ConstructTFNetwork(seurat_obj)

This function results in a table showing information about each potential TF-gene pair, which we can access using GetTFNetwork.

TODO: explain the results table

tfnet_df <- GetTFNetwork(seurat_obj)
head(tfnet_df)

Expand to see example with different XGBoost parameters

We can use the parameter model_params to pass a list of arguments for XGBoost. See this webpage for the full list of parameters.

seurat_obj <- ConstructTFNetwork(
    seurat_obj,
    model_params=list(
        objective = 'reg:squarederror',
        max_depth = 1,
        eta = 0.1,
        nthread=16,
        alpha=0.5
    )
)

Define TF Regulons

In this step, we use the whole TF network from the previous step to define "regulons" for each TF. As discussed by Aibar et al in the SCENIC paper, regulons are similar to co-expression modules, but the genes in each regulon are comprised of highly confident target genes for each TF. Essentially, this step is pruning the TF regulatory network to only keep the strongest TF-gene connections. Here we offer several strategies for defining TF regulons.

For this analysis, we employ Strategy "A", selecting the top 10 TFs for each gene.

seurat_obj <- AssignTFRegulons(
    seurat_obj,
    strategy = "A",
    reg_thresh = 0.01,
    n_tfs = 10
)

Expand to see example of Strategy "B"

seurat_obj <- AssignTFRegulons(
    seurat_obj,
    strategy = "B",
    reg_thresh = 0.01,
    n_genes = 50
)

Expand to see example of Strategy "C"

seurat_obj <- AssignTFRegulons(
    seurat_obj,
    strategy = "C",
    reg_thresh = 0.1
)

Calculate regulon expression signatures

In the hdWGCNA co-expression analysis, we compute aggregated expression scores for each module, called Module Eigengenes. In TF regulatory network analysis, we also have groups of genes (regulons) for which we can calculate gene expression scores. This will inform us which cells express genes which are likely regulated by specific TFs. Here we use the function RegulonScores to compute scores for each TF regulon.

Importantly, in our TF network analysis, there are some TF-gene pairs with positive co-expression, and some TF-gene pairs with negative co-expression. For regulon scoring, we provide the option target_type to select 'positive', 'negative', or 'both', making it possible to separately analyze the signatures for genes that are activated or repressed by a given TF.



smorabit/hdWGCNA documentation built on July 20, 2024, 8:58 p.m.