Tutorial under construction.
Compiled: 11-03-2024
In this tutorial we show how to perform transcription factor (TF) regulatory network analysis with hdWGCNA. This is an additional type of network analysis that is complementary to the standard co-expression network analysis included in hdWGCNA. While hdWGCNA co-expression networks are undirected networks, where we do not have explicit information about which genes regulate each other, TF regulatory networks leverage TF binding motif information to build directed networks of TFs and their downstream target genes. Some of this analysis and the concepts used here are similar to another network analysis package called SCENIC, but this TF regulatory network analysis included in hdWGCNA is a distinct approach with key differences. Here, we demonstrate this analysis on a dataset of the human prefrontal cortex, but keep in mind that this analysis must be modified appropriately for different species.
Importantly, this approach was not included in the original hdWGCNA paper Morabito et al., Cell Reports Methods (2023), and we first described this method in our paper Childs & Morabito et al., Cell Reports (2024). If you use the TF regulatory network analysis in your research, please cite both of these papers.
# single-cell analysis package library(Seurat) # plotting and data science packages library(tidyverse) library(cowplot) library(patchwork) library(magrittr) # co-expression network analysis packages: library(WGCNA) library(hdWGCNA) # network analysis & visualization package: library(igraph) # using the cowplot theme for ggplot theme_set(theme_cowplot()) # set random seed for reproducibility set.seed(12345) # re-load the Zhou et al snRNA-seq dataset seurat_obj <- readRDS('data/Zhou_control.rds')
For this analysis, we need to install some additional R packages
to work with TF binding motifs. These broadly fall in two categories,
tools for working TF motifs and genomic coordinates, and database tools
to provide us information on TF motifs and genomic features. Two of the
databases that we are using are specific to human data
(EnsDb.Hsapiens.v86
, BSgenome.Hsapiens.UCSC.hg38
), and the JASPAR
database includes motif information for multiple species. We also need to
install xgboost
, which includes an algorithm that we use to model
TF regulation for each gene.
# install packages for dealing with TF motifs and genomic coordinates BiocManager::install(c( 'motifmatchr', 'TFBSTools', 'GenomicRanges' )) # install database packages for human motifs & genomic features BiocManager::install(c( 'JASPAR2020', 'EnsDb.Hsapiens.v86', 'BSgenome.Hsapiens.UCSC.hg38' )) # install xgboost install('xgboost') # load these packages into R: library(JASPAR2020) library(motifmatchr) library(TFBSTools) library(EnsDb.Hsapiens.v86) library(BSgenome.Hsapiens.UCSC.hg38) library(GenomicRanges) library(xgboost)
The first main step in our TF regulatory network analysis is to
determine which genes are potentially regulated by each TF. We provide
a function MotifScan
which uses the algorithm motifmatchr
to search for occurrences of different TF motifs within gene promoter
regions. This function will store information in the seurat_obj
about which TF motifs are present in each gene's promoter.
seurat_obj <- readRDS(file='/dfs7/swaruplab/smorabit/analysis/scWGCNA/data/zhou_tutorial.rds') # use TFBSTools to get the motif position weight matrices # for the JASPAR 2020 database pfm_core <- TFBSTools::getMatrixSet( x = JASPAR2020, opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE) ) # run the motif scan seurat_obj <- MotifScan( seurat_obj, species_genome = 'hg38', pfm = pfm_core, EnsDb = EnsDb.Hsapiens.v86 )
Now that we have information about which TFs potentially regulate each gene based on
TF motif presence, we have a basis for constructing a TF regulatory network. In this
section, we use the function ConstructTFNetwork
to construct a network of TFs and their
putative target genes. This function leverages the extreme gradient boosting (XGBoost) algorithm,
a powerful ensemble learning approach that we use to predict the expression of a given gene
based on the expression of all TFs with a matching motif in that gene's promoter. This analysis
will reveal a ranking of which TFs are best at predicting the expression of the target gene,
which we consider the most likely regulators of that particular gene. Please read the methods
section of our paper for more information.
Similar to the standard hdWGCNA co-expression network analysis, we need to define the
set of genes that will be used for this analysis, and we need to explicitly
define the expression matrix that will be used with the function SetDatExpr
.
The user may decide which genes that they want to use for this analysis, but here we will use all of the genes that were assigned to a co-expression module based on the co-expression network analysis, and all genes corresponding to a TF.
# get the motif df: motif_df <- GetMotifs(seurat_obj) # keep all TFs, and then remove all genes from the grey module tf_genes <- unique(motif_df$gene_name) modules <- GetModules(seurat_obj) nongrey_genes <- subset(modules, module != 'grey') %>% .$gene_name genes_use <- c(tf_genes, nongrey_genes) # update the gene list and re-run SetDatExpr seurat_obj <- SetWGCNAGenes(seurat_obj, genes_use) seurat_obj <- SetDatExpr(seurat_obj, group.by = 'cell_type', group_name='INH')
Now we are ready to run ConstructTFNetwork
. Since this function models each gene,
the runtime will scale wth the number of genes selected from the previous step, and it
will scale with the number of metacells / metaspots that are used for this analysis.
library(xgboost) seurat_obj <- ConstructTFNetwork(seurat_obj)
This function results in a table showing information about each potential TF-gene pair,
which we can access using GetTFNetwork
.
TODO: explain the results table
tfnet_df <- GetTFNetwork(seurat_obj) head(tfnet_df)
Expand to see example with different XGBoost parameters
We can use the parameter model_params
to pass a list of arguments for XGBoost.
See this webpage for the full list of parameters.
seurat_obj <- ConstructTFNetwork( seurat_obj, model_params=list( objective = 'reg:squarederror', max_depth = 1, eta = 0.1, nthread=16, alpha=0.5 ) )
In this step, we use the whole TF network from the previous step to define "regulons" for each TF. As discussed by Aibar et al in the SCENIC paper, regulons are similar to co-expression modules, but the genes in each regulon are comprised of highly confident target genes for each TF. Essentially, this step is pruning the TF regulatory network to only keep the strongest TF-gene connections. Here we offer several strategies for defining TF regulons.
reg_thresh
). For this analysis, we employ Strategy "A", selecting the top 10 TFs for each gene.
seurat_obj <- AssignTFRegulons( seurat_obj, strategy = "A", reg_thresh = 0.01, n_tfs = 10 )
Expand to see example of Strategy "B"
seurat_obj <- AssignTFRegulons( seurat_obj, strategy = "B", reg_thresh = 0.01, n_genes = 50 )
Expand to see example of Strategy "C"
seurat_obj <- AssignTFRegulons( seurat_obj, strategy = "C", reg_thresh = 0.1 )
In the hdWGCNA co-expression analysis, we compute aggregated expression scores for each
module, called Module Eigengenes. In TF regulatory network analysis, we also have groups
of genes (regulons) for which we can calculate gene expression scores. This will inform
us which cells express genes which are likely regulated by specific TFs. Here we
use the function RegulonScores
to compute scores for each TF regulon.
Importantly, in our TF network analysis, there are some TF-gene pairs with positive
co-expression, and some TF-gene pairs with negative co-expression. For regulon scoring,
we provide the option target_type
to select 'positive', 'negative', or 'both',
making it possible to separately analyze the signatures for genes that are
activated or repressed by a given TF.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.