knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this vignette, we will build one regulatory network for LCL cell line samples and one for whole blood samples from the GTEx gene expression data using the netZooR package. Next, we will compare the two networks, and find the pathways enriched for genes differentially targeted between the LCL cell line and whole blood.
Cell lines are an essential tool in biomedical research and often used as surrogates for tissues. LCLs (obtained from the transformation of B cells present in whole blood) are among the most widely used continuous cell lines with the ability to proliferate indefinitely. By comparing the regulatory networks of LCL cell lines with its tissue of origin (whole blood), we find that LCLs exhibit large changes in their patterns of transcription factor regulation, specifically a loss of repressive transcription factor targeting of cell cycle genes.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager",repos = "http://cran.us.r-project.org") BiocManager::install("fgsea") # install.packages("reshape2",repos = "http://cran.us.r-project.org") install.packages("ggplot2",repos = "http://cran.us.r-project.org")
install.packages("devtools") library(devtools) devtools::install_github("netZoo/netZooR", build_vignettes = FALSE)
library(netZooR) library(fgsea) library(ggplot2) library(reshape2)
PANDA (Passing Attributes between Networks for Data Assimilation) is a method for constructing gene regulatory networks. It uses message passing to find congruence between 3 different data layers: protein-protein interaction (PPI), gene expression, and transcription factor (TF) motif data.
More details can be found in the published paper https://doi.org/10.1371/journal.pone.0064832.
Now we locate our ppi and motif priors. The ppi represents physical interactions between transcription factor proteins, and is an undirected network. The transcription factor motif prior represents putative regulation events where a transcription factor binds in the promotor of a gene to regulate its expression, as predicted by the presence of transcription factor binding motifs in the promotor region of the gene. The motif prior is thus a directed network linking transcription factors to their predicted gene targets. These are small example priors for the purposes of demonstrating this method. A complete set of motif priors by species can be downloaded from: https://sites.google.com/a/channing.harvard.edu/kimberlyglass/tools/resources
The function sourcePPI() can be used to source the protein-protein interaction network from the STRING database, but it does so only to produce an unweighted network that aggregates all types of interactions not only the physical interaction subnetwork. Please refer to STRINGdb website or R package to specify a PPI network suited for your application.
Let's take a look at the priors:
# download motif and ppi file from AWS Bucket system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/motif_subset.txt") system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/ppi_subset.txt") motif <- read.delim("./motif_subset.txt", stringsAsFactors=F, header=F) ppi <- read.delim("./ppi_subset.txt", stringsAsFactors=F, header=F) ppi[1:5,] motif[1:5,]
Now we locate our expression data. As an example, we will use a subset of the GTEx version 7 RNA-Seq data, downloaded from https://gtexportal.org/home/datasets. We start with a subset of RNA-Seq data (tpm normalized) for 1,000 genes from 130 LCL cell line samples and 407 whole blood samples.
# dowload and load the GTEx expression matrix (tpm normalized expression) system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/expression_tpm_lcl_blood_subset.txt") exp <- read.delim("./expression_tpm_lcl_blood_subset.txt", stringsAsFactors = F, check.names = F) # Log transform the tpm normalized expression exp <- log2(exp+1) # Determine the number of non-NA/non-zero rows in the expression data. This is to be able to ensure that PANDA will have enough values in the vectors to calculate pearson correlations between gene expression profiles in the construction of the gene co-expression prior. zero_na_counts <- apply(exp, MARGIN = 1, FUN = function(x) length(x[(!is.na(x) & x!=0) ])) # Maintain only genes with at least 20 valid gene expression entries exp <- exp[zero_na_counts > 20,] # # The set of genes in the expression, motif and ppi matrices must be the same # exp <- exp[rownames(exp) %in% motif$V2,] # motif_subset <- motif[(motif$V1 %in% rownames(exp)) & (motif$V2 %in% rownames(exp)),] # ppi_subset <- ppi[(ppi$V1 %in% motif_subset$V1) & (ppi$V2 %in% motif_subset$V1),] # Load the sample ids of LCL samples system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/LCL_samples.txt") lcl_samples <-read.delim("./LCL_samples.txt", header=FALSE, stringsAsFactors=FALSE) # Select the columns of the expression matrix corresponding to the LCL samples lcl_exp <- exp[,colnames(exp) %in% lcl_samples[,1]] # Load the sample ids of whole blood samples system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/WholeBlood_samples.txt") wblood_samples <-read.delim("./WholeBlood_samples.txt", header=FALSE, stringsAsFactors=FALSE) # Select the columns of the expression matrix corresponding to the whole blood samples wb_exp <- exp[,colnames(exp) %in% wblood_samples[,1]]
Now we run PANDA, pointing it to the parsed expression data, motif prior and ppi prior. We will point to the same motif and ppi priors for each PANDA run, which represents the initial putative regulatory information. We then point to the expression matrix correspoding to the LCL samples to generate the LCL regulatory network, and to the expression matrix corresponding to the whole blood samples to generate the whole blood regulatory network.
pandaLCL <- panda(motif, lcl_exp, ppi, mode="intersection") pandaWB <- panda(motif, wb_exp, ppi, mode="intersection") pandaLCL pandaWB
The regulatory network (bipartite graph) with edge weights representing the "likelihood" that a transcription factor binds the promotor of and regulates the expression of a gene.
# The bipartite regulatory network (transcription factors as rows and target genes as columns) regNetLCL <- pandaLCL@regNet regNetLCL[1:5,1:5] regNetWB <- pandaWB@regNet
In this section we will visualize parts of the network using the Cytoscape software. Download Cytoscape from: https://cytoscape.org and have the software open before calling the function.
# We will use the function vis.panda.in.cytoscape to plot a set of nodes and edges on Cytoscape. The input for this function is a data.frame of edges to plot with 4 columns: "tf", "gene", "motif" (TF motif present or not on gene promoter), "force" (edge weight calculated by PANDA). lcl_vis <- reshape2::melt(pandaLCL@regNet) wb_vis <- reshape2::melt(pandaWB@regNet) lcl_vis <- data.frame("TF"=as.character(lcl_vis[,1]),"Gene"=as.character(lcl_vis[,2]),"Motif"=NA,"Score"=as.numeric(lcl_vis[,3]),stringsAsFactors = FALSE) wb_vis <- data.frame("TF"=as.character(wb_vis[,1]),"Gene"=as.character(wb_vis[,2]),"Motif"=NA,"Score"=as.numeric(wb_vis[,3]),stringsAsFactors = FALSE) head(lcl_vis)
n=200 # number of edges to plot top <- order(lcl_vis$Score,decreasing=T)[1:n] lcl_vis_top <- lcl_vis[top,] # Plot in cytoscape (open Cytoscape before running this command) visPandaInCytoscape(lcl_vis_top, network_name="LCL") # Here we will load a customized visual style for our network, in which TF nodes are orange circles, target gene nodes are blue squares, and edges shade and width are the edge weight (likelyhood of regulatory interaction between the TF and gene). You can further customize the network style directly from Cytoscape. createPandaStyle(style_name="PandaStyle")
# Select the top differential edge weights betweeen LCL and whole blood diffRes <- pandaDiffEdges(lcl_vis, wb_vis, condition_name="LCL") head(diffRes) # Number of differential edges is: nrow(diffRes) # Select the top differential edges higher in LCL to plot in Cytoscape n=200 # number of edges to select from each condition diffResLCL <- diffRes[diffRes$LCL=="T",] diffResLCL <- diffResLCL[order(diffResLCL$Score,decreasing=TRUE),][1:n,] # Select the top differential edges higher in whole blood to plot in Cytoscape diffResWB <- diffRes[diffRes$LCL=="F",] diffResWB <- diffResWB[order(diffResWB$Score,decreasing=TRUE),][1:n,] # Combine top differential edges in LCL and WB to plot in Cytoscape diffRes_vis <- rbind(diffResLCL, diffResWB) # Plot the network (open Cytoscape before running this command) # Purple edges indicate higher edge weight in the defined "condition_name" parameter (LCL in our example), and green edges indicate higher edge weight in the other condition (whole blood in our example). visDiffPandaInCytoscape(diffRes_vis, condition_name = "LCL", network_name="diff.PANDA") # Apply the style to the network createDiffPandaStyle(style_name="Diff.PandaStyle", condition_name="LCL")
lcl_outdegree <- calcDegree(pandaLCL, type="tf") wb_outdegree <- calcDegree(pandaWB, type="tf") lcl_indegree <- calcDegree(pandaLCL, type="gene") wb_indegree <- calcDegree(pandaWB, type="gene") # Calculate the gene in-degree difference for two different panda regulatory networks (LCL minus whole blood) degreeDiff <- calcDegreeDifference(pandaLCL, pandaWB, type="gene") head(degreeDiff)
Well will use the fgsea package to perform gene set enrichment analysis. We need to point to a ranked gene list (for example the gene in-degree difference between LCL and whole blood), and a list of gene sets (or signatures) in gmt format to test for enrichment. The gene sets can be downloaded from MSigDB: http://software.broadinstitute.org/gsea/msigdb Same gene annotation should be used in the ranked gene list and gene sets. In our example we will use the KEGG pathways downloaded from MSigDB.
system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/c2.cp.kegg.v7.0.symbols.gmt") pathways <- gmtPathways("./c2.cp.kegg.v7.0.symbols.gmt") # To retrieve biological-relevant processes, we will load and use the complete ranked gene list (27,175 genes) calculated from the complete network instead of the 1,000 subset genes we used in this tutorial example to build PANDA networks within a very short run time. system("curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/lclWB_indegreeDifference.rnk") degreeDiff_all <- read.delim("./lclWB_indegreeDifference.rnk",stringsAsFactors = F,header=F) degreeDiff_all <- setNames(degreeDiff_all[,2], degreeDiff_all[,1]) fgseaRes <- fgsea(pathways, degreeDiff_all, minSize=15, maxSize=500, nperm=1000) head(fgseaRes) # Subset to pathways with FDR < 0.05 sig <- fgseaRes[fgseaRes$padj < 0.05,] # Get the top 10 significant pathways enriched for genes having lower targeting in LCLs sig[order(sig$NES)[1:10],]
Bubble plot of gene sets (KEGG pathways) on y-axis and adjusted p-value (padj) on x-axis. Bubble size indicates the number of genes in each gene set, and bubble color indicates the normalized enrichment score (NES). Blue is for negative NES (enrichment of higher targeted genes in whole blood), and red is for positive NES (enrichment of higher targeted genes in LCL).
dat <- data.frame(fgseaRes) # Settings fdrcut <- 0.05 # FDR cut-off to use as output for significant signatures dencol_neg <- "blue" # bubble plot color for negative ES dencol_pos <- "red" # bubble plot color for positive ES signnamelength <- 4 # set to remove prefix from signature names (2 for "GO", 4 for "KEGG", 8 for "REACTOME") asp <- 3 # aspect ratio of bubble plot charcut <- 100 # cut signature name in heatmap to this nr of characters # Make signature names more readable a <- as.character(dat$pathway) # 'a' is a great variable name to substitute row names with something more readable for (j in 1:length(a)){ a[j] <- substr(a[j], signnamelength+2, nchar(a[j])) } a <- tolower(a) # convert to lower case (you may want to comment this out, it really depends on what signatures you are looking at, c6 signatures contain gene names, and converting those to lower case may be confusing) for (j in 1:length(a)){ if(nchar(a[j])>charcut) { a[j] <- paste(substr(a[j], 1, charcut), "...", sep=" ")} } # cut signature names that have more characters than charcut, and add "..." a <- gsub("_", " ", a) dat$NAME <- a # Determine what signatures to plot (based on FDR cut) dat2 <- dat[dat[,"padj"]<fdrcut,] dat2 <- dat2[order(dat2[,"padj"]),] dat2$signature <- factor(dat2$NAME, rev(as.character(dat2$NAME))) # Determine what labels to color sign_neg <- which(dat2[,"NES"]<0) sign_pos <- which(dat2[,"NES"]>0) # Color labels signcol <- rep(NA, length(dat2$signature)) signcol[sign_neg] <- dencol_neg # text color of negative signatures signcol[sign_pos] <- dencol_pos # text color of positive signatures signcol <- rev(signcol) # need to revert vector of colors, because ggplot starts plotting these from below # Plot bubble plot g<-ggplot(dat2, aes(x=padj,y=signature,size=size)) g+geom_point(aes(fill=NES), shape=21, colour="white")+ theme_bw()+ # white background, needs to be placed before the "signcol" line xlim(0,fdrcut)+ scale_size_area(max_size=10,guide="none")+ scale_fill_gradient2(low=dencol_neg, high=dencol_pos)+ theme(axis.text.y = element_text(colour=signcol))+ theme(aspect.ratio=asp, axis.title.y=element_blank()) # test aspect.ratio
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.