This vignette provides a detail workflow on how to carry out an Orthologous Clustering approach to identify conserved and lineage specific co-expressed gene modules. In addition, we illustrate various aspects of data manipulation, filtering and visualization of large transcriptomic experiments.
The fastOC package can be obtainded from github and installed using install_github()
from the devtools package. It is important to note that fastOC depends on other R packages, so please make sure they are installed before you begin. These packages include igraph, Matrix, WGCNA, reshape2, fastcluster and dynamicTreeCut.
#install dependencies install.packages(c("igraph", "Matrix", "WGCNA", "reshape2", "fastcluster", "dynamicTreeCut"), dependencies = TRUE) #install fastOC require(devtools); install_github("mzinkgraf/fastOC");
Before you start the analysis, begin by setting up the work environment and load packages.
require(fastOC); options(scipen = 0); options(stringsAsFactors = FALSE); #change working directory setwd(".") getwd() #list.files()
Begin by loading or creating a list object containing the rpkm expression data. Each element in the list should pertain to the rpkm values for a given species and it is helpful to name each list element after the species.
<<<<<<< HEAD
data("rpkm") names(rpkm)
=======
2ca324f234124cdddc5696f8c7a2cc1193da04dc
##load rpkm data sets load("../data/rpkm.rda") names(rpkm)
data("rpkm") names(rpkm)
Example of the expression data and data.frame layout for rpkm$potri:
pander::pandoc.table(rpkm$potri[1:4,1:2])
Next, we will filter out genes that display low variance. A variance threshold is applied to each of the expression data because genes with low variance often result in weak or non-significant correlations.
rpkm_var <- filterVariance(rpkm, variance = 0.1)
In this section, we will generate a data frame that contains information about the species, genes and columns for storing results. The data frame is created from the rpkm list object and will be used through out the analysis.
GeneMeta <- createGeneMeta(rpkm_var)
Example of GeneMeta:
x<-GeneMeta[c(1,2,28779,64104),] row.names(x)<-NULL pander::pandoc.table(x)
Orhthologous relationships between genes are a central component of fastOC and are used to align co-expression patterns between species. The ortholougous relationships can be calculated locally using programs such as inParanoid or OrthoMCL, or obtained from phytozome.net using the PhytoMine portal. If using inParanoid, see the function parseInParanoid()
to format the inParanoid output into a two column format.
ortho = list() ortho$potri_eugra <- read.table("../inst/extdata/potri_eugra.txt", sep = " ", header = TRUE) ortho$potri_sapur <- read.table("../inst/extdata/potri_sapur.txt", sep = " ", header = TRUE) ortho$sapur_eugra <- read.table("../inst/extdata/sapur_eugra.txt", sep = " ", header = TRUE)
#or lazyload data(ortho)
Example of orthologous table for ortho$potri_eugra:
names(ortho$potri_eugra)[1:2]<-c("GeneA","GeneB") pander::pandoc.table(ortho$potri_eugra[1:4,1:2])
In this section, we will generate the multi-layer network in the iGraph edgelist format that will be used to identify co-expressed gene modules. This network consists of within species co-expression relationships and between species orthologous weight.
First, generate the edgelist for co-expressed genes within each species. The network is created using each gene and the top (default top=5) most highly connected neighbors.
gene_edgelist<-getEdgelist(rpkm_var, GeneMeta, top = 5)
Second, we will generate the orthologous weights in edgelist format that will be used to link the within species co-expression networks. The orthologous weight calculation takes into account the complete evolutionary relationship between genes and the strength of the weight between two genes is based on the number of orthologous connections^[Yan K-K, Wang D, Rozowsky J, Zheng H, Cheng C, Gerstein M. 2014. OrthoClust: an orthology-based network framework for clustering data across multiple species. Genome Biology 15(8): R100.]. In addition, a coupling constant (default = 1) can be applied to the orthologous weights to help balance the relative contribution of co-expression links within a species and the orthologous links across species.
ortho_weights <- getOrthoWeights(ortho, GeneMeta, couple_const = 1)
Third, the two edgelists are combined into a single edgelist object.
names(ortho_weights) <- names(gene_edgelist) combined_out <- rbind(gene_edgelist, ortho_weights) #write file to disk save(combined_out,file="Louvain_input.rdata")
load("../data/combined_out.rda"); load("../data/MultiSpp_trees.rda"); load("../data/occurance.rda");
Use the Louvain community detection method to identify groups of genes that are co-expressed within and orthologous across species. The Louvain method is heuristic algorithm and should be run multiple times to reduce noise and incease significance of detected modules^[Vincent DB, Jean-Loup G, Renaud L, Etienne L. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10): P10008.].
nRuns = 100 results <- louvain(GeneMeta, combined_out, nRuns) save(results, file="igraph_louvain.rdata")
Remove Louvain communities that have fewer than a minimum number of members. To get an idea of how many genes are assigned to each Louvain community use hist(colSums(occurance))
,
occurance <- filterCommunityAssign(results, minMem = 10) save(occurance, file = "Occurance_sparse.rdata")
Calculate co-appearacnce and perform hierarchical clustering to determine which genes have similar co-appearance. Co-appearance represents how often two genes are assigned to the same Louvain community.
MultiSpp_trees <- multiSppHclust(occurance, nRuns, GeneMeta) save(MultiSpp_trees, file = "MultiSppTrees.rdata")
Identify gene modules using dynamic tree cutting.
MultiSpp_modules <- multiSppModules(MultiSpp_trees, GeneMeta, minModuleSize = c(800,800,800), cut = c(0.5,0.5,0.5)) #add module assignemnts to columns 3 of GeneMeta GeneMeta[,3]<-do.call(c,lapply(seq_along(MultiSpp_modules), function(y, i) { y[[i]] }, y=MultiSpp_modules))
In this section, we will create a heatmap plot of the co-appearance results. One major challenge with visualizing these data is the sheer number of genes and visualizing the entire data set is very computationally intense and difficult to render. To get around this, we will sample every ith gene from the gene order defined by the hierarchical cluster step. By default the function plot_MultiSpp()
samples every 12th gene (sb=12)
#Create heatmap of co-appearance modules across all species par(fig=c(0,1,0,1)) plot_MultiSpp(GeneMeta = GeneMeta, order = MultiSpp_trees$order, sb=12, CA_keep = occurance, remove_0 = F, lwd = 1, cex = 0.5) #Add heatmap color bar par(fig=c(0,0.6,0.6,0.9), new=TRUE) color.bar(lut = colorRampPalette(c("white", "lightyellow", "red","black"))(n = 100), min = 0, max = 1, nticks=5, title = "Co-Appearance")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.