knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
As human studies often encounter numerous constraints, including larger biological heterogeneity, hidden confounding factors, greater cost and time, and potential ethical concerns, model organisms have played an indispensable role in pre-clinical research to understand pathogenesis of human diseases at the behavioral, cellular and molecular level. Despite their indispensable role in mechanistic investigation and drug development, molecular congruence of animal models to human has long been questioned and debated. We hereby introduce a comprehensive and quantitative framework, namely Congruence Analysis of Model Organisms (CAMO), for congruence and translational evaluation of animal models in their molecular responses in omics data. CAMO is served as an objective and quantitative approach to identify biomarkers, pathways and topological gene modules that are best or least mimicked by the model organism.
knitr::include_graphics("CAMO_workflow.png")
Figure 1 presents an overview of the CAMO pipeline, consisting of state-of-the-art methods and novel approaches for a thorough congruence evaluation.
(A). The steps to calculate genome wide and pathway level c-scores and d-scores for a pair of human study (HS) and mouse study (MS), which consists of
When multiple cohorts are jointly analyzed, c-scores and d-scores are calculated for all pair-wise studies in each individual pathway.
(B). Downstream machine learning and bioinformatics tools for comparative analysis and knowledge discovery using pairwise scores and their p-values of multiple studies, which includes
To demonstrate how to perform congruence analysis of multiple studies using CAMO package, we utilize two human and two mouse inflammatory diseases (Burn and Sepsis), which are part of the data used in case study 1 in the CAMO paper.
We first run differential anlysis (LIMMA for microarray data) followed by the Bayesian differential analysis to transform frequentist p-values to differential posterior probabilities for each study to derive the empirical distribution of DE indicators for each gene. The MCMC step in Bayesian analysis may take a while.
library(CAMO) data("hb")#load human burn data data("hb.group") summaryDE = indDE(data=hb,group=as.factor(hb.group),data.type="microarray", case.label="2", ctrl.label="1")#differential anlysis hb_pData = summaryDE[,c(3,1)] hb_MCMCout = bayes(hb_pData, seed=12345)#Bayesian differential analysis data("hs")#load human sepsis data data("hs.group") summaryDE = indDE(data=hs,group=as.factor(hs.group),data.type="microarray", case.label="2", ctrl.label="1")#differential anlysis hs_pData = summaryDE[,c(3,1)] hs_MCMCout = bayes(hs_pData, seed=12345)#Bayesian differential analysis data("mb")#load mouse burn data data("mb.group") summaryDE = indDE(data=mb,group=as.factor(mb.group),data.type="microarray", case.label="2", ctrl.label="1")#differential anlysis mb_pData = summaryDE[,c(3,1)] mb_MCMCout = bayes(mb_pData, seed=12345)#Bayesian differential analysis data("ms")#load mouse sepsis data data("ms.group") summaryDE = indDE(data=ms,group=as.factor(ms.group),data.type="microarray", case.label="2", ctrl.label="1")#differential anlysis ms_pData = summaryDE[,c(3,1)] ms_MCMCout = bayes(ms_pData, seed=12345)#Bayesian differential analysis
We merge the MCMC matrices by orthologs mapping between human and mouse genes.
mcmc.list = list(hb_MCMCout,hs_MCMCout,mb_MCMCout,ms_MCMCout) species = c(rep("human",2), rep("mouse",2))#specify species for each MCMC matrix data(hm_orth)##load orthologs file, retrieved from:https://fgr.hms.harvard.edu/diopt mcmc.merge.list <- merge(mcmc.list,species = species, ortholog.db = hm_orth, reference=1) save(mcmc.merge.list, file = "mcmc.merge.list.RData")
Then, we construct genome-wide c-scores and d-scores with permutation test to derive p-values.
dataset.names = c("hb","hs","mb","ms")#specify study names for each merged MCMC matrix set.seed(12345) ACS_ADS_global = multi_ACS_ADS_global(mcmc.merge.list,dataset.names, measure="Fmeasure",B=1000) save(ACS_ADS_global,file="ACS_ADS_global.RData")
For pathway level c-scores and d-scores, we first select pathways of interest by meta-enrichment-analysis and then c-scores and d-scores with permuted p-values can be calculated similarly.
#Select pathways by meta-enrichement-analysis data(human.pathway.list) #load pathway database, this includes KEGG and Reactome database select.pathway = pathSelect(mcmc.merge.list,pathway.list, pathwaysize.lower.cut = 5, pathwaysize.upper.cut=200, overlapsize.cut = 5, med.de.cut =3,min.de.cut=0, qfisher.cut = 0.05) (K = length(select.pathway)) select.pathway.list = pathway.list[select.pathway] save(select.pathway.list,file=paste(WD,"/select_",K,"_pathways.RData",sep="")) set.seed(12345) ACS_ADS_pathway = multi_ACS_ADS_pathway(mcmc.merge.list,dataset.names, select.pathway.list, measure="Fmeasure", B=1000,parallel=F) save(ACS_ADS_pathway,file="ACS_ADS_pathway.RData")
We apply the multi-dimensional scaling (MDS) plot using transformation of the genome-wide c-scores as dissimilarity measure provides congruence visualization.The plot will be automatically saved to the current directory.
mdsGlobal(ACS_ADS_global$ACS,row.names(ACS_ADS_global$ACS),sep="_",file="globalMDS_cscores.pdf")
We first select the optimal number of clusters by consensus clustering. The consensus CDF and delta area plot will be automatically saved to the current directory to determing the number of pathway clusters.
ACSpvalue.mat = ACS_ADS_pathway$ACSpvalue.mat results = ConsensusClusterPlus(d=t(-log10(ACSpvalue.mat)),maxK=10,reps=50,pItem=0.8,pFeature=1,title="Consensus Clustering",clusterAlg="hc",innerLinkage="ward.D2",finalLinkage="ward.D2",seed=12345,plot="png")
Then, we apply the consensus clustering with specified K=4 (suggested from last step) and allow pathways with small silhouette index to be removed from the current clusters repeatedly until every pathway has a silhouette index above the selected cutoff (0.1). Text mining will then be applied to each tight cluster. Heatmap and MDS of the minus-log-transformed p-values of pathway level c-scores in pair-wise studies will be saved as pdf files. Text minining results will be saved as a spreadsheet. In addition, the co-membership heatmaps summarizing the proportion of significantly concordant pathways within each pathway cluster between each pair of studies will also be saved.
data(hashtb_hsa) #load the pathway-phrase matrix for text mining set.seed(12345) multiOutput(mcmc.merge.list,dataset.names,select.pathway.list, ACS_ADS_pathway, output=c("clustPathway"), hashtb=hashtb,optK = 4,keywords_cut=0.2,comemberProb_cut=0.6)
CAMO package provides within pathway visualization tools to visualize the pairwise c-scores/d-scores for individual pathways of interests. For each pathway, "mdsModel" and "clustModel" provide MDS and clustering heatmap for study clusters, "genePM" provides a gene-wise heatmap of posterior mean of DE evidence for all studies, "keggView" and "reactomeView" generate the topology graph where gene nodes are colored by its concordance/discordance information in a study pair.
##mdsModel, clustModel, genePM sub_select.pathway.list = select.pathway.list[c("KEGG B cell receptor signaling pathway", "KEGG Leukocyte transendothelial migration", "Reactome TCR signaling")]#select pathway of interests multiOutput(mcmc.merge.list,dataset.names,select.pathway.list=sub_select.pathway.list, ACS_ADS_pathway, output=c("mdsModel","clustModel","genePM")) ##keggView multiOutput(mcmc.merge.list,dataset.names,select.pathway.list=sub_select.pathway.list, ACS_ADS_pathway, output=c("keggView"), ViewPairSelect = NULL, # will generate for all study pairs automatically kegg.species="hsa", KEGG.dataGisEntrezID=FALSE, # data genes are symbols, not entrezID KEGG.dataG2EntrezID=NULL, # when not provided, org.Hs.eg.db is used to map gene names KEGG.pathID2name=NULL) #when not provided, KEGGREST is used to retrieve pathway ID ##reactomeView multiOutput(mcmc.merge.list,dataset.names,select.pathway.list=sub_select.pathway.list, ACS_ADS_pathway, output=c("reactomeView"), ViewPairSelect = NULL,# will generate for all study pairs automatically reactome.species="HSA", Reactome.dataG2TopologyGtype=NULL, # when not provided, search for data gene names on Reactome topology directly Reactome.pathID2name=NULL) #when not provided, reactome.db is used to retrieve pathway ID
For KEGG pathways, a community detection algorithm is developed for a pair of models to identify concordant or discordant subnetworks based on the topological regulatory information. This is recommended for human pathways (KEGG species = "hsa") because the current knowledge about the the regulatory information in other species is very limited. We demonstrate how to run the local community detection algorithm and generate the KEGG topology plots with highlighted modules on the human pathway "KEGG Leukocyte transendothelial migration" below.
res_hsa04670 = KEGG_module(mcmc.merge.list, dataset.names,KEGGspecies="hsa", KEGGpathwayID="04670",data.pair = c("hs","ms"), gene_type = c("discordant"), DE_PM_cut = 0.2, minM = 4,maxM = NULL, B = 1000, cores = 1, search_method = c("Exhaustive"), Elbow_plot = T, filePath = getwd()) #Elbow plot of -log10(p-value) will be saved at the filePath res = KEGG_module_topology_plot(res_hsa04670,which_to_draw = 12) #KEGG toplogy with highlighted gene module will be saved locally
In this example, we compare the developmental stages of Caenorhabditis elegans (worm) and Drosophila melanogaster (fruit fly). Please refer to the CAMO paper for detailed data description.
Bayesian differential analysis to derive the empirical distribution of DE indicators in each of the 5 developmental stages of worm and fly. The MCMC step in Bayesian analysis may take a while.
library(CAMO) load(ce_e0) summaryDE <- indDE(data=ce_e0_dat,group=as.factor(ce_e0_group),data.type="microarray",case.label="2", ctrl.label="1") ce_e0_pData <- summaryDE[,c(3,1)] ce_e0_MCMCout <- bayes(ce_e0_pData, seed=12345) save(ce_e0_MCMCout,file="ce_e0_MCMCout.RData") load(ce_e1) summaryDE <- indDE(data=ce_e1_dat,group=as.factor(ce_e1_group),data.type="microarray",case.label="2", ctrl.label="1") ce_e1_pData <- summaryDE[,c(3,1)] ce_e1_MCMCout <- bayes(ce_e1_pData, seed=12345) save(ce_e1_MCMCout,file="ce_e1_MCMCout.RData") load(ce_e2) summaryDE <- indDE(data=ce_e2_dat,group=as.factor(ce_e2_group),data.type="microarray",case.label="2", ctrl.label="1") ce_e2_pData <- summaryDE[,c(3,1)] ce_e2_MCMCout <- bayes(ce_e2_pData, seed=12345) save(ce_e2_MCMCout,file="ce_e2_MCMCout.RData") load(ce_lar) summaryDE <- indDE(data=ce_lar_dat,group=as.factor(ce_lar_group),data.type="microarray",case.label="2", ctrl.label="1") ce_lar_pData <- summaryDE[,c(3,1)] ce_lar_MCMCout <- bayes(ce_lar_pData, seed=12345) save(ce_lar_MCMCout,file="ce_lar_MCMCout.RData") load(ce_dau) summaryDE <- indDE(data=ce_dau_dat,group=as.factor(ce_dau_group),data.type="microarray",case.label="2", ctrl.label="1") ce_dau_pData <- summaryDE[,c(3,1)] ce_dau_MCMCout <- bayes(ce_dau_pData, seed=12345) save(ce_dau_MCMCout,file="ce_dau_MCMCout.RData") load(dm_e0) summaryDE <- indDE(data=dm_e0_dat,group=as.factor(dm_e0_group),data.type="microarray",case.label="2", ctrl.label="1") dm_e0_pData <- summaryDE[,c(3,1)] dm_e0_MCMCout <- bayes(dm_e0_pData, seed=12345) save(dm_e0_MCMCout,file="dm_e0_MCMCout.RData") load(dm_e1) summaryDE <- indDE(data=dm_e1_dat,group=as.factor(dm_e1_group),data.type="microarray",case.label="2", ctrl.label="1") dm_e1_pData <- summaryDE[,c(3,1)] dm_e1_MCMCout <- bayes(dm_e1_pData, seed=12345) save(dm_e1_MCMCout,file="dm_e1_MCMCout.RData") load(dm_e2) summaryDE <- indDE(data=dm_e2_dat,group=as.factor(dm_e2_group),data.type="microarray",case.label="2", ctrl.label="1") dm_e2_pData <- summaryDE[,c(3,1)] dm_e2_MCMCout <- bayes(dm_e2_pData, seed=12345) save(dm_e2_MCMCout,file="dm_e2_MCMCout.RData") load(dm_lar) summaryDE <- indDE(data=dm_lar_dat,group=as.factor(dm_lar_group),data.type="microarray",case.label="2", ctrl.label="1") dm_lar_pData <- summaryDE[,c(3,1)] dm_lar_MCMCout <- bayes(dm_lar_pData, seed=12345) save(dm_lar_MCMCout,file="dm_lar_MCMCout.RData") load(dm_pup) summaryDE <- indDE(data=dm_pup_dat,group=as.factor(dm_pup_group),data.type="microarray",case.label="2", ctrl.label="1") dm_pup_pData <- summaryDE[,c(3,1)] dm_pup_MCMCout <- bayes(dm_pup_pData, seed=12345) save(dm_pup_MCMCout,file="dm_pup_MCMCout.RData")
Merge the MCMC matrices by orthologs mapping between worm and fly genes:
mcmc.list = list(ce_e0_MCMCout,ce_e1_MCMCout,ce_e2_MCMCout,ce_lar_MCMCout,ce_dau_MCMCout, dm_e0_MCMCout,dm_e1_MCMCout,dm_e2_MCMCout,dm_lar_MCMCout,dm_pup_MCMCout) data(cd_orth) species = c(rep("ce",5), rep("dm",5)) mcmc.merge.list <- merge(mcmc.list,species = species, ortholog.db = cd_orth, reference=1) dataset.names = c("ce.e0","ce.e1","ce.e2","ce.lar","ce.dau", "dm.e0","dm.e1","dm.e2","dm.lar","dm.pup") names(mcmc.merge.list) = dataset.names save(mcmc.merge.list,file="mcmc.merge.list.RData")
Construct genome-wide c-scores and d-scores with permutation test to derive p-values:
dataset.names = c("ce.e0","ce.e1","ce.e2","ce.lar","ce.dau", "dm.e0","dm.e1","dm.e2","dm.lar","dm.pup") set.seed(12345) ACS_ADS_global = multi_ACS_ADS_global(mcmc.merge.list,dataset.names, measure="Fmeasure",B=1000) save(ACS_ADS_global,file="ACS_ADS_global.RData")
Meta-pathway enrichment analysis to select pathways and calculate pathway level c-scores and d-scores:
#Select pathways by meta-enrichement-analysis data(worm.pathway.list) #load pathway database, this includes KEGG and Reactome database select.pathway0 = pathSelect(mcmc.merge.list,worm.pathway.list, pathwaysize.lower.cut=3, pathwaysize.upper.cut=500, overlapsize.cut=3,med.de.cut=0,min.de.cut=2, qfisher.cut = 1.01, # relax the requirement of q-values topPath.indStudy.num = 50)# select genes by gene rankings (K = length(select.pathway)) select.pathway.list = pathway.list[select.pathway] save(select.pathway.list,file=paste(WD,"/select_",K,"_pathways.RData",sep="")) set.seed(12345) ACS_ADS_pathway = multi_ACS_ADS_pathway(mcmc.merge.list,dataset.names, select.pathway.list, measure="Fmeasure", B=1000,parallel=F) save(ACS_ADS_pathway,file="ACS_ADS_pathway.RData")
mdsGlobal(ACS_ADS_global$ACS,row.names(ACS_ADS_global$ACS),sep="_",file="globalMDS_cscores.pdf")
Select the optimal number of clusters by consensus clustering. The consensus CDF and delta area plot will be automatically saved to the current directory to determing the number of pathway clusters.
ACSpvalue.mat = ACS_ADS_pathway$ACSpvalue.mat results = ConsensusClusterPlus(d=t(-log10(ACSpvalue.mat)),maxK=10,reps=50,pItem=0.8,pFeature=1,title="Consensus Clustering",clusterAlg="hc",innerLinkage="ward.D2",finalLinkage="ward.D2",seed=12345,plot="png")
Consensus tight clustering with specified K=6 to generate clustering heatmap, MDS, text mining results asng co-membership heatmaps:
data(hashtb_cel) #load the pathway-phrase matrix for text mining set.seed(12345) multiOutput(mcmc.merge.list,dataset.names,select.pathway.list, ACS_ADS_pathway, output=c("clustPathway"), hashtb=hashtb,optK = 6,keywords_cut=0.2,comemberProb_cut=0.7)
Generate within pathway clustering heatmap, MDS, gene-wise heatmap, KEGG topology plot and Reactome topology plot:
##mdsModel, clustModel, genePM sub_select.pathway.list = select.pathway.list[c("KEGG Homologous recombination", "KEGG Mismatch repair", "Reactome Nucleotide-binding domain, leucine rich repeat containing receptor (NLR) signaling pathways", "Reactome PKA activation")]#select pathway of interests multiOutput(mcmc.merge.list,dataset.names,select.pathway.list=sub_select.pathway.list, ACS_ADS_pathway, output=c("mdsModel","clustModel","genePM")) ##keggView data(Seqname2Entrez_ce)#data genes are in sequence names, need to be mapped to EntrezID for KEGG topology plot multiOutput(mcmc.merge.list,dataset.names,select.pathway.list=sub_select.pathway.list, ACS_ADS_pathway, output=c("keggView"), ViewPairSelect = NULL, # will generate for all study pairs automatically kegg.species="cel", KEGG.dataGisEntrezID=FALSE, # data genes are symbols, not entrezID KEGG.dataG2EntrezID=Seqname2Entrez_ce, # a data frame maping data gene names (1st col) to EntrezID (2nd col) KEGG.pathID2name=NULL) #when not provided, KEGGREST is used to retrieve pathway ID ##reactomeView data(Seqname2Publicname_ce)#data genes are in sequence names, need to be mapped to public names for Reactome topology plot multiOutput(mcmc.merge.list,dataset.names,select.pathway.list=sub_select.pathway.list, ACS_ADS_pathway, output=c("reactomeView"), ViewPairSelect = NULL,# will generate for all study pairs automatically reactome.species="CEL", Reactome.dataG2TopologyGtype=Seqname2Publicname_ce, # a data frame maping data gene names (1st col) to public names (2nd col) Reactome.pathID2name=NULL) #when not provided, reactome.db is used to retrieve pathway ID
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.