```{css, echo=FALSE} pre code, pre, code { white-space: pre !important; overflow-x: auto !important; overflow-y: auto !important; word-break: keep-all !important; word-wrap: normal !important; max-height: 300px; }
```r knitr::opts_chunk$set( collapse = TRUE, width=300, comment = "#>" )
meta-analysis of differential expression analyses (metaDEA) is a simple package that provide easy functions to integrate the results of previously computed differential expression analyses to find out the genes that are more consistently differentially expressed across different comparisons and their statistics. The power of this package is that it allows to perform meta-analysis on highly heterogeneneous transcriptomic data. Classical co-expression analyses is sensitive to outliers and require high-quality homogenous data, especially in terms of platform. To overcome this limitation we developed a simpler method that uses previously computed differential expression analyses. This allows to apply the suitable method for each specific platform (e.g. moderated t-statistics for microarray and negative binomial for RNA-sequencing data) and thereafter, integrating the results by simply analyzing the already computed log2 fold changes and adjusted p-values.
metaDEA can be installed from github
. You need to have the devtools
package installed. Also you will need the igraph
package and the plot3D
package as depenencies for the plotting function.
if ("devtools" %in% installed.packages()){ library(devtools) } else { install.packages("devtools") library(devtools) } if ("metaDEA" %in% installed.packages()){ library(metaDEA) } else{ install_github("Ilarius/metaDEA") library(metaDEA) }
The create_dataset
dataset function joins all the results of the differential expression analyses (DEAs) that you want to integrate in a single list.
In the example we load two different dataframes with the differential expression analyses performed on Down syndrome samples. The first (ipsc_deseq2
) is the results of a DEA performed with DESeq2
on an mRNA-profiling of induced pluripotent stem cells derived from fibroblasts of monozygotic twins discordant for trisomy 21; the second (thymus_limma
) is the results of a DEA peformed with limma
on microarray data of thymic samples from patient with and without Down syndrome.
options(width=300) data("ipsc_deseq2") head(ipsc_deseq2)
data("thymus_limma") head(thymus_limma[, c(8, 11:15)])
We then create the dataset by specifying the name(s) of the columns containing respectively the identifiers (GeneName
), the adjusted p-values (adj.P.Val
and padj
) and the log2 fold changes (logFC
and log2FoldChange
). Please note that in ìpsc_deseq2
there is no column for the identifiers (gene names are in the rownames) and therefore you do not need to specify anything in the col_name
argument. Also notice that the identifiers should be of the same type. Therefore, for the thymus_limma
dataset ensemble identifiers have been previously retrieved from the probe names with an annotation package.
list_array_2studies=create_dataset(thymus_limma, ipsc_deseq2,col_names="GeneName", col_adjpvals=c("adj.P.Val", "padj"), col_log2FC=c("logFC", "log2FoldChange"))
This returns a list of 3 lists, one called names
with the gene identifiers of the two comparisons (each one is a sublist named with the names of the DEAs), one called adjpval
with the adjused p-values of the comparisons and one called log2FC
with the log2 fold changes. The elements of each sublist are ordered in the same way like that there is a correspondance among names, adjusted p-values, and fold changes for each gene.
str(list_array_2studies)
Researchers might want to filter the comparisons included in the dataset based on fold changes and/or p-values to focus on the genes significantly changing in one condition compared to another. Here we upload a dataset containing the results of 124 comparisons between trisomic samples versus euploid samples that had been previously filter using a low threshold (adjusted p-value<0.1). These comparisons belong to both mouse and human samples and include different cells, tissues, and platform (both sequencing and microarrays). We use the function subset_metanalysis
to further filter each comparison keeping only the genes with adjusted p-value < 0.05 and absolute fold change >1.5. Since this filter could still be too low for some of the comparisons we also set a third threshold to keep only the first 500 genes in case there are more than 500 DE genes.
#load data data(list_array) # create a list of lists were only the first 500 most significant genes with adjusted p-value < 0.05 and fold change >1.5 or < -1.5 are included list_array.05_fc1.5_max500 <- subset_metanalysis(dataset=list_array, adjpval = 0.05, abslog2FC = log2(1.5), max_n_genes = 500 )
By applying these thresholds we end up with r length(list_array.05_fc1.5_max500$names)
comparisons.
Once we have our filtered dataset we might be interested in how specific genes are dysregulated across the different comparisons. We coded a simple function to retrieve this information by querying specific genes. The function find_fold_changes_and_pvalues
returns a dataframe with log2 fold changes and adjusted p-values for each one of the comparisons of the different studies. For this example we query the DYRK1A gene and find its deregulation (mainly upregulation) in 9 comparisons, as expected since this genes is coded on the triplicated chromosome and is one of the most studied for its pathogenetic role in several hallmarks of the syndrome.
#retrieve log2FCs and adjusted-p-values for the DYRK1A gene (ensemblID="ENSG00000157540") find_fold_changes_and_pvalues(gene="ENSG00000157540",dataset=list_array.05_fc1.5_max500)
We can also retrieve summary statistics of all the genes in the dataset to detect the genes that are most consistently deregulated across all comparisons. In order to rank genes we can consider in how many comparisons a given gene is found differentially expressed (occurences) or we can analyze the fold changes calculating for each gene their mean, standard deviation and median log2 fold change across the different comparisons. We also computed an original statistics that we called "pseudo t-score" corresponding to the ratio of the mean log2FC over the standard deviation of the log2FCs divided by the square root of the number of comparisons. This statistics will therefore be negative for overall downregulated genes and positive for overall upregulated genes and its absolute value will be higher for those genes with high and consistent changes in several comparisons, and lower for inconsistent and variable fold changes (e.g. upregulated in some datasets and downregulated in others) that are found only in a few comparisons.
In this example we focus on the top 50 genes and save computation time.
gene_stats=find_gene_statistics(dataset=list_array.05_fc1.5_max500, top=50) gene_stats
We found that the gene most consistently differentially expressed was SOD1 (ENSG00000142168) found in r max(gene_stats$occurences)
comparisons, while the gene with the highest fold change was the dynein axonemal heavy chain 11 (DNAH11, ENSG00000105877) with a log2 fold change of r max(gene_stats$mean_log2FC)
. However, the gene with the highest pseudo-t-score was the VPS26 endosomal protein sorting factor C (ENSG00000157538). Interestingly, all these genes (SOD1, DNAH11, and VPS26C) were coded on the triplicated chromosomes. However, DYRK1A (ENSG00000157540) was only found at position r which(gene_stats$names == "ENSG00000157540")
.
One of the advantages of our approach is that it allows to detect genes that are consistently differentially (co-DE genes) expressed together. By analysing our Down syndrome transcriptomic dataset we found that co-DE genes tend to: (1) be involved in the same gene or (2) disease ontology or in (3) physical and functional interactions reported in STRINGdb or (4) in the same genomic region.
Co-differential expression analysis has not to be confused with co-expression analysis. Co-expression between two genes means that gene levels across sample correlate, while in our case it means that the two genes are both differentially expressed in different comparisons.
Coming back to our example with DYRK1A by using the find_coDE
function we found that this gene was found co-DE with 2798 other genes. The most frequent co-DE interactions occurred with five genes that were found co-DE together with DYRK1A in 4 comparisons: CSF2RB (ENSG00000100368), PFKL (ENSG00000141959), USP16 (ENSG00000156256), PAXBP1 (ENSG00000159086), and BRWD1 (ENSG00000185658). Notably, PFKL, USP16, PAXBP1, and BRWD1 are all HSA21 genes.
#find co-DE genes for the DYRK1A gene (ensemblID="ENSG00000157540") code_dyrk1a=find_coDE(gene="ENSG00000157540",dataset=list_array.05_fc1.5_max500) dim(code_dyrk1a)
head(code_dyrk1a)
Let's move forward and find the network of co-DE interactions among the top 50 genes whose statistics we previously retrieved in this vignette. You can whether pass to the function the whole dataset with the option top=50
or pass the previously computed dataframes: one with the statistics and one with the co-differential expression results. We already computed the statistics for the most frequent 50 genes. We then pass these genes as input to the find_coDE
function to create our co-DE dataframe. When find_coDE
receives a vector of genes instead of a single gene it only retrieves co-DE interactions between the genes given as input. Thereafter, we pass both the dataframe with statistics and the one with co-DE interactions to the plot_gene_coDE_network
. This plot a graph where each node is a gene whose size is proportional to the number of occurrences, and each edge is a co-DE interaction with thickness proportional to the number of comparisons in which a given interaction was found.
We first plot the graph coloring the nodes with the mean fold change across samples. We set node.names
to FALSE
since our indentifiers are ensembl gene ID and therefore useless for the reader.
set.seed(1) #to obtain always the same plot #create the co-DE matrix for the most frequent 50 genes code_matrix=find_coDE(gene = gene_stats$names, dataset=list_array.05_fc1.5_max500) #plot the network set.seed(1) network=plot_gene_coDE_newtork(code_matrix=code_matrix, gene_stats_matrix=gene_stats,node_color="mean", node.names=FALSE)
As you can see only one gene is downregulated, corresponding to intergrin beta 8 which is crucial for neurogenesis and neurovascular homeostasis regulation may affect Down syndrome neuropathology.
The function also return the graph as an object for further analysis. Summary statistics of the graph shows that there are r igraph::ecount(network)
co-DE interactions between the 50 nodes.
summary(network)
We can also plot the same network this time coloring the nodes using the pseudo-t-score. Please note the usage of set.seed(1)
before each usage of the plotting function in order to have the nodes plotted in the same place.
set.seed(1) network=plot_gene_coDE_newtork(code_matrix=code_matrix, gene_stats_matrix=gene_stats,node_color="pseudotscore", node.names=FALSE)
You can see that some genes that had a relatively modest fold change now have become redder, meaning that their changes where consistent across the comparisons, while other genes have become pinkier, suggesting that their high mean log2 fold changes had been pushed by outliers.
If you wants to know more about the Down syndrome dataset used in this vignette please visit this web application:
https://ilariodetoma.shinyapps.io/shiny_meta-analysis/
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.