knitr::opts_chunk$set( tidy = FALSE, cache = FALSE, dpi = 72, dev = "png", message = FALSE, error = FALSE, warning = TRUE )
correlationAnalyzeR
is the R interface to the Correlation AnalyzeR database
and web application. The web version can be accessed here.
This package is designed to allow greater customization and control over the functions in the web interface. This vignette will demonstrate each function using an example. Additional info can be found in the reference manual.
To speed up the analysis, it is useful to generate a TERM2GENE
object ahead of time.
The GSEA_Type
argument specifies which gene set databases to pull annotations from.
See the details of ?getTERM2GENE
to see the different options.
library(correlationAnalyzeR) TERM2GENE <- getTERM2GENE(GSEA_Type = c("GO:BP")) # GO Biological Process
correlationAnalyzeR
can be used to predict gene function using analyzeSingleGenes()
(the equivalent of Single Gene Mode in the web application). In this example, Tissue
and Sample_Type
arguments were set in order to limit the analysis to co-expression correlations in normal brain samples.
res <- analyzeSingleGenes(genesOfInterest = c("BRCA1"), Tissue = "brain", Sample_Type = "normal", TERM2GENE = TERM2GENE)
This runs most of the core tasks for predicting gene functionality using this analysis mode. This includes running "corGSEA", an implementation of GSEA developed in this package for use on genome-wide co-expression correlations.
The results are a list containing several items:
head(res$correlations)
head(res$`P values`)
head(res$`BRCA1, Brain - Normal`$GSEA$eres)
res$`BRCA1, Brain - Normal`$corrHist
res$`BRCA1, Brain - Normal`$GSEA$GSEA_up
res$`BRCA1, Brain - Normal`$GSEA$GSEA_down
correlationAnalyzeR
relies on pre-calculated datasets which are stored in a cloud database. However, it is also possible for users to generate predictions from their own datasets. To generate a correlation matrix you can supply a read count matrix to generateCorrelations()
Here is an example using the airway
dataset. We first wrangle the dataset into a raw read count matrix:
library(airway) library(EnsDb.Hsapiens.v86) library(dplyr) data(airway) cts <- assay(airway) ens2gene <- ensembldb::select(EnsDb.Hsapiens.v86, keys = rownames(cts), columns = c("SYMBOL"), keytype = "GENEID") %>% dplyr::distinct(SYMBOL, .keep_all = TRUE) %>% dplyr::inner_join(y = data.frame("GENEID" = rownames(cts))) cts <- cts[ens2gene$GENEID,] rownames(cts) <- ens2gene$SYMBOL
We then generate the correlation matrix with the generateCorrelations()
function:
corrMat <- generateCorrelations(cts)
Once the correlation matrix is generated, it can be used as the input to analyzeSingleGenes()
via the corrMat
argument with a corrMat_label
set (this is the custom label used during plotting functions).
res <- analyzeSingleGenes(genesOfInterest = c("BRCA1"), corrMat = corrMat, corrMat_label = "User-supplied DataSet", TERM2GENE = TERM2GENE)
Here is the correlation histogram produced with the custom dataset:
res$`BRCA1, User-supplied DataSet`$corrHist
It is important to note that user-supplied datasets should provide enough samples to ensure robust co-expression calculations. In the above example, it is clear that there are not enough samples within the airway
dataset to support this calculation. In our experience, it is necessary to have at least 30 samples in most cases.
crossCompareMode
allows a user to examine the correlations across multiple tissue and disease conditions. For example, to analyze the correlations of BRCA1 across all tissues, we could do the following:
res <- analyzeSingleGenes(genesOfInterest = c("BRCA1"), crossCompareMode = TRUE)
The output is a list containing several tables:
head(res$BRCA1$correlations, n = 3)
head(res$BRCA1$VST_DF)
The output list also contains several plots:
res$BRCA1$VST_boxPlot
res$BRCA1$heatmapSmallDataCo
.res$BRCA1$heatmapSmallCo
res$BRCA1$heatmapBigDataCo
.res$BRCA1$heatmapBigCo
res$BRCA1$heatmapSmallDataCo
.res$BRCA1$heatmapSmallVar
res$BRCA1$heatmapBigDataVar
.res$BRCA1$heatmapBigVar
correlationAnalyzeR
can be used to analyze differences between two genes using analyzeGenePairs()
(the equivalent of Gene vs Gene Mode in the web application).
res <- analyzeGenePairs(genesOfInterest = c("BRCA1", "BRCA2"), Tissue = "all", Sample_Type = "all", TERM2GENE = TERM2GENE)
The analyzeGenePairs()
function performs analyzeSingleGenes()
on both of the supplied genes and then compares the results, generating several tables and figures:
The correlation between the two genes is visualized using a scatter plot of their VST-transformed expression values.
By Disease:
res$compared$VST_corrPlot$corrPlot_disease
res$compared$VST_corrPlot$corrPlot_tissue
res$compared$correlations %>% arrange(desc(average)) %>% head()
head(res$compared$`P values`)
head(res$compared$correlatedPathwaysDataFrame)
head(res$compared$VST_Data)
res$compared$correlationPlot
res$compared$correlationPlotBin
res$compared$correlationVarianceHeatmap
res$compared$correlationSimilarityHeatmap
res$compared$pathwayVarianceHeatmap
res$compared$pathwaySimilarityHeatmap
In analyzeGenePairs()
, cross-compare mode allows the user to analyze the co-expression of two genes across all tissue-disease conditions (geneVsGene
) or one gene in cancer vs normal (normalVsCancer
).
When genesOfInterest
is supplied with two different genes and crossCompareMode=TRUE
, then geneVsGene
mode is executed.
res <- analyzeGenePairs(genesOfInterest = c("BRCA1", "BRCA2"), crossCompareMode = TRUE)
This analysis produces a list containing several figures and tables:
res$Correlations %>% arrange(desc(average)) %>% head(n=3)
head(res$`P values`, n=3)
ggpubr::ggarrange(res$crossCompareVST$VST_boxPlotOne, res$crossCompareVST$VST_boxPlotTwo, nrow = 2)
head(res$crossCompareVST$VST_DF)
res$pairResList$`Bone Cancer`$scatterPlot
res$pairResList$`Bone Cancer`$heatMap
When genesOfInterest
is supplied with only one gene, Tissue
includes Cancer
and Normal
, and crossCompareMode=TRUE
, then normalVsCancer
mode is executed.
res <- analyzeGenePairs(genesOfInterest = c("BRCA1", "BRCA1"), Tissue = c("Cancer", "Normal"), crossCompareMode = TRUE)
The primary difference betwen normalVsCancer
and geneVsGene
is that analyzeGenePairs
will output:
res$crossCompareVST$VST_boxPlot
res$pairResList$`Immune - Normal`$scatterPlot
res$pairResList$`Immune - Normal`$heatMap
This mode provides an empirical approach for determining whether a gene is significantly correlated with a list of genes. This is an alternative to the typical Pearson correlation p value which can only determine whether any two genes are significantly co-expressed. To run this mode, use the geneVsGeneListAnalyze()
function:
res <- geneVsGeneListAnalyze(pairedGenesList = list("BRCA1" = c("BRCA2", "EZH2", "CCND1", "SLC7A11", "GCLC", "CDKN1A")), Sample_Type = "cancer", Tissue = "bone")
This returns several plots and tables:
head(res$correlations)
head(res$`P values`)
res$BRCA1$Correlation_histogram
res$BRCA1$sigTest$tTest_pvalsPlot
res$BRCA1$sigTest$meansPlot res$BRCA1$sigTest$mediansPlot
data.frame(means = res$BRCA1$sigTest$means, medians = res$BRCA1$sigTest$medians, pvals = res$BRCA1$sigTest$tTest_pvals) %>% head()
Many methods for dimensionality reduction exist, but most are focused on sample-level comparisons and few methods for analyzing feature-space topology exist. In the final analysis mode, correlationAnalyzeR
uses gene co-expression correlation values as a metric for dimensionality reduction via PCA
and tSNE
with agglomerative clustering to determine the topology of a list of genes.
The analysis can be accessed using the analyzeGenesetTopology()
function from this package:
genesOfInterest <- c("CDK12", "AURKB", "SFPQ", "NFKB1", "BRCC3", "BRCA2", "PARP1", "EZH2", "CCND1", "SLC7A11", "GCLC", "CDKN1A", "MTAP", "DHX9", "SON", "AURKA", "SETX", "BRCA1", "ATMIN") res <- analyzeGenesetTopology(genesOfInterest = genesOfInterest, Sample_Type = "cancer", Tissue = "bone")
This produces several tables and figures:
head(res$Correlation_Data)
head(res$`P values`)
res$variantGenesHeatmap_Top
head(res$variantGenesHeatmap_Top_MAT)
res$variantGenesHeatmap
head(res$variantGenesHeatmap_MAT)
res$cocorrelativeGenesHeatmap
head(res$cocorrelativeGenesHeatmap_MAT)
res$PCA_plot
res$PCA_data
res$inputGenes_pathwayEnrich_dotplot
head(res$inputGenes_pathwayEnrich_data)
clusterProfiler
during pathway enrichment (this is compatible with the other functions in the clusterProfiler
package):res$inputGenes_pathwayEnrich
Unlike the web application version of Correlation AnalyzeR, the R package is capable of handling arbitrarily-large gene lists for analyzeGenesetTopology()
. One instance where one might wish to perform an analysis like this could be in parsing an existing gene set from curated sources like Gene Ontology.
To obtain the list of genes for this analysis, it is convenient to use the msigdbr
package in the following manner:
library(tidyverse) MDF <- msigdbr::msigdbr(category = "C2", subcategory = "CGP") geneList <- MDF %>% filter(gs_name == "RIGGI_EWING_SARCOMA_PROGENITOR_UP") %>% pull(gene_symbol)
We have now obtained a vector with the 434 genes in the "RIGGI_EWING_SARCOMA_PROGENITOR_UP" gene set from the Chemical and Genetic Perturbations (CGP) database in the "C2" collection of MSigDB. A link to the info page for this gene set can be found here. Now, we can use this list as the input for analyzeGenesetTopology()
. NOTE: when a gene in our gene list is not found in the correlation data, it will automatically be skipped.
res <- analyzeGenesetTopology(genesOfInterest = geneList, Sample_Type = "cancer", Tissue = "bone")
Because of the large number of genes supplied, a tSNE was calculated instead of PCA. This behavior can be prevented by setting the alternativeTSNE
parameter to FALSE
. The visualization is designed to allow easier cluster interpretation and does not include gene labels:
res$TSNE_plot
However, the underlying plot data is supplied as well:
head(res$TSNE_data)
Which means that, using plotly
, it is straightforward to create an interactive visualization that includes gene name information:
plt <- (res$TSNE_data %>% ggplot(aes(x = tsne1, y = tsne2, color = hclust, label = geneNames)) + geom_point()) %>% plotly::ggplotly()
sessionInfo()
Feel free to email Henry Miller (millerh1@uthscsa.edu) any time with questions, bug reports, or if you want to contribute!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.