knitr::opts_chunk$set(
  tidy = FALSE,
  cache = FALSE,
  dpi = 72,
  dev = "png",
  message = FALSE, error = FALSE, warning = TRUE
)

correlationAnalyzeR is the R interface to the Correlation AnalyzeR database and web application. The web version can be accessed here.

This package is designed to allow greater customization and control over the functions in the web interface. This vignette will demonstrate each function using an example. Additional info can be found in the reference manual.

Analyze Single Genes

To speed up the analysis, it is useful to generate a TERM2GENE object ahead of time. The GSEA_Type argument specifies which gene set databases to pull annotations from. See the details of ?getTERM2GENE to see the different options.

library(correlationAnalyzeR)
TERM2GENE <- getTERM2GENE(GSEA_Type = c("GO:BP"))  # GO Biological Process

Basic Analysis

correlationAnalyzeR can be used to predict gene function using analyzeSingleGenes() (the equivalent of Single Gene Mode in the web application). In this example, Tissue and Sample_Type arguments were set in order to limit the analysis to co-expression correlations in normal brain samples.

res <- analyzeSingleGenes(genesOfInterest = c("BRCA1"), 
                          Tissue = "brain", Sample_Type = "normal",
                          TERM2GENE = TERM2GENE)

This runs most of the core tasks for predicting gene functionality using this analysis mode. This includes running "corGSEA", an implementation of GSEA developed in this package for use on genome-wide co-expression correlations.

The results are a list containing several items:

Tables

  1. The genome-wide correlations (Pearson's R) for BRCA1:
head(res$correlations)
  1. The associated correlation P values:
head(res$`P values`)
  1. The table of corGSEA results:
head(res$`BRCA1, Brain - Normal`$GSEA$eres)

Figures

  1. A histogram showing the genome-wide correlation value (R) distribution:
res$`BRCA1, Brain - Normal`$corrHist
  1. The top increasing corGSEA hits:
res$`BRCA1, Brain - Normal`$GSEA$GSEA_up
  1. The top decreasing corGSEA hits:
res$`BRCA1, Brain - Normal`$GSEA$GSEA_down

Supplying a custom dataset

correlationAnalyzeR relies on pre-calculated datasets which are stored in a cloud database. However, it is also possible for users to generate predictions from their own datasets. To generate a correlation matrix you can supply a read count matrix to generateCorrelations()

Here is an example using the airway dataset. We first wrangle the dataset into a raw read count matrix:

library(airway)
library(EnsDb.Hsapiens.v86)
library(dplyr)

data(airway)
cts <- assay(airway)
ens2gene <- ensembldb::select(EnsDb.Hsapiens.v86, keys = rownames(cts),
                              columns = c("SYMBOL"), keytype = "GENEID") %>%
  dplyr::distinct(SYMBOL, .keep_all = TRUE) %>%
  dplyr::inner_join(y = data.frame("GENEID" = rownames(cts)))
cts <- cts[ens2gene$GENEID,]
rownames(cts) <- ens2gene$SYMBOL

We then generate the correlation matrix with the generateCorrelations() function:

corrMat <- generateCorrelations(cts)

Once the correlation matrix is generated, it can be used as the input to analyzeSingleGenes() via the corrMat argument with a corrMat_label set (this is the custom label used during plotting functions).

res <- analyzeSingleGenes(genesOfInterest = c("BRCA1"), corrMat = corrMat,
                          corrMat_label = "User-supplied DataSet",
                          TERM2GENE = TERM2GENE)

Here is the correlation histogram produced with the custom dataset:

res$`BRCA1, User-supplied DataSet`$corrHist

It is important to note that user-supplied datasets should provide enough samples to ensure robust co-expression calculations. In the above example, it is clear that there are not enough samples within the airway dataset to support this calculation. In our experience, it is necessary to have at least 30 samples in most cases.

Cross-compare mode

crossCompareMode allows a user to examine the correlations across multiple tissue and disease conditions. For example, to analyze the correlations of BRCA1 across all tissues, we could do the following:

res <- analyzeSingleGenes(genesOfInterest = c("BRCA1"), crossCompareMode = TRUE)

The output is a list containing several tables:

  1. The co-expression correlations for BRCA1 across all tissues:
head(res$BRCA1$correlations, n = 3)
  1. All the VST-transformed counts for BRCA1 across all samples:
head(res$BRCA1$VST_DF)

The output list also contains several plots:

  1. A box plot comparing cancer and normal samples by VST
res$BRCA1$VST_boxPlot
  1. A heatmap of the top 30 co-correlated genes with BRCA1 (genes which show similar co-expression correlations to BRCA1). The accompanying values for this plot are in res$BRCA1$heatmapSmallDataCo.
res$BRCA1$heatmapSmallCo
  1. A heatmap of the top 200 co-correlated genes with BRCA1. The accompanying values for this plot are in res$BRCA1$heatmapBigDataCo.
res$BRCA1$heatmapBigCo
  1. A heatmap of the top 30 variably-correlated genes with BRCA1 (genes which show divergent co-expression correlations compared to BRCA1). The accompanying values for this plot are in res$BRCA1$heatmapSmallDataCo.
res$BRCA1$heatmapSmallVar
  1. A heatmap of the top 200 variably-correlated genes with BRCA1. The accompanying values for this plot are in res$BRCA1$heatmapBigDataVar.
res$BRCA1$heatmapBigVar

Analyze Gene Pairs

Basic Analysis

correlationAnalyzeR can be used to analyze differences between two genes using analyzeGenePairs() (the equivalent of Gene vs Gene Mode in the web application).

res <- analyzeGenePairs(genesOfInterest = c("BRCA1", "BRCA2"),
                        Tissue = "all", Sample_Type = "all",
                        TERM2GENE = TERM2GENE)

The analyzeGenePairs() function performs analyzeSingleGenes() on both of the supplied genes and then compares the results, generating several tables and figures:

  1. The correlation between the two genes is visualized using a scatter plot of their VST-transformed expression values.

  2. By Disease:

res$compared$VST_corrPlot$corrPlot_disease
res$compared$VST_corrPlot$corrPlot_tissue
  1. The gene co-expression correlations with the average Pearson R and variance included. Note that, with only two data points, variance is just 2x the squared deviation from the mean.
res$compared$correlations %>%
       arrange(desc(average)) %>%
  head()
  1. The p values of the correlation calculation.
head(res$compared$`P values`)
  1. The combined results of corGSEA for BRCA1 and BRCA2
head(res$compared$correlatedPathwaysDataFrame)
  1. The VST-transformed counts for BRCA1 and BRCA2 across all tissues
head(res$compared$VST_Data)
  1. A scatter plot comparing the genome-wide co-expression correlations for BRCA1 and BRCA2.
res$compared$correlationPlot
  1. The same plot, binned to reduce the computational requirements for plotting:
res$compared$correlationPlotBin
  1. A heatmap showing the genes with the top variance between BRCA1 and BRCA2 by co-expression correlation. This is simply a measure of the absolute difference between them.
res$compared$correlationVarianceHeatmap
  1. A heatmap showing the genes with the top similarity in co-expression correlation between BRCA1 and BRCA2.
res$compared$correlationSimilarityHeatmap
  1. A heatmap showing the pathways with the top variance between BRCA1 and BRCA2 by corGSEA score.
res$compared$pathwayVarianceHeatmap
  1. A heatmap showing the pathways with the top similarity in corGSEA score between BRCA1 and BRCA2.
res$compared$pathwaySimilarityHeatmap

Cross-compare mode

In analyzeGenePairs(), cross-compare mode allows the user to analyze the co-expression of two genes across all tissue-disease conditions (geneVsGene) or one gene in cancer vs normal (normalVsCancer).

Gene vs Gene

When genesOfInterest is supplied with two different genes and crossCompareMode=TRUE, then geneVsGene mode is executed.

res <- analyzeGenePairs(genesOfInterest = c("BRCA1", "BRCA2"), 
                        crossCompareMode = TRUE)

This analysis produces a list containing several figures and tables:

  1. The co-expression correlation results within each tissue-disease condition, along with the average co-expression values and variance for each gene.
res$Correlations %>%
       arrange(desc(average)) %>%
  head(n=3)
  1. The correlation p values for each condition.
head(res$`P values`, n=3)
  1. The VST box plots for each gene across conditions:
ggpubr::ggarrange(res$crossCompareVST$VST_boxPlotOne, res$crossCompareVST$VST_boxPlotTwo,
                  nrow = 2)
  1. The data which the expression boxplots are based upon:
head(res$crossCompareVST$VST_DF)
  1. For each tissue-disease condition, a scatter plot of the genome-wide co-expression correlations between BRCA1 and BRCA2.
res$pairResList$`Bone Cancer`$scatterPlot
  1. For each tissue-disease condition, a heatmap showing the top variable genes between BRCA1 and BRCA2.
res$pairResList$`Bone Cancer`$heatMap

Normal vs Cancer

When genesOfInterest is supplied with only one gene, Tissue includes Cancer and Normal, and crossCompareMode=TRUE, then normalVsCancer mode is executed.

res <- analyzeGenePairs(genesOfInterest = c("BRCA1", "BRCA1"), 
                        Tissue = c("Cancer", "Normal"),
                        crossCompareMode = TRUE)

The primary difference betwen normalVsCancer and geneVsGene is that analyzeGenePairs will output:

  1. A comparative boxplot showing the difference between cancer and normal conditions across tissues with respect to BRCA1 expression:
res$crossCompareVST$VST_boxPlot
  1. A list of scatter plots in which the genome-wide co-expression correlations for BRCA1 are compared between cancer and normal conditions:
res$pairResList$`Immune - Normal`$scatterPlot
  1. And the list also contains heatmaps showing the top variable genes with respect to BRCA1 gene co-expression between cancer and normal:
res$pairResList$`Immune - Normal`$heatMap

Gene vs Gene List Analysis

This mode provides an empirical approach for determining whether a gene is significantly correlated with a list of genes. This is an alternative to the typical Pearson correlation p value which can only determine whether any two genes are significantly co-expressed. To run this mode, use the geneVsGeneListAnalyze() function:

res <- geneVsGeneListAnalyze(pairedGenesList = list("BRCA1" = c("BRCA2", "EZH2", "CCND1",
                                                         "SLC7A11", "GCLC", "CDKN1A")),
                              Sample_Type = "cancer",
                              Tissue = "bone")

This returns several plots and tables:

  1. It returns to genome-wide correlations for BRCA1
head(res$correlations)
  1. Along with the p values corresponding to these Pearson correlations.
head(res$`P values`)
  1. It returns a histogram showing the genome-wide co-expression correlations for BRCA1 along with the secondary gene list annotated on top:
res$BRCA1$Correlation_histogram
  1. A plot showing the distribution of p values from bootstrapping with the observed p values for the specified gene list. NOTE that the summit represents the point with the highest density from the empirical distribution, not necessarily a specific observation.
res$BRCA1$sigTest$tTest_pvalsPlot
  1. A plot show the empirical distribution of bootstrapped mean and median correlation values with the observed correlation for the specified gene list shown, along with a p value that indicates significance. NOTE that this uses a simplistic approach to finding significance (anything > .95 is significant) and the method used above is preferred.
res$BRCA1$sigTest$meansPlot
res$BRCA1$sigTest$mediansPlot
  1. The data accompanying each plot:
data.frame(means = res$BRCA1$sigTest$means, 
           medians = res$BRCA1$sigTest$medians,
           pvals = res$BRCA1$sigTest$tTest_pvals) %>% head()

Gene List Topology Analysis

Many methods for dimensionality reduction exist, but most are focused on sample-level comparisons and few methods for analyzing feature-space topology exist. In the final analysis mode, correlationAnalyzeR uses gene co-expression correlation values as a metric for dimensionality reduction via PCA and tSNE with agglomerative clustering to determine the topology of a list of genes.

The analysis can be accessed using the analyzeGenesetTopology() function from this package:

genesOfInterest <- c("CDK12", "AURKB", "SFPQ", "NFKB1", "BRCC3", "BRCA2", "PARP1",
                     "EZH2", "CCND1", "SLC7A11", "GCLC", "CDKN1A", "MTAP",
                     "DHX9", "SON", "AURKA", "SETX", "BRCA1", "ATMIN")
res <- analyzeGenesetTopology(genesOfInterest = genesOfInterest,
                              Sample_Type = "cancer", Tissue = "bone")

This produces several tables and figures:

  1. The co-expression correlations for each gene in the supplied gene list:
head(res$Correlation_Data)
  1. The p values corresponding to these co-expression correlations:
head(res$`P values`)
  1. A heatmap of the top 50 variant genes across the gene list by co-expression values:
res$variantGenesHeatmap_Top
  1. The data matrix accompanying this heatmap:
head(res$variantGenesHeatmap_Top_MAT)
  1. A heatmap of the top 500 variant genes across the gene list by co-expression values:
res$variantGenesHeatmap
  1. The data matrix accompanying this heatmap:
head(res$variantGenesHeatmap_MAT)
  1. The same as (5), but with co-correlative genes (genes which are similarly co-correlated across each gene in the supplied list) instead of variant genes.
res$cocorrelativeGenesHeatmap
  1. And the corresponding data matrix:
head(res$cocorrelativeGenesHeatmap_MAT)
  1. The PCA plot showing the gene list members projected in PC1 and PC2, labeled, and colored by hierarchical cluster membership:
res$PCA_plot
  1. The data frame corresponding to (9):
res$PCA_data
  1. The pathway enrichment of the input gene list displayed as a dotplot:
res$inputGenes_pathwayEnrich_dotplot
  1. The pathway enrichment results in a data frame:
head(res$inputGenes_pathwayEnrich_data)
  1. The object generated by clusterProfiler during pathway enrichment (this is compatible with the other functions in the clusterProfiler package):
res$inputGenes_pathwayEnrich

Enriching with large gene lists

Unlike the web application version of Correlation AnalyzeR, the R package is capable of handling arbitrarily-large gene lists for analyzeGenesetTopology(). One instance where one might wish to perform an analysis like this could be in parsing an existing gene set from curated sources like Gene Ontology.

To obtain the list of genes for this analysis, it is convenient to use the msigdbr package in the following manner:

library(tidyverse)
MDF <- msigdbr::msigdbr(category = "C2", subcategory = "CGP")
geneList <- MDF %>%
  filter(gs_name == "RIGGI_EWING_SARCOMA_PROGENITOR_UP") %>%
  pull(gene_symbol)

We have now obtained a vector with the 434 genes in the "RIGGI_EWING_SARCOMA_PROGENITOR_UP" gene set from the Chemical and Genetic Perturbations (CGP) database in the "C2" collection of MSigDB. A link to the info page for this gene set can be found here. Now, we can use this list as the input for analyzeGenesetTopology(). NOTE: when a gene in our gene list is not found in the correlation data, it will automatically be skipped.

res <- analyzeGenesetTopology(genesOfInterest = geneList, 
                              Sample_Type = "cancer",
                              Tissue = "bone")

Because of the large number of genes supplied, a tSNE was calculated instead of PCA. This behavior can be prevented by setting the alternativeTSNE parameter to FALSE. The visualization is designed to allow easier cluster interpretation and does not include gene labels:

res$TSNE_plot

However, the underlying plot data is supplied as well:

head(res$TSNE_data)

Which means that, using plotly, it is straightforward to create an interactive visualization that includes gene name information:

plt <- (res$TSNE_data %>%
  ggplot(aes(x = tsne1, y = tsne2, color = hclust, label = geneNames)) +
  geom_point()) %>%
  plotly::ggplotly()

Session info

sessionInfo()

Questions

Feel free to email Henry Miller (millerh1@uthscsa.edu) any time with questions, bug reports, or if you want to contribute!



Bishop-Laboratory/correlationAnalyzeR documentation built on June 28, 2022, 8:31 p.m.