title: "decorate: Differential Epigenetic Correlation Test"
author: "Developed by Gabriel Hoffman"
date: "Run on r Sys.Date()
"
documentclass: article
output:
html_document:
toc: true
smart: false
vignette: >
%\VignetteIndexEntry{decorate: Differential Epigenetic Correlation Test}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\usepackage[utf8]{inputenc}
Standard analysis compares the differences in magnitude in epigenetic signal between two subsets of a dataset. The decorate workflow evaluates the local correlation structure between nearby epigenetic features and identify clusters of features that are differential correlation between two subsets of a dataset. Epigenetic datasets where decorate can be applied include ChIP-seq, ATAC-seq and DNA methylation.
decorate vr packageVersion("decorate")
\item 0) Data normalization and covariate correction
+ We assume this has already been perform using standard methods for your assay + decorate should be run on residuals after removing the effects of confounding variables
\item 1) Compute correlations between local pairs of features
+ `runOrderedClusteringGenome()`
\item 2) Perform hierarchical clustering of features
+ built into `runOrderedClusteringGenome()`
\item 3) Produce discrete clusters of features
+ `createClusters():` can take multiple parameter values to return multiple sets of clusters at different resolutions
\item 4) Filter clusters
- Based on strength of correlation + `scoreClusters():` evaluate the strength of the correlation structure of each cluster + `retainClusters():` identify clusters that pass a cutoff for strength of the correlation structure. By default filters by LEF, the lead eigen-value fraction, which is the fraction of variance explained by the first eigen-value. Higher LEF values indicate a stronger correlation structure. + `filterClusters():` apply filter to clusters based on this cutoff - Identify overlapping clusters and drop if redundant + `collapseClusters():` Collapse redundant clusters identified from using multiple parameter values in step (3). Redundancy of two clusters is evaluated using the Jaccard index, the fraction of epigenetic features shared between two clusters.
\item 5) Statistical test of differential correlation
+ `evalDiffCorr():` implements 16 tests of differential correlation with different properties. See [here](https://hoffmg01.u.hpc.mssm.edu/software/decorate/simulations.html) for comparison of methods. I recommend 'Box.permute', 'deltaSLE', 'sLED' to retain power while controlling the false positive rate + merge all results into one data.frame with `combineResults()`
\item 6) Data visualization
+ plotDecorate():
plot local correlation structure, clusters of features and genes in the region
+ plotCompareCorr():
plot correlation structure for two subsets of the data
+ plotScatterPairs()
: scatter plot for each pair of features for two subsets of the data
suppressPackageStartupMessages(library(knitr)) options(xtable.type="html") knitr::opts_chunk$set( echo=FALSE, warning=FALSE, message=TRUE, error = FALSE, tidy = FALSE, cache = TRUE, cache.lazy = FALSE, dev = c("png"), fig.width=7, fig.height=7) options(markdown.HTML.stylesheet = 'css/custom.css')
library(decorate) library(GenomicRanges) library(cowplot) library(limma) # load data data('decorateData') # Compute residuals with respect to variable of interested plus confounders design = model.matrix(~ Disease, metadata) fit = lmFit(simData, design) residValues = residuals( fit, simData) # Evaluate hierarchical clustering # adjacentCount is the number of adjacent peaks considered in correlation # use Spearman correlation to reduce the effects of outliers treeList = runOrderedClusteringGenome( residValues, simLocation, adjacentCount=500, method.corr="spearman" ) # Choose cutoffs and return clusters using multiple values for meanClusterSize # Clusters corresponding to each parameter value are returned # and then processed downstream # By using multiple parameter values, epigenetic features are included in clusters # at different resolutions treeListClusters = createClusters( treeList, method = "meanClusterSize", meanClusterSize=c(20, 30, 40, 50) ) # Evaluate strength of correlation for each cluster clstScore = scoreClusters(treeList, treeListClusters ) # Filter to retain only strong clusters # If lead eigen value fraction (LEF) > 30% then keep clusters # LEF is the fraction of variance explained by the first eigen-value clustInclude = retainClusters( clstScore, "LEF", 0.15 ) # get retained clusters treeListClusters_filter = filterClusters( treeListClusters, clustInclude) # collapse redundant clusters treeListClusters_collapse = collapseClusters( treeListClusters_filter, simLocation, jaccardCutoff=0.9) # Plot correlations and clusters in region defind by query # get single entry giving range of the region query = range(simLocation)
Number of cluster parameters: r length(treeListClusters)
meanClusterSize = c(20, 30, 40, 50)
Number of total clusters: r countClusters(treeListClusters)
treeListClusters
Number of clusters after filtering: r countClusters(treeListClusters_filter)
treeListClusters_filter
Number of clusters after collapsing: r countClusters(treeListClusters_collapse)
treeListClusters_collapse
dfDist = evaluateCorrDecay( treeList, simLocation, verbose=FALSE) plotCorrDecay( dfDist, outlierQuantile=1e-5 )
# load gene locations # this is ENSEMBL v86 from the hg38 assembly # but other versions and assemblies are available library(EnsDb.Hsapiens.v86)
if( is.loaded("EnsDb.Hsapiens.v86")){ detach('package:EnsDb.Hsapiens.v86', unload=TRUE) } library(EnsDb.Hsapiens.v86) ensdb = EnsDb.Hsapiens.v86
fig1 = plotDecorate( ensdb, treeList, treeListClusters, simLocation, query) fig2 = plotDecorate( ensdb, treeList, treeListClusters_filter, simLocation, query) plot_grid( fig1, fig2, ncol=2, labels=c('A: All clusters', 'B: After filtering') )
fig = plotDecorate( ensdb, treeList, treeListClusters_collapse, simLocation, query) plot_grid( fig, ncol=1, labels=c('C: After collapsing clusters'), hjust=0 )
Evaluate treeListClusters_collapse
, the filtered collapsed set of clusters
library(BiocParallel) register(SnowParam(4, progressbar=TRUE)) # Evaluate Differential Correlation between two subsets of data # Use Spearman correlation to reduce the effect of outliers resDiffCorr = evalDiffCorr( residValues, metadata$Disease, simLocation, treeListClusters_collapse, method='deltaSLE', method.corr="spearman") # get summary of results df = summary( resDiffCorr ) # print results head(df)
head(df)
The id column gives the clustering parameter from runOrderedClusteringGenome()
, so cluster 6 with id 10 is different from cluster 6 with id 50.
Combine results to merge properties of each cluster into a single data.frame
df_results = combineResults( resDiffCorr, clstScore, treeListClusters, simLocation) head(df_results)
Note that permutations are only used for test 'sLED'
, so is set to zero for all others.
# Histogram of LEF ggplot(df_results, aes(LEF, fill=id)) + geom_histogram(alpha=0.7) + theme_bw(17) + xlim(0,1) + theme(aspect.ratio=1, legend.position="bottom", plot.title = element_text(hjust = 0.5)) + scale_fill_discrete(name = "Requested mean cluster size") + xlab("Lead eigenvalue fraction (LEF)") + ggtitle("Summarize LEF") # Histogram of mean absolute correlation ggplot(df_results, aes(mean_abs_corr, fill=id)) + geom_histogram(alpha=0.7) + theme_bw(17) + xlim(0,1) + theme(aspect.ratio=1, legend.position="bottom", plot.title = element_text(hjust = 0.5)) + scale_fill_discrete(name = "Requested mean cluster size") + xlab("Mean absolute correlation") + ggtitle("Summarize absolute correlation") # Boxplot of number of features per cluster ggplot(df_results, aes(id, N, fill=id)) + geom_boxplot() + theme_bw(17) + theme(aspect.ratio=1, legend.position="bottom", plot.title = element_text(hjust = 0.5)) + scale_fill_discrete(name = "Feature per cluster") + xlab("Requested mean cluster size") + ylab("Number of features") + ggtitle("Summarize feature per cluster") + coord_flip()
# extract feature identifiers from most significant cluster peakIDs = getFeaturesInCluster( treeListClusters_collapse, df_results$chrom[1], df_results$cluster[1], df_results$id[1]) query = range(simLocation[names(simLocation) %in% peakIDs]) locText = with( data.frame(query), paste0(seqnames, ':', format(start, big.mark=','), '-', format(end, big.mark=','))) # plot comparison of correlation matrices for peaks in peakIDs # where data is subset by metadata$Disease main = paste0(df$chrom[1], ': cluster ', df$cluster[1], '\n', locText) plotCompareCorr( residValues, peakIDs, metadata$Disease) + ggtitle(main)
First, plot correlation structure for controls (metadata\$Disease==0). Then, plot correlation structure for cases (metadata\$Disease==1).
The cluster located at r locText
has a p-value of r format(df_results$pValue[1], digits=4)
and a test statistic of r format(df_results$stat[1], digits=3)
.
# get location of peaks in this cluster query = range(simLocation[names(simLocation) %in% peakIDs]) # expand window to include adjacent clusters window = 2000 start(query) = start(query) - window end(query) = end(query) + window fig1 = plotDecorate( ensdb, treeList, treeListClusters_collapse, simLocation, query, data=residValues[,metadata$Disease==0]) fig2 = plotDecorate( ensdb, treeList, treeListClusters_collapse, simLocation, query, data=residValues[,metadata$Disease==1]) plot_grid( fig1, fig2, ncol=2, labels=c('A: Contols (i.e. Disease==0)', 'B: Cases (i.e. Disease==1)') )
For all pairs of features in the significant cluster, make a scatterplot for cases and controls seperately.
plotScatterPairs( residValues, peakIDs, metadata$Disease) + ggtitle(main)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.