title: "decorate: Differential Epigenetic Correlation Test" author: "Developed by Gabriel Hoffman" date: "Run on r Sys.Date()" documentclass: article output: html_document: toc: true smart: false vignette: > %\VignetteIndexEntry{decorate: Differential Epigenetic Correlation Test} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc}


Standard analysis compares the differences in magnitude in epigenetic signal between two subsets of a dataset. The decorate workflow evaluates the local correlation structure between nearby epigenetic features and identify clusters of features that are differential correlation between two subsets of a dataset. Epigenetic datasets where decorate can be applied include ChIP-seq, ATAC-seq and DNA methylation.

decorate vr packageVersion("decorate")

Workflow

\item 0) Data normalization and covariate correction

+ We assume this has already been perform using standard methods for your assay
+ decorate should be run on residuals after removing the effects of confounding variables

\item 1) Compute correlations between local pairs of features

+ `runOrderedClusteringGenome()`

\item 2) Perform hierarchical clustering of features

+ built into `runOrderedClusteringGenome()`

\item 3) Produce discrete clusters of features

+ `createClusters():` can take multiple parameter values to return multiple sets of clusters at different resolutions

\item 4) Filter clusters

- Based on strength of correlation
  + `scoreClusters():` evaluate the strength of the correlation structure of each cluster
  + `retainClusters():` identify clusters that pass a cutoff for strength of the correlation structure.  By default filters by LEF, the lead eigen-value fraction, which is the fraction of variance explained by the first eigen-value.  Higher LEF values indicate a stronger correlation structure.
  + `filterClusters():` apply filter to clusters based on this cutoff

- Identify overlapping clusters and drop if redundant
  + `collapseClusters():` Collapse redundant clusters identified from using multiple parameter values in step (3).  Redundancy of two clusters is evaluated using the Jaccard index, the fraction of epigenetic features shared between two clusters.

\item 5) Statistical test of differential correlation

+ `evalDiffCorr():` implements 16 tests of differential correlation with different properties.  See [here](https://hoffmg01.u.hpc.mssm.edu/software/decorate/simulations.html) for comparison of methods.  I recommend 'Box.permute', 'deltaSLE', 'sLED' to retain power while controlling the false positive rate
+ merge all results into one data.frame with `combineResults()`

\item 6) Data visualization + plotDecorate(): plot local correlation structure, clusters of features and genes in the region + plotCompareCorr(): plot correlation structure for two subsets of the data + plotScatterPairs(): scatter plot for each pair of features for two subsets of the data

suppressPackageStartupMessages(library(knitr))

options(xtable.type="html")

knitr::opts_chunk$set(
  echo=FALSE,
  warning=FALSE,
  message=TRUE,
  error = FALSE,
  tidy = FALSE,
  cache = TRUE,
  cache.lazy = FALSE,
  dev = c("png"), 
  fig.width=7, fig.height=7)

options(markdown.HTML.stylesheet = 'css/custom.css')

Analysis of simulated data

library(decorate)
library(GenomicRanges)
library(cowplot)     
library(limma)     

# load data
data('decorateData')

# Compute residuals with respect to variable of interested plus confounders
design = model.matrix(~ Disease, metadata)
fit = lmFit(simData, design)
residValues = residuals( fit, simData)

# Evaluate hierarchical clustering
# adjacentCount is the number of adjacent peaks considered in correlation
# use Spearman correlation to reduce the effects of outliers
treeList = runOrderedClusteringGenome( residValues, simLocation, adjacentCount=500, method.corr="spearman" )

# Choose cutoffs and return clusters using multiple values for meanClusterSize 
# Clusters corresponding to each parameter value are returned 
# and then processed downstream
# By using multiple parameter values, epigenetic features are included in clusters 
# at different resolutions
treeListClusters = createClusters( treeList, method = "meanClusterSize", meanClusterSize=c(20, 30, 40, 50) )

# Evaluate strength of correlation for each cluster
clstScore = scoreClusters(treeList, treeListClusters )

# Filter to retain only strong clusters
# If lead eigen value fraction (LEF) > 30% then keep clusters
# LEF is the fraction of variance explained by the first eigen-value
clustInclude = retainClusters( clstScore, "LEF", 0.15 )

# get retained clusters
treeListClusters_filter = filterClusters( treeListClusters, clustInclude)

# collapse redundant clusters
treeListClusters_collapse = collapseClusters( treeListClusters_filter, simLocation, jaccardCutoff=0.9)

# Plot correlations and clusters in region defind by query
# get single entry giving range of the region
query = range(simLocation)  

Summary of clustering

Number of cluster parameters: r length(treeListClusters)

Number of total clusters: r countClusters(treeListClusters)

Number of clusters after filtering: r countClusters(treeListClusters_filter)

Number of clusters after collapsing: r countClusters(treeListClusters_collapse)

Make plots

Evaluate correlation structure versus distance

dfDist = evaluateCorrDecay( treeList, simLocation, verbose=FALSE)

plotCorrDecay( dfDist, outlierQuantile=1e-5 )

Plot correlation structure along genome

Load gene database

# load gene locations
# this is ENSEMBL v86 from the hg38 assembly
# but other versions and assemblies are available
library(EnsDb.Hsapiens.v86)
if( is.loaded("EnsDb.Hsapiens.v86")){
  detach('package:EnsDb.Hsapiens.v86', unload=TRUE)
}
library(EnsDb.Hsapiens.v86)      
ensdb = EnsDb.Hsapiens.v86
fig1 = plotDecorate( ensdb, treeList, treeListClusters, simLocation, query)
fig2 = plotDecorate( ensdb, treeList, treeListClusters_filter, simLocation, query)
plot_grid( fig1, fig2, ncol=2, labels=c('A: All clusters', 'B: After filtering') )

Plot after filtering and collapsing redundant clusters

fig = plotDecorate( ensdb, treeList, treeListClusters_collapse, simLocation, query)
plot_grid( fig, ncol=1, labels=c('C: After collapsing clusters'), hjust=0 )

Run test of differential correlation using sLED

Evaluate treeListClusters_collapse, the filtered collapsed set of clusters

library(BiocParallel)
register(SnowParam(4, progressbar=TRUE))

# Evaluate Differential Correlation between two subsets of data
# Use Spearman correlation to reduce the effect of outliers
resDiffCorr = evalDiffCorr( residValues, metadata$Disease, simLocation, 
  treeListClusters_collapse, method='deltaSLE',  method.corr="spearman")

# get summary of results
df = summary( resDiffCorr )

# print results
head(df)
head(df)    

The id column gives the clustering parameter from runOrderedClusteringGenome(), so cluster 6 with id 10 is different from cluster 6 with id 50. Combine results to merge properties of each cluster into a single data.frame

df_results = combineResults( resDiffCorr, clstScore, treeListClusters, simLocation)
head(df_results)

Note that permutations are only used for test 'sLED', so is set to zero for all others.

Summary of cluster properties

# Histogram of LEF
ggplot(df_results, aes(LEF, fill=id)) + geom_histogram(alpha=0.7) + 
  theme_bw(17) + xlim(0,1) + 
  theme(aspect.ratio=1, legend.position="bottom", 
    plot.title = element_text(hjust = 0.5)) + 
    scale_fill_discrete(name = "Requested mean cluster size") + 
    xlab("Lead eigenvalue fraction (LEF)") + ggtitle("Summarize LEF")

# Histogram of mean absolute correlation
ggplot(df_results, aes(mean_abs_corr, fill=id)) + geom_histogram(alpha=0.7) + theme_bw(17) + xlim(0,1) + 
  theme(aspect.ratio=1, legend.position="bottom", 
    plot.title = element_text(hjust = 0.5)) + 
    scale_fill_discrete(name = "Requested mean cluster size") + 
    xlab("Mean absolute correlation") + 
    ggtitle("Summarize absolute correlation")

# Boxplot of number of features per cluster
ggplot(df_results, aes(id, N, fill=id)) + geom_boxplot() + 
  theme_bw(17) + 
  theme(aspect.ratio=1, legend.position="bottom", 
    plot.title = element_text(hjust = 0.5)) + 
  scale_fill_discrete(name = "Feature per cluster") + 
  xlab("Requested mean cluster size") + 
  ylab("Number of features") + 
  ggtitle("Summarize feature per cluster") + 
  coord_flip()

Compare correlation structure in two subsets of the data

# extract feature identifiers from most significant cluster
peakIDs = getFeaturesInCluster( treeListClusters_collapse, df_results$chrom[1], df_results$cluster[1], df_results$id[1])

query = range(simLocation[names(simLocation) %in% peakIDs])

locText = with( data.frame(query), paste0(seqnames, ':', format(start, big.mark=','), '-', format(end, big.mark=',')))

# plot comparison of correlation matrices for peaks in peakIDs
#  where data is subset by metadata$Disease
main = paste0(df$chrom[1], ': cluster ', df$cluster[1], '\n', locText)   
plotCompareCorr( residValues, peakIDs, metadata$Disease) + ggtitle(main)

Compare correlation structure along genome

First, plot correlation structure for controls (metadata\$Disease==0). Then, plot correlation structure for cases (metadata\$Disease==1).

The cluster located at r locText has a p-value of r format(df_results$pValue[1], digits=4) and a test statistic of r format(df_results$stat[1], digits=3).

# get location of peaks in this cluster  
query = range(simLocation[names(simLocation) %in% peakIDs])

# expand window to include adjacent clusters
window = 2000 
start(query) = start(query) - window
end(query) = end(query) + window

fig1 = plotDecorate( ensdb, treeList, treeListClusters_collapse, simLocation, query, data=residValues[,metadata$Disease==0])

fig2 = plotDecorate( ensdb, treeList, treeListClusters_collapse, simLocation, query, data=residValues[,metadata$Disease==1])
plot_grid( fig1, fig2, ncol=2, labels=c('A: Contols (i.e. Disease==0)', 'B: Cases (i.e. Disease==1)') )

Pairwise scatter plots

For all pairs of features in the significant cluster, make a scatterplot for cases and controls seperately.

plotScatterPairs( residValues, peakIDs, metadata$Disease) + ggtitle(main)


GabrielHoffman/decorate documentation built on May 23, 2023, 1:29 a.m.