Similarity between two ChIP-Seq profiles"

BiocStyle::markdown()
library(knitr)


Package: r Rpackage("similaRpeak")
Authors: r packageDescription("similaRpeak")[["Author"]]
Version: r packageDescription("similaRpeak")$Version
Compiled date: r Sys.Date()
License: r packageDescription("similaRpeak")[["License"]]

Metrics which estimate similarity between two ChIP-Seq profiles

Astrid Deschenes, Elsa Bernatchez, Charles Joly Beauparlant, Fabien Claude Lamaze, Rawane Samb, Pascal Belleau and Arnaud Droit.

This package and the underlying similaRpeak code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

Introduction

The similaRpeak package calculates metrics to estimate the level of similarity between two ChIP-Seq profiles.

The metrics are:

alt text

Loading the similaRpeak package

library(similaRpeak)

Inputs

ChIP-Seq profiles vectors

ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. For a specific region, the read count of aligned sequences at each position of the region is used to generate the ChIP-Seq profile for the region.

To estimate the level of similarity between two ChIP-Seq profiles for a specific region, two vector containing the profiles values, as reported in reads per million (RPM) for each position of the selected region, are needed. Both vector should have the same length and should not contain any negative value.

Aligned sequences are usually stored in BAM files. As example, a slimmed BAM file (align1.bam) is selected as well as a specific region (chr1:172081530-172083529). Using BAM file and region information, represented by as a GRanges object, the coverage for the specified region is extracted using specialized Bioconductor packages.

suppressMessages(library(GenomicAlignments))
suppressMessages(library(rtracklayer))
suppressMessages(library(Rsamtools))

bamFile01 <- system.file("extdata/align1.bam", package = "similaRpeak")

region <- GRanges(Rle(c("chr1")), IRanges(start = 172081530, end = 172083529), 
                strand= Rle(strand(c("*"))))

param <- ScanBamParam(which = region)

alignments01 <- readGAlignments(bamFile01, param = param)

coveragesRegion01 <- coverage(alignments01)[region]
coveragesRegion01

The coverages01 can easily be transformed to a vector of numerical value to obtain the raw ChIP-Seq profile for the selected region.

coveragesRegion01 <- as.numeric(coveragesRegion01$chr1)
length(coveragesRegion01)
summary(coveragesRegion01)

To convert the raw coverage to reads per million (RPM), the total number of reads present in the BAM file is needed to assign a weight at the coverage function.

param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery=FALSE))
count01 <- countBam(bamFile01, param = param)$records
coveragesRPMRegion01 <- coverage(alignments01, weight=1000000/count01)[region]
coveragesRPMRegion01 <- as.numeric(coveragesRPMRegion01$chr1)
length(coveragesRPMRegion01)
summary(coveragesRPMRegion01)

The read per millions values are quite low for the coveragesRPMRegion01 because the original BAM file has been reduced in size to simplify the example.

Other examples are available on the worflows section of the Bioconductor website.

Finally, the metagene package , available on Bioconductor, can also be used to generate ChIP-Seq profiles. An example is available on metagene wiki.

Metrics

Metric versus Pseudometric

Mathematically, a metric is considered as a function that quantifies the similarity between two objects.The function must return zero when the two objects are perfectly similar (identity of indiscernibles) and a non-negative value when are dissimilar.

The metrics present in the similaRpeak package do not strictly respect this standard but can all be translated to pseudometrics. A pseudometric is a function d which satisfies the axioms for a metric, except that instead of the identity of indiscernibles axiom, the metric must only return zero when it compare an object to itself.

By using the absolute value of the DIFF_POS_MAX metric, the definition of a pseudometric is formally respected. However, the respective position of the maximum peak of profiles is lost.

$$ |DIFF_POS_MAX| $$

By using the absolute value of the logarithm of the RATIO_AREA, RATIO_MAX_MAX, RATIO_INTERSECT and RATIO_NORMALIZED_INTERSECT metrics, the definition of a pseudometric is formally respected.

$$ |\log(RATIO)| $$


Metrics Presentation

To ease comparison, the same ChIP-Seq profiles are used in each metric description section. Those are ChIP-Seq profiles of two histone post-transcriptional modifications linked to highly active enhancers H3K27ac (DCC accession: ENCFF000ASG) and H3K4me1 (DCC accession: ENCFF000ARY) from the Encyclopedia of DNA Elements (ENCODE) data (Dunham I et al. 2012).

Here is how to load the demoProfiles data used in following sections. The ChIP-Seq profiles of the enhancers H3K27ac and H3K4me1 for 4 specifics regions are in reads per million (RPM).

data(demoProfiles)
str(demoProfiles)


RATIO_AREA

The RATIO_AREA metric is the ratio between the profile areas. The first profile (profile1 parameter) is always divided by the second profile (profile2 parameter). NA is returned if minimal area threshold (ratioAreaThreshold parameter, default = 1) is not respected for at least one of the profiles.

The RATIO_AREA metric can be useful to detect regions with similar coverage even if the profiles are different.

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(250, 17, "RATIO_AREA=1.03", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(375, 1000, "RATIO_AREA=0.06", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(285, 80, "RATIO_AREA=2.23", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 225, "RATIO_AREA=0.12", cex=1.5, col="red")

 

**similarity(profile1, profile2, ratioAreaThreshold = 1)**

metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_AREA

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                        demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_AREA

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                        demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_AREA

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                        demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_AREA


DIFF_POS_MAX

The DIFF_POS_MAX metric is the difference between the maximal peaks positions. The difference is always the first profile value (profile1 parameter) minus the second profile value (profile2 parameter). NA is returned if minimal peak value is not respected. A profile can have more than one position with the maximum value. In that case, the median position is used. A threshold (diffPosMaxTolerance parameter) can be set to consider all positions within a certain range of the maximum value. A threshold (diffPosMaxThresholdMaxDiff parameter) can also be set to ensure that the distance between two maximum values is not too wide. When this distance is not respected, it is assumed that more than one peak is present in the profile and NA is returned. Finally, a threshold (diffPosMaxThresholdMinValue parameter) can be set to ensure that the maximum value is egal or superior to a minimal value. When this minimum value is not respected, it is assumed that no peak is present in the profile and NA is returned.

The DIFF_POS_MAX metric can be useful to detect regions with shifted peaks.

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(250, 17, "DIFF_POS_MAX=-20", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(375, 1000, "DIFF_POS_MAX=-0.5", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(285, 80, "DIFF_POS_MAX=2.5", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 225, "DIFF_POS_MAX=2.5", cex=1.5, col="red")

**similarity(profile1, profile2, diffPosMaxTolerance = 0.01, diffPosMaxThresholdMaxDiff = 100, diffPosMaxThresholdMinValue = 1)**

metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$DIFF_POS_MAX

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                        demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$DIFF_POS_MAX

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                        demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$DIFF_POS_MAX

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                        demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$DIFF_POS_MAX


RATIO_MAX_MAX

The RATIO_MAX_MAX metric is the ratio between the peaks values in each profile. The first profile (profile1 parameter) is always divided by the second profile (profile2 parameter). NA if minimal peak value threshold (ratioMaxMaxThreshold parameter, default = 1) is not respected for at least one of the profiles.

The RATIO_MAX_MAX metric can be useful to detect regions with peaks with similar (or dissimilar) amplitude.

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(250, 17, "RATIO_MAX_MAX=0.95", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(380, 1000, "RATIO_MAX_MAX=0.06", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(290, 80, "RATIO_MAX_MAX=2.5", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(280, 225, "RATIO_MAX_MAX=0.12", cex=1.5, col="red")

 

**similarity(profile1, profile2, ratioMaxMaxThreshold = 1)**

metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_MAX_MAX

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                        demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_MAX_MAX

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                        demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_MAX_MAX

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                        demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_MAX_MAX


RATIO_INTERSECT

The RATIO_INTERSECT metric is the ratio between the intersection area and the total area of the two profiles. NA if minimal area threshold (ratioIntersectThreshold parameter, default = 1) is not respected for the intersection area.

The RATIO_INTERSECT metric can be useful to detect regions with possibily similar profiles.

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(242, 17, "RATIO_INTERSECT=0.63", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(375, 1000, "RATIO_INTERSECT=0.06", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(288, 87, "RATIO_INTERSECT=", cex=1.5, col="red")
text(288, 75, "0.43", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 225, "RATIO_INTERSECT=0.12", cex=1.5, col="red")

 

**similarity(profile1, profile2, ratioIntersectThreshold = 1)**

metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                      demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_INTERSECT

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                      demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_INTERSECT

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                      demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_INTERSECT

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                      demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_INTERSECT


RATIO_NORMALIZED_INTERSECT

The RATIO_NORMALIZED_INTERSECT metric is the ratio between the intersection area and the total area of the two normalized profiles. The profiles are normalized by divinding them by their average value (total area of the profile divided by profile lenght). NA if minimal area threshold (ratioNormalizedIntersectThreshold parameter, default = 1) is not respected for the intersection area.

The RATIO_NORMALIZED_INTERSECT metric can be useful to detect regions with possibily similar profiles even if their have different amplitude.

par(mar = c(6,4,2,2))
par(mfrow = c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac*
        length(demoProfiles$chr2.70360770.70361098$H3K27ac)/
        sum(demoProfiles$chr2.70360770.70361098$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 3),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1*
        length(demoProfiles$chr2.70360770.70361098$H3K4me1)/
        sum(demoProfiles$chr2.70360770.70361098$H3K4me1, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 3))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(245, 2, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(245, 1.75, "INTERSECT=0.63", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac*
        length(demoProfiles$chr8.43092918.43093442$H3K27ac)/
        sum(demoProfiles$chr8.43092918.43093442$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 12),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1*
        length(demoProfiles$chr8.43092918.43093442$H3K4me1)/
        sum(demoProfiles$chr8.43092918.43093442$H3K4me1, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 12))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(370, 8, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(370, 7, "INTERSECT=0.89", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac*
        length(demoProfiles$chr3.73159773.73160145$H3K27ac)/
        sum(demoProfiles$chr3.73159773.73160145$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 8),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1*
        length(demoProfiles$chr3.73159773.73160145$H3K4me)/
        sum(demoProfiles$chr3.73159773.73160145$H3K4me, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 8))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(288, 5.5, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(288, 4, "INTERSECT=0.78", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac*
         length(demoProfiles$chr19.27739373.27739767$H3K27ac)/
         sum(demoProfiles$chr19.27739373.27739767$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 12),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1*
        length(demoProfiles$chr19.27739373.27739767$H3K4me1)/
        sum(demoProfiles$chr19.27739373.27739767$H3K4me1, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 12))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 7, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(275, 6, "INTERSECT=0.84", cex=1.5, col="red")

 

**similarity(profile1, profile2, ratioNormalizedIntersectThreshold = 1)**

metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                      demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                      demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                      demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                      demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT


SPEARMAN_CORRELATION

The SPEARMAN_CORRELATION metric is the Spearman's rho statistic calculated usign the two profiles. NA if minimal standard deviation (spearmanCorrSDThreashold parameter, default = 1e-8) is not respected for at least one of the profiles.

The SPEARMAN_CORRELATION assesses how well the relationship between the two ChIP-Seq profiles can be described using a monotonic function. As the RATIO_NORMALIZED_INTERSECT metric, the SPEARMAN_CORRELATION metric can be useful to detect regions with possibily similar profiles even if their have different amplitude.

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(225, 17, "SPEARMAN_CORRELATION", cex=1.5, col="red")
text(225, 14, "=0.06", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(325, 1000, "SPEARMAN_CORRELATION", cex=1.5, col="red")
text(325, 850, "=0.82", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(288, 92, "SPEARMAN_", cex=1.5, col="red")
text(288, 78, "CORRELATION=0.95", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(270, 225, "SPEARMAN_CORRELATION", cex=1.5, col="red")
text(270, 192, "=0.62", cex=1.5, col="red")

 

**similarity(profile1, profile2, spearmanCorrSDThreashold = 1e-8)**

metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                      demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                      demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                      demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                      demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION


Using similaRpeak on real ChIP-Seq profiles

Highly active enhancer regions are thought to be important for the cell fate (Andersson et al. 2014, FANTOM5 consortium, Hnisz et al. 2013). Highly active enhancers regions have been selected in GM12878 cells. Similarity of ChIP-seq profiles has been tested using two histone post-transcriptional modifications linked to highly active enhancers H3K27ac (DCC accession: ENCFF000ASG) and H3K4me1 (DCC accession: ENCFF000ARY) from the Encyclopedia of DNA Elements (ENCODE) data (Dunham I et al. 2012). Accordingly with the literature, similarity between the profiles of these two histone marks has been identified.

First, the similaRpeak package must be loaded.

suppressMessages(library(similaRpeak))

A region, chr7:61968807-61969730, shows interesting profiles for both histones. Let's load the data for this region.

data(chr7Profiles)
str(chr7Profiles)

H3K27ac and H3K4me1 profiles have those shapes:

plot(chr7Profiles$chr7.61968807.61969730$H3K27ac, type="l", col="blue", 
        xlab="", ylab="", ylim=c(0, 700), main="chr7:61968807-61969730")
par(new=TRUE)
plot(chr7Profiles$chr7.61968807.61969730$H3K4me1, type="l", col="darkgreen", 
        xlab="Position", ylab="Coverage in reads per million (RPM)", 
        ylim=c(0, 700))
legend("topleft", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)

The metrics are calculated using the similarity function which takes as arguments the two ChIP-Seq profiles vectors and the threshold values.

metrics <- similarity(chr7Profiles$chr7.61968807.61969730$H3K27ac, 
                            chr7Profiles$chr7.61968807.61969730$H3K4me1, 
                            ratioAreaThreshold=5, 
                            ratioMaxMaxThreshold=2, 
                            ratioIntersectThreshold=5, 
                            ratioNormalizedIntersectThreshold=2,
                            diffPosMaxThresholdMinValue=10, 
                            diffPosMaxThresholdMaxDiff=100, 
                            diffPosMaxTolerance=0.01)

The similarity function returns a list which contains the general information about both ChIP-Seq profiles and a list of all calculated metrics.

metrics

Each specific information can be directly accessed. Some examples:

metrics$areaProfile1
metrics$areaProfile2
metrics$metrics$RATIO_INTERSECT

The RATIO_INTERSECT value of r round(metrics$metrics$RATIO_INTERSECT, 2) and the RATIO_MAX_MAX value of r round(metrics$metrics$RATIO_MAX_MAX, 2) are quite low. Both values can be explained by the large difference in coverage between profiles. Those values could be interpreted as two profiles with low level of similarity. However, the RATIO_NORMALIZED_INTERSECT of r round(metrics$metrics$RATIO_NORMALIZED_INTERSECT, 2) is much closer to 1. It could be a sign that the profiles, once normalized, are quite similar. This hypothesis can be validated by looking at a graph of the normalized profiles :

plot(chr7Profiles$chr7.61968807.61969730$H3K27ac*
        length(chr7Profiles$chr7.61968807.61969730$H3K27ac)/
        sum(chr7Profiles$chr7.61968807.61969730$H3K27ac, na.rm=TRUE), 
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 3.5))
par(new=TRUE)
plot(chr7Profiles$chr7.61968807.61969730$H3K4me1*
        length(chr7Profiles$chr7.61968807.61969730$H3K4me1)/
        sum(chr7Profiles$chr7.61968807.61969730$H3K4me1, na.rm=TRUE), 
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)", 
        ylim=c(0, 3.5))
legend("topleft", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)

Metrics calculation using a MetricFactory object

It is possible to create only one selected metric by using the MetricFactory object (with the possibility of specifying the thresholds) and by passing the name of the metric to create (RATIO_AREA, DIFF_POS_MAX, RATIO_MAX_MAX, RATIO_INTERSECT or RATIO_NORMALIZED_INTERSECT):

factory = MetricFactory$new(diffPosMaxTolerance=0.04)

The factory has to be iniatized only once and can be used as many times as necessary. It can calculate the same metrics but with different profiles or different metrics with same profiles as long as the thresholds values remain the same:

ratio_max_max <- factory$createMetric(metricType="RATIO_MAX_MAX", 
                        profile1=demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        profile2=demoProfiles$chr2.70360770.70361098$H3K4me1)

ratio_max_max

ratio_normalized_intersect <- factory$createMetric(
                        metricType="RATIO_NORMALIZED_INTERSECT",
                        profile1=demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        profile2=demoProfiles$chr2.70360770.70361098$H3K4me1)

ratio_normalized_intersect

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

References

Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, et al. (2014) An atlas of active enhancers across human cell types and tissues. Nature, 507(7493), 455-461.

Dunham I, Kundaje A, Aldred SF, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74.

Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, et al. (2014) A promoter-level mammalian expression atlas. Nature, 507(7493):462-470.

Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-André V, et al. (2013) Super-enhancers in the control of cell identity and disease. Cell, 155(4), 934-947.



Try the similaRpeak package in your browser

Any scripts or data that you put into this service are public.

similaRpeak documentation built on Nov. 1, 2018, 3:46 a.m.