calculateCrossCorrelation: Calculate the cross correlation for a given GRanges object.

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

This method calculates the cross correlation, i.e. the Pearson correlation between the coverages of the positive and negative strand from a DNA sequencing experiment. The cross correlation can be used as a quality measure in ChIP-seq experiments (Kharchenko et al. 2008). Cross correlation can also be used to estimate the fragment size by determining the shift (given in base pairs) that maximizes the cross correlation.

Usage

1
calculateCrossCorrelation(object, shift=c(200,250,300), bin=10, mode="none", minReads=10000, chrs=NA, mc.cores=1)

Arguments

object

An GRanges object containing the aligned reads.

shift

The number of bases that the negative strand is shifted towards its three prime end. This can be a vector, if the correlation should be calculated for different shifts.

bin

If bin is larger than one, the coverage is calculated for bins of size bin and not for each single base. This speeds up calculations and might be beneficial in cases of low coverage. Note that shifting is performed after binning, so that the shift(s) should be a multiple of bin (otherwise, shift is rounded to the nearest multiple of bin).

mode

mode defines how bases (or bins) without reads are handled. both means that only bases covered on both strands are included when calculating the correlation. one means that the base has to be covered on at least one strand and none mean that all bases are included independent of their coverage.

minReads

If not at least minReads are mapped to a chromosome, the chromosome is omitted.

chrs

A character vector with the chromosomes that should be included into the calculation. NA means all chromosomes.

mc.cores

Number of cores to be used.

Details

Only 5 prime start positions of reads are used for calculating the coverage. Therefore, after removing duplicates in a single end sequencing experiment, the coverage can not be larger than one, if the bin size is set to one. (In this setting, mode both is meaningless.) If bin is larger than one, the coverage within a bin is aggregated. Then, the correlation is calculated for each shift. A shift (given in basepairs) should be multiple of the bin size (given in basepairs, too). If not, the binnend coverage is shifted by round(shift/bin) elements.

The different modes define whether regions without coverage or with only one covered strand should used. The original implementation in the package "spp" does not make use of regions without coverage. However, this seems to be a loss of information, since no coverage has also a biological meaning in a ChIP-seq experiment. If the fragment size is approximately 500bp, setting shift=seq(200, 800, 10), bin=10 and mode="none" should be a good setting.

After the cross correlation was calculated for each chromosome, the weighted mean correlation across all chromosomes is calculated. The weight for a specific chromosome equals the fraction of all reads that were aligned to that chromosome.

Value

A numeric vector with the cross correlation for each shift. The names of the vector correspond to the shifts.

Author(s)

Hans-Ulrich Klein (hklein@broadinstitute.org)

References

Kharchenko PV, Tolstorukov MY and Park PJ. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 2008, 26(12):1351-9

Landt SG et al., ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22(9):1813-31

See Also

GRanges-class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
triangularKernel <- function(x, pos, h)  {
  res <- 1 - (abs(x - pos) / h)
  res[res < 0] <- 0
  return(res)
}
covPos <- round(triangularKernel(1:100, 60, 50) * 100)
covNeg <- round(triangularKernel(1:100, 65, 50) * 100)

reads <- GRanges(IRanges(start=c(rep(seq_along(covPos), covPos), rep(seq_along(covNeg), covNeg) - 9),
                         width=10),
                         strand=c(rep("+", sum(covPos)), rep("-", sum(covNeg))),
                         seqnames=rep("1", sum(covPos)+sum(covNeg)))

calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="none", minReads=1)
cor(covPos, covNeg)
cor(covPos[1:(length(covPos)-10)], covNeg[11:length(covNeg)])

calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="one", minReads=1)
cor(covPos[covPos != 0 | covNeg != 0], covNeg[covPos != 0 | covNeg != 0])

calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="both", minReads=1)
cor(covPos[covPos != 0 & covNeg != 0], covNeg[covPos != 0 & covNeg != 0])


covPos2 <- round(triangularKernel(1:100, 60, 50) * 50)
covNeg2 <- round(triangularKernel(1:100, 68, 50) * 50)
reads2 <- GRanges(IRanges(start=c(rep(seq_along(covPos2), covPos2), rep(seq_along(covNeg2), covNeg2) - 9),
                          width=10),
                          strand=c(rep("+", sum(covPos2)), rep("-", sum(covNeg2))),
                          seqnames=rep("2", sum(covPos2)+sum(covNeg2)))
seqlevels(reads2) <- c("1", "2")
allReads <- c(reads, reads2)

calculateCrossCorrelation(allReads, shift=5, minReads=1, bin=1, mode="none")
cor1 <- cor(covPos[1:(length(covPos)-5)], covNeg[6:length(covNeg)])
cor2 <- cor(covPos2[1:(length(covPos2)-5)], covNeg2[6:length(covNeg2)])
cor1 * (sum(c(covPos, covNeg))/length(allReads)) +
  cor2 * (sum(c(covPos2, covNeg2))/length(allReads))

epigenomix documentation built on Nov. 8, 2020, 5:24 p.m.