selectBinSize: Select optimal bin size based on Shimazaki formula

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

The function iteratively estimates the cost of increasing bin size within a defined range and finally selects the bin size with minimum cost.

Usage

1
2
selectBinSize(alignGR, minBinSize, maxBinSize = 1000, 
	increment = 5, getFullResults = FALSE)

Arguments

alignGR

GRanges object of alignments on a single chromosome

minBinSize

Minimum bin size to start with (Default: 5 * read length)

maxBinSize

Maximum bin size to end with (Default: 1000).

increment

Number of bases to increment the bin size in searching for the optimal bin size within the defined range (Default: 5).

getFullResults

Binary indicator. If TRUE, the optimal bin size (with the minimum cost), minimum cost, and all of the bin sizes considered and their costs are returned. If FALSE, only the optimal bin size is returned.

Details

Based on the preprocessed alignments for a chromosome, RIPSeeker divides the chromosome into bins of equal size b and compute the number of reads that b needs to be determined either empirically (e.g., based on the gel-selected length of the RNA fragment) or computationally. If the bin size is too small, the read counts fluctuates greatly, making it difficult to discern the underlying read count distribution. Additionally, input size to HMM increases as bin size decreases. A very small bin size results in a very long Markov chain of read counts to model, making the computation inefficient. On the other hand, if a bin size is too large, resolution becomes poor. Consequently, one cannot detect the local RIP region with subtle but intrinsic difference from the background, and the RIP regions tend to be too wide for designing specific primer for validation.

Intuitively, selecting an appropriate bin size for each chromosome is metaphorically equivalent to choosing an optimal intervals for building a histogram (Song, 2011). Here we implement the algorithm developed by Shimazaki and Shinomoto (2007), which is based on the goodness of the fit of the time histogram to estimate the rate of neural response of an animal to certain stimuli in a spike-in experiment. This approach has been successfully applied in a recently developed ChIP-seq program (Song and Smith, 2011). Algorithm 1 describes the pseudocode adapted from Shimazaki and Shinomoto (2007) that iteratively estimates the cost C of increasing bin size b within a defined range is outlined as follows.

For b = minBinSize to maxBinSize; do

1. Divide chromosome sequence into N bins of width b.

2. Count number of read counts x_i that enter the i'th bin.

3. Compute: \bar{x} = \frac{1}{N}∑_{i=1}^{N}x_{i} and v = \frac{1}{N}∑_{i=1}^{N}(x_{i} - \bar{x})^{2}.

4. Compute: C(b) = \frac{2\bar{x}-v}{b^{2}}

End For

Choose b that minimize C(b)

Value

When getFullResults is TRUE, return a list containing:

bestBinSize

the optimal bin size (with the minimum cost)

minCosts

cost of the optimal bin size

binSizes

all of the bin sizes considered

costs

all of the costs

When getFullResults is FALSE, only the optimal bin size (bestBinSize) is returned.

Author(s)

Yue Li

References

Hideaki Shimazaki and Shigeru Shinomoto. A method for selecting the bin size of a time histogram. Neural computation, 19(6):1503-1527, June 2007.

Qiang Song and Andrew D. Smith. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics (Oxford, England), 27(6):870-871, March 2011.

See Also

evalBinSize, binCount

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Retrieve system files
extdata.dir <- system.file("extdata", package="RIPSeeker") 

bamFiles <- list.files(extdata.dir, ".bam$", recursive=TRUE, full.names=TRUE)

bamFiles <- grep("PRC2", bamFiles, value=TRUE)

alignGal <- getAlignGal(bamFiles[1], reverseComplement=TRUE, genomeBuild="mm9")

alignGR <- as(alignGal, "GRanges")

alignGRList <- GRangesList(as.list(split(alignGR, seqnames(alignGR))))

minBinSize <- 200

maxBinSize <- 1200

gr <- alignGRList$chrX

b <- selectBinSize(gr, minBinSize, maxBinSize, increment=100, getFullResults=TRUE)

plot(b$binSizes, b$costs)

chrname <- as.character(runValue(seqnames(gr)))

chrlen <- seqlengths(gr)[chrname]

legend("topright", box.lty=0,
		
		sprintf("%s: 1-%d;\nTotal mapped reads: %d;\nOptimal bin size = %d bp",
		
		chrname, chrlen, length(gr), b$bestBinSize))	

gorillayue/RIPSeeker documentation built on May 17, 2019, 7:59 a.m.