selectBinSize: Select optimal bin size based on Shimazaki formula
In RIPSeeker: RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments

Description Usage Arguments Details Value Author(s) References See Also Examples

The function iteratively estimates the cost of increasing bin size within a defined range and finally selects the bin size with minimum cost.

1 2	selectBinSize(alignGR, minBinSize, maxBinSize = 1000, increment = 5, getFullResults = FALSE)

`alignGR`	GRanges object of alignments on a single chromosome
`minBinSize`	Minimum bin size to start with (Default: 5 * read length)
`maxBinSize`	Maximum bin size to end with (Default: 1000).
`increment`	Number of bases to increment the bin size in searching for the optimal bin size within the defined range (Default: 5).
`getFullResults`	Binary indicator. If TRUE, the optimal bin size (with the minimum cost), minimum cost, and all of the bin sizes considered and their costs are returned. If FALSE, only the optimal bin size is returned.

Based on the preprocessed alignments for a chromosome, RIPSeeker divides the chromosome into bins of equal size b and compute the number of reads that b needs to be determined either empirically (e.g., based on the gel-selected length of the RNA fragment) or computationally. If the bin size is too small, the read counts fluctuates greatly, making it difficult to discern the underlying read count distribution. Additionally, input size to HMM increases as bin size decreases. A very small bin size results in a very long Markov chain of read counts to model, making the computation inefficient. On the other hand, if a bin size is too large, resolution becomes poor. Consequently, one cannot detect the local RIP region with subtle but intrinsic difference from the background, and the RIP regions tend to be too wide for designing specific primer for validation.

Intuitively, selecting an appropriate bin size for each chromosome is metaphorically equivalent to choosing an optimal intervals for building a histogram (Song, 2011). Here we implement the algorithm developed by Shimazaki and Shinomoto (2007), which is based on the goodness of the fit of the time histogram to estimate the rate of neural response of an animal to certain stimuli in a spike-in experiment. This approach has been successfully applied in a recently developed ChIP-seq program (Song and Smith, 2011). Algorithm 1 describes the pseudocode adapted from Shimazaki and Shinomoto (2007) that iteratively estimates the cost C of increasing bin size b within a defined range is outlined as follows.

For b = minBinSize to maxBinSize; do

1. Divide chromosome sequence into N bins of width b.

2. Count number of read counts x_i that enter the i'th bin.

3. Compute: \bar{x} = \frac{1}{N}∑_{i=1}^{N}x_{i} and v = \frac{1}{N}∑_{i=1}^{N}(x_{i} - \bar{x})^{2}.

4. Compute: C(b) = \frac{2\bar{x}-v}{b^{2}}

End For

Choose b that minimize C(b)

When getFullResults is TRUE, return a list containing:

`bestBinSize`	the optimal bin size (with the minimum cost)
`minCosts`	cost of the optimal bin size
`binSizes`	all of the bin sizes considered
`costs`	all of the costs

When getFullResults is FALSE, only the optimal bin size (bestBinSize) is returned.

Yue Li

Hideaki Shimazaki and Shigeru Shinomoto. A method for selecting the bin size of a time histogram. Neural computation, 19(6):1503-1527, June 2007.

Qiang Song and Andrew D. Smith. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics (Oxford, England), 27(6):870-871, March 2011.

evalBinSize, binCount

# Retrieve system files
extdata.dir <- system.file("extdata", package="RIPSeeker") 

bamFiles <- list.files(extdata.dir, ".bam$", recursive=TRUE, full.names=TRUE)

bamFiles <- grep("PRC2", bamFiles, value=TRUE)

alignGal <- getAlignGal(bamFiles[1], reverseComplement=TRUE, genomeBuild="mm9")

alignGR <- as(alignGal, "GRanges")

alignGRList <- GRangesList(as.list(split(alignGR, seqnames(alignGR))))

minBinSize <- 200

maxBinSize <- 1200

gr <- alignGRList$chrX

b <- selectBinSize(gr, minBinSize, maxBinSize, increment=100, getFullResults=TRUE)

plot(b$binSizes, b$costs)

chrname <- as.character(runValue(seqnames(gr)))

chrlen <- seqlengths(gr)[chrname]

legend("topright", box.lty=0,
		
		sprintf("%s: 1-%d;\nTotal mapped reads: %d;\nOptimal bin size = %d bp",
		
		chrname, chrlen, length(gr), b$bestBinSize))