# selectBinSize: Select optimal bin size based on Shimazaki formula In RIPSeeker: RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments

## Description

The function iteratively estimates the cost of increasing bin size within a defined range and finally selects the bin size with minimum cost.

## Usage

 1 2 selectBinSize(alignGR, minBinSize, maxBinSize = 1000, increment = 5, getFullResults = FALSE) 

## Arguments

 alignGR GRanges object of alignments on a single chromosome minBinSize Minimum bin size to start with (Default: 5 * read length) maxBinSize Maximum bin size to end with (Default: 1000). increment Number of bases to increment the bin size in searching for the optimal bin size within the defined range (Default: 5). getFullResults Binary indicator. If TRUE, the optimal bin size (with the minimum cost), minimum cost, and all of the bin sizes considered and their costs are returned. If FALSE, only the optimal bin size is returned.

## Details

Based on the preprocessed alignments for a chromosome, RIPSeeker divides the chromosome into bins of equal size b and compute the number of reads that b needs to be determined either empirically (e.g., based on the gel-selected length of the RNA fragment) or computationally. If the bin size is too small, the read counts fluctuates greatly, making it difficult to discern the underlying read count distribution. Additionally, input size to HMM increases as bin size decreases. A very small bin size results in a very long Markov chain of read counts to model, making the computation inefficient. On the other hand, if a bin size is too large, resolution becomes poor. Consequently, one cannot detect the local RIP region with subtle but intrinsic difference from the background, and the RIP regions tend to be too wide for designing specific primer for validation.

Intuitively, selecting an appropriate bin size for each chromosome is metaphorically equivalent to choosing an optimal intervals for building a histogram (Song, 2011). Here we implement the algorithm developed by Shimazaki and Shinomoto (2007), which is based on the goodness of the fit of the time histogram to estimate the rate of neural response of an animal to certain stimuli in a spike-in experiment. This approach has been successfully applied in a recently developed ChIP-seq program (Song and Smith, 2011). Algorithm 1 describes the pseudocode adapted from Shimazaki and Shinomoto (2007) that iteratively estimates the cost C of increasing bin size b within a defined range is outlined as follows.

For b = minBinSize to maxBinSize; do

1. Divide chromosome sequence into N bins of width b.

2. Count number of read counts x_i that enter the i'th bin.

3. Compute: \bar{x} = \frac{1}{N}∑_{i=1}^{N}x_{i} and v = \frac{1}{N}∑_{i=1}^{N}(x_{i} - \bar{x})^{2}.

4. Compute: C(b) = \frac{2\bar{x}-v}{b^{2}}

End For

Choose b that minimize C(b)

## Value

When getFullResults is TRUE, return a list containing:

 bestBinSize the optimal bin size (with the minimum cost) minCosts cost of the optimal bin size binSizes all of the bin sizes considered costs all of the costs

When getFullResults is FALSE, only the optimal bin size (bestBinSize) is returned.

Yue Li

## References

Hideaki Shimazaki and Shigeru Shinomoto. A method for selecting the bin size of a time histogram. Neural computation, 19(6):1503-1527, June 2007.

Qiang Song and Andrew D. Smith. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics (Oxford, England), 27(6):870-871, March 2011.

evalBinSize, binCount
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 # Retrieve system files extdata.dir <- system.file("extdata", package="RIPSeeker") bamFiles <- list.files(extdata.dir, ".bam", recursive=TRUE, full.names=TRUE) bamFiles <- grep("PRC2", bamFiles, value=TRUE) alignGal <- getAlignGal(bamFiles[1], reverseComplement=TRUE, genomeBuild="mm9") alignGR <- as(alignGal, "GRanges") alignGRList <- GRangesList(as.list(split(alignGR, seqnames(alignGR)))) minBinSize <- 200 maxBinSize <- 1200 gr <- alignGRListchrX b <- selectBinSize(gr, minBinSize, maxBinSize, increment=100, getFullResults=TRUE) plot(b$binSizes, b$costs) chrname <- as.character(runValue(seqnames(gr))) chrlen <- seqlengths(gr)[chrname] legend("topright", box.lty=0, sprintf("%s: 1-%d;\nTotal mapped reads: %d;\nOptimal bin size = %d bp", chrname, chrlen, length(gr), b\$bestBinSize))