gcEffects: ChIP-seq GC Effects Estimation
In tengmx/gcapc: GC Aware Peak Caller

Description Usage Arguments Value Examples

GC effects are estimated based on effective GC content and reads count on genome-wide windows, using generalized linear mixture models. Genome wide windows are randomly or supervised sampled with given proportions. GC effects of background and foreground are estimated separately.

gcEffects(coverage, bdwidth, flank = NULL, plot = TRUE, sampling = c(0.05,
  1), supervise = GRanges(), gcrange = c(0.3, 0.8), emtrace = TRUE,
  model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0,
  theta1 = mu1, p = 0.02, converge = 0.001, genome = "hg19",
  gctype = c("ladder", "tricube"))

`coverage`	A list object returned by function `read5endCoverage`.
`bdwidth`	A non-negative integer vector with two elements specifying ChIP-seq binding width and peak detection half window size. Usually generated by function `bindWidth`. A bad estimation of bdwidth results no meaning of downstream analysis.
`flank`	A non-negative integer specifying the flanking width of ChIP-seq binding. This parameter provides the flexibility that reads appear in flankings by decreased probabilities as increased distance from binding region. This paramter helps to define effective GC content calculation. Default is NULL, which means this paramater will be calculated from `bdwidth`. However, if customized numbers provided, there won't be recalucation for this parameter; instead, the 2nd elements of `bdwidth` will be recalculated based on `flank`.
`plot`	A logical vector which, when TRUE (default), returns plots of intermediate results.
`sampling`	A numeric vector with length 2. The first number specifies the proportion of regions to be sampled for GC effects estimation. The second number specifies the repeat times for sampling. Default c(0.05,1) gives pretty robust estimation for human genome. However, smaller genomes might need both higher proportion and more repeat times for robust estimation.
`supervise`	A GRanges object specifying peak regions in the studied data, such as peaks called by peak callers, e.g. MACS & SPP. These peak regions provide supervised window sampling for both mixtures in the generalized linear model. Default no supervising. Or, if provided peak regions have too few covered windows, supervised sampling will be replaced by random sampling automatically.
`gcrange`	A non-negative numeric vector with length 2. This vector sets the range of GC content to filter regions for GC effect estimation. For human, most regions have GC content between 0.3 and 0.8, which is set as the default. Other regions with GC content beyond this range will be ignored. This range is critical when very few foreground regions are selected for mixture model fitting, since outliers could drive the regression lines. Thus, if possible, first make a scatter plot between counts and GC content to decide this parameter. Alternatively, select a narrower range, e.g. c(0.35,0.7), to aviod outlier effects from both high and low GC-content regions.
`emtrace`	A logical vector which, when TRUE (default), allows to print the trace of log likelihood changes in EM iterations.
`model`	A character specifying the distribution model to be used in generalized linear model fitting. The default is negative binomial(`nbinom`), while `poisson` is also supported currently. Based on our tests of multiple datasets, mostly poisson is a very good approximation of negative binomial, and provides much faster model fitting.
`mu0`	A non-negative numeric initiating read count signals for background regions. This is treated as the starting value of background mean for poisson/nbinom fitting. Default is 1.
`mu1`	A non-negative numeric initiating read count signals for foreground regions. This is treated as the starting value of foreground mean for poisson/nbinom fitting, Default is 50.
`theta0`	A non-negative numeric initiating the shape parameter of negative binomial model for background regions. For more detail, see theta in `glm.nb` function.
`theta1`	A non-negative numeric initiating the shape parameter of negative binomial model for foreground regions. For more detail, see theta in `glm.nb` function.
`p`	A non-negative numeric specifying the proportion of foreground regions in all estimated regions. This is treated as a starting value for EM algorithm. Default is 0.02.
`converge`	A non-negative numeric specifying the condition of EM algorithm termination. EM algorithm stops when the ratio of log likelihood increment to whole log likelihood is less or equivalent to `converge`.
`genome`	A BSgenome object containing the sequences of the reference genome that was used to align the reads, or the name of this reference genome specified in a way that is accepted by the `getBSgenome` function defined in the BSgenome software package. In that case the corresponding BSgenome data package needs to be already installed (see `?getBSgenome` in the BSgenome package for the details).
`gctype`	A character vector specifying choice of method to calculate effective GC content. Default `ladder` is based on uniformed fragment distribution. A more smoother method based on tricube assumption is also allowed. However, tricube should be not used if estimated peak half size is 3 times or more larger than estimated bind width.

A list of objects

`gc`	The GC contents at which GC effects are estimated.
`mu0`	Predicted background signals at GC content `gc`.
`mu1`	Predicted foreground signals at GC content `gc` .
`mu0med0`	Median of predicted background signals.
`mu1med1`	Median of predicted foreground signals.
`mu0med1`	Median of predicted background signals at GC content of foreground windows.
`mu1med0`	Median of predicted foreground signals at GC content of background windows.

1 2 3 4	bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb <- gcEffects(cov, bdw, sampling = c(0.15,1))