gcEffects: ChIP-seq GC Effects Estimation

Description Usage Arguments Value Examples

Description

GC effects are estimated based on effective GC content and reads count on genome-wide windows, using generalized linear mixture models. Genome wide windows are randomly or supervised sampled with given proportions. GC effects of background and foreground are estimated separately.

Usage

1
2
3
4
5
gcEffects(coverage, bdwidth, flank = NULL, plot = TRUE, sampling = c(0.05,
  1), supervise = GRanges(), gcrange = c(0.3, 0.8), emtrace = TRUE,
  model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0,
  theta1 = mu1, p = 0.02, converge = 0.001, genome = "hg19",
  gctype = c("ladder", "tricube"))

Arguments

coverage

A list object returned by function read5endCoverage.

bdwidth

A non-negative integer vector with two elements specifying ChIP-seq binding width and peak detection half window size. Usually generated by function bindWidth. A bad estimation of bdwidth results no meaning of downstream analysis.

flank

A non-negative integer specifying the flanking width of ChIP-seq binding. This parameter provides the flexibility that reads appear in flankings by decreased probabilities as increased distance from binding region. This paramter helps to define effective GC content calculation. Default is NULL, which means this paramater will be calculated from bdwidth. However, if customized numbers provided, there won't be recalucation for this parameter; instead, the 2nd elements of bdwidth will be recalculated based on flank.

plot

A logical vector which, when TRUE (default), returns plots of intermediate results.

sampling

A numeric vector with length 2. The first number specifies the proportion of regions to be sampled for GC effects estimation. The second number specifies the repeat times for sampling. Default c(0.05,1) gives pretty robust estimation for human genome. However, smaller genomes might need both higher proportion and more repeat times for robust estimation.

supervise

A GRanges object specifying peak regions in the studied data, such as peaks called by peak callers, e.g. MACS & SPP. These peak regions provide supervised window sampling for both mixtures in the generalized linear model. Default no supervising. Or, if provided peak regions have too few covered windows, supervised sampling will be replaced by random sampling automatically.

gcrange

A non-negative numeric vector with length 2. This vector sets the range of GC content to filter regions for GC effect estimation. For human, most regions have GC content between 0.3 and 0.8, which is set as the default. Other regions with GC content beyond this range will be ignored. This range is critical when very few foreground regions are selected for mixture model fitting, since outliers could drive the regression lines. Thus, if possible, first make a scatter plot between counts and GC content to decide this parameter. Alternatively, select a narrower range, e.g. c(0.35,0.7), to aviod outlier effects from both high and low GC-content regions.

emtrace

A logical vector which, when TRUE (default), allows to print the trace of log likelihood changes in EM iterations.

model

A character specifying the distribution model to be used in generalized linear model fitting. The default is negative binomial(nbinom), while poisson is also supported currently. Based on our tests of multiple datasets, mostly poisson is a very good approximation of negative binomial, and provides much faster model fitting.

mu0

A non-negative numeric initiating read count signals for background regions. This is treated as the starting value of background mean for poisson/nbinom fitting. Default is 1.

mu1

A non-negative numeric initiating read count signals for foreground regions. This is treated as the starting value of foreground mean for poisson/nbinom fitting, Default is 50.

theta0

A non-negative numeric initiating the shape parameter of negative binomial model for background regions. For more detail, see theta in glm.nb function.

theta1

A non-negative numeric initiating the shape parameter of negative binomial model for foreground regions. For more detail, see theta in glm.nb function.

p

A non-negative numeric specifying the proportion of foreground regions in all estimated regions. This is treated as a starting value for EM algorithm. Default is 0.02.

converge

A non-negative numeric specifying the condition of EM algorithm termination. EM algorithm stops when the ratio of log likelihood increment to whole log likelihood is less or equivalent to converge.

genome

A BSgenome object containing the sequences of the reference genome that was used to align the reads, or the name of this reference genome specified in a way that is accepted by the getBSgenome function defined in the BSgenome software package. In that case the corresponding BSgenome data package needs to be already installed (see ?getBSgenome in the BSgenome package for the details).

gctype

A character vector specifying choice of method to calculate effective GC content. Default ladder is based on uniformed fragment distribution. A more smoother method based on tricube assumption is also allowed. However, tricube should be not used if estimated peak half size is 3 times or more larger than estimated bind width.

Value

A list of objects

gc

The GC contents at which GC effects are estimated.

mu0

Predicted background signals at GC content gc.

mu1

Predicted foreground signals at GC content gc .

mu0med0

Median of predicted background signals.

mu1med1

Median of predicted foreground signals.

mu0med1

Median of predicted background signals at GC content of foreground windows.

mu1med0

Median of predicted foreground signals at GC content of background windows.

Examples

1
2
3
4
bam <- system.file("extdata", "chipseq.bam", package="gcapc")
cov <- read5endCoverage(bam)
bdw <- bindWidth(cov)
gcb <- gcEffects(cov, bdw, sampling = c(0.15,1))

tengmx/gcapc documentation built on May 31, 2019, 8:35 a.m.