bumphunter: Bumphunter
In rafalab/bumphunter: Bump Hunter

bumphunter

R Documentation

Bumphunter

Description

Estimate regions for which a genomic profile deviates from its baseline value. Originally implemented to detect differentially methylated genomic regions between two populations.

Usage

## S4 method for signature 'matrix'
bumphunter(object, design, chr=NULL, pos, cluster=NULL,coef=2, cutoff=NULL, pickCutoff = FALSE, pickCutoffQ = 0.99, maxGap=500, nullMethod=c("permutation","bootstrap"),smooth=FALSE,smoothFunction=locfitByCluster, useWeights=FALSE, B=ncol(permutations), permutations=NULL,verbose=TRUE, ...)

bumphunterEngine(mat, design, chr = NULL, pos, cluster = NULL, coef = 2, cutoff = NULL, pickCutoff = FALSE, pickCutoffQ = 0.99, maxGap = 500, nullMethod=c("permutation","bootstrap"), smooth = FALSE, smoothFunction = locfitByCluster, useWeights = FALSE, B=ncol(permutations), permutations=NULL, verbose = TRUE, ...)
## S3 method for class 'bumps'
print(x, ...)

Arguments

`object`	An object of class matrix.
`x`	An object of class `bumps`.
`mat`	A matrix with rows representing genomic locations and columns representing samples.
`design`	Design matrix with rows representing samples and columns representing covariates. Regression is applied to each row of mat.
`chr`	A character vector with the chromosomes of each location.
`pos`	A numeric vector representing the chromosomal position.
`cluster`	The clusters of locations that are to be analyzed together. In the case of microarrays, the clusters are many times supplied by the manufacturer. If not available the function `clusterMaker` can be used to cluster nearby locations.
`coef`	An integer denoting the column of the design matrix containing the covariate of interest. The hunt for bumps will be only be done for the estimate of this coefficient.
`cutoff`	A numeric value. Values of the estimate of the genomic profile above the cutoff or below the negative of the cutoff will be used as candidate regions. It is possible to give two separate values (upper and lower bounds). If one value is given, the lower bound is minus the value.
`pickCutoff`	Should bumphunter attempt to pick a cutoff using the permutation distribution?
`pickCutoffQ`	The quantile used for picking the cutoff using the permutation distribution.
`maxGap`	If cluster is not provided this maximum location gap will be used to define cluster via the `clusterMaker` function.
`nullMethod`	Method used to generate null candidate regions, must be one of ‘bootstrap’ or ‘permutation’ (defaults to ‘permutation’). However, if covariates in addition to the outcome of interest are included in the design matrix (ncol(design)>2), the ‘permutation’ approach is not recommended. See vignette and original paper for more information.
`smooth`	A logical value. If TRUE the estimated profile will be smoothed with the smoother defined by `smoothFunction`
`smoothFunction`	A function to be used for smoothing the estimate of the genomic profile. Two functions are provided by the package: `loessByCluster` and `runmedByCluster`.
`useWeights`	A logical value. If `TRUE` then the standard errors of the point-wise estimates of the profile function will be used as weights in the loess smoother `loessByCluster`. If the `runmedByCluster` smoother is used this argument is ignored.
`B`	An integer denoting the number of resamples to use when computing null distributions. This defaults to 0. If `permutations` is supplied that defines the number of permutations/bootstraps and `B` is ignored.
`permutations`	is a matrix with columns providing indexes to be used to scramble the data and create a null distribution when `nullMethod` is set to permutations. If the bootstrap approach is used this argument is ignored. If this matrix is not supplied and `B`>0 then these indexes are created using the function `sample`.
`verbose`	logical value. If `TRUE`, it writes out some messages indicating progress. If `FALSE` nothing should be printed.
`...`	further arguments to be passed to the smoother functions.

Details

This function performs the bumphunting approach described by Jaffe et al. International Journal of Epidemiology (2012). The main output is a table of candidate regions with permutation or bootstrap-based family-wide error rates (FWER) and p-values assigned.

The general idea is that for each genomic location we have a value for several individuals. We also have covariates for each individual and perform regression. This gives us one estimate of the coefficient of interest (a common example is case versus control). These estimates are then (optionally) smoothed. The smoothing occurs in clusters of locations that are ‘close enough’. This gives us an estimate of a genomic profile that is 0 when uninteresting. We then take values above (in absolute value) cutoff as candidate regions. Permutations can then performed to create null distributions for the candidate regions.

The simplest way to use permutations or bootstraps to create a null distribution is to set B. If the number of samples is large this can be set to a large number, such as 1000. Note that this will be slow and we have therefore provided parallelization capabilities. In cases were the user wants to define the permutations or bootstraps, for example cases in which all possible permutations/boostraps can be enumerated, these can be supplied via the permutations argument.

Uncertainty is assessed via permutations or bootstraps. Each of the B permutations/bootstraps will produce an estimated ‘null profile’ from which we can define ‘null candidate regions’. For each observed candidate region we determine how many null regions are ‘more extreme’ (longer and higher average value). The ‘p.value’ is the percent of candidate regions obtained from the permutations/boostraps that are as extreme as the observed region. These p-values should be interpreted with care as the theoretical proporties are not well understood. The ‘fwer’ is the proportion of permutations/bootstraps that had at least one region as extreme as the observed region. We compute p.values and FWER for the area of the regions (as opposed to length and value as a pair) as well. Note that for cases with more than one covariate the permutation approach is not generally recommended; the nullMethod argument will coerce to ‘bootstrap’ in this scenario. See vignette and original paper for more information.

Parallelization is implemented through the foreach package.

Value

An object of class bumps with the following components:

`tab`	The table with candidate regions and annotation for these.
`coef`	The single loci coefficients.
`fitted`	The estimated genomic profile used to determine the regions.
`pvaluesMarginal`	marginal p-value for each genomic location.
`null`	The null distribution.
`algorithm`	details on the algorithm.

Author(s)

Rafael A. Irizarry, Martin J. Aryee, Kasper D. Hansen, and Shan Andrews.

References

Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012) Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. International Journal of Epidemiology 41(1):200-9.

Examples

dat <- dummyData()
# Enable parallelization
require(doParallel)
registerDoParallel(cores = 2)
# Find bumps
bumps <- bumphunter(dat$mat, design=dat$design, chr=dat$chr, pos=dat$pos,
                    cluster=dat$cluster, coef=2, cutoff= 0.28, nullMethod="bootstrap",
                    smooth=TRUE, B=250, verbose=TRUE,
                    smoothFunction=loessByCluster)
bumps
# cleanup, for Windows
bumphunter:::foreachCleanup()

rafalab/bumphunter documentation built on March 20, 2024, 6:22 a.m.