fdrEnrichedCounts: Posterior probability that a certain number of repeats are...

Description Usage Arguments Details Value References Examples

Description

Given a vector of number of repeats (e.g. there are 100 sequences appearing once, 50 sequences appearing twice etc.) the function computes the false discovery rate that each number of repeats is unusually high.

Usage

1
fdrEnrichedCounts(counts,use=1:10,components=0,mc.cores=1)

Arguments

counts

vector with observed frequencies. The vector must have names. tabDuplReads function can be used for this purpose.

use

number of repeats to be used when estimating the null distribution. The number of repeats expected if no unusually high repeats are present. The first 10 are used by default.

components

number of negative binomials that will be used to fit the null distribution. The default value is 1. This value has to be between 0 and 4. If 0 is given the optimal number of negative biomials is chosen using the Bayesian information criterion (BIC)

mc.cores

number of cores to be used to compute calculations. This parameter will be passed bt to mclappply

Details

The null distribution is a combination of n negative binomials where. n is assigned through the components parameter. If components is equal to 0 the optimal number of negative binomials is choosen using the Bayesian information criterion (BIC). The parameters of the null distribution are estimated from the number of observations with as many repeats as told in the use parameter. If use is 1:10 the null distribution will be estimated using repeats that appear 1 time, 2 times, ... or 10 times.

False discovery rate for usually high number of repeats is done following an empirical Bayes scheme similar to that in Efron et al. Let f0(x) be the null distribution, f(x) be the overall distribution and (1-pi0) the proportion of unusually high repeats. We assume the two component mixture f(x)= pi0 f0(x) + (1-pi0)f1(x). Essentially, f(x) is estimated from the data (imposing that f(x) must be monotone decreasing after its mode using isoreg from packabe base, to improve the estimate in the tails). Currently pi0 is set to 1, i.e. its maximum possible value, which provides an upper bound for the FDR. The estimated false discovery rate for enrichment is 1-pi0*(1-cumsum(f0(x)))/(1-cumsum(f(x))). A monotone regression (isoreg) is applied to remove small random fluctuations in the estimated FDR and to guarantee that it decreases with x.

Value

data.frame with the following columns:

pdfH0

vector with pdf under the null hypothesis of no enrichment

pdfOverall

vector with pdf for mixture distribution

fdrEnriched

vector with false discovery rate that each count is significantly enriched

References

Ji et al. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature Biotechnology, 2008, 26, 1293-1300.

Efron et al. Empirical Bayes Analysis of a Microarray Experiment, Journal of the American Statistical Association, 2001, 96, 1151-1160.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#Generate 1000 sequences repeated once, on the average
nrepeats <- c(rpois(10^4,1),rpois(10,10))
nrepeats <- nrepeats[nrepeats>0]
counts <- table(nrepeats)
barplot(counts) -> xaxis #observe bimodality around 10
fdrest <- fdrEnrichedCounts(counts,use=1:5,components=1)
cutoff <- xaxis[which(fdrest$fdrEnriched<0.95)[1]]
abline(v=cutoff,col=2)
text(cutoff,counts[1]/2,'cut-off',col=2)
head(fdrest)

Example output

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
    rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: MASS
Loading required package: BSgenome
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: Biostrings
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit


Attaching package: 'BSgenome'

The following objects are masked from 'package:GenomeInfoDb':

    GenomeDescription, bsgenomeName, organism, provider,
    providerVersion, releaseDate, releaseName, species

The following objects are masked from 'package:BiocGenerics':

    organism, species

Warning: namespace 'rtracklayer' is not available and has been replaced
by .GlobalEnv when processing object ''
Warning: namespace 'rtracklayer' is not available and has been replaced
by .GlobalEnv when processing object ''
Warning: namespace 'rtracklayer' is not available and has been replaced
by .GlobalEnv when processing object ''
Warning: namespace 'rtracklayer' is not available and has been replaced
by .GlobalEnv when processing object ''
Warning messages:
1: multiple methods tables found for 'organism' 
2: multiple methods tables found for 'species' 
3: multiple methods tables found for 'seqinfo' 
4: multiple methods tables found for 'seqinfo<-' 
5: multiple methods tables found for 'seqnames' 
6: no function found corresponding to methods exports from 'DelayedArray' for: 'acbind', 'arbind' 
7: no function found corresponding to methods exports from 'SummarizedExperiment' for: 'acbind', 'arbind' 
8: replacing previous import 'BiocGenerics::path' by 'Rsamtools::path' when loading 'GenomicAlignments' 
9: multiple methods tables found for 'rglist' 
         pdfH0  pdfOverall fdrEnriched
1 0.5777667336 0.573284741   1.0000000
2 0.2924841567 0.299635557   0.9963087
3 0.0987099416 0.096022817   0.9963087
4 0.0249850776 0.022658850   0.9963087
5 0.0050593008 0.006179686   0.7208937
6 0.0008537271 0.000950721   0.4484370

htSeqTools documentation built on May 6, 2019, 3:39 a.m.