filterDuplReads: Detect and filter duplicated reads/sequences.

Description Usage Arguments Value Methods Author(s) See Also Examples

Description

filterDuplReads filters highly repeated sequences, i.e. with the same chromosome, start and end positions. As many such sequences are likely due to over-amplification artifacts, this can be a useful pre-processing step for ultra high-throughput sequencing data. A false discovery rate is computed for each number of repeats being unusually high. The reads with a higher false discovery rate will be removed. For more information on the false discovery rate calculation please read the fdrEnrichment manual.

tabDuplReads counts the number reads with no duplications, duplicated once, twice etc.

Usage

1
2
3
filterDuplReads(x, maxRepeats, fdrOverAmp=0.01, negBinomUse=.999,components=0, mc.cores=1)

tabDuplReads(x,  minRepeats=1, mc.cores=1)

Arguments

x

Object containing read locations. Currently methods for RangedData and list. Duplication is assessed based only on the space, start, end and x[['strand']], i.e. even if they are different based on other variables stored in values(x), the reads are considered duplicated and only the first appearance is returned.

maxRepeats

Reads appearing maxRepeats or more times will be excluded. If not specified, this is setup automatically based on fdrOverAmp.

fdrOverAmp

Reads with false discovery rate of being over-amplified greater than fdrOverAmp are excluded.

negBinomUse

Number of counts that will be used to compute the null distribution. Using 1 - 1/1000 would mean that 99.9% of the reads will be used. The ones with higher number of repetitions are the excluded ones.

components

number of negative binomials that will be used to fit null distribution. The default value is 1. This value hase to be between 0 and 4. If 0 is given the optimal number of negative biomials is choosen using the Bayesian information criterion (BIC)

mc.cores

Number of cores to be used in parallel computing (passed on to mclapply).

minRepeats

The table is only produced for reads with at least minRepeats repeats.

Value

filterDuplReads returns x without highly repetitive sequencesas, determined by maxRepeats or ppOverAmp.

tabDuplReads returns a table counting the number of sequences repeating 1 times, 2 times, 3 times etc.

Methods

Methods for filterDuplReads and tabDuplReads

signature(x = "RangedData")

Two reads are duplicated if they have the same space, start and end position.

signature(x = "list")

The method is applied separately to each RangedData element in the list.

Author(s)

Evarist Planet, David Rossell, Oscar Flores

See Also

fdrEnrichedCounts to compute the posterior probability that a certain number of repeats is due to over-amplification.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
set.seed(1)
st <- round(rnorm(1000,500,100))
strand <- rep(c('+','-'),each=500)
space <- sample(c('chr1','chr2'),size=length(st),replace=TRUE)
sample1 <- RangedData(IRanges(st,st+38),strand=strand,space=space)

#Add artificial repeats
st <- rep(400,20)
repeats <- RangedData(IRanges(st,st+38),strand='+',space='chr1')
sample1 <- rbind(sample1,repeats)

filterDuplReads(sample1)

htSeqTools documentation built on May 6, 2019, 3:39 a.m.