filterDNA: Filter Double Strand Sequences from a Bam File

Description Usage Arguments Details Value See Also Examples

View source: R/filterDNA.R

Description

Filter putative double strand DNA from a strand specific RNA-seq using a window sliding across the genome.

Usage

1
2
3
4
5
filterDNA(file, destination, statfile, sequences, mapqFilter = 0, paired,
yieldSize = 1e+06, winWidth = 1000, winStep = 100, readProp = 0.5,
threshold = 0.7, pvalueThreshold = 0.05, useCoverage = FALSE,
mustKeepRanges, getWin = FALSE, minCov = 0, maxCov = 0,
errorRate = 0.01)

Arguments

file

the input bam file to be filterd. Your bamfile should be sorted and have an index file located at the same path.

destination

The file path where the filtered output will be written

statfile

the file to write the summary of the results

sequences

the list of sequences to be filtered.

mapqFilter

every read that has mapping quality below mapqFilter will be removed before any analysis If missing, the entire bam file will be read.

paired

if TRUE then the input bamfile will be considered as paired end reads. If missing, 100 thousands first reads will be inspected to test if the input bam file in paired end or single end.

yieldSize

by default is 1e6, i.e. the bam file is read by block of records whose size is defined by this paramter. It is used to pass to same paramter of the scanBam function.

winWidth

the length of the sliding window, 1000 by default.

winStep

the step length to sliding the window, 100 by default.

readProp

A read is considered to be included in a window if at least readProp of it is in the window. Specified as a proportion. 0.5 by default.

threshold

the strand proportion threshold to test whether to keep a window or not. 0.7 by default

pvalueThreshold

the threshold for the p-value in the test of keeping windows. 0.05 by default

useCoverage

if TRUE, then the strand information in each window corresponds to the sum of coverage coming from positive/negative reads; and not the number of positive/negative reads as default.

mustKeepRanges

a GRanges object; all reads that map to those ranges will be kept regardless the strand proportion of the windows containing them.

getWin

if TRUE, the function will not only filter the bam file but also return a data frame containing the information of all windows of the original and filtered bam file.

minCov

if useCoverage=FALSE, every window that has less than minCov reads will be rejected regardless the strand proportion. If useCoverage=TRUE, every window has max coverage least than minCov will be rejected. 0 by default

maxCov

if useCoverage=FALSE, every window that has more than maxCov reads will be kept regardless the strand proportion. If useCoverage=TRUE, every window with max coverage more than maxCov will be kept. If 0 then it doesn't have effect on selecting window. 0 by default.

errorRate

the probability that an RNA read takes the false strand. 0.01 by default.

Details

filterDNA reads a bam file containing strand specific RNA reads, and filter reads coming from putative double strand DNA. Using a window sliding across the genome, we calculate the positive/negative proportion of reads in each window. We then use logistic regression to estimate the strand proportion of reads in each window, and calculate the p-value when comparing that to a given threshold. Let π be the strand proportion of reads in a window.

Null hypothesis for positive window: π ≤ threshold.

Null hypothesis for negative window: π ≥ 1-threshold.

Only windows with p-value <= pvalueThreshold are kept. For a kept positive window, each positive read in this window is kept with the probability (P-M)/P where P be the number of positive reads, and M be the number of negative reads. That is because those M negative reads are supposed to come from double-strand DNA, then there should be also M postive reads among the P positive reads come from double-strand DNA. In other words, there are only (P-M) positive reads come from RNA. Each negative read is kept with the probability equalling the rate that an RNA read of your sample has wrong strand, which is errorRate. Similar for kept negative windows.

Since each alignment can be belonged to several windows, then the probability of keeping an alignment is the maximum probability defined by all windows that contain it.

Value

if getWin is TRUE: a DataFrame object which could also be obtained by the function getWinFromBamFile

See Also

getWinFromBamFile, plotHist, plotWin

Examples

1
2
file <- system.file('extdata','s2.sorted.bam',package = 'strandCheckR')
filterDNA(file,sequences='10',destination='out.bam')

strandCheckR documentation built on Nov. 8, 2020, 7:02 p.m.