dba.count: Count reads in binding site intervals

Description Usage Arguments Value Author(s) See Also Examples

View source: R/DBA.R

Description

Counts reads in binding site intervals. Files must be one of bam, bed and gzip-compressed bed. File suffixes must be ".bam", ".bed", or ".bed.gz" respectively.

Usage

1
2
3
4
5
6
7
dba.count(DBA, peaks, minOverlap=2, score=DBA_SCORE_NORMALIZED,
          fragmentSize=DBA$config$fragmentSize, 
          summits=200, filter=1, bRemoveDuplicates=FALSE, bScaleControl=TRUE,
          bSubControl = is.null(DBA$greylist),
          mapQCth=DBA$config$mapQCth, filterFun=max, minCount=0, 
          bLog=FALSE, bUseSummarizeOverlaps=TRUE, 
          readFormat=DBA_READS_DEFAULT, bParallel=DBA$config$RunParallel) 

Arguments

DBA

DBA object

peaks

If GRanges, RangedData, dataframe, or matrix, this parameter contains the intervals to use for counting. If character string, it specifies a file containing the intervals to use (with the first three columns specifying chromosome, startpos, endpos).If missing or a mask, generates a consensus peakset using minOverlap parameter (after applying the mask if present). If NULL, the score, filter, and summits parameters are honored, updating the global binding matrix without re-counting in the cases of score and filter, and only counting after re-centering in the case of summits.

minOverlap

only include peaks in at least this many peaksets when generating consensus peakset (i.e. when peaks parameter is missing). If minOverlap is between zero and one, peak will be included from at least this proportion of peaksets.

score

which score to use in the binding affinity matrix. Note that all raw read counts are maintained for use by dba.analyze, regardless of how this is set. One of:

DBA_SCORE_NORMALIZED normalized reads, as set by dba.normalize
DBA_SCORE_READS raw read count for interval using only reads from ChIP
DBA_SCORE_CONTROL_READS raw read count for interval using only reads from Control
DBA_SCORE_READS_FOLD raw read count for interval from ChIP divided by read count for interval from control
DBA_SCORE_READS_MINUS raw read count for interval from ChIP minus read count for interval from control
DBA_SCORE_RPKM RPKM for interval using only reads from ChIP
DBA_SCORE_RPKM_FOLD RPKM for interval from ChIP divided by RPKM for interval from control
DBA_SCORE_RPKM_MINUS RPKM for interval from ChIP minus RPKM for interval from control
DBA_SCORE_SUMMIT summit height (maximum read pileup value)
DBA_SCORE_SUMMIT_ADJ summit height (maximum read pileup value), normalized to relative library size
DBA_SCORE_SUMMIT_POS summit position (location of maximum read pileup)
fragmentSize

This value will be used as the length of the reads. Each read will be extended from its endpoint along the appropriate strand by this many bases. If set to zero, the read size indicated in the BAM/BED file will be used. fragmentSize may also be a vector of values, one for each ChIP sample plus one for each unique Control library.

summits

unless set to FALSE, summit heights (read pileup) and locations will be calculated for each peak. The values can retrieved using dba.peakset. The summits can also be used as a read score in the global binding matrix (see score).

If the value of summits is TRUE (or 0), the summits will be calculated but the peaksets will be unaffected. If the value is greater than zero, all consensus peaks will be re-centered around a consensus summit, with the value of summits indicating how many base pairs to include upstream and downstream of the summit (so all consensus peaks will be of the same width, namely 2 * summits + 1).

Note that if summits is no FALSE on the first call, the counting procedure will take twice as long..

filter

value to use for filtering intervals with low read counts. The filterFun will be applied to the counts for each interval, and if it returns a value below the filter value, the interval will be removed from further analysis. If peaks is NULL, will remove sites from existing DBA object without recounting. If filter is a vector of values, dba.count will return a vector of the same length, indicating how many intervals will be retained for each specified filter level.

NB: the filtering will be based on RPKM values. If bSubControl is FALSE, this is the RPKM value of the read counts (equivalent to score=DBA_SCORE_RPKM. If bSubControl is TRUE, this is the RPKM value of the control counts subtracted from the RPKM of the read counts (equivalent to score=DBA_SCORE_RPKM_MINUS).

bRemoveDuplicates

logical indicating if duplicate reads (ones that map to exactly the same genomic position) should be removed. If TRUE, any location where multiple reads map will be counted as a single read. Note that if bLowMem is set, duplicates needs to have been already marked in all of the BAM files. The built-in counting code may not correctly handle certain cases when the bRemoveDuplicates parameter is set to TRUE. These cases include paired-end data and datasets where the read length may differ within a single BAM file. In these cases, see the bUseSummarizeOverlaps parameter.

bScaleControl

logical indicating if the Control reads should be scaled based on relative library sizes. If TRUE, and there are more reads in the Control library than in the ChIP library, the number of Control reads for each peak will be multiplied by a scaling factor determined by dividing the total number of reads in the ChIP library by the total number of reads in the Control library. If this value is not an integer, the number of Control reads for each peak will be the next highest integer.

bSubControl

logical indicating whether Control read counts are subtracted for each site in each sample. If there are more overlapping control reads than ChIP reads, the count will be set to the minCount value specified when dba.count was called, or zero if no value is specified.

If bSubControl is not explicitly specified, it will be set to TRUE unless a greylist has been applied (see dba.blacklist).

mapQCth

for filtering by mapping quality (mapqc). Only alignments with mapping scores of at least this value will be included. Only applicable for bam files when bUseSummarizeOverlaps=FALSE (setting DBA$config$scanbamparam appropriately to filter on quality scores when using summarizeOverlaps.)

filterFun

function that will be invoked for each interval with a vector of scores for each sample. Returns a score that will be evaluated against the filter value (only intervals with a score at least as high as filter will be kept). Default is max, indicating that at least one sample should have a score of at least filter; other useful values include sum (indicating that all the scores added together should be at least filter) and mean (setting a minimum mean normalized count level). Users can supply their own function as well.

minCount

minimum read count value. Any interval with fewer than this many overlapping reads will be set to have this count. Also applies to scores.

bLog

logical indicating whether log2 of score should be used (only applies to DBA_SCORE_RPKM_FOLD and DBA_SCORE_READS_FOLD).

bUseSummarizeOverlaps

logical indicating that summarizeOverlaps should be used for counting instead of the built-in counting code. This option is slower but uses the more standard counting function. If TRUE, all read files must be BAM (.bam extension), with associated index files (.bam.bai extension). The fragmentSize parameter must absent.

See notes for when the bRemoveDuplicates parameter is set to TRUE, where the built-in counting code may not correctly handle certain cases and bUseSummarizeOverlaps should be set to TRUE.

Five additional parameters for summarizeOverlaps may be specified in DBA$config:

DBA$config$yieldSize yieldSize indicating how many reads to process at one time; default is 5000000. The lower this value, the less memory will be used, but the more time it will take to complete the count operation.
DBA$config$intersectMode mode indicating which overlap algorithm to use; default is "IntersectionNotEmpty"
DBA$config$singleEnd logical indicating if reads are single end; if NULL, status will be automatically detected.
DBA$config$fragments logical indicating how unmatched reads are counted; default is FALSE
DBA$config$scanbamparam ScanBamParam object to pass to summarizeOverlaps. If present, bRemoveDuplicates is ignored.
readFormat

Specify the file type of the read files, over-riding the file extension. Possible values:

DBA_READS_DEFAULT use file extension (.bam, .bed, .bed.gz) to determine file type
DBA_READS_BAM assume the file type is BAM, regardless of the file extension
DBA_READS_BED assume the file type is BED (or zipped BED), regardless of the file extension.

Note that if readFormat is anything other than DBA_READS_DEFAULT, all the read files must be of the same file type.

bParallel

if TRUE, use multicore to get counts for each read file in parallel

Value

DBA object with binding affinity matrix based on read count scores.

Author(s)

Rory Stark and Gordon Brown

See Also

dba.analyze

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# These won't run unless you have the reads available in a BAM or BED file
data(tamoxifen_peaks)
## Not run: tamoxifen <- dba.count(tamoxifen)


# Count using a peakset made up of only peaks in all responsive MCF7 replicates
data(tamoxifen_peaks)
mcf7Common <- dba.overlap(tamoxifen,tamoxifen$masks$MCF7&tamoxifen$masks$Responsive)
## Not run: tamoxifen <- dba.count(tamoxifen,peaks=mcf7Common$inAll)
tamoxifen

#First make consensus peaksets from each set of replicates, 
#then derive master consensus set for counting from those
data(tamoxifen_peaks)
tamoxifen <- dba.peakset(tamoxifen,consensus = -DBA_REPLICATE)
## Not run: tamoxifen <- dba.count(tamoxifen, peaks=tamoxifen$masks$Consensus)
tamoxifen

# Change binding affinity scores
data(tamoxifen_counts)
dba.peakset(tamoxifen, bRetrieve=TRUE) # default: DBA_SCORE_NORMALIZED
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_READS)
dba.peakset(tamoxifen, bRetrieve=TRUE)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_RPKM_MINUS)
dba.peakset(tamoxifen, bRetrieve=TRUE)

# Plot effect of a range of filter values and then apply filter 
data(tamoxifen_counts)
rate.max <- dba.count(tamoxifen, peaks=NULL, filter=0:250)
rate.sum <- dba.count(tamoxifen, peaks=NULL, filter=0:250,filterFun=sum)
plot(0:250,rate.max/rate.max[1],type='l',xlab="Filter Value",ylab="Proportion Retained Sites")
lines(0:250,rate.sum/rate.sum[1],col=2)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,filter=125,filterFun=sum)
tamoxifen

# Calculate summits
data(tamoxifen_counts) 
# pre-counted with summits=250 or 501bp intervals
as.numeric(dba.show(tamoxifen)$FRiP)
## Not run: tamoxifen <- dba.count(tamoxifen,peaks=NULL,summits=50)
# re-counted with summits=50 or 101bp intervals
as.numeric(dba.show(tamoxifen)$FRiP)

DiffBind documentation built on March 24, 2021, 6 p.m.