Count reads in binding site intervals

Description

Counts reads in binding site intervals. Files must be one of bam, bed and gzip-compressed bed. File suffixes must be ".bam", ".bed", or ".bed.gz" respectively.

Usage

1
2
3
4
5
6
7
8
dba.count(DBA, peaks, minOverlap=2, score=DBA_SCORE_TMM_MINUS_FULL, bLog=FALSE,
          fragmentSize=DBA$config$fragmentSize, 
          summits, filter=0, bRemoveDuplicates=FALSE, bScaleControl=TRUE,
          mapQCth=DBA$config$mapQCth, 
          filterFun=max, 
          bCorPlot=DBA$config$bCorPlot, 
          bUseSummarizeOverlaps=FALSE, readFormat=DBA_READS_DEFAULT,
          bParallel=DBA$config$RunParallel) 

Arguments

DBA

DBA object

peaks

If GRanges, RangedData, dataframe, or matrix, this parameter contains the intervals to use for counting. If character string, it specifies a file containing the intervals to use (with the first three columns specifying chromosome, startpos, endpos).If missing or a mask, generates a consensus peakset using minOverlap parameter (after applying the mask if present). If NULL, the score, filter, and summits parameters are honored, updating the global binding matrix without re-counting in the cases of score and filter, and only counting after re-centering in the case of summits.

minOverlap

only include peaks in at least this many peaksets when generating consensus peakset (i.e. when peaks parameter is missing). If minOverlap is between zero and one, peak will be included from at least this proportion of peaksets.

score

which score to use in the binding affinity matrix. Note that all raw read counts are maintained for use by dba.analyze, regardless of how this is set. One of:

DBA_SCORE_READS raw read count for interval using only reads from ChIP
DBA_SCORE_READS_FOLD raw read count for interval from ChIP divided by read count for interval from control
DBA_SCORE_READS_MINUS raw read count for interval from ChIP minus read count for interval from control
DBA_SCORE_RPKM RPKM for interval using only reads from ChIP
DBA_SCORE_RPKM_FOLD RPKM for interval from ChIP divided by RPKM for interval from control
DBA_SCORE_TMM_READS_FULL TMM normalized (using edgeR), using ChIP read counts and Full Library size
DBA_SCORE_TMM_READS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts and Effective Library size
DBA_SCORE_TMM_MINUS_FULL TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size
DBA_SCORE_TMM_MINUS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size
DBA_SCORE_TMM_READS_FULL_CPM same as DBA_SCORE_TMM_READS_FULL, but reported in counts-per-million.
DBA_SCORE_TMM_READS_EFFECTIVE_CPM same as DBA_SCORE_TMM_READS_EFFECTIVE, but reported in counts-per-million.
DBA_SCORE_TMM_MINUS_FULL_CPM same as DBA_SCORE_TMM_MINUS_FULL, but reported in counts-per-million.
DBA_SCORE_TMM_MINUS_EFFECTIVE_CPM same as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reported in counts-per-million.
DBA_SCORE_SUMMIT summit height (maximum read pileup value)
DBA_SCORE_SUMMIT_ADJ summit height (maximum read pileup value), normalized to relative library size
DBA_SCORE_SUMMIT_POS summit position (location of maximum read pileup)
bLog

logical indicating whether log2 of score should be used (only applies to DBA_SCORE_RPKM_FOLD and DBA_SCORE_READS_FOLD).

fragmentSize

This value will be used as the length of the reads. Each read will be extended from its endpoint along the appropriate strand by this many bases. If set to zero, the read size indicated in the BAM/BED file will be used. fragmentSize may also be a vector of values, one for each ChIP sample plus one for each unique Control library.

summits

if present, summit heights (read pileup) and locations will be calculated for each peak. The values can retrieved using dba.peakset. The summits can also be used as a read score in the global binding matrix (see score).

If the value of summits is TRUE (or 0), the summits will be calculated but the peaksets will be unaffected. If the value is greater than zero, all consensus peaks will be re-centered around a consensus summit, with the value of summits indicating how many base pairs to include upstream and downstream of the summit (so all consensus peaks will be of the same width, namely 2 * summits).

Note that if summits is greater than zero, the counting procedure will take twice as long, and bUseSummarizeOverlaps must be FALSE.

filter

value to use for filtering intervals with low read counts. The filterFun will be applied to the scores for each interval, and if it returns a value below the filter value, the interval will be removed from further analysis. If peaks is NULL, will remove sites from existing DBA object without recounting. If filter is a vector of values, dba.count will return a vector of the same length, indicating how many intervals will be retained for each specified filter level.

bRemoveDuplicates

logical indicating if duplicate reads (ones that map to exactly the same genomic position) should be removed. If TRUE, any location where multiple reads map will be counted as a single read. Note that if bLowMem is set, duplicates needs to have been already marked in all of the BAM files. The built-in counting code may not correctly handle certain cases when the bRemoveDuplicates parameter is set to TRUE. These cases include paired-end data and datasets where the read length may differ within a single BAM file. In these cases, see the bUseSummarizeOverlaps parameter.

bScaleControl

logical indicating if the Control reads should be scaled based on relative library sizes. If TRUE, and there are more reads in the Control library than in the ChIP library, the number of Control reads for each peak will be multiplied by a scaling factor determined by dividing the total number of reads in the ChIP library by the total number of reads in the Control library. If this value is not an integer, the number of Control reads for each peak will be the next highest integer.

mapQCth

for filtering by mapping quality (mapqc). Only alignments with mapping scores of at least this value will be included. Only applicable for bam files when bUseSummarizeOverlaps=FALSE (setting DBA$config$scanbamparam appropriately to filter on quality scores when using summarizeOverlaps.)

filterFun

function that will be invoked for each interval with a vector of scores for each sample. Returns a score that will be evaluated against the filter value (only intervals with a score at least as high as filter will be kept). Default is max, indicating that at least one sample should have a score of at least filter; other useful values include sum (indicating that all the scores added together should be at least filter) and mean (setting a minimum mean normalized count level). Users can supply their own function as well.

bCorPlot

logical indicating whether to plot a correlation heatmap for the counted data

bUseSummarizeOverlaps

logical indicating that summarizeOverlaps should be used for counting instead of the built-in counting code. This option is slower but uses the more standard counting function. If TRUE, all read files must be BAM (.bam extension), with associated index files (.bam.bai extension). The insertLength parameter must absent.

See notes for when the bRemoveDuplicates parameter is set to TRUE, where the built-in counting code may not correctly handle certain cases and bUseSummarizeOverlaps should be set to TRUE.

Five additional parameters for summarizeOverlaps may be specified in DBA$config:

DBA$config$yieldSize yieldSize indicating how many reads to process at one time; default is 5000000. The lower this value, the less memory will be used, but the more time it will take to complete the count operation.
DBA$config$intersectMode mode indicating which overlap algorithm to use; default is "IntersectionNotEmpty"
DBA$config$singleEnd logical indicating if reads are single end; default is TRUE
DBA$config$fragments logical indicating how unmatched reads are counted; default is FALSE
DBA$config$scanbamparam ScanBamParam object to pass to summarizeOverlaps. If present, bRemoveDuplicates is ignored.
readFormat

Specify the file type of the read files, over-riding the file extension. Possible values:

DBA_READS_DEFAULT use file extension (.bam, .bed, .bed.gz) to determine file type
DBA_READS_BAM assume the file type is BAM, regardless of the file extension
DBA_READS_BED assume the file type is BED (or zipped BED), regardless of the file extension.

Note that if readFormat is anything other than DBA_READS_DEFAULT, all the read files must be of the same file type.

bParallel

if TRUE, use multicore to get counts for each read file in parallel

Value

DBA object with binding affinity matrix based on read count scores.

Author(s)

Rory Stark and Gordon Brown

See Also

dba.analyze

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# These won't run unless you have the reads available in a BAM or BED file
data(tamoxifen_peaks)
## Not run: tamoxifen <- dba.count(tamoxifen)


# Count using a peakset made up of only peaks in all responsive MCF7 replicates
data(tamoxifen_peaks)
mcf7Common <- dba.overlap(tamoxifen,tamoxifen$masks$MCF7&tamoxifen$masks$Responsive)
## Not run: tamoxifen <- dba.count(tamoxifen,peaks=mcf7Common$inAll)
tamoxifen

#First make consensus peaksets from each set of replicates, 
#then derive master consensus set for counting from those
data(tamoxifen_peaks)
tamoxifen <- dba.peakset(tamoxifen,consensus = -DBA_REPLICATE)
## Not run: tamoxifen <- dba.count(tamoxifen, peaks=tamoxifen$masks$Consensus)
tamoxifen

# Change binding affinity scores
data(tamoxifen_counts)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_READS)
dba.peakset(tamoxifen, bRetrieve=TRUE)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_RPKM_FOLD)
dba.peakset(tamoxifen, bRetrieve=TRUE)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_TMM_MINUS_FULL)
dba.peakset(tamoxifen, bRetrieve=TRUE)

# Plot effect of a range of filter values and then apply filter 
data(tamoxifen_counts)
rate.max <- dba.count(tamoxifen, peaks=NULL, filter=0:250)
rate.sum <- dba.count(tamoxifen, peaks=NULL, filter=0:250,filterFun=sum)
plot(0:250,rate.max/rate.max[1],type='l',xlab="Filter Value",ylab="Proportion Retained Sites")
lines(0:250,rate.sum/rate.sum[1],col=2)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,filter=125,filterFun=sum)
tamoxifen