coverage-methods: Coverage of a set of ranges

Description Usage Arguments Value Author(s) See Also Examples

Description

For each position in the space underlying a set of ranges, counts the number of ranges that cover it.

Usage

1
2
3
4
5
6
7
8
9
coverage(x, shift=0L, width=NULL, weight=1L, ...)

## S4 method for signature 'IntegerRanges'
coverage(x, shift=0L, width=NULL, weight=1L,
            method=c("auto", "sort", "hash", "naive"))

## S4 method for signature 'IntegerRangesList'
coverage(x, shift=0L, width=NULL, weight=1L,
            method=c("auto", "sort", "hash", "naive"))

Arguments

x

A IntegerRanges, Views, or IntegerRangesList object. See ?`coverage-methods` in the GenomicRanges package for coverage methods for other objects.

shift, weight

shift specifies how much each range in x should be shifted before the coverage is computed. A positive shift value will shift the corresponding range in x to the right, and a negative value to the left. NAs are not allowed.

weight assigns a weight to each range in x.

  • If x is an IntegerRanges or Views object: each of these arguments must be an integer or numeric vector parallel to x (will get recycled if necessary). Alternatively, each of these arguments can also be specified as a single string naming a metadata column in x (i.e. a column in mcols(x)) to be used as the shift (or weight) vector. Note that when x is an IPos object, each of these arguments can only be a single number.

  • If x is an IntegerRangesList object: each of these arguments must be a numeric vector or list-like object of the same length as x (will get recycled if necessary). If it's a numeric vector, it's first turned into a list with as.list. After recycling, each list element shift[[i]] (or weight[[i]]) must be an integer or numeric vector parallel to x[[i]] (will get recycled if necessary).

If weight is an integer vector or list-like object of integer vectors, the coverage vector(s) will be returned as integer-Rle object(s). If it's a numeric vector or list-like object of numeric vectors, the coverage vector(s) will be returned as numeric-Rle object(s).

width

Specifies the length of the returned coverage vector(s).

  • If x is an IntegerRanges object: width must be NULL (the default), an NA, or a single non-negative integer. After being shifted, the ranges in x are always clipped on the left to keep only their positive portion i.e. their intersection with the [1, +inf) interval. If width is a single non-negative integer, then they're also clipped on the right to keep only their intersection with the [1, width] interval. In that case coverage returns a vector of length width. Otherwise, it returns a vector that extends to the last position in the underlying space covered by the shifted ranges.

  • If x is a Views object: Same as for a IntegerRanges object, except that, if width is NULL then it's treated as if it was length(subject(x)).

  • If x is a IntegerRangesList object: width must be NULL or an integer vector parallel to x (i.e. with one element per list element in x). If not NULL, the vector must contain NAs or non-negative integers and it will get recycled to the length of x if necessary. If NULL, it is replaced with NA and recycled to the length of x. Finally width[i] is used to compute the coverage vector for x[[i]] and is therefore treated like explained above (when x is a IntegerRanges object).

method

If method is set to "sort", then x is sorted previous to the calculation of the coverage. If method is set to "hash" or "naive", then x is hashed directly to a vector of length width without previous sorting.

The "hash" method is faster than the "sort" method when x is large (i.e. contains a lot of ranges). When x is small and width is big (e.g. x represents a small set of reads aligned to a big chromosome), then method="sort" is faster and uses less memory than method="hash".

The "naive" method is a slower version of the "hash" method that has the advantage of avoiding floating point artefacts in the no-coverage regions of the numeric-Rle object returned by coverage() when the weights are supplied as a numeric vector of type double. See "FLOATING POINT ARITHMETIC CAN BRING A SURPRISE" section in the Examples below for more information.

Using method="auto" selects between the "sort" and "hash" methods, picking the one that is predicted to be faster based on length(x) and width.

...

Further arguments to be passed to or from other methods.

Value

If x is a IntegerRanges or Views object: An integer- or numeric-Rle object depending on whether weight is an integer or numeric vector.

If x is a IntegerRangesList object: An RleList object with one coverage vector per list element in x, and with x names propagated to it. The i-th coverage vector can be either an integer- or numeric-Rle object, depending on the type of weight[[i]] (after weight has gone thru as.list and recycling, like described previously).

Author(s)

H. Pagès and P. Aboyoun

See Also

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
## ---------------------------------------------------------------------
## A. COVERAGE OF AN IRanges OBJECT
## ---------------------------------------------------------------------
x <- IRanges(start=c(-2L, 6L, 9L, -4L, 1L, 0L, -6L, 10L),
             width=c( 5L, 0L, 6L,  1L, 4L, 3L,  2L,  3L))
coverage(x)
coverage(x, shift=7)
coverage(x, shift=7, width=27)
coverage(x, shift=c(-4, 2))  # 'shift' gets recycled
coverage(x, shift=c(-4, 2), width=12)
coverage(x, shift=-max(end(x)))

coverage(restrict(x, 1, 10))
coverage(reduce(x), shift=7)
coverage(gaps(shift(x, 7), start=1, end=27))

## With weights:
coverage(x, weight=as.integer(10^(0:7)))  # integer-Rle
coverage(x, weight=c(2.8, -10))  # numeric-Rle, 'shift' gets recycled

## ---------------------------------------------------------------------
## B. FLOATING POINT ARITHMETIC CAN BRING A SURPRISE
## ---------------------------------------------------------------------
## Please be aware that rounding errors in floating point arithmetic can
## lead to some surprising results when computing a weighted coverage:
y <- IRanges(c(4, 10), c(18, 15))
w1 <- 0.958
w2 <- 1e4
cvg <- coverage(y, width=100, weight=c(w1, w2))
cvg  # non-zero coverage at positions 19 to 100!

## This is an artefact of floating point arithmetic and the algorithm
## used to compute the weighted coverage. It can be observed with basic
## floating point arithmetic:
w1 + w2 - w2 - w1  # very small non-zero value!

## Note that this only happens with the "sort" and "hash" methods but
## not with the "naive" method:
coverage(y, width=100, weight=c(w1, w2), method="sort")
coverage(y, width=100, weight=c(w1, w2), method="hash")
coverage(y, width=100, weight=c(w1, w2), method="naive")

## These very small non-zero coverage values in the no-coverage regions
## of the numeric-Rle object returned by coverage() are not always
## present. But when they are, they can cause problems downstream or
## in unit tests. For example downstream code that relies on things
## like 'cvg != 0' to find regions with coverage won't work properly.
## This can be mitigated either by selecting the "naive" method (be aware
## that this can slow down things significantly) or by "cleaning" 'cvg'
## first e.g. with something like 'cvg <- round(cvg, digits)' where
## 'digits' is a carefully chosen number of digits:
cvg <- round(cvg, digits=3)

## Note that this rounding will also have the interesting side effect of
## reducing the memory footprint of the Rle object in general (because
## some runs might get merged into a single run as a consequence of the
## rounding).

## ---------------------------------------------------------------------
## C. COVERAGE OF AN IPos OBJECT
## ---------------------------------------------------------------------
pos_runs <- IRanges(c(1, 5, 9), c(10, 8, 15))
ipos <- IPos(pos_runs)
coverage(ipos)

## ---------------------------------------------------------------------
## D. COVERAGE OF AN IRangesList OBJECT
## ---------------------------------------------------------------------
x <- IRangesList(A=IRanges(3*(4:-1), width=1:3), B=IRanges(2:10, width=5))
cvg <- coverage(x)
cvg

stopifnot(identical(cvg[[1]], coverage(x[[1]])))
stopifnot(identical(cvg[[2]], coverage(x[[2]])))

coverage(x, width=c(50, 9))
coverage(x, width=c(NA, 9))
coverage(x, width=9)  # 'width' gets recycled

## Each list element in 'shift' and 'weight' gets recycled to the length
## of the corresponding element in 'x'.
weight <- list(as.integer(10^(0:5)), -0.77)
cvg2 <- coverage(x, weight=weight)
cvg2  # 1st coverage vector is an integer-Rle, 2nd is a numeric-Rle

identical(mapply(coverage, x=x, weight=weight), as.list(cvg2))

## ---------------------------------------------------------------------
## E. SOME MATHEMATICAL PROPERTIES OF THE coverage() FUNCTION
## ---------------------------------------------------------------------

## PROPERTY 1: The coverage vector is not affected by reordering the
## input ranges:
set.seed(24)
x <- IRanges(sample(1000, 40, replace=TRUE), width=17:10)
cvg0 <- coverage(x)
stopifnot(identical(coverage(sample(x)), cvg0))

## Of course, if the ranges are shifted and/or assigned weights, then
## this doesn't hold anymore, unless the 'shift' and/or 'weight'
## arguments are reordered accordingly.

## PROPERTY 2: The coverage of the concatenation of 2 IntegerRanges
## objects 'x' and 'y' is the sum of the 2 individual coverage vectors:
y <- IRanges(sample(-20:280, 36, replace=TRUE), width=28)
stopifnot(identical(coverage(c(x, y), width=100),
                    coverage(x, width=100) + coverage(y, width=100)))

## Note that, because adding 2 vectors in R recycles the shortest to
## the length of the longest, the following is generally FALSE:
identical(coverage(c(x, y)), coverage(x) + coverage(y))  # FALSE

## It would only be TRUE if the 2 coverage vectors that we add had the
## same length, which would only happen by chance. By using the same
## 'width' value when we computed the 2 coverages previously, we made
## sure they had the same length.

## Because of properties 1 & 2, we have:
x1 <- x[c(TRUE, FALSE)]  # pick up 1st, 3rd, 5th, etc... ranges
x2 <- x[c(FALSE, TRUE)]  # pick up 2nd, 4th, 6th, etc... ranges
cvg1 <- coverage(x1, width=100)
cvg2 <- coverage(x2, width=100)
stopifnot(identical(coverage(x, width=100), cvg1 + cvg2))

## PROPERTY 3: Multiplying the weights by a scalar has the effect of
## multiplying the coverage vector by the same scalar:
weight <- runif(40)
cvg3 <- coverage(x, weight=weight)
stopifnot(all.equal(coverage(x, weight=-2.68 * weight), -2.68 * cvg3))

## Because of properties 1 & 2 & 3, we have:
stopifnot(identical(coverage(x, width=100, weight=c(5L, -11L)),
                    5L * cvg1 - 11L * cvg2))

## PROPERTY 4: Using the sum of 2 weight vectors produces the same
## result as using the 2 weight vectors separately and summing the
## 2 results:
weight2 <- 10 * runif(40) + 3.7
stopifnot(all.equal(coverage(x, weight=weight + weight2),
                    cvg3 + coverage(x, weight=weight2)))

## PROPERTY 5: Repeating any input range N number of times is
## equivalent to multiplying its assigned weight by N:
times <- sample(0:10L, length(x), replace=TRUE)
stopifnot(all.equal(coverage(rep(x, times), weight=rep(weight, times)),
                    coverage(x, weight=weight * times)))

## In particular, if 'weight' is not supplied:
stopifnot(identical(coverage(rep(x, times)), coverage(x, weight=times)))

## PROPERTY 6: If none of the input range actually gets clipped during
## the "shift and clip" process, then:
##
##     sum(cvg) = sum(width(x) * weight)
##
stopifnot(sum(cvg3) == sum(width(x) * weight))

## In particular, if 'weight' is not supplied:
stopifnot(sum(cvg0) == sum(width(x)))

## Note that this property is sometimes used in the context of a
## ChIP-Seq analysis to estimate "the number of reads in a peak", that
## is, the number of short reads that belong to a peak in the coverage
## vector computed from the genomic locations (a.k.a. genomic ranges)
## of the aligned reads. Because of property 6, the number of reads in
## a peak is approximately the area under the peak divided by the short
## read length.

## PROPERTY 7: If 'weight' is not supplied, then disjoining or reducing
## the ranges before calling coverage() has the effect of "shaving" the
## coverage vector at elevation 1:
table(cvg0)
shaved_cvg0 <- cvg0
runValue(shaved_cvg0) <- pmin(runValue(cvg0), 1L)
table(shaved_cvg0)

stopifnot(identical(coverage(disjoin(x)), shaved_cvg0))
stopifnot(identical(coverage(reduce(x)), shaved_cvg0))

## ---------------------------------------------------------------------
## F. SOME SANITY CHECKS
## ---------------------------------------------------------------------
dummy_coverage <- function(x, shift=0L, width=NULL)
{
    y <- IRanges:::unlist_as_integer(shift(x, shift))
    if (is.null(width))
        width <- max(c(0L, y))
    Rle(tabulate(y,  nbins=width))
}

check_real_vs_dummy <- function(x, shift=0L, width=NULL)
{
    res1 <- coverage(x, shift=shift, width=width)
    res2 <- dummy_coverage(x, shift=shift, width=width)
    stopifnot(identical(res1, res2))
}
check_real_vs_dummy(x)
check_real_vs_dummy(x, shift=7)
check_real_vs_dummy(x, shift=7, width=27)
check_real_vs_dummy(x, shift=c(-4, 2))
check_real_vs_dummy(x, shift=c(-4, 2), width=12)
check_real_vs_dummy(x, shift=-max(end(x)))

## With a set of distinct single positions:
x3 <- IRanges(sample(50000, 20000), width=1)
stopifnot(identical(sort(start(x3)), which(coverage(x3) != 0L)))

Example output

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package:BiocGenericsThe following objects are masked frompackage:parallel:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked frompackage:stats:

    IQR, mad, sd, var, xtabs

The following objects are masked frompackage:base:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package:S4VectorsThe following object is masked frompackage:base:

    expand.grid

integer-Rle of length 14 with 6 runs
  Lengths: 2 2 4 1 3 2
  Values : 3 1 0 1 2 1
integer-Rle of length 21 with 10 runs
  Lengths: 3 1 2 1 2 2 4 1 3 2
  Values : 1 0 1 2 3 1 0 1 2 1
integer-Rle of length 27 with 11 runs
  Lengths: 3 1 2 1 2 2 4 1 3 2 6
  Values : 1 0 1 2 3 1 0 1 2 1 0
integer-Rle of length 14 with 4 runs
  Lengths: 1 9 1 3
  Values : 0 1 0 1
integer-Rle of length 12 with 4 runs
  Lengths: 1 9 1 1
  Values : 0 1 0 1
integer-Rle of length 0 with 0 runs
  Lengths: 
  Values : 
integer-Rle of length 10 with 5 runs
  Lengths: 2 2 4 1 1
  Values : 3 1 0 1 2
integer-Rle of length 21 with 5 runs
  Lengths: 3 1 7 4 6
  Values : 1 0 1 0 1
integer-Rle of length 27 with 6 runs
  Lengths: 3 1 7 4 6 6
  Values : 0 1 0 1 0 1
integer-Rle of length 14 with 6 runs
  Lengths:        2        2        4        1        3        2
  Values :   110001    10000        0      100 10000100      100
numeric-Rle of length 14 with 6 runs
  Lengths:    2    2    4    1    3    2
  Values : -4.4  2.8  0.0  2.8 -7.2  2.8
numeric-Rle of length 100 with 5 runs
  Lengths:           3           6           6           3          82
  Values : 0.00000e+00 9.58000e-01 1.00010e+04 9.58000e-01 5.38458e-13
[1] 5.384582e-13
numeric-Rle of length 100 with 5 runs
  Lengths:           3           6           6           3          82
  Values : 0.00000e+00 9.58000e-01 1.00010e+04 9.58000e-01 5.38458e-13
numeric-Rle of length 100 with 5 runs
  Lengths:           3           6           6           3          82
  Values : 0.00000e+00 9.58000e-01 1.00010e+04 9.58000e-01 5.38458e-13
numeric-Rle of length 100 with 5 runs
  Lengths:         3         6         6         3        82
  Values :     0.000     0.958 10000.958     0.958     0.000
integer-Rle of length 15 with 3 runs
  Lengths: 4 6 5
  Values : 1 2 1
RleList of length 2
$A
integer-Rle of length 12 with 7 runs
  Lengths: 1 1 1 2 5 1 1
  Values : 1 0 1 0 1 0 1

$B
integer-Rle of length 14 with 10 runs
  Lengths: 1 1 1 1 1 5 1 1 1 1
  Values : 0 1 2 3 4 5 4 3 2 1

RleList of length 2
$A
integer-Rle of length 50 with 8 runs
  Lengths:  1  1  1  2  5  1  1 38
  Values :  1  0  1  0  1  0  1  0

$B
integer-Rle of length 9 with 6 runs
  Lengths: 1 1 1 1 1 4
  Values : 0 1 2 3 4 5

RleList of length 2
$A
integer-Rle of length 12 with 7 runs
  Lengths: 1 1 1 2 5 1 1
  Values : 1 0 1 0 1 0 1

$B
integer-Rle of length 9 with 6 runs
  Lengths: 1 1 1 1 1 4
  Values : 0 1 2 3 4 5

RleList of length 2
$A
integer-Rle of length 9 with 5 runs
  Lengths: 1 1 1 2 4
  Values : 1 0 1 0 1

$B
integer-Rle of length 9 with 6 runs
  Lengths: 1 1 1 1 1 4
  Values : 0 1 2 3 4 5

RleList of length 2
$A
integer-Rle of length 12 with 8 runs
  Lengths:     1     1     1     2     3     2     1     1
  Values : 10000     0  1000     0   100    10     0     1

$B
numeric-Rle of length 14 with 10 runs
  Lengths:     1     1     1     1     1     5     1     1     1     1
  Values :  0.00 -0.77 -1.54 -2.31 -3.08 -3.85 -3.08 -2.31 -1.54 -0.77

[1] TRUE
[1] FALSE
Warning message:
In coverage(x) + coverage(y) :
  longer object length is not a multiple of shorter object length
cvg0
  0   1   2   3   4   5 
554 268  98  13   8   1 
shaved_cvg0
  0   1 
554 388 

IRanges documentation built on Dec. 14, 2020, 2 a.m.