Identify clusters in a collection of positions or intervals

Share:

Description

This function uses tools in the intervals package to quickly identify clusters – contiguous collections of positions or intervals which are separated by no more than a given distance from their neighbors to either side.

Usage

1
2
3
4
5
## S4 method for signature 'numeric'
clusters(x, w, which = FALSE, check_valid = TRUE)

## S4 method for signature 'Intervals_virtual'
clusters(x, w, which = FALSE, check_valid = TRUE)

Arguments

x

An appropriate object.

w

Maximum permitted distance between a cluster member and its neighbors to either side.

which

Should indices into the x object be returned instead of actual subsets?

check_valid

Should validObject be called before passing to compiled code? Also see interval_overlap and reduce.

Details

A cluster is defined to be a maximal collection, with at least two members, of components of x which are separated by no more than w. Note that when x represents intervals, an interval must actually contain a point at distance w or less from a neighboring interval to be assigned to the same cluster. If the ends of both intervals in question are open and exactly at distance w, they will not be deemed to be cluster co-members. See the example below.

Value

A list whose components are the clusters. Each component is thus a subset of x, or, if which == TRUE, a vector of indices into the x object. (The indices correspond to row numbers when x is of class "Intervals_virtual".)

Note

Implementation is by a call to reduce followed by a call to interval_overlap. The clusters methods are included to illustrate the utility of the core functions in the intervals package, although they are also useful in their own right.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Numeric method
w <- 20
x <- sample( 1000, 100 )
c1 <- clusters( x, w )

# Check results
sapply( c1, function( x ) all( diff(x) <= w ) )
d1 <- diff( sort(x) )
all.equal(
          as.numeric( d1[ d1 <= w ] ),
          unlist( sapply( c1, diff ) )
          )

# Intervals method, starting with a reduced object so we know that all
# intervals are disjoint and sorted.
B <- 100
left <- runif( B, 0, 1e4 )
right <- left + rexp( B, rate = 1/10 )
y <- reduce( Intervals( cbind( left, right ) ) )

gaps <- function(x) x[-1,1] - x[-nrow(x),2]
hist( gaps(y), breaks = 30 )

w <- 200
c2 <- clusters( y, w )
head( c2 )
sapply( c2, function(x) all( gaps(x) <= w ) )

# Clusters and open end points. See "Details".
z <- Intervals(
               matrix( 1:4, 2, 2, byrow = TRUE ),
               closed = c( TRUE, FALSE )
               )
z
clusters( z, 1 )
closed(z)[1] <- FALSE
z
clusters( z, 1 )

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.