coalesce: Create an index that groups unique values together

View source: R/coalesce.R

coalesceR Documentation

Create an index that groups unique values together

Description

coalesce makes sure that a given index vector is coalesced, i.e., identical values are grouped into contiguous blocks. This can be used as a much faster alternative to sort.list where the goal is to group identical values, but not necessarily in a pre-defined order. The algorithm is linear in the length of the vector.

Usage

  coalesce(x)

Arguments

x

character, integer or real vector to coalesce

Details

The current implementation takes two passes through the vector. In the first pass it creates a hash table for the values of x counting the occurrences in the process. In the second pass it assigns indices for every element based on the index stored in the hash table.

The order of the groups of unique values is defined by the first occurence of each unique value, hence it is identical to the order of unique.

One common use of coalesce is to allow the use of arbitrary vectors in ctapply via ctapply(x[coalesce(x)], ...).

Value

Integer vector with the resulting permutation. x[coalesce(x)] gives x with contiguous unique values.

Author(s)

Simon Urbanek

See Also

unique, sort.list, ctapply

Examples

i = rnorm(2e6)
names(i) = as.integer(rnorm(2e6))
## compare sorting and coalesce
system.time(o <- i[order(names(i))])
system.time(o <- i[coalesce(names(i))])

## more fair comparison taking the coalesce time (and copy) into account
system.time(tapply(i, names(i), sum))
system.time({ o <- i[coalesce(names(i))]; ctapply(o, names(o), sum) })

## in fact, using ctapply() on a dummy vector is faster than table() ...
## believe it or not ... (that that is actually wasteful, since coalesce
## already computed the table internally anyway ...)
ftable <- function(x) {
   t <- ctapply(rep(0L, length(x)), x[coalesce(x)], length)
   t[sort.list(names(t))]
}
system.time(table(names(i)))
system.time(ftable(names(i)))

fastmatch documentation built on Aug. 18, 2023, 9:07 a.m.