Finding Unique or Duplicated Rows or Columns for Atomic Matrices

Share:

Description

These S3 methods are alternative (typically much faster) implementations of counterparts in the base package for atomic matrices.

unique.matrix returns a matrix with duplicated rows (or columns) removed.

duplicated.matrix returns a logical vector indicating which rows (or columns) are duplicated.

anyDuplicated.matrix returns an integer indicating the index of the first duplicate row (or column) if any, and 0L otherwise.

Usage

1
2
3
4
5
6
7
8
9
## S3 method for class 'matrix'
unique(x, incomparables = FALSE, MARGIN = 1,
       fromLast = FALSE, signif=Inf, ...)
## S3 method for class 'matrix'
duplicated(x, incomparables = FALSE, MARGIN = 1,
           fromLast = FALSE, signif=Inf,...)
## S3 method for class 'matrix'
anyDuplicated(x, incomparables = FALSE,
           MARGIN = 1, fromLast = FALSE, signif=Inf,...)

Arguments

x

an atomic matrix of mode "numeric", "integer", "logical", "complex", "character" or "raw". When x is not atomic or when it is not a matrix, the base::unique.matrix in the base package will be called.

incomparables

a vector of values that cannot be compared, as in base::unique.matrix. Only when incomparables=FALSE will the code in uniqueAtomMat package be used; otherwise, the base version will be called.

fromLast

a logical scalar indicating if duplication should be considered from the last, as in base::unique.matrix.

...

arguments for particular methods.

MARGIN

a numeric scalar, the matrix margin to be held fixed, as in apply. For unique.matrix, only MARGIN=1 and MARGIN=2 are allowed; for duplicated.matrix and anyDuplicated.matrix, MARGIN=0 is also allowed. For all other cases, the implementation in the base package will be called.

signif

a numerical scalar only applicable to numeric or complex x. If signif=NULL, then x will first be passed to signif function with the number of significant digits being the C constant DBL_DIG, as explained in as.character. If signif=Inf (which is the default value), then x is untouched before finding duplicates. If signif is any other number, it specifies the required number of significant digits for signif function.

Details

These S3 methods are alternative implementations of counterparts in the base package for atomic matrices (i.e., double, integer, logical, character, complex and raw) directly based on C++98 Standard Template Library (STL) std::set, or C++11 STL std::unordered_set. The implementation treats the whole row (or column) vector as the key, without the intermediate steps of converting the mode to character nor collapsing them into a scalar as done in base. On systems with empty `R CMD config CXX1X`, the C++98 STL std::set is used, which is typically implemented as a self-balancing tree (usually a red-black tree) that takes O[n log(n)] to find all duplicates, where n=dim(x)[MARGIN]. On systems with non-empty `R CMD config CXX1X`, the C++11 STL std::unordered_set is used, with average O(n) performance and worst case O(n^2) performance.

Missing values are regarded as equal, but NaN is not equal to NA_real_.

Further, in contrast to the base counterparts, characters are compared directly based on their internal representations; i.e., no encoding issues for characters. Complex values are compared by their real and imaginary parts separately.

Value

unique.matrix returns a matrix with duplicated rows (if MARGIN=1) or columns (if MARGIN=2) removed.

duplicated.matrix returns a logical vector indicating which rows (if MARGIN=1) or columns (if MARGIN=2) are duplicated.

anyDuplicated.matrix returns an integer indicating the index of the first (if fromLast=FALSE) or last (if fromLast=TRUE) duplicate row (if MARGIN=1) or column (if MARGIN=2) if any, and 0L otherwise.

Warning

In contrast to the base counterparts, characters are compared directly based on their internal representations without considering encoding issues; for numeric and complex matrices, the default signif is Inf, i.e. comparing floating point values directly without rounding; and long vectors are not supported yet.

See Also

base::duplicated, base::unique, signif, grpDuplicated

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
## prepare test data: 
set.seed(9992722L, kind="Mersenne-Twister")
x.double=model.matrix(~gl(5,8))[sample(40), ]

## typical uses
unique(x.double)
unique(x.double, fromLast=TRUE)
unique(t(x.double), MARGIN=2)
unique(t(x.double), MARGIN=2, fromLast=TRUE)
anyDuplicated(x.double)
anyDuplicated(x.double, fromLast = TRUE)


## additional atomic test data
x.integer=as.integer(x.double); attributes(x.integer)=attributes(x.double)
x.factor=as.factor(x.integer); dim(x.factor)=dim(x.integer); dimnames(x.factor)=dimnames(x.integer)
x.logical=as.logical(x.double); attributes(x.logical)=attributes(x.double)
x.character=as.character(x.double); attributes(x.character)=attributes(x.double)
x.complex=as.complex(x.double); attributes(x.complex)=attributes(x.double)
x.raw=as.raw(x.double); attributes(x.raw)=attributes(x.double)

## compare results with base:
stopifnot(identical(base::duplicated.matrix(x.double), 
                    uniqueAtomMat::duplicated.matrix(x.double)
))
stopifnot(identical(base::duplicated.matrix(x.integer, fromLast=TRUE), 
                    uniqueAtomMat::duplicated.matrix(x.integer, fromLast=TRUE)
))
stopifnot(identical(base::duplicated.matrix(t(x.logical), MARGIN=2L), 
                    uniqueAtomMat::duplicated.matrix(t(x.logical), MARGIN=2L) 
))
stopifnot(identical(base::duplicated.matrix(t(x.character), MARGIN=2L, fromLast=TRUE), 
                    uniqueAtomMat::duplicated.matrix(t(x.character), MARGIN=2L, fromLast=TRUE) 
))

stopifnot(identical(base::unique.matrix(x.complex), 
                    uniqueAtomMat::unique.matrix(x.complex) 
))
stopifnot(identical(base::unique.matrix(x.raw), 
                    uniqueAtomMat::unique.matrix(x.raw) 
))
stopifnot(identical(base::unique.matrix(x.factor), 
                    uniqueAtomMat::unique.matrix(x.factor) 
))
stopifnot(identical(base::duplicated.matrix(x.double, MARGIN=0), 
                    uniqueAtomMat::duplicated.matrix(x.double, MARGIN=0) 
))
stopifnot(identical(base::anyDuplicated.matrix(x.integer, MARGIN=0), 
                    uniqueAtomMat::anyDuplicated.matrix(x.integer, MARGIN=0) 
))


## benchmarking
if (require(microbenchmark)){
    print(microbenchmark(base::duplicated.matrix(x.double)))
    print(microbenchmark(uniqueAtomMat::duplicated.matrix(x.double)))

    print(microbenchmark(base::duplicated.matrix(x.character)))
    print(microbenchmark(uniqueAtomMat::duplicated.matrix(x.character)))
}else{
    print(system.time(replicate(5e3L, base::duplicated.matrix(x.double))))
    print(system.time(replicate(5e3L, uniqueAtomMat::duplicated.matrix(x.double))))

    print(system.time(replicate(5e3L, base::duplicated.matrix(x.character))))
    print(system.time(replicate(5e3L, uniqueAtomMat::duplicated.matrix(x.character))))
}

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.