dist: Matrix Distance/Similarity Computation

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/dist.R

Description

These functions compute and return the auto-distance/similarity matrix between either rows or columns of a matrix/data frame, or a list, as well as the cross-distance matrix between two matrices/data frames/lists.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
dist(x, y = NULL, method = NULL, ..., diag = FALSE, upper = FALSE,
     pairwise = FALSE, by_rows = TRUE, convert_similarities = TRUE,
     auto_convert_data_frames = TRUE)
simil(x, y = NULL, method = NULL, ..., diag = FALSE, upper = FALSE,
      pairwise = FALSE, by_rows = TRUE, convert_distances = TRUE,
      auto_convert_data_frames = TRUE)

pr_dist2simil(x)
pr_simil2dist(x)

as.dist(x, FUN = NULL)
as.simil(x, FUN = NULL)

## S3 method for class 'dist'
as.matrix(x, diag = 0, ...)
## S3 method for class 'simil'
as.matrix(x, diag = NA, ...)

Arguments

x

For dist and simil, a numeric matrix object, a data frame, or a list. A vector will be converted into a column matrix. For as.simil and as.dist, an object of class dist and simil, respectively, or a numeric matrix. For pr_dist2simil and pr_simil2dist, any numeric vector.

y

NULL, or a similar object than x

method

a function, a registry entry, or a mnemonic string referencing the proximity measure. A list of all available measures can be obtained using pr_DB (see examples). The default for dist is "Euclidean", and for simil "correlation".

diag

logical value indicating whether the diagonal of the distance/similarity matrix should be printed by print.dist/print.simil. Note that the diagonal values are never stored in dist objects.

In the context of as.matrix the value to use on the diagonal representing self-proximities. In case of similarities, this defaults to NA since a priori there are no upper bounds, so the maximum similarity needs to be specified by the user.

upper

logical value indicating whether the upper triangle of the distance/similarity matrix should be printed by print.dist/print.simil

pairwise

logical value indicating whether distances should be computed for the pairs of x and y only.

by_rows

logical indicating whether proximities between rows, or columns should be computed.

convert_similarities, convert_distances

logical indicating whether distances should be automatically converted into similarities (and the other way round) if needed.

auto_convert_data_frames

logical indicating whether data frames should be converted to matrices if all variables are numeric, or all are logical, or all are complex.

FUN

optional function to be used by as.dist and as.simil. If NULL, it is looked up in the method registry. If there is none specified there, FUN defaults to pr_simil2dist and pr_dist2simil, respectively.

...

further arguments passed to the proximity function.

Details

The interface is fashioned after dist, but can also compute cross-distances, and allows user extensions by means of registry of all proximity measures (see pr_DB).

Missing values are allowed but are excluded from all computations involving the rows within which they occur. If some columns are excluded in calculating a Euclidean, Manhattan, Canberra or Minkowski distance, the sum is scaled up proportionally to the number of columns used (compare dist in package stats).

Data frames are silently coerced to matrix if all columns are of (same) mode numeric or logical.

Distance measures can be used with simil, and similarity measures with dist. In these cases, the result is transformed accordingly using the specified coercion functions (default: pr\_simil2dist(x) = 1 - abs(x) and pr\_dist2simil(x) = 1 / (1 + x)). Objects of class simil and dist can be converted one in another using as.dist and as.simil, respectively.

Distance and similarity objects can conveniently be subset (see examples). Note that duplicate indexes are silently ignored.

Value

Auto distances/similarities are returned as an object of class dist/simil and cross-distances/similarities as an object of class crossdist/crosssimil.

Author(s)

David Meyer David.Meyer@R-project.org and Christian Buchta Christian.Buchta@wu-wien.ac.at

References

Anderberg, M.R. (1973), Cluster analysis for applications, 359 pp., Academic Press, New York, NY, USA.

Cox, M.F. and Cox, M.A.A. (2001), Multidimensional Scaling, Chapman and Hall.

Sokol, R.S. and Sneath P.H.A (1963), Principles of Numerical Taxonomy, W. H. Freeman and Co., San Francisco.

See Also

dist for compatibility information, and pr_DB for the proximity data base.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
### show available proximities
summary(pr_DB)

### get more information about a particular one
pr_DB$get_entry("Jaccard")

### binary data
x <- matrix(sample(c(FALSE, TRUE), 8, rep = TRUE), ncol = 2)
dist(x, method = "Jaccard")

### for real-valued data
dist(x, method = "eJaccard")

### for positive real-valued data
dist(x, method = "fJaccard")

### cross distances
dist(x, x, method = "Jaccard")

### pairwise (diagonal)
dist(x, x, method = "Jaccard", 
	 pairwise = TRUE)

### this is the same but less efficient
as.matrix(stats::dist(x, method = "binary"))

### numeric data
x <- matrix(rnorm(16), ncol = 4)

## test inheritance of names
rownames(x) <- LETTERS[1:4]
colnames(x) <- letters[1:4]
dist(x)
dist(x, x)

## custom distance function
f <- function(x, y) sum(x * y)
dist(x, f)

## working with lists
z <- unlist(apply(x, 1, list), recursive = FALSE)
(d <- dist(z))
dist(z, z)

## subsetting
d[[1:2]]
subset(d, c(1,3,4))
d[[c(1,2,2)]]	    # duplicate index gets ignored

## transformations and self-proximities
as.matrix(as.simil(d, function(x) exp(-x)), diag = 1)

## row and column indexes
row.dist(d)
col.dist(d)

Example output

Attaching package: 'proxy'

The following objects are masked from 'package:stats':

    as.dist, dist

The following object is masked from 'package:base':

    as.matrix

* Similarity measures:
Braun-Blanquet, Chi-squared, Cramer, Dice, Fager, Faith, Gower, Hamman,
Jaccard, Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
Pearson, Phi, Phi-squared, Russel, Simpson, Stiles, Tanimoto,
Tschuprow, Yule, Yule2, correlation, cosine, eDice, eJaccard, simple
matching

* Distance measures:
Bhjattacharyya, Bray, Canberra, Chord, Euclidean, Geodesic, Hellinger,
Kullback, Levenshtein, Mahalanobis, Manhattan, Minkowski, Podani,
Soergel, Wave, Whittaker, divergence, fJaccard, supremum

      names Jaccard, binary, Reyssac, Roux
        FUN R_bjaccard
   distance FALSE
     PREFUN pr_Jaccard_prefun
    POSTFUN NA
    convert pr_simil2dist
       type binary
       loop FALSE
      C_FUN TRUE
    PACKAGE proxy
       abcd FALSE
    formula a / (a + b + c)
  reference Jaccard, P. (1908). Nouvelles recherches sur la
            distribution florale. Bull. Soc. Vaud. Sci. Nat., 44, pp.
            223--270.
description The Jaccard Similarity (C implementation) for binary data.
            It is the proportion of (TRUE, TRUE) pairs, but not
            considering (FALSE, FALSE) pairs. So it compares the
            intersection with the union of object sets.
    1   2   3
2 0.5        
3 1.0 0.5    
4 1.0 1.0 1.0
    1   2   3
2 0.5        
3 1.0 0.5    
4 1.0 1.0 1.0
    1   2   3
2 0.5        
3 1.0 0.5    
4 1.0 1.0 1.0
     [,1] [,2] [,3] [,4]
[1,]  0.0  0.5  1.0  1.0
[2,]  0.5  0.0  0.5  1.0
[3,]  1.0  0.5  0.0  1.0
[4,]  1.0  1.0  1.0  0.0
[1] 0 0 0 0
    1   2   3 4
1 0.0 0.5 1.0 1
2 0.5 0.0 0.5 1
3 1.0 0.5 0.0 1
4 1.0 1.0 1.0 0
         A        B        C
B 2.138168                  
C 2.927701 1.497771         
D 2.681559 3.605614 3.841411
         A        B        C        D
A 0.000000 2.138168 2.927701 2.681559
B 2.138168 0.000000 1.497771 3.605614
C 2.927701 1.497771 0.000000 3.841411
D 2.681559 3.605614 3.841411 0.000000
           A          B          C
B  1.8268886                      
C -0.7881506  1.6596316           
D  1.6878902 -1.9332338 -3.4264274
         A        B        C
B 2.138168                  
C 2.927701 1.497771         
D 2.681559 3.605614 3.841411
         A        B        C        D
A 0.000000 2.138168 2.927701 2.681559
B 2.138168 0.000000 1.497771 3.605614
C 2.927701 1.497771 0.000000 3.841411
D 2.681559 3.605614 3.841411 0.000000
         A
B 2.138168
         A        C
C 2.927701         
D 2.681559 3.841411
         A
B 2.138168
           A          B          C          D
A 1.00000000 0.11787062 0.05351994 0.06845635
B 0.11787062 1.00000000 0.22362807 0.02717075
C 0.05351994 0.22362807 1.00000000 0.02146330
D 0.06845635 0.02717075 0.02146330 1.00000000
[1] 2 3 4 3 4 4
[1] 1 1 1 2 2 3

proxy documentation built on June 8, 2021, 1:06 a.m.