distance()
The distance()
function implemented in philentropy
is able to compute 46 different distances/similarities between probability density functions (see ?philentropy::distance
for details).
The distance()
function is implemented using the same logic as R's base function stats::dist()
and takes a matrix
or data.frame
object as input. The corresponding matrix
or data.frame
should store probability density functions (as rows) for which distance computations should be performed.
# define a probability density function P P <- 1:10/sum(1:10) # define a probability density function Q Q <- 20:29/sum(20:29) # combine P and Q as matrix object x <- rbind(P,Q)
Please note that when defining a matrix
from vectors, probability vectors should be combined as rows (rbind()
).
library(philentropy) # compute the Euclidean Distance with default parameters distance(x, method = "euclidean")
euclidean 0.1280713
For this simple case you can compare the results with R's base function to compute the euclidean distance stats::dist()
.
# compute the Euclidean Distance using R's base function stats::dist(x, method = "euclidean")
P Q 0.1280713
However, the R base function stats::dist()
only computes the following distance measures: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"
, whereas distance()
allows you to choose from 46 distance/similarity measures and when selecting the native distance functions underlying distance()
users can speed up their computations 3-5x.
To find out which method
s are implemented in distance()
you can consult
the getDistMethods()
function.
# names of implemented distance/similarity functions getDistMethods()
[1] "euclidean" "manhattan" "minkowski" "chebyshev" [5] "sorensen" "gower" "soergel" "kulczynski_d" [9] "canberra" "lorentzian" "intersection" "non-intersection" [13] "wavehedges" "czekanowski" "motyka" "kulczynski_s" [17] "tanimoto" "ruzicka" "inner_product" "harmonic_mean" [21] "cosine" "hassebrook" "jaccard" "dice" [25] "fidelity" "bhattacharyya" "hellinger" "matusita" [29] "squared_chord" "squared_euclidean" "pearson" "neyman" [33] "squared_chi" "prob_symm" "divergence" "clark" [37] "additive_symm" "kullback-leibler" "jeffreys" "k_divergence" [41] "topsoe" "jensen-shannon" "jensen_difference" "taneja" [45] "kumar-johnson" "avg"
Now you can choose any distance/similarity method
that serves you.
# compute the Jaccard Distance with default parameters distance(x, method = "jaccard")
jaccard 0.133869
Analogously, in case a probability matrix is specified the following output is generated.
# combine three probabilty vectors to a probabilty matrix ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) rownames(ProbMatrix) <- paste0("Example", 1:3) # compute the euclidean distance between all # pairwise comparisons of probability vectors distance(ProbMatrix, method = "euclidean")
#> Metric: 'euclidean'; comparing: 3 vectors. v1 v2 v3 v1 0.0000000 0.12807130 0.13881717 v2 0.1280713 0.00000000 0.01074588 v3 0.1388172 0.01074588 0.00000000
Alternatively, users can specify the argument use.row.names = TRUE
to maintain the
rownames of the input matrix and pass them as rownames and colnames to the output distance matrix.
# compute the euclidean distance between all # pairwise comparisons of probability vectors distance(ProbMatrix, method = "euclidean", use.row.names = TRUE)
#> Metric: 'euclidean'; comparing: 3 vectors. Example1 Example2 Example3 Example1 0.0000000 0.12807130 0.13881717 Example2 0.1280713 0.00000000 0.01074588 Example3 0.1388172 0.01074588 0.00000000
This output differs from the output of stats::dist()
.
# compute the euclidean distance between all # pairwise comparisons of probability vectors # using stats::dist() stats::dist(ProbMatrix, method = "euclidean")
1 2 2 0.12807130 3 0.13881717 0.01074588
Whereas distance()
returns a symmetric distance matrix, stats::dist()
returns only one part of the symmetric matrix.
However, users can also specify the argument as.dist.obj = TRUE
in philentropy::distance()
to retrieve a philentropy::distance()
output which is an object of type stats::dist()
.
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) rownames(ProbMatrix) <- paste0("test", 1:3) distance(ProbMatrix, method = "euclidean", use.row.names = TRUE, as.dist.obj = TRUE)
Metric: 'euclidean'; comparing: 3 vectors. test1 test2 test2 0.12807130 test3 0.13881717 0.01074588
Now let's compare the run times of base R and philentropy
. For this purpose you need to install the microbenchmark
package.
Note: Please make sure to insert vector objects (in our example P, Q) when directly running the low-level functions such as
euclidean()
etc. Otherwise, computational overheads are produced that significantly slow down computations when using large vectors.
# install.packages("microbenchmark") library(microbenchmark) microbenchmark( distance(x,method = "euclidean", test.na = FALSE), dist(x,method = "euclidean"), euclidean(P, Q, FALSE) )
Unit: microseconds expr min lq mean median uq max neval distance(x, method = "euclidean", test.na = FALSE) 26.518 28.3495 29.73174 29.2210 30.1025 62.096 100 dist(x, method = "euclidean") 11.073 12.9375 14.65223 14.3340 15.1710 65.130 100 euclidean(P, Q, FALSE) 4.329 4.9605 5.72378 5.4815 6.1240 22.510 100
As you can see, although the distance()
function is quite fast, the internal checks cause it to be 2x slower than the base dist()
function (for the euclidean
example). Nevertheless, in case you need to implement a faster version of the corresponding distance measure you can type philentropy::
and then TAB
allowing you to select the base distance computation functions (written in C++), e.g. philentropy::euclidean()
which is almost 3x faster than the base dist()
function.
The advantage of distance()
is that it implements 46 distance measures based on base C++ functions that can be accessed individually by typing philentropy::
and then TAB
. In future versions of philentropy
I will optimize the distance()
function so that internal checks for data type correctness and correct input data will take less termination time than the base dist()
function.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.