This functions computes the distance/dissimilarity between two probability density functions.
distance( x, method = "euclidean", p = NULL, test.na = TRUE, unit = "log", epsilon = 1e05, est.prob = NULL, use.row.names = FALSE, as.dist.obj = FALSE, diag = FALSE, upper = FALSE, mute.message = FALSE )
x 
a numeric 
method 
a character string indicating whether the distance measure that should be computed. 
p 
power of the Minkowski distance. 
test.na 
a boolean value indicating whether input vectors should be tested for 
unit 
a character string specifying the logarithm unit that should be used to compute distances that depend on log computations. 
epsilon 
a small value to address cases in the distance computation where division by zero occurs. In
these cases, x / 0 or 0 / 0 will be replaced by 
est.prob 
method to estimate probabilities from input count vectors such as nonprobability vectors. Default:

use.row.names 
a logical value indicating whether or not row names from
the input matrix shall be used as rownames and colnames of the output distance matrix. Default value is 
as.dist.obj 
shall the return value or matrix be an object of class 
diag 
if 
upper 
if 
mute.message 
a logical value indicating whether or not messages printed by 
Here a distance is defined as a quantitative degree of how far two mathematical objects are apart from eachother (Cha, 2007).
This function implements the following distance/similarity measures to quantify the distance between probability density functions:
L_p Minkowski family
Euclidean : d = sqrt( ∑  P_i  Q_i ^2)
Manhattan : d = ∑  P_i  Q_i 
Minkowski : d = ( ∑  P_i  Q_i ^p)^1/p
Chebyshev : d = max  P_i  Q_i 
L_1 family
Sorensen : d = ∑  P_i  Q_i  / ∑ (P_i + Q_i)
Gower : d = 1/d * ∑  P_i  Q_i 
Soergel : d = ∑  P_i  Q_i  / ∑ max(P_i , Q_i)
Kulczynski d : d = ∑  P_i  Q_i  / ∑ min(P_i , Q_i)
Canberra : d = ∑  P_i  Q_i  / (P_i + Q_i)
Lorentzian : d = ∑ ln(1 +  P_i  Q_i )
Intersection family
Intersection : s = ∑ min(P_i , Q_i)
NonIntersection : d = 1  ∑ min(P_i , Q_i)
Wave Hedges : d = ∑  P_i  Q_i  / max(P_i , Q_i)
Czekanowski : d = ∑  P_i  Q_i  / ∑  P_i + Q_i 
Motyka : d = ∑ min(P_i , Q_i) / (P_i + Q_i)
Kulczynski s : d = 1 / ∑  P_i  Q_i  / ∑ min(P_i , Q_i)
Tanimoto : d = ∑ (max(P_i , Q_i)  min(P_i , Q_i)) / ∑ max(P_i , Q_i) ; equivalent to Soergel
Ruzicka : s = ∑ min(P_i , Q_i) / ∑ max(P_i , Q_i) ; equivalent to 1  Tanimoto = 1  Soergel
Inner Product family
Inner Product : s = ∑ P_i * Q_i
Harmonic mean : s = 2 * ∑ (P_i * Q_i) / (P_i + Q_i)
Cosine : s = ∑ (P_i * Q_i) / sqrt(∑ P_i^2) * sqrt(∑ Q_i^2)
KumarHassebrook (PCE) : s = ∑ (P_i * Q_i) / (∑ P_i^2 + ∑ Q_i^2  ∑ (P_i * Q_i))
Jaccard : d = 1  ∑ (P_i * Q_i) / (∑ P_i^2 + ∑ Q_i^2  ∑ (P_i * Q_i)) ; equivalent to 1  KumarHassebrook
Dice : d = ∑ (P_i  Q_i)^2 / (∑ P_i^2 + ∑ Q_i^2)
Squaredchord family
Fidelity : s = ∑ sqrt(P_i * Q_i)
Bhattacharyya : d =  ln ∑ sqrt(P_i * Q_i)
Hellinger : d = 2 * sqrt( 1  ∑ sqrt(P_i * Q_i))
Matusita : d = sqrt( 2  2 * ∑ sqrt(P_i * Q_i))
Squaredchord : d = ∑ ( sqrt(P_i)  sqrt(Q_i) )^2
Squared L_2 family (X^2 squared family)
Squared Euclidean : d = ∑ ( P_i  Q_i )^2
Pearson X^2 : d = ∑ ( (P_i  Q_i )^2 / Q_i )
Neyman X^2 : d = ∑ ( (P_i  Q_i )^2 / P_i )
Squared X^2 : d = ∑ ( (P_i  Q_i )^2 / (P_i + Q_i) )
Probabilistic Symmetric X^2 : d = 2 * ∑ ( (P_i  Q_i )^2 / (P_i + Q_i) )
Divergence : X^2 : d = 2 * ∑ ( (P_i  Q_i )^2 / (P_i + Q_i)^2 )
Clark : d = sqrt ( ∑ ( P_i  Q_i  / (P_i + Q_i))^2 )
Additive Symmetric X^2 : d = ∑ ( ((P_i  Q_i)^2 * (P_i + Q_i)) / (P_i * Q_i) )
Shannon's entropy family
KullbackLeibler : d = ∑ P_i * log(P_i / Q_i)
Jeffreys : d = ∑ (P_i  Q_i) * log(P_i / Q_i)
K divergence : d = ∑ P_i * log(2 * P_i / P_i + Q_i)
Topsoe : d = ∑ ( P_i * log(2 * P_i / P_i + Q_i) ) + ( Q_i * log(2 * Q_i / P_i + Q_i) )
JensenShannon : d = 0.5 * ( ∑ P_i * log(2 * P_i / P_i + Q_i) + ∑ Q_i * log(2 * Q_i / P_i + Q_i))
Jensen difference : d = ∑ ( (P_i * log(P_i) + Q_i * log(Q_i) / 2)  (P_i + Q_i / 2) * log(P_i + Q_i / 2) )
Combinations
Taneja : d = ∑ ( P_i + Q_i / 2) * log( P_i + Q_i / ( 2 * sqrt( P_i * Q_i)) )
KumarJohnson : d = ∑ (P_i^2  Q_i^2)^2 / 2 * (P_i * Q_i)^1.5
Avg(L_1, L_n) : d = ∑  P_i  Q_i + max{  P_i  Q_i } / 2
In cases where x
specifies a count matrix, the argument est.prob
can be selected to first estimate probability vectors
from input count vectors and second compute the corresponding distance measure based on the estimated probability vectors.
The following probability estimation methods are implemented in this function:
est.prob = "empirical"
: relative frequencies of counts.
The following results are returned depending on the dimension of x
:
in case nrow(x)
= 2 : a single distance value.
in case nrow(x)
> 2 : a distance matrix
storing distance values for all pairwise probability vector comparisons.
According to the reference in some distance measure computations invalid computations can occur when dealing with 0 probabilities.
In these cases the convention is treated as follows:
division by zero  case 0/0
: when the divisor and dividend become zero, 0/0
is treated as 0
.
division by zero  case n/0
: when only the divisor becomes 0
, the corresponsning 0
is replaced by a small ε = 0.00001.
log of zero  case 0 * log(0)
: is treated as 0
.
log of zero  case log(0)
: zero is replaced by a small ε = 0.00001.
HajkGeorg Drost
SungHyuk Cha. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. International Journal of Mathematical Models and Methods in Applied Sciences 4: 1.
getDistMethods
, estimate.probability
, dist.diversity
# Simple Examples # receive a list of implemented probability distance measures getDistMethods() ## compute the euclidean distance between two probability vectors distance(rbind(1:10/sum(1:10), 20:29/sum(20:29)), method = "euclidean") ## compute the euclidean distance between all pairwise comparisons of probability vectors ProbMatrix < rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) distance(ProbMatrix, method = "euclidean") # compute distance matrix without testing for NA values in the input matrix distance(ProbMatrix, method = "euclidean", test.na = FALSE) # alternatively use the colnames of the input data for the rownames and colnames # of the output distance matrix ProbMatrix < rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) rownames(ProbMatrix) < paste0("Example", 1:3) distance(ProbMatrix, method = "euclidean", use.row.names = TRUE) # Specialized Examples CountMatrix < rbind(1:10, 20:29, 30:39) ## estimate probabilities from a count matrix distance(CountMatrix, method = "euclidean", est.prob = "empirical") ## compute the euclidean distance for count data ## NOTE: some distance measures are only defined for probability values, distance(CountMatrix, method = "euclidean") ## compute the KullbackLeibler Divergence with different logarithm bases: ### case: unit = log (Default) distance(ProbMatrix, method = "kullbackleibler", unit = "log") ### case: unit = log2 distance(ProbMatrix, method = "kullbackleibler", unit = "log2") ### case: unit = log10 distance(ProbMatrix, method = "kullbackleibler", unit = "log10")
