glottodist: Calculate distances between languages

View source: R/glottodist.R

glottodistR Documentation

Calculate distances between languages

Description

Calculate distances between languages

Usage

glottodist(glottodata, metric = "gower")

Arguments

glottodata

glottodata or glottosubdata, either with or without structure table.

metric

either "gower" or "anderberg"

Value

object of class dist

Details

The function “glottodist” returns a “dist” object with respect to either Gower distance or Anderberg dissimilarity. The Anderberg dissimilarity is defined as follows. Consider a categorical dataset L containing N objects X_1, \cdots, X_N defined over a set of d categorical features where A_k denotes the k-th feature. The feature A_k take n_k values in the given dataset which are denoted by \mathcal{A}_k. We regard 'NA' as a new value. We also use the following notations:

  • f_k(x): The number of times feature A_k takes the value x in the dataset L. If x\notin\mathcal{A}_k, f_k(x)=0.

  • \hat{p}_k(x): The sample frequency of feature A_k to take the value x in the dataset L. \hat{p}_k(x)=\frac{f_k(x)}{N}.

The Anderberg dissimilarity of X and Y is defined in the form of: d(X_i, X_j)= \frac{D}{D+S}, where

D = \sum\limits_{k\in \{1\leq k \leq d: X_k \neq Y_k\}} w_k * \delta^{(k)}_{ij} * \tau_{ij}^{(k)}\left(\frac{1}{2\hat{p}_k(X_k)\hat{p}_k(Y_k)}\right)\frac{2}{n_k(n_k+1)},

and

S = \sum\limits_{k\in \{1\leq k \leq d: X_k = Y_k\}} w_k * \delta^{(k)}_{ij}\left(\frac{1}{\hat{p}_k(X_k)}\right)^2\frac{2}{n_k(n_k+1)}

The numeber w_k gives the weight of the k-th feature, and the numebr \delta^{(k)}_{ij} is equal to either 0 or 1. It is equal to 0 when the type of the k-th feature is asymmetric binary and both values of X_i and X_j are 0, or when either value of the k-th feature is missing, otherwise, it is equal to 1. When X_k \neq Y_k and the type of A_k is "ordered", \tau_{ij}^{(k)} is equal to the normalized difference of X_k and Y_k, otherwise \tau_{ij}^{(k)} is equal to 1.

References

Andergerg M.R. (1973). Cluster analysis for applications. Academic Press, New York.

Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.

Examples

glottodata <- glottoget("demodata", meta = TRUE)
glottodist <- glottodist(glottodata = glottodata, metric="anderberg")

glottosubdata <- glottoget("demosubdata", meta = TRUE)
glottodist <- glottodist(glottodata = glottosubdata)



SietzeN/glottospace documentation built on June 15, 2024, 10:45 p.m.