# gower.dist: Computes the Gower's Distance In StatMatch: Statistical Matching or Data Fusion

 gower.dist R Documentation

## Computes the Gower's Distance

### Description

This function computes the Gower's distance (dissimilarity) between units in a dataset or between observations in two distinct datasets.

### Usage

gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE, var.weights = NULL, robcb=NULL)

### Arguments

 data.x A matrix or a data frame containing variables that should be used in the computation of the distance. Columns of mode numeric will be considered as interval scaled variables; columns of mode character or class factor will be considered as categorical nominal variables; columns of class ordered will be considered as categorical ordinal variables and, columns of mode logical will be considered as binary asymmetric variables (see Details for further information). Missing values (NA) are allowed. If only data.x is supplied, the dissimilarities between rows of data.x will be computed. data.y A numeric matrix or data frame with the same variables, of the same type, as those in data.x. Dissimilarities between rows of data.x and rows of data.y will be computed. If not provided, by default it is assumed equal to data.x and only dissimilarities between rows of data.x will be computed. rngs A vector with the ranges to scale the variables. Its length must be equal to number of variables in data.x. In correspondence of nonnumeric variables, just put 1 or NA. When rngs=NULL (default) the range of a numeric variable is estimated by jointly considering the values for the variable in data.x and those in data.y. Therefore, assuming rngs=NULL, if a variable "X1" is considered: rngs["X1"] <- max(data.x[,"X1"], data.y[,"X1"]) - min(data.x[,"X1"], data.y[,"X1"]). KR.corr When TRUE (default) the extension of the Gower's dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when KR.corr=FALSE, the Gower's (1971) formula is considered. var.weights By default (NULL) each variable has the same weight (value 1) when calculating the overall distance (weighted average of distances on single variables; see Details). User can specify different weights for the different variables by providing a numeric value for each of the variables contributing to the distance. In other words, var.weights should be set equal to a numeric vector having length equal to the number of variables considered in calculating distance. Entered weights are scales to sum up to 1. robcb By default is (NULL). If robcb="IQR" the scaling of the Manhattan distance is done by the Inter–quartile range. In alternative, robcb="IDR" the scaling of the Manhattan distance is done by the Inter–decile range. In this case scaled distances greater than 1 are set equal to 1. This option is suggested in the presence of outliers in the continuous variables.

### Details

This function computes distances between records when variables of different type (categorical and continuous) have been observed. In order to handle different types of variables, the Gower's dissimilarity coefficient (Gower, 1971) is used. By default (KR.corr=TRUE) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used.

The final dissimilarity between the ith and jth unit is obtained as a weighted sum of dissimilarities for each variable:

d(i,j) = sum_k(delta_ijk * d_ijk * w_k) / sum_k( delta_ijk * w_k)

In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable, while w_k is the weight assigned to variable k (by default 1 for all the variables, unless different weights are provided by user with argument var.weights). Distance depends on the nature of the variable:

• logical columns are considered as asymmetric binary variables, for such case d_ijk = 0 if x_ik = x_jk = TRUE, 1 otherwise;

• factor or character columns are considered as categorical nominal variables and d_ijk = 0 if x_ik = x_jk, 1 otherwise;

• numeric columns are considered as interval-scaled variables and

d_ijk = abs(x_ik - x_jk) / R_k

being R_k the range of the kth variable. The range is the one supplied with the argument rngs (rngs[k]) or the one computed on available data (when rngs=NULL);

• ordered columns are considered as categorical ordinal variables and the values are substituted with the corresponding position index, r_ik in the factor levels. When KR.corr=FALSE these position indexes (that are different from the output of the R function rank) are transformed in the following manner

z_ik = (r_ik - 1)/(max(r_ik) - 1)

These new values, z_ik, are treated as observations of an interval scaled variable.

As far as the weight delta_ijk is concerned:

• delta_ijk = 0 if x_ik = \code{NA} or x_jk = NA;

• delta_ijk = 0 if the variable is asymmetric binary and x_ik = x_jk = 0 or x_ik = x_jk = FALSE;

• delta_ijk = 1 in all the other cases.

In practice, NAs and couple of cases with x_ik = x_jk = \code{FALSE} do not contribute to distance computation.

### Value

A matrix object with distances between rows of data.x and those of data.y.

### Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

### References

Gower, J. C. (1971), “A general coefficient of similarity and some of its properties”. Biometrics, 27, 623–637.

Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

daisy, dist

### Examples

x1 <- as.logical(rbinom(10,1,0.5))
x2 <- sample(letters, 10, replace=TRUE)
x3 <- rnorm(10)
x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE))
xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)

# matrix of distances between observations in xx
dx <- gower.dist(xx)