NominalDistances: Distances among individuals with nominal variables

View source: R/NominalDistances.R

NominalDistancesR Documentation

Distances among individuals with nominal variables

Description

This function computes several measures of distance (or similarity) among individuals from a nominal data matrix.

Usage

NominalDistances(X, method = 1, diag = FALSE, upper = FALSE, similarity = TRUE)

Arguments

X

Matrix or data.frame with the nominal variables.

method

An integer between 1 and 6. See details

diag

A logical value indicating whether the diagonal of the distance matrix should be printed.

upper

a logical value indicating whether the upper triangle of the distance matrix should be printed.

similarity

A logical value indicating whether the similarity matrix should be computed.

Details

Let be the table of nominal data. All these distances are of type d=\sqrt{1-s} with s a similarity coefficient.

1 = Overlap method

The overlap measure simply counts the number of attributes that match in the two data instances.

2 = Eskin

Eskin et al. proposed a normalization kernel for record-based network intrusion detection data. The original measure is distance-based and assigns a weight of \frac{2}{n_{k}^{2}} for mismatches; when adapted to similarity, this becomes a weight of \frac{n_{k}^{2}}{n_{k}^{2}+2}.This measure gives more weight to mismatches that occur on attributes that take many values.

3=IOF (Inverse Occurrence Frequency .)

This measure assigns lower similarity to mismatches on more frequent values. The IOF measure is related to the concept of inverse document frequency which comes from information retrieval, where it is used to signify the relative number of documents that contain a spe- cific word.

4 = OF (Ocurrence Frequency)

This measure gives the opposite weighting of the IOF measure for mismatches, i.e., mismatches on less frequent values are assigned lower similarity and mismatches on more frequent values are assigned higher similarity

5 = Goodall3

This measure assigns a high similarity if the matching values are infrequent regardless of the frequencies of the other values.

6 = Lin

This measure gives higher weight to matches on frequent values, and lower weight to mismatches on infrequent values.

Value

An object of class distance

Author(s)

Jose L. Vicente-Villardon

References

Boriah, S., Chandola, V. & Kumar,V.(2008). Similarity measures for categorical data: A comparative evaluation. In proceedings of the eight SIAM International Conference on Data Mining, pp 243–254.

See Also

BinaryDistances,ContinuousDistances

Examples

## Not run: 
data(Env)
Distance<-NominalDistances(Env,upper=TRUE,diag=TRUE,similarity=FALSE,method=1)

## End(Not run)

MultBiplotR documentation built on Nov. 21, 2023, 5:08 p.m.