sim.nominal: Similarity-measures for nominal variables

Description Usage Arguments Details Value Note Author(s) References Examples

Description

Nominal variables can be encoded as a combination of a sparse incidence and index matrix, as discussed at splitTable. The present two functions are easy-to-use shortcuts to use those sparse matrices to computes pairwise similarities, either between observations (sim.obs) or attributes (sim.att).

Usage

1
2
sim.att(D, method = "chuprov", sparse = TRUE, ...)
sim.obs(D, method = "hamming", sparse = TRUE, ...)

Arguments

D

Dataframe with nominal attributes (‘variables’) as columns and observations as rows.

method

method to be used for similarity computation. See Details below.

sparse

All methods try to be as sparse as possible. Specifically, when there are no observed co-occurrence, then nothing is computed. This might lead to slight deviations in the results for some methods. Set sparse=F to force computation for all cells. This leads to non-sparse results, so use with caution with large datasets.

...

Arguments passed internally to splitTable, especially useful for multi-valued cells, using the option split. Note that method = hamming will give unexpected results for the comparison of cells that both are multi-valued. Consider using method = weighted instead.

Details

The function sim.att and sim.obs are convenience wrappers around the basic cosRow, cosCol and assocRow, assocCol functions. The sim functions take a dataframe as input, internally calling splitTable to turn the dataframe into sparse matrices, and then applying sparse matrix algebra to efficiently compute similarities. Currently only a few exemplary methods are encoded.

sim.att computes similarities between the different nominal variables. The method chuprov computes Chuprov's T (very similar to Cramer's V, but easier to compute efficiently). The method g computes the G-test from Sokal and Rohlf (1982), also known as Dunning's G from Dunning (1993). This G is closely related to Mutual Information (G = 2*N*MI, with N being the sample size). The method mutual returns the mutual information, and the method variation returns the so-called ‘variation of information’ (join information - mutual information). Note that the this last one is a metric, not a similarity. All these methods can be abbreviated, e.g use "c", "g", "m", and "v".

sim.obs computes similarities between the different observation for the nominal variables. The method hamming computes the relative Hamming similarity, i.e. the number of similarities devided by the number of comparisons made (Goebl 1984 calls this the ‘Relativer Identitaetswert’). The method weighted uses an inverse square root weighting on all similarities, i.e. rare similarities count more. This is very similar to Goebl's ‘Gewichteter Identitaetswert’, though note that his definition is slightly different from the one used here. Further, all methods as defined for assocSparse can be used here, i.e. res, pmi, wpmi, poi, and new methods can be defined according to the explanations as assocSparse.

Value

All methods return symmetric similarity matrices in the form dsCMatrix, only specifying the upper triangle. The only exception is when sparse=T is chose, then the result will be in the form dsyMatrix.

Note

Note that these methods automatically take missing data into account. They also work with large amount of missing data, but of course the validity of any similarity with much missing data is problematic.

The sim.att and sim.obs methods by default use sparse computations, which leads (among other effects) to errors on the diagonal. The main diagonal should be one everywhere by definition, but this will only be the case with the option sparse = F. The deviations with sparse = T should be minimal in the non-diagonal entries, but computations should be faster, and the results often take up less space.

Author(s)

Michael Cysouw

References

Goebl, Hans. 1984. Dialektometrische Studien: anhand italoromanischer, raetoromanischer und galloromanischer Sprachmaterialien aus AIS und AFL. (Beihefte zur Zeitschrift fuer Romanische Philologie). Tuebingen: Niemeyer.

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational linguistics 19(1). 61-74.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# first a simple example using the farms-dataset from MASS
library(MASS)

# similarities between farms
s <- sim.obs(farms)
plot(hclust(as.dist(1-s), method = "ward.D"))

# similarities between attributes (`variables`)
s <- sim.att(farms)
plot(hclust(as.dist(1-s), method = "ward.D"))

# use the split option for multi-valued cells
farms2 <- as.matrix(farms)
farms2[1,1] <- "M1,M5"

s <- sim.obs(farms2, split = ",")
plot(hclust(as.dist(1-s), method = "ward.D"))

## Not run: 
# a larger example with lots of missing data: the WALS-data as included here
# computations go reasonably quick
# (on 2566 observations and 131 attributes with 630 different values in total)
data(wals)
system.time(s <- sim.att(wals$data))
rownames(s) <- colnames(wals$data)
plot(hclust(as.dist(1-s), method = "ward.D"), cex = 0.5)

# Note that using sparse=T speeds up computations because it 
# ignores zero co-occurrences
# This leads to small errors in the computation of Chuprov's T
system.time( # faster
	chup.sparse <- sim.att(wals$data,  method = "chuprov", sparse = TRUE)
)
system.time( # slower
	chup.full <- sim.att(wals$data, method = "chuprov", sparse = FALSE)
)

# The sparse approach is almost identical to the full approach.
# sparse sligtly underestimates the real values for Chuprov's T 
plot(as.dist(chup.sparse), as.dist(chup.full))

# some more similarities on the attributes
g <- sim.att(wals$data, method = "g") # Dunning's G
m <- sim.att(wals$data, method = "mutual") # Mutual Information
v <- sim.att(wals$data, method = "variation") # Variation of Information

# Note the strong differences between these approaches
pairs(~ as.dist(chup.sparse) + as.dist(m) + as.dist(g) + as.dist(v),
	labels=c("Chuprov's T","Mutual Information","G-statistic","Variation of Information"))
	
# Relative Hamming similarity on all observations (languages) in WALS
# time is not a problem, but the data is so sparse
# that for many language-pairs there is no shared data
system.time( s <- sim.obs(wals$data))

# select only the 168 language with more than 80 datapoints
sel <- wals$data[apply(wals$data,1,function(x){sum(!is.na(x))})>80,]

# compare different similarities
w <- sim.obs(sel, "weighted")
h <- sim.obs(sel, "hamming")
r <- sim.obs(sel, "res")
p <- sim.obs(sel, "poi")
m <- sim.obs(sel, "wpmi")
i <- sim.obs(sel, "pmi")

pairs(~ as.dist(w) + as.dist(h) + as.dist(r) + as.dist(p) + as.dist(m) + as.dist(i),
	labels = c("weighted","hamming","residuals","poisson","weighted PMI","PMI"))

## End(Not run)

cysouw/qlcMatrix documentation built on Dec. 18, 2017, 9:12 a.m.