Description Usage Arguments Details Value Note Author(s) References Examples

Nominal variables can be encoded as a combination of a sparse incidence and index matrix, as discussed at `splitTable`

. The present two functions are easy-to-use shortcuts to use those sparse matrices to computes pairwise similarities, either between observations (`sim.obs`

) or attributes (`sim.att`

).

1 2 |

`D` |
Dataframe with nominal attributes (‘variables’) as columns and observations as rows. |

`method` |
method to be used for similarity computation. See Details below. |

`sparse` |
All methods try to be as sparse as possible. Specifically, when there are no observed co-occurrence, then nothing is computed. This might lead to slight deviations in the results for some methods. Set |

`...` |
Arguments passed internally to |

The function `sim.att`

and `sim.obs`

are convenience wrappers around the basic `cosRow, cosCol`

and `assocRow, assocCol`

functions. The `sim`

functions take a dataframe as input, internally calling `splitTable`

to turn the dataframe into sparse matrices, and then applying sparse matrix algebra to efficiently compute similarities. Currently only a few exemplary methods are encoded.

`sim.att`

computes similarities between the different nominal variables. The method `chuprov`

computes Chuprov's T (very similar to Cramer's V, but easier to compute efficiently). The method `g`

computes the G-test from Sokal and Rohlf (1982), also known as Dunning's G from Dunning (1993). This G is closely related to Mutual Information (G = 2*N*MI, with N being the sample size). The method `mutual`

returns the mutual information, and the method `variation`

returns the so-called ‘variation of information’ (join information - mutual information). Note that the this last one is a metric, not a similarity. All these methods can be abbreviated, e.g use "c", "g", "m", and "v".

`sim.obs`

computes similarities between the different observation for the nominal variables. The method `hamming`

computes the relative Hamming similarity, i.e. the number of similarities devided by the number of comparisons made (Goebl 1984 calls this the ‘Relativer Identitaetswert’). The method `weighted`

uses an inverse square root weighting on all similarities, i.e. rare similarities count more. This is very similar to Goebl's ‘Gewichteter Identitaetswert’, though note that his definition is slightly different from the one used here. Further, all methods as defined for `assocSparse`

can be used here, i.e. `res, pmi, wpmi, poi`

, and new methods can be defined according to the explanations as `assocSparse`

.

All methods return symmetric similarity matrices in the form `dsCMatrix`

, only specifying the upper triangle. The only exception is when `sparse=T`

is chose, then the result will be in the form `dsyMatrix`

.

Note that these methods automatically take missing data into account. They also work with large amount of missing data, but of course the validity of any similarity with much missing data is problematic.

The `sim.att`

and `sim.obs`

methods by default use sparse computations, which leads (among other effects) to errors on the diagonal. The main diagonal should be one everywhere by definition, but this will only be the case with the option `sparse = F`

. The deviations with `sparse = T`

should be minimal in the non-diagonal entries, but computations should be faster, and the results often take up less space.

Michael Cysouw

Goebl, Hans. 1984. *Dialektometrische Studien: anhand italoromanischer, raetoromanischer und galloromanischer Sprachmaterialien aus AIS und AFL.* (Beihefte zur Zeitschrift fuer Romanische Philologie). Tuebingen: Niemeyer.

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. *Computational linguistics* 19(1). 61-74.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | ```
# first a simple example using the farms-dataset from MASS
library(MASS)
# similarities between farms
s <- sim.obs(farms)
plot(hclust(as.dist(1-s), method = "ward.D"))
# similarities between attributes (`variables`)
s <- sim.att(farms)
plot(hclust(as.dist(1-s), method = "ward.D"))
# use the split option for multi-valued cells
farms2 <- as.matrix(farms)
farms2[1,1] <- "M1,M5"
s <- sim.obs(farms2, split = ",")
plot(hclust(as.dist(1-s), method = "ward.D"))
## Not run:
# a larger example with lots of missing data: the WALS-data as included here
# computations go reasonably quick
# (on 2566 observations and 131 attributes with 630 different values in total)
data(wals)
system.time(s <- sim.att(wals$data))
rownames(s) <- colnames(wals$data)
plot(hclust(as.dist(1-s), method = "ward.D"), cex = 0.5)
# Note that using sparse=T speeds up computations because it
# ignores zero co-occurrences
# This leads to small errors in the computation of Chuprov's T
system.time( # faster
chup.sparse <- sim.att(wals$data, method = "chuprov", sparse = TRUE)
)
system.time( # slower
chup.full <- sim.att(wals$data, method = "chuprov", sparse = FALSE)
)
# The sparse approach is almost identical to the full approach.
# sparse sligtly underestimates the real values for Chuprov's T
plot(as.dist(chup.sparse), as.dist(chup.full))
# some more similarities on the attributes
g <- sim.att(wals$data, method = "g") # Dunning's G
m <- sim.att(wals$data, method = "mutual") # Mutual Information
v <- sim.att(wals$data, method = "variation") # Variation of Information
# Note the strong differences between these approaches
pairs(~ as.dist(chup.sparse) + as.dist(m) + as.dist(g) + as.dist(v),
labels=c("Chuprov's T","Mutual Information","G-statistic","Variation of Information"))
# Relative Hamming similarity on all observations (languages) in WALS
# time is not a problem, but the data is so sparse
# that for many language-pairs there is no shared data
system.time( s <- sim.obs(wals$data))
# select only the 168 language with more than 80 datapoints
sel <- wals$data[apply(wals$data,1,function(x){sum(!is.na(x))})>80,]
# compare different similarities
w <- sim.obs(sel, "weighted")
h <- sim.obs(sel, "hamming")
r <- sim.obs(sel, "res")
p <- sim.obs(sel, "poi")
m <- sim.obs(sel, "wpmi")
i <- sim.obs(sel, "pmi")
pairs(~ as.dist(w) + as.dist(h) + as.dist(r) + as.dist(p) + as.dist(m) + as.dist(i),
labels = c("weighted","hamming","residuals","poisson","weighted PMI","PMI"))
## End(Not run)
``` |

cysouw/qlcMatrix documentation built on April 22, 2018, 4:59 a.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.