# Similarity matrices from wordlists

### Description

A few different approaches are implemented here to compute similarities from wordlists. `sim.lang`

computes similarities between languages, assuming a harmonized orthography (i.e. symbols can be equated across languages). `sim.con`

computes similarities between concepts, using only language-internal similarities. `sim.graph`

computes similarities between graphemes (i.e. language-specific symbols) between languages, as a crude approximation of regular sound correspondences.

WARNING: All these methods are really very crude! If they seem to give expected results, then this should be a lesson to rethink more complex methods proposed in the literature. However, in most cases the methods implemented here should be taken as a proof-of-concept, showing that such high-level similarities can be computed efficiently for large datasets. For actual research, I strongly urge anybody to adapt the current methods, and fine-tune them as needed.

### Usage

1 2 3 4 5 6 7 8 9 10 11 | ```
sim.lang(wordlist,
doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
method = "parallel", assoc.method = res, weight = NULL, sep = "")
sim.con(wordlist,
doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
method = "bigrams", assoc.method = res, weight = NULL, sep = "")
sim.graph(wordlist,
doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "TOKENS",
method = "cooccurrence", assoc.method = poi, weight = NULL, sep = " ")
``` |

### Arguments

`wordlist` |
Dataframe or matrix containing the wordlist data. Should have at least columns corresponding to languages (DOCULECT), meanings (CONCEPT) and translations (COUNTERPART). |

`doculects, concepts, counterparts` |
The name (or number) of the column of |

`method` |
Specific approach for the computation of the similarities. See Details below. |

`assoc.method, weight` |
Measures to be used internally (passed on to |

`sep` |
Separator to be used to split strings. See |

### Details

The following methods are currently implemented (all methods can be abbreviated):

For `sim.lang`

:

`global:`

Global bigram similarity, i.e. ignoring the separation into concepts, and simply taking the bigram vector of all words per language. Probably best combined with`weight = idf`

.`parallel:`

By default, computes a parallel bigram similarity, i.e. splitting the bigram vectors per language and per concepts, and then simply making one long vector per language from all individual concept-bigram vectors. This approach seems to be very similar (if not slightly better) than the widespread ‘average Levenshtein’ distance.

For `sim.con`

:

`colexification:`

Simply count the number of languages in which two concepts have at least one complete identical translations. No normalization is attempted, and`assoc.method`

and`weight`

are ignored (internally this just uses`tcrossprod`

on the`CW (concepts x words)`

sparse matrix). Because no splitting of strings is necessary, this method is very quick.`global:`

Global bigram similarity, i.e. ignoring the separation into languages, and simply taking the bigram vector of all words per concept. Probably best combined with`weight = idf`

.`bigrams:`

By default, compute the similarity between concepts by comparing bigraphs, i.e. language-specific bigrams. In that way, cross-linguistically recurrent partial similarities are uncovered. It is very interesting to compare this measure with`colexification`

above.

For `sim.graph`

:

`cooccurrence:`

Currently the only method implemented. Computes the co-occurrence statistics for all pair of graphemes (e.g. between symbol x from language L1 and symbol y from language L2). See Prokic & Cysouw (2013) for an example using this approach.

All these methods (except for `sim.con(method = "colexification")`

) use either `assocSparse`

or `cosSparse`

for the computation of the similarities. For the different measures available, see the documentation there. Currently implemented are `res, poi, pmi, wpmi`

for `assocSparse`

and `idf, isqrt, none`

for `cosWeight`

. It is actually very easy to define your own measure.

When `weight = NULL`

, then `assocSparse`

is used with the internal method as specified in `assoc.method`

. When `weight`

is specified, then `cosSparse`

is used with an Euclidean norm and the weighting as specified in `weight`

. When `weight`

is specified, and specification of `assoc.method`

is ignored.

### Value

A sparse similarity matrix of class `dsCMatrix`

. The magnitude of the actual values in the matrices depend strongly on the methods chosen.

With `sim.graph`

a list of two matrices is returned.

`GG` |
The grapheme by grapheme similarity matrix of class |

`GD` |
A pattern matrix of class indicating which grapheme belongs to which language. |

### Author(s)

Michael Cysouw

### References

Prokic, Jelena and Michael Cysouw. 2013. Combining regular sound correspondences and geographic spread. *Language Dynamics and Change* 3(2). 147–168.

### See Also

Based on `splitWordlist`

for the underlying conversion of the wordlist into sparse matrices. The actual similarities are mostly computed using `assocSparse`

or `cosSparse`

.

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | ```
# ----- load data -----
# an example wordlist, see help(huber) for details
data(huber)
# ----- similarity between languages -----
# most time is spend splitting the strings
# the rest does not really influence the time needed
system.time( sim <- sim.lang(huber, method = "p") )
# a simple distance-based UPGMA tree
plot(hclust(as.dist(-sim), method = "average"), cex = .7)
## Not run:
# ----- similarity between concepts -----
# similarity based on bigrams
system.time( simB <- sim.con(huber, method = "b") )
# similarity based on colexification. much easier to calculate
system.time( simC <- sim.con(huber, method = "c") )
# As an example, look at all adjectival concepts
adj <- c(1,5,13,14,28,35,40,48,67,89,105,106,120,131,137,146,148,
171,179,183,188,193,195,206,222,234,259,262,275,279,292,
294,300,309,341,353,355,359)
# show them as trees
par(mfrow = c(1,2))
plot(hclust(as.dist(-simB[adj,adj]), method = "ward"),
cex = .5, main = "bigrams")
plot(hclust(as.dist(-simC[adj,adj]), method = "ward"),
cex = .5, main = "colexification")
par(mfrow = c(1,1))
# ----- similarity between graphemes -----
# this is a very crude approach towards regular sound correspondences
# when the languages are not too distantly related, it works rather nicely
# can be used as a quick first guess of correspondences for input in more advanced methods
# all 2080 graphemes in the data by all 2080 graphemes, from all languages
system.time( X <- sim.graph(huber) )
# throw away the low values
# select just one pair of languages for a quick visualisation
X$GG <- drop0(X$GG, tol = 1)
colnames(X$GG) <- rownames(X$GG)
correspondences <- X$GG[X$GD[,"bora"],X$GD[,"muinane"]]
heatmap(as.matrix(correspondences))
## End(Not run)
``` |