A few different approaches are implemented here to compute similarities from wordlists.
sim.lang computes similarities between languages, assuming a harmonized orthography (i.e. symbols can be equated across languages).
sim.con computes similarities between concepts, using only language-internal similarities.
sim.graph computes similarities between graphemes (i.e. language-specific symbols) between languages, as a crude approximation of regular sound correspondences.
WARNING: All these methods are really very crude! If they seem to give expected results, then this should be a lesson to rethink more complex methods proposed in the literature. However, in most cases the methods implemented here should be taken as a proof-of-concept, showing that such high-level similarities can be computed efficiently for large datasets. For actual research, I strongly urge anybody to adapt the current methods, and fine-tune them as needed.
1 2 3 4 5 6 7 8 9 10 11
sim.lang(wordlist, doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART", method = "parallel", assoc.method = res, weight = NULL, sep = "") sim.con(wordlist, doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART", method = "bigrams", assoc.method = res, weight = NULL, sep = "") sim.graph(wordlist, doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "TOKENS", method = "cooccurrence", assoc.method = poi, weight = NULL, sep = " ")
Dataframe or matrix containing the wordlist data. Should have at least columns corresponding to languages (DOCULECT), meanings (CONCEPT) and translations (COUNTERPART).
The name (or number) of the column of
Specific approach for the computation of the similarities. See Details below.
Measures to be used internally (passed on to
Separator to be used to split strings. See
The following methods are currently implemented (all methods can be abbreviated):
global: Global bigram similarity, i.e. ignoring the separation into concepts, and simply taking the bigram vector of all words per language. Probably best combined with
weight = idf.
parallel: By default, computes a parallel bigram similarity, i.e. splitting the bigram vectors per language and per concepts, and then simply making one long vector per language from all individual concept-bigram vectors. This approach seems to be very similar (if not slightly better) than the widespread ‘average Levenshtein’ distance.
colexification: Simply count the number of languages in which two concepts have at least one complete identical translations. No normalization is attempted, and
weight are ignored (internally this just uses
tcrossprod on the
CW (concepts x words) sparse matrix). Because no splitting of strings is necessary, this method is very quick.
global: Global bigram similarity, i.e. ignoring the separation into languages, and simply taking the bigram vector of all words per concept. Probably best combined with
weight = idf.
bigrams: By default, compute the similarity between concepts by comparing bigraphs, i.e. language-specific bigrams. In that way, cross-linguistically recurrent partial similarities are uncovered. It is very interesting to compare this measure with
cooccurrence: Currently the only method implemented. Computes the co-occurrence statistics for all pair of graphemes (e.g. between symbol x from language L1 and symbol y from language L2). See Prokic & Cysouw (2013) for an example using this approach.
All these methods (except for
sim.con(method = "colexification")) use either
cosSparse for the computation of the similarities. For the different measures available, see the documentation there. Currently implemented are
res, poi, pmi, wpmi for
idf, isqrt, none for
cosWeight. It is actually very easy to define your own measure.
weight = NULL, then
assocSparse is used with the internal method as specified in
weight is specified, then
cosSparse is used with an Euclidean norm and the weighting as specified in
weight is specified, and specification of
assoc.method is ignored.
A sparse similarity matrix of class
dsCMatrix. The magnitude of the actual values in the matrices depend strongly on the methods chosen.
sim.graph a list of two matrices is returned.
The grapheme by grapheme similarity matrix of class
A pattern matrix of class indicating which grapheme belongs to which language.
Prokic, Jelena and Michael Cysouw. 2013. Combining regular sound correspondences and geographic spread. Language Dynamics and Change 3(2). 147–168.
splitWordlist for the underlying conversion of the wordlist into sparse matrices. The actual similarities are mostly computed using
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
# ----- load data ----- # an example wordlist, see help(huber) for details data(huber) # ----- similarity between languages ----- # most time is spend splitting the strings # the rest does not really influence the time needed system.time( sim <- sim.lang(huber, method = "p") ) # a simple distance-based UPGMA tree plot(hclust(as.dist(-sim), method = "average"), cex = .7) ## Not run: # ----- similarity between concepts ----- # similarity based on bigrams system.time( simB <- sim.con(huber, method = "b") ) # similarity based on colexification. much easier to calculate system.time( simC <- sim.con(huber, method = "c") ) # As an example, look at all adjectival concepts adj <- c(1,5,13,14,28,35,40,48,67,89,105,106,120,131,137,146,148, 171,179,183,188,193,195,206,222,234,259,262,275,279,292, 294,300,309,341,353,355,359) # show them as trees par(mfrow = c(1,2)) plot(hclust(as.dist(-simB[adj,adj]), method = "ward"), cex = .5, main = "bigrams") plot(hclust(as.dist(-simC[adj,adj]), method = "ward"), cex = .5, main = "colexification") par(mfrow = c(1,1)) # ----- similarity between graphemes ----- # this is a very crude approach towards regular sound correspondences # when the languages are not too distantly related, it works rather nicely # can be used as a quick first guess of correspondences for input in more advanced methods # all 2080 graphemes in the data by all 2080 graphemes, from all languages system.time( X <- sim.graph(huber) ) # throw away the low values # select just one pair of languages for a quick visualisation X$GG <- drop0(X$GG, tol = 1) colnames(X$GG) <- rownames(X$GG) correspondences <- X$GG[X$GD[,"bora"],X$GD[,"muinane"]] heatmap(as.matrix(correspondences)) ## End(Not run)