sim.wordlist: Similarity matrices from wordlists

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

A few different approaches are implemented here to compute similarities from wordlists. sim.lang computes similarities between languages, assuming a harmonized orthography (i.e. symbols can be equated across languages). sim.con computes similarities between concepts, using only language-internal similarities. sim.graph computes similarities between graphemes (i.e. language-specific symbols) between languages, as a crude approximation of regular sound correspondences.

WARNING: All these methods are really very crude! If they seem to give expected results, then this should be a lesson to rethink more complex methods proposed in the literature. However, in most cases the methods implemented here should be taken as a proof-of-concept, showing that such high-level similarities can be computed efficiently for large datasets. For actual research, I strongly urge anybody to adapt the current methods, and fine-tune them as needed.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
sim.lang(wordlist, 
	doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
	method = "parallel", assoc.method =  res, weight = NULL, sep = "")

sim.con(wordlist,
	doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
	method = "bigrams", assoc.method = res, weight = NULL, sep = "")

sim.graph(wordlist,
	doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "TOKENS",
	method = "cooccurrence", assoc.method = poi, weight = NULL, sep = " ")

Arguments

wordlist

Dataframe or matrix containing the wordlist data. Should have at least columns corresponding to languages (DOCULECT), meanings (CONCEPT) and translations (COUNTERPART).

doculects, concepts, counterparts

The name (or number) of the column of wordlist in which the respective information is to be found. The defaults are set to coincide with the naming of the example dataset included in this package. See huber.

method

Specific approach for the computation of the similarities. See Details below.

assoc.method, weight

Measures to be used internally (passed on to assocSparse or cosSparse). See Details below.

sep

Separator to be used to split strings. See link{splitStrings} for details.

Details

The following methods are currently implemented (all methods can be abbreviated):

For sim.lang:

For sim.con:

For sim.graph:

All these methods (except for sim.con(method = "colexification")) use either assocSparse or cosSparse for the computation of the similarities. For the different measures available, see the documentation there. Currently implemented are res, poi, pmi, wpmi for assocSparse and idf, isqrt, none for cosWeight. It is actually very easy to define your own measure.

When weight = NULL, then assocSparse is used with the internal method as specified in assoc.method. When weight is specified, then cosSparse is used with an Euclidean norm and the weighting as specified in weight. When weight is specified, and specification of assoc.method is ignored.

Value

A sparse similarity matrix of class dsCMatrix. The magnitude of the actual values in the matrices depend strongly on the methods chosen.

With sim.graph a list of two matrices is returned.

GG

The grapheme by grapheme similarity matrix of class dsCMatrix

GD

A pattern matrix of class indicating which grapheme belongs to which language.

Author(s)

Michael Cysouw

References

Prokic, Jelena and Michael Cysouw. 2013. Combining regular sound correspondences and geographic spread. Language Dynamics and Change 3(2). 147–168.

See Also

Based on splitWordlist for the underlying conversion of the wordlist into sparse matrices. The actual similarities are mostly computed using assocSparse or cosSparse.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# ----- load data -----

# an example wordlist, see help(huber) for details
data(huber)

# ----- similarity between languages -----

# most time is spend splitting the strings
# the rest does not really influence the time needed
system.time( sim <- sim.lang(huber, method = "p") )

# a simple distance-based UPGMA tree
plot(hclust(as.dist(-sim), method = "average"), cex = .7)

## Not run: 

# ----- similarity between concepts -----

# similarity based on bigrams
system.time( simB <- sim.con(huber, method = "b") )
# similarity based on colexification. much easier to calculate
system.time( simC <- sim.con(huber, method = "c") )

# As an example, look at all adjectival concepts
adj <- c(1,5,13,14,28,35,40,48,67,89,105,106,120,131,137,146,148,
	171,179,183,188,193,195,206,222,234,259,262,275,279,292,
	294,300,309,341,353,355,359)

# show them as trees
par(mfrow = c(1,2)) 
plot(hclust(as.dist(-simB[adj,adj]), method = "ward"), 
	cex = .5, main = "bigrams")
plot(hclust(as.dist(-simC[adj,adj]), method = "ward"), 
	cex = .5, main = "colexification")
par(mfrow = c(1,1))

# ----- similarity between graphemes -----

# this is a very crude approach towards regular sound correspondences
# when the languages are not too distantly related, it works rather nicely 
# can be used as a quick first guess of correspondences for input in more advanced methods

# all 2080 graphemes in the data by all 2080 graphemes, from all languages
system.time( X <- sim.graph(huber) )

# throw away the low values
# select just one pair of languages for a quick visualisation
X$GG <- drop0(X$GG, tol = 1)
colnames(X$GG) <- rownames(X$GG)
correspondences <- X$GG[X$GD[,"bora"],X$GD[,"muinane"]]
heatmap(as.matrix(correspondences))

## End(Not run)

qlcMatrix documentation built on May 2, 2019, 9:14 a.m.