etymology: Etymological age and regularity in Dutch

Description Usage Format References Examples

Description

Estimated etymological age for regular and irregular monomorphemic Dutch verbs, together with other distributional predictors of regularity.

Usage

1

Format

A data frame with 285 observations on the following 14 variables.

Verb

a factor with the verbs as levels.

WrittenFrequency

a numeric vector of logarithmically transformed frequencies in written Dutch (as available in the CELEX lexical database).

NcountStem

a numeric vector for the number of orthographic neighbors.

MeanBigramFrequency

a numeric vector for mean log bigram frequency.

InflectionalEntropy

a numeric vector for Shannon's entropy calculated for the word's inflectional variants.

Auxiliary

a factor with levels hebben, zijn and zijnheb for the verb's auxiliary in the perfect tenses.

Regularity

a factor with levels irregular and regular.

LengthInLetters

a numeric vector of the word's orthographic length.

Denominative

a factor with levels Den and N specifying whether a verb is derived from a noun according to the CELEX lexical database.

FamilySize

a numeric vector for the number of types in the word's morphological family.

EtymAge

an ordered factor with levels Dutch, DutchGerman, WestGermanic, Germanic and IndoEuropean.

Valency

a numeric vector for the verb's valency, estimated by its number of argument structures.

NVratio

a numeric vector for the log-transformed ratio of the nominal and verbal frequencies of use.

WrittenSpokenRatio

a numeric vector for the log-transformed ratio of the frequencies in written and spoken Dutch.

References

Baayen, R. H. and Moscoso del Prado Martin, F. (2005) Semantic density and past-tense formation in three Germanic languages, Language, 81, 666-698.

Tabak, W., Schreuder, R. and Baayen, R. H. (2005) Lexical statistics and lexical processing: semantic density, information complexity, sex, and irregularity in Dutch, in Kepser, S. and Reis, M., Linguistic Evidence - Empirical, Theoretical, and Computational Perspectives, Berlin: Mouton de Gruyter, pp. 529-555.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
## Not run: 
data(etymology)

# ---- EtymAge should be an ordered factor, set contrasts accordingly

etymology$EtymAge = ordered(etymology$EtymAge, levels = c("Dutch",
"DutchGerman", "WestGermanic", "Germanic", "IndoEuropean")) 
options(contrasts=c("contr.treatment","contr.treatment"))

library(rms)
etymology.dd = datadist(etymology)
options(datadist = 'etymology.dd')

# ---- EtymAge as additional predictor for regularity

etymology.lrm = lrm(Regularity ~ WrittenFrequency + 
rcs(FamilySize, 3) + NcountStem + InflectionalEntropy + 
Auxiliary + Valency + NVratio + WrittenSpokenRatio + EtymAge, 
data = etymology, x = TRUE, y = TRUE)
anova(etymology.lrm)

# ---- EtymAge as dependent variable

etymology.lrm = lrm(EtymAge ~ WrittenFrequency + NcountStem +
MeanBigramFrequency + InflectionalEntropy + Auxiliary +
Regularity + LengthInLetters + Denominative + FamilySize + Valency + 
NVratio + WrittenSpokenRatio, data = etymology, x = TRUE, y = TRUE)

# ---- model simplification 

etymology.lrm = lrm(EtymAge ~ NcountStem + Regularity + Denominative, 
data = etymology, x = TRUE, y = TRUE)
validate(etymology.lrm, bw=TRUE, B=200)

# ---- plot partial effects and check assumptions ordinal regression

plot(Predict(etymology.lrm))
plot(etymology.lrm)
resid(etymology.lrm, 'score.binary', pl = TRUE)
plot.xmean.ordinaly(EtymAge ~ NcountStem, data = etymology)

## End(Not run)

languageR documentation built on May 2, 2019, 10:02 a.m.