etymology: Etymological age and regularity in Dutch

etymologyR Documentation

Etymological age and regularity in Dutch

Description

Estimated etymological age for regular and irregular monomorphemic Dutch verbs, together with other distributional predictors of regularity.

Usage

data(etymology)

Format

A data frame with 285 observations on the following 14 variables.

Verb

a factor with the verbs as levels.

WrittenFrequency

a numeric vector of logarithmically transformed frequencies in written Dutch (as available in the CELEX lexical database).

NcountStem

a numeric vector for the number of orthographic neighbors.

MeanBigramFrequency

a numeric vector for mean log bigram frequency.

InflectionalEntropy

a numeric vector for Shannon's entropy calculated for the word's inflectional variants.

Auxiliary

a factor with levels hebben, zijn and zijnheb for the verb's auxiliary in the perfect tenses.

Regularity

a factor with levels irregular and regular.

LengthInLetters

a numeric vector of the word's orthographic length.

Denominative

a factor with levels Den and N specifying whether a verb is derived from a noun according to the CELEX lexical database.

FamilySize

a numeric vector for the number of types in the word's morphological family.

EtymAge

an ordered factor with levels Dutch, DutchGerman, WestGermanic, Germanic and IndoEuropean.

Valency

a numeric vector for the verb's valency, estimated by its number of argument structures.

NVratio

a numeric vector for the log-transformed ratio of the nominal and verbal frequencies of use.

WrittenSpokenRatio

a numeric vector for the log-transformed ratio of the frequencies in written and spoken Dutch.

References

Baayen, R. H. and Moscoso del Prado Martin, F. (2005) Semantic density and past-tense formation in three Germanic languages, Language, 81, 666-698.

Tabak, W., Schreuder, R. and Baayen, R. H. (2005) Lexical statistics and lexical processing: semantic density, information complexity, sex, and irregularity in Dutch, in Kepser, S. and Reis, M., Linguistic Evidence - Empirical, Theoretical, and Computational Perspectives, Berlin: Mouton de Gruyter, pp. 529-555.

Examples

## Not run: 
data(etymology)

# ---- EtymAge should be an ordered factor, set contrasts accordingly

etymology$EtymAge = ordered(etymology$EtymAge, levels = c("Dutch",
"DutchGerman", "WestGermanic", "Germanic", "IndoEuropean")) 
options(contrasts=c("contr.treatment","contr.treatment"))

library(rms)
etymology.dd = datadist(etymology)
options(datadist = 'etymology.dd')

# ---- EtymAge as additional predictor for regularity

etymology.lrm = lrm(Regularity ~ WrittenFrequency + 
rcs(FamilySize, 3) + NcountStem + InflectionalEntropy + 
Auxiliary + Valency + NVratio + WrittenSpokenRatio + EtymAge, 
data = etymology, x = TRUE, y = TRUE)
anova(etymology.lrm)

# ---- EtymAge as dependent variable

etymology.lrm = lrm(EtymAge ~ WrittenFrequency + NcountStem +
MeanBigramFrequency + InflectionalEntropy + Auxiliary +
Regularity + LengthInLetters + Denominative + FamilySize + Valency + 
NVratio + WrittenSpokenRatio, data = etymology, x = TRUE, y = TRUE)

# ---- model simplification 

etymology.lrm = lrm(EtymAge ~ NcountStem + Regularity + Denominative, 
data = etymology, x = TRUE, y = TRUE)
validate(etymology.lrm, bw=TRUE, B=200)

# ---- plot partial effects and check assumptions ordinal regression

plot(Predict(etymology.lrm))
plot(etymology.lrm)
resid(etymology.lrm, 'score.binary', pl = TRUE)
plot.xmean.ordinaly(EtymAge ~ NcountStem, data = etymology)

## End(Not run)

languageR documentation built on June 10, 2025, 9:08 a.m.