knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(LDLConvFunctions)
library(DT)
library(WpmWithLdl)
library(FNN)
library(dplyr)

LDLConvFunctions offers functions

This vignette provides all necessary information on the LDLConvFunctions package, including installation instructions and examples for all functions.

All functions explicitly mentioned in this vignette which are not part of the LDLConvFunctions package are part of the WpmWithLdl package unless noted otherwise. Some measures computed and extracted by functions of this package were originally introduced by Chuang et al. (2020).

This package and its functions were first used in Schmitz et al. (2021). The Mean Word Support function is inspired by Stein & Plag (2021).


Installation

The preferred way to install this package is through devtools:

devtools::install_github("dosc91/LDLConvFunctions", upgrade_dependencies = FALSE)

The ALDC function requires the stringdist package to function. Thus, the stringdist package is automatically installed unless it is installed already.

In case this automatic installation does not work for whatever reason, please install it manually:

install.packages("stringdist")
install.packages("FNN")

Overview

This is a full list of all functions and data contained in LDLConvFunctions.

Functions to compute and extract measures:

Functions to reorder C matrices:

Additionally, the package provides a small sample data set:

This data set will be used to illustrate the use of functions throughout this vignette.


Data {#Data}

The LDLConvFunctions_data.rda file which is part of this package contains a data frame with 100 rows and 4 variables.

# load data
data(LDLConvFunctions_data)
datatable(data)

The complete data set contains real words as well as pseudowords. To obtain the specific data sets for real words and pseudowords, follow this procedure:

# real word data set
real.data <- data[1:88,]

# pseudoword data set
pseudo.data <- data[89:100,]

The real word data set contains 88 rows, the pseudoword data set contains 12 rows.

real.cues = make_cue_matrix(data = real.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

pseudo.cues = make_cue_matrix(data = pseudo.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

combined.cues = make_cue_matrix(data = data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

real.Smat = make_S_matrix(data = real.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

# pseudo.Smat can be obtained by using the result of learn_comprehension for real words

# learn comprehension for real words
real.comp = learn_comprehension(cue_obj = real.cues, S = real.Smat)

## extract Hp
Hp = real.comp$F

## dimensions of Hp
dim(Hp)
# 461 461

## dimensions of pseudoword C matrix
dim(pseudo.cues$matrices$C)
# 12 31

## create an empty matrix, i.e. cells contain zeros, with rows=48 and columns=336-63
add.rows <- matrix(0, 12, 430)
dim(add.rows)
# 12 430

pseudo.C.bigger <- cbind(pseudo.cues$matrices$C, add.rows)
dim(pseudo.C.bigger)
# 12 461

## make sure the new C matrix is considered a matrix
pseudo.C.bigger <- as.matrix(pseudo.C.bigger)

## using Hp and the bigger pseudoword C matrix, create the pseudoword S matrix
pseudo.Smat = pseudo.C.bigger %*% Hp
dim(pseudo.Smat)
# 12 461

## make sure the new S matrix is considered a matrix
pseudo.Smat <- as.matrix(pseudo.Smat)


#### combined S matrix ----

combined.Smat <- rbind(real.Smat, pseudo.Smat)

Functions to reorder C matrices

find_triphones {#find_triphones}

find_triphones finds all triphones contained in two C matrices, e.g. in a (smaller) pseudoword and (bigger) real word cue matrix. All triphones contained in both matrices as well as their column numbers are returned.

# pseudo.cues = C matrix for pseudowords
# real.cues = C matrix for real words

found_triphones <- find_triphones(pseudo_C_matrix = pseudo.cues, real_C_matrix = real.cues)
datatable(found_triphones)

reorder_pseudo_C_matrix {#reorder_pseudo_C_matrix}

reorder_pseudo_C_matrix uses the information on triphones found by find_triphones to create a new version of the (smaller) pseudoword matrix with identical column order as the (bigger) real word matrix.

# pseudo.cues = C matrix for pseudowords
# real.cues = C matrix for real words

reordered_C_matrix <- reorder_pseudo_C_matrix(pseudo_C_matrix = pseudo.cues, real_C_matrix = real.cues, found_triphones = found_triphones)
all(rownames(real.cues$matrices$C) == rownames(reordered_C_matrix))

Functions to compute and extract measures

ALC - Average Lexical Correlation {#ALC}

The Average Lexical Correlation variable describes the mean correlation of a pseudoword's estimated semantic vector (as contained in $\hat{S}$) with each of the words' semantic vectors (as contained in $S$).

Higher ALC values indicate that a pseudoword vector is part of a denser semantic neighbourhood.

To use this function, you must create a real word S matrix as well as a pseudoword S matrix beforehand. Additionally, the data set to create the pseudoword S matrix is needed.

# pseudo.Smat = S matrix for pseudowords
# real.Smat = S matrix for real words

ALCframe <- ALC(pseudo_S_matrix = pseudo.Smat, real_S_matrix = real.Smat, pseudo_word_data = pseudo.data)
datatable(ALCframe)

ALDC - Average Levenshtein Distance of Candidates {#ALDC}

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. The Levenshtein distance is measured using the stringdist function of the stringdist package.

The Average Levenshtein Distance of Candidates describes the mean Levenshtein Distance of all candidate productions of a word.

Higher ALDC values indicate that the candidate forms are very different from the intended pronunciation, indicating a sparse form neighborhood.

To use this function, you must learn_production and evaluate its accuracy by accuracy_production beforehand. Additionally, the data set to compute the production accuracy is needed.

# getting the mapping for comprehension
combined.comp = learn_comprehension(cue_obj = combined.cues, S = combined.Smat)


# evaluating comprehension accuracy
combined.comp_acc = accuracy_comprehension(m = combined.comp, data = data, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE)

combined.comp_acc$acc #0.98

combined.comp_measures <- comprehension_measures(comp_acc = combined.comp_acc, Shat = combined.comp$Shat, S = combined.Smat)

# getting the mapping for production
combined.prod = learn_production(cue_obj = combined.cues, S = combined.Smat, comp = combined.comp)

# evaluating production accuracy
combined.prod_acc = accuracy_production(m = combined.prod, data = data, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE)

# getting prod.measures
combined.prod_measures = production_measures(prod_acc = combined.prod_acc, S = combined.Smat, prod = combined.prod)


combined.prod_acc$acc #0.99
# combined.prod_acc = production accuracy as computed with 'accuracy_production'

ALDCframe <- ALDC(prod_acc = combined.prod_acc, data = data)
datatable(ALDCframe)

DRC - Dual Route Consistency {#DRC}

The Dual Route Consistency describes the correlation between the semantic vector estimated from the direct route and that from the indirect route.

Higher DRC values indicates that the semantic vectors produced by the two routes are more similar to each other.

To use this function, you must learn_comprehension via direct and indirect route beforehand. A shortcut to obtain the $\hat{S}$ matrix of the indirect route is given below. Additionally, the data set to compute the production accuracy is needed.

## get cues for INDIRECT route
combined.cues.indirect = make_cue_matrix(data = data, formula = ~ Word + Affix, grams = 3, cueForm = "DISC", indirectRoute = TRUE)


# make sure row names are in the correct order
all(rownames(combined.cues.indirect$matrices$C) == rownames(combined.Smat))

combined.Smat2 <- combined.Smat[rownames(combined.cues.indirect$matrices$C)]

all(rownames(combined.cues.indirect$matrices$C) == rownames(combined.Smat2))


## get mapping for comprehension with INDIRECT route cues
combined.comp.indirect = learn_comprehension(cue_obj = combined.cues.indirect, S = combined.Smat2)


## solve the following equations to obtain the indirect route Shat matrix
Tmat = combined.comp.indirect$matrices$T
X = MASS::ginv(t(combined.comp.indirect$matrices$C) %*% combined.comp.indirect$matrices$C) %*% t(combined.comp.indirect$matrices$C)
K = X %*% combined.comp.indirect$matrices$T
H = MASS::ginv(t(Tmat) %*% Tmat) %*% t(Tmat) %*% combined.Smat #IMPORTANT: this is the direct route Smat!!!
That = combined.comp.indirect$matrices$C %*% K
Shat = That %*% H
# comprehension = comprehension as obtained by 'learn_comprehension'
# Shat_matrix_indirect = Shat matrix of the indirect route

DRCframe <- DRC(comprehension = combined.comp, Shat_matrix_indirect = Shat, data = data)
datatable(DRCframe)

Shortcut to obtain the indirect route Shat matrix:

# combined.comp.indirect = cue matrix as obtained by 'make_cue_matrix'

Tmat = combined.comp.indirect$matrices$T

X = MASS::ginv(t(combined.comp.indirect$matrices$C) %*% combined.comp.indirect$matrices$C) %*% t(combined.comp.indirect$matrices$C)

K = X %*% combined.comp.indirect$matrices$T

H = MASS::ginv(t(Tmat) %*% Tmat) %*% t(Tmat) %*% combined.Smat #IMPORTANT: this is the direct route Smat!!!

That = combined.comp.indirect$matrices$C %*% K

Shat = That %*% H

EDNN - Euclidian Distance from Nearest Neighbour {#EDNN}

The Euclidian Distance from Nearest Neighbour describes the Euclidean distance from the semantic vector $\hat{s}$ produced by the direct route to its closest semantic word neighbour. The euclidian distance is measured using the get.knnx function of the FNN package.

Higher EDNN values indicate that a word is more distant from the nearest word neighbour.

To use this function, you must learn_comprehension. Additionally, the data set to compute learn_comprehension is needed.

# comprehension = comprehension as obtained by 'learn_comprehension'

EDNNframe <- EDNN(comprehension = combined.comp, data = data)
datatable(EDNNframe)

MWS - Mean Word Support {#MWS}

The Mean Word Support describes the mean support for triphones in the winning production form of a word or pseudoword. It takes the path_sum as computed by the WpmWithLdl package and divides it by the number of triphones, i.e. $MWS = \frac{ \text{path sum}}{\text{number of triphones}}$.

Higher MWS values indicate a higher support of a predicted word form.

To use this function, you must learn_production, get accuracy_production, and get production_measures.

# combined.prod_acc = production accuracy as computed with 'accuracy_production'
# prod_measures = production measures as obtained by 'production_measures'

MWSframe <- MWS(prod_acc = combined.prod_acc, prod_measures = combined.prod_measures)
datatable(MWSframe)

NNC - Nearest Neighbour Correlation {#NNC}

The Nearest Neighbour Correlation describes the maximum of the correlations between a pseudoword's estimated semantic vector and words' semantic vectors.

Higher NNC values indicate that a pseudoword's meaning is similar to that of a real word.

To use this function, you must create a real word S matrix as well as a pseudoword S matrix beforehand. Additionally, the data sets to compute both S matrices are needed.

# pseudo.Smat = S matrix for pseudowords
# real.Smat = S matrix for real words

NNCframe <- NNC(pseudo_S_matrix = pseudo.Smat, real_S_matrix = real.Smat, pseudo_word_data = pseudo.data, real_word_data = real.data)
datatable(NNCframe)

SCPP - Semantic Correlation of Predicted Production {#SCPP}

The Semantic Correlation of Predicted Production describes the maximum correlation between a word's predicted semantic vector and any of the semantic vectors of its candidate forms.

Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning.

To use this function, you must learn_production and extract production measures beforehand. Additionally, the data set to compute learn_production is needed.

# prod_measures = production measures as obtained by 'production_measures'

SCPPframe <- SCPP(prod_measures = combined.prod_measures, data = data)
datatable(SCPPframe)

References

WpmWithLdl package:
Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270. doi.org/10.1075/ml.18010.baa

Linear Discriminative Learning:
Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39. doi.org/10.1155/2019/4895891

Study introducing ALC, ALDC, DRC, EDNN, NNC, and SCPP:
Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., & Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51. doi.org/10.3758/s13428-020-01356-w

First study using the LDLConvFunctions package:
Schmitz, D., Plag, I., Baer-Henney, D., & Stein, S. D. (2021). Durational differences of word-final /s/ emerge from the lexicon: Modelling morpho-phonetic effects in pseudowords with linear discriminative learning. Frontiers in Psychology. 10.3389/fpsyg.2021.680889

Study introducing the Mean Word Support measure:
Stein, S.D., & Plag, I. (2021). Morpho-phonetic effects in speech production: Modeling the acoustic duration of English derived words with linear discriminative learning. Frontiers in Psychology. 10.3389/fpsyg.2021.678712


Please message the author at contact@dominicschmitz.com in case of any questions, errors or ideas.



dosc91/LDLConvFunctions documentation built on Feb. 2, 2022, 6:17 p.m.