knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(LDLConvFunctions) library(DT) library(WpmWithLdl) library(FNN) library(dplyr)
LDLConvFunctions
offers functions
WpmWithLdl
objectsThis vignette provides all necessary information on the LDLConvFunctions
package, including installation instructions and examples for all functions.
All functions explicitly mentioned in this vignette which are not part of the LDLConvFunctions
package are part of the WpmWithLdl
package unless noted otherwise. Some measures computed and extracted by functions of this package were originally introduced by Chuang et al. (2020).
This package and its functions were first used in Schmitz et al. (2021). The Mean Word Support
function is inspired by Stein & Plag (2021).
The preferred way to install this package is through devtools:
devtools::install_github("dosc91/LDLConvFunctions", upgrade_dependencies = FALSE)
The ALDC function requires the stringdist
package to function. Thus, the stringdist
package is automatically installed unless it is installed already.
In case this automatic installation does not work for whatever reason, please install it manually:
install.packages("stringdist") install.packages("FNN")
This is a full list of all functions and data contained in LDLConvFunctions
.
Functions to compute and extract measures:
Functions to reorder C matrices:
Additionally, the package provides a small sample data set:
This data set will be used to illustrate the use of functions throughout this vignette.
The LDLConvFunctions_data.rda
file which is part of this package contains a data frame with 100 rows and 4 variables.
# load data data(LDLConvFunctions_data)
datatable(data)
The complete data set contains real words as well as pseudowords. To obtain the specific data sets for real words and pseudowords, follow this procedure:
# real word data set real.data <- data[1:88,] # pseudoword data set pseudo.data <- data[89:100,]
The real word data set contains 88 rows, the pseudoword data set contains 12 rows.
real.cues = make_cue_matrix(data = real.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC") pseudo.cues = make_cue_matrix(data = pseudo.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC") combined.cues = make_cue_matrix(data = data, formula = ~ Word + Affix, grams = 3, wordform = "DISC") real.Smat = make_S_matrix(data = real.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC") # pseudo.Smat can be obtained by using the result of learn_comprehension for real words # learn comprehension for real words real.comp = learn_comprehension(cue_obj = real.cues, S = real.Smat) ## extract Hp Hp = real.comp$F ## dimensions of Hp dim(Hp) # 461 461 ## dimensions of pseudoword C matrix dim(pseudo.cues$matrices$C) # 12 31 ## create an empty matrix, i.e. cells contain zeros, with rows=48 and columns=336-63 add.rows <- matrix(0, 12, 430) dim(add.rows) # 12 430 pseudo.C.bigger <- cbind(pseudo.cues$matrices$C, add.rows) dim(pseudo.C.bigger) # 12 461 ## make sure the new C matrix is considered a matrix pseudo.C.bigger <- as.matrix(pseudo.C.bigger) ## using Hp and the bigger pseudoword C matrix, create the pseudoword S matrix pseudo.Smat = pseudo.C.bigger %*% Hp dim(pseudo.Smat) # 12 461 ## make sure the new S matrix is considered a matrix pseudo.Smat <- as.matrix(pseudo.Smat) #### combined S matrix ---- combined.Smat <- rbind(real.Smat, pseudo.Smat)
find_triphones
finds all triphones contained in two C matrices, e.g. in a (smaller) pseudoword and (bigger) real word cue matrix. All triphones contained in both matrices as well as their column numbers are returned.
# pseudo.cues = C matrix for pseudowords # real.cues = C matrix for real words found_triphones <- find_triphones(pseudo_C_matrix = pseudo.cues, real_C_matrix = real.cues)
datatable(found_triphones)
reorder_pseudo_C_matrix
uses the information on triphones found by find_triphones
to create a new version of the (smaller) pseudoword matrix with identical column order as the (bigger) real word matrix.
# pseudo.cues = C matrix for pseudowords # real.cues = C matrix for real words reordered_C_matrix <- reorder_pseudo_C_matrix(pseudo_C_matrix = pseudo.cues, real_C_matrix = real.cues, found_triphones = found_triphones)
all(rownames(real.cues$matrices$C) == rownames(reordered_C_matrix))
The Average Lexical Correlation variable describes the mean correlation of a pseudoword's estimated semantic vector (as contained in $\hat{S}$) with each of the words' semantic vectors (as contained in $S$).
Higher ALC values indicate that a pseudoword vector is part of a denser semantic neighbourhood.
To use this function, you must create a real word S matrix as well as a pseudoword S matrix beforehand. Additionally, the data set to create the pseudoword S matrix is needed.
# pseudo.Smat = S matrix for pseudowords # real.Smat = S matrix for real words ALCframe <- ALC(pseudo_S_matrix = pseudo.Smat, real_S_matrix = real.Smat, pseudo_word_data = pseudo.data)
datatable(ALCframe)
The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. The Levenshtein distance is measured using the stringdist
function of the stringdist
package.
The Average Levenshtein Distance of Candidates describes the mean Levenshtein Distance of all candidate productions of a word.
Higher ALDC values indicate that the candidate forms are very different from the intended pronunciation, indicating a sparse form neighborhood.
To use this function, you must learn_production
and evaluate its accuracy by accuracy_production
beforehand. Additionally, the data set to compute the production accuracy is needed.
# getting the mapping for comprehension combined.comp = learn_comprehension(cue_obj = combined.cues, S = combined.Smat) # evaluating comprehension accuracy combined.comp_acc = accuracy_comprehension(m = combined.comp, data = data, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE) combined.comp_acc$acc #0.98 combined.comp_measures <- comprehension_measures(comp_acc = combined.comp_acc, Shat = combined.comp$Shat, S = combined.Smat) # getting the mapping for production combined.prod = learn_production(cue_obj = combined.cues, S = combined.Smat, comp = combined.comp) # evaluating production accuracy combined.prod_acc = accuracy_production(m = combined.prod, data = data, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE) # getting prod.measures combined.prod_measures = production_measures(prod_acc = combined.prod_acc, S = combined.Smat, prod = combined.prod) combined.prod_acc$acc #0.99
# combined.prod_acc = production accuracy as computed with 'accuracy_production' ALDCframe <- ALDC(prod_acc = combined.prod_acc, data = data)
datatable(ALDCframe)
The Dual Route Consistency describes the correlation between the semantic vector estimated from the direct route and that from the indirect route.
Higher DRC values indicates that the semantic vectors produced by the two routes are more similar to each other.
To use this function, you must learn_comprehension
via direct
and indirect
route beforehand. A shortcut to obtain the $\hat{S}$ matrix of the indirect route is given below. Additionally, the data set to compute the production accuracy is needed.
## get cues for INDIRECT route combined.cues.indirect = make_cue_matrix(data = data, formula = ~ Word + Affix, grams = 3, cueForm = "DISC", indirectRoute = TRUE) # make sure row names are in the correct order all(rownames(combined.cues.indirect$matrices$C) == rownames(combined.Smat)) combined.Smat2 <- combined.Smat[rownames(combined.cues.indirect$matrices$C)] all(rownames(combined.cues.indirect$matrices$C) == rownames(combined.Smat2)) ## get mapping for comprehension with INDIRECT route cues combined.comp.indirect = learn_comprehension(cue_obj = combined.cues.indirect, S = combined.Smat2) ## solve the following equations to obtain the indirect route Shat matrix Tmat = combined.comp.indirect$matrices$T X = MASS::ginv(t(combined.comp.indirect$matrices$C) %*% combined.comp.indirect$matrices$C) %*% t(combined.comp.indirect$matrices$C) K = X %*% combined.comp.indirect$matrices$T H = MASS::ginv(t(Tmat) %*% Tmat) %*% t(Tmat) %*% combined.Smat #IMPORTANT: this is the direct route Smat!!! That = combined.comp.indirect$matrices$C %*% K Shat = That %*% H
# comprehension = comprehension as obtained by 'learn_comprehension' # Shat_matrix_indirect = Shat matrix of the indirect route DRCframe <- DRC(comprehension = combined.comp, Shat_matrix_indirect = Shat, data = data)
datatable(DRCframe)
Shortcut to obtain the indirect route Shat matrix:
# combined.comp.indirect = cue matrix as obtained by 'make_cue_matrix' Tmat = combined.comp.indirect$matrices$T X = MASS::ginv(t(combined.comp.indirect$matrices$C) %*% combined.comp.indirect$matrices$C) %*% t(combined.comp.indirect$matrices$C) K = X %*% combined.comp.indirect$matrices$T H = MASS::ginv(t(Tmat) %*% Tmat) %*% t(Tmat) %*% combined.Smat #IMPORTANT: this is the direct route Smat!!! That = combined.comp.indirect$matrices$C %*% K Shat = That %*% H
The Euclidian Distance from Nearest Neighbour describes the Euclidean distance from the semantic vector $\hat{s}$ produced by the direct route to its closest semantic word neighbour. The euclidian distance is measured using the get.knnx
function of the FNN
package.
Higher EDNN values indicate that a word is more distant from the nearest word neighbour.
To use this function, you must learn_comprehension
. Additionally, the data set to compute learn_comprehension
is needed.
# comprehension = comprehension as obtained by 'learn_comprehension' EDNNframe <- EDNN(comprehension = combined.comp, data = data)
datatable(EDNNframe)
The Mean Word Support describes the mean support for triphones in the winning production form of a word or pseudoword. It takes the path_sum as computed by the WpmWithLdl package and divides it by the number of triphones, i.e. $MWS = \frac{ \text{path sum}}{\text{number of triphones}}$.
Higher MWS values indicate a higher support of a predicted word form.
To use this function, you must learn_production
, get accuracy_production
, and get production_measures
.
# combined.prod_acc = production accuracy as computed with 'accuracy_production' # prod_measures = production measures as obtained by 'production_measures' MWSframe <- MWS(prod_acc = combined.prod_acc, prod_measures = combined.prod_measures)
datatable(MWSframe)
The Nearest Neighbour Correlation describes the maximum of the correlations between a pseudoword's estimated semantic vector and words' semantic vectors.
Higher NNC values indicate that a pseudoword's meaning is similar to that of a real word.
To use this function, you must create a real word S matrix as well as a pseudoword S matrix beforehand. Additionally, the data sets to compute both S matrices are needed.
# pseudo.Smat = S matrix for pseudowords # real.Smat = S matrix for real words NNCframe <- NNC(pseudo_S_matrix = pseudo.Smat, real_S_matrix = real.Smat, pseudo_word_data = pseudo.data, real_word_data = real.data)
datatable(NNCframe)
The Semantic Correlation of Predicted Production describes the maximum correlation between a word's predicted semantic vector and any of the semantic vectors of its candidate forms.
Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning.
To use this function, you must learn_production
and extract production measures beforehand. Additionally, the data set to compute learn_production
is needed.
# prod_measures = production measures as obtained by 'production_measures' SCPPframe <- SCPP(prod_measures = combined.prod_measures, data = data)
datatable(SCPPframe)
WpmWithLdl package: Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270. doi.org/10.1075/ml.18010.baa
Linear Discriminative Learning: Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39. doi.org/10.1155/2019/4895891
Study introducing ALC, ALDC, DRC, EDNN, NNC, and SCPP: Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., & Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51. doi.org/10.3758/s13428-020-01356-w
First study using the LDLConvFunctions package: Schmitz, D., Plag, I., Baer-Henney, D., & Stein, S. D. (2021). Durational differences of word-final /s/ emerge from the lexicon: Modelling morpho-phonetic effects in pseudowords with linear discriminative learning. Frontiers in Psychology. 10.3389/fpsyg.2021.680889
Study introducing the Mean Word Support measure: Stein, S.D., & Plag, I. (2021). Morpho-phonetic effects in speech production: Modeling the acoustic duration of English derived words with linear discriminative learning. Frontiers in Psychology. 10.3389/fpsyg.2021.678712
Please message the author at contact@dominicschmitz.com in case of any questions, errors or ideas.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.