In jaytimm/lexvarsdatr:

output: md_document: variant: markdown_github

lexvarsdatr

An R package: some tools for investigating lexical variation from both behavioral and distributional perspectives. Including:

A collection of psycholinguistic/behavioral data sets, &
A few functions for extracting semantic associations and network structures from term-feature matrices.

Installation

library(devtools)
devtools::install_github("jaytimm/lexvarsdatr")
library(lexvarsdatr)

Usage

Behavioral data

Behavioral data included in the package: Response times in lexical decision & naming, concreteness ratings, age-of-acquisition (AoA) ratings, and word association norms. Sources are presented below:

Data <- c("Lexical decision and naming","Concreteness ratings","AoA ratings", "Word association")
Source <- c( "Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. *Behavior research methods*, 39(3), 445-459.","Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. *Behavior research methods*, 46(3), 904-911.","Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. *Behavior Research Methods*, 44(4), 978-990.","Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. *Behavior Research Methods, Instruments, & Computers*, 36(3), 402-407.")

knitr::kable(data.frame(Data=Data, Source=Source))

Response times in lexical decision/naming, concreteness ratings, and AoA ratings have been collated into a single data frame, lex_behav_data. Approximately 18K word forms are included in all three data sets.

library(tidyverse)
lexvarsdatr::lvdr_behav_data %>% na.omit %>% head()

The South Florida word association data can be accessed via lvdr_association. A description of variables included in the normed data set, as well as methodologies, can be found here. Word association data is also available as as sparse matrix, lvdr_association_sparse.

Functions

To demonstrate the utility of the functions included in the package, we first create a simple count-based term-feature co-occurrence matrix using US Presidential State of the Union (SOTU) addresses -- made available in TIF format via the sotu package. A fairly small corpus at ~2 million words.

Here, we work within the text2vec framework. Window size of co-occurrence, 5x5. For simplicity, we tokenize at the word-level.

library(sotu)
t2v_ents <- text2vec::itoken(sotu::sotu_text, 
                            preprocessor = toupper, 
                            tokenizer = text2vec::word_tokenizer, 
                            ids = 1:236)

vocab <- text2vec::create_vocabulary(t2v_ents, 
                                    stopwords = toupper(tm::stopwords())) 

pruned_vocab <- text2vec::prune_vocabulary(
  vocab, term_count_min = 10, doc_proportion_max = 0.95) %>%
  filter(!grepl('[0-9]', term))

tcm <- text2vec::create_tcm(t2v_ents, 
                           vectorizer = text2vec::vocab_vectorizer(pruned_vocab), 
                           skip_grams_window = 5L,
                           skip_grams_window_context = "symmetric",
                           weight = c(1,1,1,1,1)) #No weight

§ Build PPMI Matrix

The lvdr_calc_ppmi function transforms a count-based co-occurrence matrix to a positive-pointwise mutual information matrix, modified from this SO post.

tcm_ppmi <- tcm %>% 
  lexvarsdatr::lvdr_calc_ppmi(make_symmetric = TRUE)

§ Get collocates, neighbors, etc.

The lvdr_get_closest function can be used to extract the n highest scoring features associated with a term (or set of terms) from a term-feature matrix. Assumes a column-oriented matrix (dgCMatrix) as input. data.table dependency. Modified from the udpipe::as.cooccurrence() function.

Per the SOTU PPMI co-occurrence matrix created above, we extract the ten strongest collocates of the term VIOLENCE. Output is a simple data frame.

lexvarsdatr::lvdr_get_closest(tfm = tcm_ppmi, 
                              #lexvarsdatr::lvdr_association_sparse, 
                              target = 'VIOLENCE', 
                              n = 10) %>%
  knitr::kable(row.names = FALSE)

The function can also be used to extract nearest neighbors from a cosine similarity matrix. To demonstrate, we (1) consolidate feature set to 150 latent dimensions via singular-value decomposition, and then (2) construct cosine-based, term-term similarity matrix.

tcm_svd <-  irlba::irlba (tcm_ppmi, nv = 150)

tcm_svd1 <- as.matrix(data.matrix(tcm_svd$u))
dimnames(tcm_svd1) <- list(rownames(tcm_ppmi), 
                           c(1:length(tcm_svd$d)))

# Create cosine similarity matrix
cos_sim <- text2vec::sim2(x = tcm_svd1, 
                          method = 'cosine', 
                          norm = 'l2')

Per matrix, we extract the five nearest neighbors (ie, ~synonyms) for the terms TARIFF and SCIENCE.

#library(data.table)
lexvarsdatr::lvdr_get_closest(tfm = cos_sim, 
                              target = c('TARIFF','SCIENCE'), 
                              n = 5) %>%
  knitr::kable(row.names = FALSE)

§ Build network structure

The lvdr_extract_network function extracts the network structure for a term (or set of terms) from a term-feature matrix (again, as dgCMatrix). The function is built on lvdr_get_closest(). Output is a list that includes a node data frame and an edges data frame, structured to play nice with the tidygraph and ggraph plotting paradigms.

The number of nodes (per term) to include in the network is specified by the n parameter, ie, the n highest scoring features associated with a term from a term-feature matrix. Term-nodes and feature-nodes are distinguished in the output for visualization purposes. If multiple terms are specified, nodes are filtered to the strongest (ie, primary) term-feature relationships (to remove potential duplicates).

Edges include the n-highest scoring term-feature associations for specified terms, as well as the n most frequent node-node associations per node (term & feature).

network <- lexvarsdatr::lvdr_extract_network (tfm = tcm_ppmi, 
                                              target = toupper(c('enemy', 'ally', 
                                                                 'friend', 'partner')),
                                              n = 15)

Quick note: Algorithms like GloVe, SVD & word2vec abstract over the term-feature associations that underlie (distributionally-derived) semantic relationships. Visualizing the network structure of semantically related terms based in actual co-occurrence can help shed light on the sources of relatedness in ways that, eg, latent dimensions cannot.

The plot below illustrates the network structure (based on the PPMI term-feature matrix for the SOTU corpus) for a set of semantically related terms: ENEMY, ALLY, FRIEND, and PARTNER. Terms are identified as triangles; features as circles. Color is used to specify primary term-feature relationships. Circle size specifies the (relative) strength of association between primary term and feature.

set.seed(66)
network %>%
  tidygraph::as_tbl_graph() %>%
  ggraph::ggraph() +

  ggraph::geom_edge_link(color = 'darkgray') + 
  ggraph::geom_node_point(aes(size = value, 
                              color = term,
                              shape = group)) +

  ggraph::geom_node_text(aes(label = toupper(label), 
                             filter = group == 'term'), 
                             repel = TRUE, size = 4) +

  ggraph::geom_node_text(aes(label = tolower(label), 
                             filter = group == 'feature'), 
                             repel = TRUE, size = 3) +
  ggthemes::scale_color_stata()+
  ggtitle('sotu co-occurrence network') +
  theme(legend.position = "none")

Another take using the word association data set, lvdr_association_sparse:

network2 <- lexvarsdatr::lvdr_extract_network(
  tfm = lexvarsdatr::lvdr_association_sparse,
  target = toupper(c('enemy', 'ally', 
                     'friend', 'partner')), 
  n = 15)

set.seed(11)
network2 %>%
  tidygraph::as_tbl_graph() %>%
  ggraph::ggraph() +

  ggraph::geom_edge_link(color = 'darkgray') + #alpha = 0.8
  ggraph::geom_node_point(aes(size = value, 
                              color = term,
                              shape = group)) +

  ggraph::geom_node_text(aes(label = toupper(label), 
                             filter = group == 'term'), 
                             repel = TRUE, size = 4) +

  ggraph::geom_node_text(aes(label = tolower(label), 
                             filter = group == 'feature'), 
                             repel = TRUE, size = 3) +
  ggthemes::scale_color_stata()+
  ggtitle('word association norms network') +
  theme(legend.position = "none")