match_item: Get suitable matches for a single item on one or several...

View source: R/match_item.R

match_itemR Documentation

Get suitable matches for a single item on one or several dimensions.

Description

Suggests items that are suitable matches for a target item, based on selected variables of a data frame. Note that unlike functions in the generate pipeline (e.g. control_for()), multiple variables' tolerances can be defined in one function.

Usage

match_item(
  df = LexOPS::lexops,
  target,
  ...,
  id_col = "string",
  filter = TRUE,
  standard_eval = FALSE
)

Arguments

df

A data frame to reorder, containing the target string (default = LexOPS::lexops).

target

The target string

...

Should specify the variables and tolerances in the form ⁠Length = 0:0, Zipf.SUBTLEX_UK = -0.1:0.1, PoS.SUBTLEX_UK⁠. Numeric variables can include tolerances (as elements 2:3 of a vector). Numeric variables with no tolerances will be matched exactly.

id_col

A character vector specifying the column identifying unique observations (e.g. in LexOPS::lexops, the id_col is "string").

filter

Logical. If TRUE, matches outside the tolerances specified in vars are removed. If FALSE, a new column, matchFilter is calculated indicating whether or not the string is within all variables' tolerances. (Default = TRUE.)

standard_eval

Logical; bypasses non-standard evaluation, and allows more standard R object of list. If TRUE, ... should be a single list specifying the variables to match by and their tolerances, in the form list("numericVariable1Name", c("numericVariable2Name", -1.5, 3), "characterVariableName"). Default = FALSE.

Value

Returns data frame based on df. If filter == TRUE, will only contain matches. If filter == FALSE, will be the original df object, with a new column, "matchFilter".

See Also

lexops for the default data frame and associated variables.

Examples


# Match by number of syllables exactly
lexops |>
  match_item("thicket", Syllables.CMU)

# Match by number of syllables exactly, but keep all entries in the original dataframe
lexops |>
  match_item("thicket", Syllables.CMU, filter = FALSE)

# Match by number of syllables exactly, and rhyme
lexops |>
  match_item("thicket", Syllables.CMU, Rhyme.CMU)

# Match by length exactly, and closely by frequency (within 0.2 Zipf either way)
lexops |>
  match_item("thicket", Length, Zipf.SUBTLEX_UK = -0.2:0.2)

# The syntax makes matching by multiple variables easiy and readable
lexops |>
  match_item(
    "elephant",
    BG.SUBTLEX_UK = -0.005:0.005,
    Length = 0:0,
    Zipf.SUBTLEX_UK = -0.1:0.1,
    PoS.SUBTLEX_UK,
    RT.ELP = -10:10
  )

# Match using standard evaluation
lexops |>
  match_item("thicket", list("Length", c("Zipf.SUBTLEX_UK", -0.2, 0.2)), standard_eval = TRUE)

# Find matches within an orthographic levenshtein distance of 5 from "thicket":
library(dplyr)
library(stringdist)
targ_word <- "thicket"
lexops |>
  mutate(old = stringdist(targ_word, string, method="lv")) |>
  match_item(targ_word, old = 0:5)

# Find matches within a phonological levenshtein distance of 2 from "thicket":
# (note that this method requires 1-letter phonological transcriptions)
library(dplyr)
library(stringdist)
targ_word <- "thicket"
targ_word_pronun <- lexops |>
  filter(string == "thicket") |>
  pull(eSpeak.br_1letter)
lexops |>
  mutate(pld = stringdist(targ_word_pronun, eSpeak.br_1letter, method="lv")) |>
  match_item(targ_word, pld = 0:2)


JackEdTaylor/LexOPS documentation built on Jan. 18, 2025, 10:37 a.m.