euc_dists: Calculate a word's Euclidean distance from other words

View source: R/euc_dists.R

euc_distsR Documentation

Calculate a word's Euclidean distance from other words

Description

Caclulates the Euclidean distance of a word from all other words in a df, on selected variables.

Usage

euc_dists(
  df = LexOPS::lexops,
  target,
  vars = "all",
  scale = TRUE,
  center = TRUE,
  weights = NA,
  standardise_weights = TRUE,
  id_col = "string",
  standard_eval = FALSE
)

Arguments

df

A data frame.

target

The target string (word) that euclidean distances are required for.

vars

The variables to be used as dimensions which Euclidean distance should be calculated over. Can be a vector of variable names (e.g. c(Zipf.SUBTLEX_UK, Length)), or, "all", to use all numeric variables in the data frame. The default is "all".

scale, center

How should variables be scaled and/or centred before calculating Euclidean distance? For options, see the scale and center arguments of scale. Default for both is TRUE. Scaling can be useful when variables are in differently scaled.

weights

An (optional) list of weights, in the same order as vars. After any scaling is applied, the values will be multiplied by these weights. Default is NA, meaning no weights are applied.

standardise_weights

Logical; should the weights be standardised to average to 1 (i.e., sum to the length of vars)? If TRUE, weights=c(1, 3, 6) will be treated as weights=c(0.3, 0.6, 1.8). Setting standardise_weights=TRUE ensures that the space itself is unchanged when weights change. This means, for example, that the same tolerance can be used in control_for_euc().

id_col

The column containing the strings (default = "string").

standard_eval

Logical; bypasses non-standard evaluation, and allows more standard R objects in vars. If TRUE, vars should be a character vector referring to columns in df (e.g. c("Length", "Zipf.SUBTLEX_UK")). Default = FALSE.

Value

Returns a vector of Euclidean distances, in the order of rows in df.

Examples


# Get the distance of every entry in the `lexops` dataset from the word "thicket".
# (Note: This will be calculated using the dimensions of frequency, arousal, and size)
lexops |>
  euc_dists("thicket", c(Zipf.SUBTLEX_UK, AROU.Warriner, SIZE.Glasgow_Norms))

# no scaling or centering
lexops |>
  euc_dists(
    "thicket",
    c(Zipf.SUBTLEX_UK, AROU.Warriner, SIZE.Glasgow_Norms),
    scale = FALSE,
    center = FALSE
  )

# Add Euclidean distance as new column
# (Also sort ascendingly by distance; barbara will have a distance of 0 so will be first)
lexops %>%
  dplyr::mutate(ed = euc_dists(., "barbara", c(Length, Zipf.SUBTLEX_UK, BG.SUBTLEX_UK))) |>
  dplyr::arrange(ed)

# bypass non-standard evaluation
lexops |>
  euc_dists(
    "thicket",
    c("Zipf.SUBTLEX_UK", "AROU.Warriner", "SIZE.Glasgow_Norms"),
    standard_eval = TRUE
  )

JackEdTaylor/LexOPS documentation built on Jan. 18, 2025, 10:37 a.m.