count_tokens: Find frequently occurring tokens within a hierarchical column

View source: R/count_tokens.R

count_tokensR Documentation

Find frequently occurring tokens within a hierarchical column

Description

Tokenized matching of hierarchical columns can yield false positives when there are tokens that occur frequently in multiple unique hierarchical values (e.g. "South", "North", "City", etc.).

This is a helper function to find such frequently-occurring tokens, which can then be passed to the exclude argument of hmatch_tokens. The frequency calculated is the number of unique, string-standardized values in which a given token is found.

Usage

count_tokens(
  x,
  split = "[-_[:space:]]+",
  min_freq = 2,
  min_nchar = 3,
  return_values = TRUE,
  std_fn = string_std,
  ...
)

Arguments

x

a character vector (generally a hierarchical column)

split

regex pattern used to split values into tokens. By default splits on any sequence of one or more space characters ("[:space:]"), dashes ("-"), and/or underscores ("_").

min_freq

minimum token frequency (i.e. number of unique values in which a given token occurs). Defaults to 2.

min_nchar

minimum token size in number of characters. Defaults to 3.

return_values

logical indicating whether to return the standardized values in which each token is found (TRUE), or only the count of the number of unique standardized values (FALSE). Defaults to TRUE.

std_fn

function to standardize strings, as performed within all hmatch_ functions. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Examples

french_departments <- c(
  "Alpes-de-Haute-Provence", "Hautes-Alpes", "Ardennes", "Bouches-du-Rhône",
  "Corse-du-Sud", "Haute-Corse", "Haute-Garonne", "Ille-et-Vilaine",
  "Haute-Loire", "Hautes-Pyrénées", "Pyrénées-Atlantiques", "Hauts-de-Seine"
)

count_tokens(french_departments)


epicentre-msf/hmatch documentation built on Nov. 15, 2023, 1:47 a.m.