count_tokens: Find frequently occurring tokens within a hierarchical column
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

count_tokens

R Documentation

Find frequently occurring tokens within a hierarchical column

Description

Tokenized matching of hierarchical columns can yield false positives when there are tokens that occur frequently in multiple unique hierarchical values (e.g. "South", "North", "City", etc.).

This is a helper function to find such frequently-occurring tokens, which can then be passed to the exclude argument of hmatch_tokens. The frequency calculated is the number of unique, string-standardized values in which a given token is found.

Usage

count_tokens(
  x,
  split = "[-_[:space:]]+",
  min_freq = 2,
  min_nchar = 3,
  return_values = TRUE,
  std_fn = string_std,
  ...
)

Arguments

`x`	a character vector (generally a hierarchical column)
`split`	regex pattern used to split values into tokens. By default splits on any sequence of one or more space characters ("[:space:]"), dashes ("-"), and/or underscores ("_").
`min_freq`	minimum token frequency (i.e. number of unique values in which a given token occurs). Defaults to `2`.
`min_nchar`	minimum token size in number of characters. Defaults to `3`.
`return_values`	logical indicating whether to return the standardized values in which each token is found (`TRUE`), or only the count of the number of unique standardized values (`FALSE`). Defaults to `TRUE`.
`std_fn`	function to standardize strings, as performed within all `hmatch_` functions. Defaults to `string_std`. Set to `NULL` to omit standardization. See also string_standardization.
`...`	additional arguments passed to `std_fn()`

Examples

french_departments <- c(
  "Alpes-de-Haute-Provence", "Hautes-Alpes", "Ardennes", "Bouches-du-Rhône",
  "Corse-du-Sud", "Haute-Corse", "Haute-Garonne", "Ille-et-Vilaine",
  "Haute-Loire", "Hautes-Pyrénées", "Pyrénées-Atlantiques", "Hauts-de-Seine"
)

count_tokens(french_departments)

epicentre-msf/hmatch documentation built on Nov. 15, 2023, 1:47 a.m.

epicentre-msf/hmatch index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

count_tokens: Find frequently occurring tokens within a hierarchical column
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

Find frequently occurring tokens within a hierarchical column

Description

Usage

Arguments

Examples

Related to count_tokens in epicentre-msf/hmatch...

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch Tools for Cleaning and Matching Hierarchically-Structured Data

count_tokens: Find frequently occurring tokens within a hierarchical column In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

Find frequently occurring tokens within a hierarchical column

Description

Usage

Arguments

Examples

Related to count_tokens in epicentre-msf/hmatch...

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

count_tokens: Find frequently occurring tokens within a hierarchical column
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data