count_tokens | R Documentation |
Tokenized matching of hierarchical columns can yield false positives when there are tokens that occur frequently in multiple unique hierarchical values (e.g. "South", "North", "City", etc.).
This is a helper function to find such frequently-occurring tokens, which can
then be passed to the exclude
argument of hmatch_tokens
. The
frequency calculated is the number of unique,
string-standardized values in which a given
token is found.
count_tokens(
x,
split = "[-_[:space:]]+",
min_freq = 2,
min_nchar = 3,
return_values = TRUE,
std_fn = string_std,
...
)
x |
a character vector (generally a hierarchical column) |
split |
regex pattern used to split values into tokens. By default splits on any sequence of one or more space characters ("[:space:]"), dashes ("-"), and/or underscores ("_"). |
min_freq |
minimum token frequency (i.e. number of unique values in
which a given token occurs). Defaults to |
min_nchar |
minimum token size in number of characters. Defaults to |
return_values |
logical indicating whether to return the standardized
values in which each token is found ( |
std_fn |
function to standardize strings, as performed within all
|
... |
additional arguments passed to |
french_departments <- c(
"Alpes-de-Haute-Provence", "Hautes-Alpes", "Ardennes", "Bouches-du-Rhône",
"Corse-du-Sud", "Haute-Corse", "Haute-Garonne", "Ille-et-Vilaine",
"Haute-Loire", "Hautes-Pyrénées", "Pyrénées-Atlantiques", "Hauts-de-Seine"
)
count_tokens(french_departments)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.