str_normalize: Apply regular-expression-based text normalization to any...

Description Usage Arguments Value See Also

View source: R/salim-GEN.R

Description

Applies a set of regular-expression-based text normalization rules to one or more files given in path. By default, changes are shown on the console only, without actually modifying any files. Set run_dry = FALSE to apply the changes.

Usage

1
2
3
4
5
6
7
8
str_normalize(
  path,
  rules = salim::regex_text_normalization,
  run_dry = TRUE,
  process_line_by_line = FALSE,
  n_context_chrs = 20L,
  verbose = TRUE
)

Arguments

path

Paths to the text files. A character vector.

rules

A tibble of regular expression patterns and replacements. It must have the columns pattern and replacement. pattern can optionally be a list column condensing multiple patterns to the same replacement rule. Patterns are interpreted as regular expressions as described in stringi::stringi-search-regex(). Replacements are interpreted as-is, except that references of the form \1, \2, etc. will be replaced with the contents of the respective matched group (created in patterns using ()). Pattern-replacement pairs are processed in the order given, meaning that first listed pairs are applied before later listed ones.

run_dry

Show replacements on the console only, without actually modifying any files. Implies verbose = TRUE.

process_line_by_line

Whether each line in a file should be treated as a separate string or the whole file as one single string. While the latter is more performant, you probably want the former if you're using "^" or "$" in your patterns.

n_context_chrs

The (maximum) number of characters displayed around the actual string and its replacement. The number refers to a single side of string/replacement, so the total number of context characters is at the maximum 2 * n_context_chrs. Only relevant if verbose = TRUE.

verbose

Whether to display replacements on the console.

Value

path invisibly.

See Also

regex_text_normalization


salim-b/salim documentation built on April 6, 2021, 10:15 a.m.