str_unify_spacing: Whitespace normalization

View source: R/str_normalize_spacing.R

str_unify_spacingR Documentation

Whitespace normalization

Description

str_unify_spacing - Normalizes whitespace by replacing everything between words and punctuation characters with single space characters. The identification of boundaries is performed using ICU Breakiterators with added exceptions for #hashtags, @screen_names, URLs and <KLARTAGS> (as created by other functions of this package)

Usage

str_unify_spacing(.str, .tok_lock_regex = NULL)

Arguments

.str

Character vector to be normalized

.tok_lock_regex

...

Value

str_unify_spacing - Returns the normalized character vector

References

https://www.unicode.org/reports/tr29/#Word_Boundaries

Examples

## str_unify_spacing EXAMPLE:

str_unify_spacing(c(
  "This  @screen_name that\n #hash_tag, #1",
  "<not-A_KLARTAG> <A_KLARTAG>!?!? An URL",
  "www.example.com/test ..."
))

m-pilarski/klartext documentation built on June 16, 2024, 1:35 p.m.