clean_text: Clean text strings automatically

cleanTextR Documentation

Clean text strings automatically

Description

cleanText: Clean character strings automatically. Options to keep ASCII characters only, keep certain characters, lower caps, title format, are available.

cleanNames: Resulting names are unique and consist only of the _ character, numbers, and ASCII letters. Capitalization preferences can be specified using the lower parameter.

Usage

cleanText(
  text,
  spaces = TRUE,
  keep = "",
  lower = TRUE,
  ascii = TRUE,
  title = FALSE
)

cleanNames(df, num = "x", keep = "_", ...)

Arguments

text

Character Vector

spaces

Boolean. Keep spaces? If character input, spaces will be transformed into passed argument.

keep

Character. String (concatenated or as vector) with all characters that are accepted and should be kept, in addition to alphanumeric.

lower

Boolean. Transform all to lower case?

ascii

Boolean. Only ASCII characters?

title

Boolean. Transform to title format (upper case on first letters).

df

data.frame/tibble.

num

Add character before only-numeric names.

...

Additional parameters passed to cleanText().

Details

Inspired by janitor::clean_names.

Value

Character vector with transformed strings.

data.frame/tibble with transformed column names.

See Also

Other Data Wrangling: balance_data(), categ_reducer(), date_cuts(), date_feats(), file_name(), formatHTML(), holidays(), impute(), left(), normalize(), num_abbr(), ohe_commas(), ohse(), quants(), removenacols(), replaceall(), replacefactor(), textFeats(), textTokenizer(), vector2text(), year_month(), zerovar()

Other Text Mining: ngrams(), remove_stopwords(), replaceall(), sentimentBreakdown(), textCloud(), textFeats(), textTokenizer(), topics_rake()

Examples

cleanText("Bernardo Lares 123")
cleanText("Bèrnärdo LáreS 123", lower = FALSE)
cleanText("Bernardo Lare$", spaces = ".", ascii = FALSE)
cleanText("\\@®ì÷å   %ñS  ..-X", spaces = FALSE)
cleanText(c("maría", "€", "núñez_a."), title = TRUE)
cleanText("29_Feb-92()#", keep = c("#", "_"), spaces = FALSE)

# For a data.frame directly:
df <- dft[1:5, 1:6] # Dummy data
colnames(df) <- c("ID.", "34", "x_2", "Num 123", "Nòn-äscì", "  white   Spaces  ")
print(df)
cleanNames(df)
cleanNames(df, lower = FALSE)

laresbernardo/lares documentation built on Jan. 14, 2025, 2:22 a.m.