clean_variables: Clean variable labels and fix spelling according to a...

View source: R/clean_variables.R

clean_variablesR Documentation

Clean variable labels and fix spelling according to a wordlist

Description

Clean variable labels and fix spelling according to a wordlist

Usage

clean_variables(
  x,
  sep = "_",
  wordlists = NULL,
  spelling_vars = 3,
  sort_by = NULL,
  protect = FALSE,
  classes = NULL,
  warn_spelling = FALSE
)

Arguments

x

a data.frame

sep

The separator used between words, and defaults to the underscore _.

wordlists

a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in x (see spelling_vars).

spelling_vars

character or integer. If wordlists is a data frame, then this column in defines the columns in x corresponding to each section of the wordlists data frame. This defaults to 3, indicating the third column is to be used.

sort_by

a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.

protect

a logical or numeric vector defining the columns to protect from any manipulation. Note: columns in protect will override any columns in either force_Date or guess_dates.

classes

a vector of class definitions for each of the columns. If this is not provided, the classes will be read from the columns themselves. Practically, this is used in clean_data() to mark columns as protected.

warn_spelling

if TRUE, errors and warnings from clean_spelling() will be aggregated and presented for each column that issues them. The default value is FALSE, which means that all errors and warnings will be ignored.

Author(s)

Zhian N. Kamvar

See Also

clean_variable_labels() to standardise text, clean_variable_spelling() to correct spelling with a wordlist.

Examples


## make toy data
toy_data <- messy_data(20)

# location data with mis-spellings, French, and English.
messy_locations <- c("hopsital", "h\u00f4pital", "hospital", 
                     "m\u00e9dical", "clinic", 
                     "feild", "field")
toy_data$location <- sample(messy_locations, 20, replace = TRUE)

## show data
toy_data

# clean labels
clean_variables(toy_data) # by default, it's the same as clean_variable_lables

# add a wordlist
wordlist <- data.frame(
  from  = c("hopsital", "hopital",  "medical", "feild"),
  to    = c("hospital", "hospital", "clinic",  "field"),
  variable = rep("location", 4),
  stringsAsFactors = FALSE
)

clean_variables(toy_data, 
                wordlists     = wordlist,
                spelling_vars = "variable"
               )

reconhub/linelist documentation built on Jan. 1, 2023, 9:39 p.m.