clean_data: Clean a data.frame

View source: R/clean_data.R

clean_dataR Documentation

Clean a data.frame

Description

This function applies several cleaning procedures to an input data.frame, by standardising variable names, labels used categorical variables (characters of factors), and setting dates to Date objects. Optionally, an intelligent date search can be used on character strings to extract dates from various formats mixed with other text. See details for more information.

Usage

clean_data(
  x,
  sep = "_",
  force_Date = TRUE,
  guess_dates = FALSE,
  error_tolerance = 0.5,
  wordlists = NULL,
  spelling_vars = 3,
  sort_by = NULL,
  warn_spelling = FALSE,
  protect = FALSE,
  ...
)

Arguments

x

a data.frame

sep

The separator used between words, and defaults to the underscore _.

force_Date

a logical or integer vector indicating the columns . If logical, indicating if POSIXct and POSIXlt objects should be converted to Date objects; defaults to TRUE; you should use this if your dates are only precise to the day (i.e. no time information within days).

guess_dates

a logical or integer vector indicating which columns should be guessed , assuming these columns store character strings or factors; this feature is experimental; see guess_dates() for more information.

error_tolerance

a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent)

wordlists

a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in x (see spelling_vars).

spelling_vars

character or integer. If wordlists is a data frame, then this column in defines the columns in x corresponding to each section of the wordlists data frame. This defaults to 3, indicating the third column is to be used.

sort_by

a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.

warn_spelling

if TRUE, errors and warnings from clean_spelling() will be aggregated and presented for each column that issues them. The default value is FALSE, which means that all errors and warnings will be ignored.

protect

a logical or numeric vector defining the columns to protect from any manipulation. Note: columns in protect will override any columns in either force_Date or guess_dates.

...

further arguments passed on to guess_dates()

Value

A data.frame with standardised labels for characters and factors.

Note

Creating your wordlist

When creating the wordlist for clean_variable_spelling(), it's important to remember that the data will first be cleaned with clean_variable_labels(), which will remove any capitalisation, accents, and replace all punctuation and spaces with "_".

Author(s)

Thibaut Jombart, Zhian N. Kamvar

See Also

This function wraps three other functions: clean_variable_names() - to handle variable names, clean_variables() - to handle character/factor variables, clean_dates() - to handle dates.

Examples


## make toy data
toy_data <- messy_data(20)

## show data
toy_data


## clean variable names, store in new object, show results
clean_data <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.1)
clean_data

clean_data2 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8)
clean_data2

## clean variable names, but keep our "messy/dates" column
to_protect <- names(toy_data) %in% "messy/dates"
clean_data3 <- clean_data(toy_data, 
                          guess_dates = TRUE,
                          error_tolerance = 0.8,
                          protect = to_protect
                         )
clean_data3

## Using a wordlist  -------------------------------

# location data with mis-spellings, French, and English.
messy_locations <- c("hopsital", "h\u00f4pital", "hospital", 
                     "m\u00e9dical", "clinic", 
                     "feild", "field")
toy_data$location <- factor(sample(messy_locations, 20, replace = TRUE))

# show data 
toy_data$location


# add a wordlist
wordlist <- data.frame(
  from  = c("hopsital", "hopital",  "medical", "feild"),
  to    = c("hospital", "hospital", "clinic",  "field"),
  variables = rep("location", 4),
  stringsAsFactors = FALSE
)

clean_data4 <- clean_data(toy_data, 
                          wordlists     = wordlist,
                          spelling_vars = "variables"
                         )
clean_data4
clean_data4$location

reconhub/linelist documentation built on Jan. 1, 2023, 9:39 p.m.