clean_data: Clean a data.frame
In reconhub/linelist: Tools to Import and Tidy Case Linelist Data

clean_data

R Documentation

Clean a data.frame

Description

This function applies several cleaning procedures to an input data.frame, by standardising variable names, labels used categorical variables (characters of factors), and setting dates to Date objects. Optionally, an intelligent date search can be used on character strings to extract dates from various formats mixed with other text. See details for more information.

Usage

clean_data(
  x,
  sep = "_",
  force_Date = TRUE,
  guess_dates = FALSE,
  error_tolerance = 0.5,
  wordlists = NULL,
  spelling_vars = 3,
  sort_by = NULL,
  warn_spelling = FALSE,
  protect = FALSE,
  ...
)

Arguments

`x`	a `data.frame`
`sep`	The separator used between words, and defaults to the underscore `_`.
`force_Date`	a `logical` or `integer` vector indicating the columns . If `logical`, indicating if `POSIXct` and `POSIXlt` objects should be converted to `Date` objects; defaults to `TRUE`; you should use this if your dates are only precise to the day (i.e. no time information within days).
`guess_dates`	a `logical` or `integer` vector indicating which columns should be guessed , assuming these columns store character strings or `factors`; this feature is experimental; see `guess_dates()` for more information.
`error_tolerance`	a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent)
`wordlists`	a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in `x` (see `spelling_vars`).
`spelling_vars`	character or integer. If `wordlists` is a data frame, then this column in defines the columns in `x` corresponding to each section of the `wordlists` data frame. This defaults to `3`, indicating the third column is to be used.
`sort_by`	a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.
`warn_spelling`	if `TRUE`, errors and warnings from `clean_spelling()` will be aggregated and presented for each column that issues them. The default value is `FALSE`, which means that all errors and warnings will be ignored.
`protect`	a logical or numeric vector defining the columns to protect from any manipulation. Note: columns in `protect` will override any columns in either `force_Date` or `guess_dates`.
`...`	further arguments passed on to `guess_dates()`

Value

A data.frame with standardised labels for characters and factors.

Note

Creating your wordlist

When creating the wordlist for clean_variable_spelling(), it's important to remember that the data will first be cleaned with clean_variable_labels(), which will remove any capitalisation, accents, and replace all punctuation and spaces with "_".

Author(s)

Thibaut Jombart, Zhian N. Kamvar

Examples


## make toy data
toy_data <- messy_data(20)

## show data
toy_data


## clean variable names, store in new object, show results
clean_data <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.1)
clean_data

clean_data2 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8)
clean_data2

## clean variable names, but keep our "messy/dates" column
to_protect <- names(toy_data) %in% "messy/dates"
clean_data3 <- clean_data(toy_data, 
                          guess_dates = TRUE,
                          error_tolerance = 0.8,
                          protect = to_protect
                         )
clean_data3

## Using a wordlist  -------------------------------

# location data with mis-spellings, French, and English.
messy_locations <- c("hopsital", "h\u00f4pital", "hospital", 
                     "m\u00e9dical", "clinic", 
                     "feild", "field")
toy_data$location <- factor(sample(messy_locations, 20, replace = TRUE))

# show data 
toy_data$location


# add a wordlist
wordlist <- data.frame(
  from  = c("hopsital", "hopital",  "medical", "feild"),
  to    = c("hospital", "hospital", "clinic",  "field"),
  variables = rep("location", 4),
  stringsAsFactors = FALSE
)

clean_data4 <- clean_data(toy_data, 
                          wordlists     = wordlist,
                          spelling_vars = "variables"
                         )
clean_data4
clean_data4$location

reconhub/linelist documentation built on Jan. 1, 2023, 9:39 p.m.