clean_data | R Documentation |
This function applies several cleaning procedures to an input data.frame
,
by standardising variable names, labels used categorical variables
(characters of factors), and setting dates to Date
objects. Optionally, an
intelligent date search can be used on character strings to extract dates
from various formats mixed with other text. See details for more information.
clean_data( x, sep = "_", force_Date = TRUE, guess_dates = FALSE, error_tolerance = 0.5, wordlists = NULL, spelling_vars = 3, sort_by = NULL, warn_spelling = FALSE, protect = FALSE, ... )
x |
a |
sep |
The separator used between words, and defaults to the underscore
|
force_Date |
a |
guess_dates |
a |
error_tolerance |
a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent) |
wordlists |
a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the wordlists by column in |
spelling_vars |
character or integer. If |
sort_by |
a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted. |
warn_spelling |
if |
protect |
a logical or numeric vector defining the columns to protect
from any manipulation. Note: columns in |
... |
further arguments passed on to |
A data.frame
with standardised labels for characters and
factors.
When creating the wordlist for clean_variable_spelling()
, it's important
to remember that the data will first be cleaned with
clean_variable_labels()
, which will remove any capitalisation, accents,
and replace all punctuation and spaces with "_".
Thibaut Jombart, Zhian N. Kamvar
This function wraps three other functions:
clean_variable_names()
- to handle variable names,
clean_variables()
- to handle character/factor variables,
clean_dates()
- to handle dates.
## make toy data toy_data <- messy_data(20) ## show data toy_data ## clean variable names, store in new object, show results clean_data <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.1) clean_data clean_data2 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8) clean_data2 ## clean variable names, but keep our "messy/dates" column to_protect <- names(toy_data) %in% "messy/dates" clean_data3 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8, protect = to_protect ) clean_data3 ## Using a wordlist ------------------------------- # location data with mis-spellings, French, and English. messy_locations <- c("hopsital", "h\u00f4pital", "hospital", "m\u00e9dical", "clinic", "feild", "field") toy_data$location <- factor(sample(messy_locations, 20, replace = TRUE)) # show data toy_data$location # add a wordlist wordlist <- data.frame( from = c("hopsital", "hopital", "medical", "feild"), to = c("hospital", "hospital", "clinic", "field"), variables = rep("location", 4), stringsAsFactors = FALSE ) clean_data4 <- clean_data(toy_data, wordlists = wordlist, spelling_vars = "variables" ) clean_data4 clean_data4$location
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.