| clean_data | R Documentation |
This function applies several cleaning procedures to an input data.frame,
by standardising variable names, labels used categorical variables
(characters of factors), and setting dates to Date objects. Optionally, an
intelligent date search can be used on character strings to extract dates
from various formats mixed with other text. See details for more information.
clean_data( x, sep = "_", force_Date = TRUE, guess_dates = FALSE, error_tolerance = 0.5, wordlists = NULL, spelling_vars = 3, sort_by = NULL, warn_spelling = FALSE, protect = FALSE, ... )
x |
a |
sep |
The separator used between words, and defaults to the underscore
|
force_Date |
a |
guess_dates |
a |
error_tolerance |
a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent) |
wordlists |
a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the wordlists by column in |
spelling_vars |
character or integer. If |
sort_by |
a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted. |
warn_spelling |
if |
protect |
a logical or numeric vector defining the columns to protect
from any manipulation. Note: columns in |
... |
further arguments passed on to |
A data.frame with standardised labels for characters and
factors.
When creating the wordlist for clean_variable_spelling(), it's important
to remember that the data will first be cleaned with
clean_variable_labels(), which will remove any capitalisation, accents,
and replace all punctuation and spaces with "_".
Thibaut Jombart, Zhian N. Kamvar
This function wraps three other functions:
clean_variable_names() - to handle variable names,
clean_variables() - to handle character/factor variables,
clean_dates() - to handle dates.
## make toy data
toy_data <- messy_data(20)
## show data
toy_data
## clean variable names, store in new object, show results
clean_data <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.1)
clean_data
clean_data2 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8)
clean_data2
## clean variable names, but keep our "messy/dates" column
to_protect <- names(toy_data) %in% "messy/dates"
clean_data3 <- clean_data(toy_data,
guess_dates = TRUE,
error_tolerance = 0.8,
protect = to_protect
)
clean_data3
## Using a wordlist -------------------------------
# location data with mis-spellings, French, and English.
messy_locations <- c("hopsital", "h\u00f4pital", "hospital",
"m\u00e9dical", "clinic",
"feild", "field")
toy_data$location <- factor(sample(messy_locations, 20, replace = TRUE))
# show data
toy_data$location
# add a wordlist
wordlist <- data.frame(
from = c("hopsital", "hopital", "medical", "feild"),
to = c("hospital", "hospital", "clinic", "field"),
variables = rep("location", 4),
stringsAsFactors = FALSE
)
clean_data4 <- clean_data(toy_data,
wordlists = wordlist,
spelling_vars = "variables"
)
clean_data4
clean_data4$location
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.