Proverb:
Data cleaning code cannot be clean. It's a sort of sin eater.
These functions will never be able to make your munge code look beautfiul. However, they should make it easier to understand when you have to come back to it!
library(brocks) data(test_data)
Even with healthy commenting, much factor-munge code can be difficult to read, and transformations which look appropriate can be overridden by changes made further down a script. If something breaks and you need to know how a factor's been recoded, often there isn't a single place to look (or command to fail).
Things get more complicated the more levels a factor has, but even with just two levels, things can still be cumbersome. Here's an example using a dichotimised classification of gender:
# Kill off blanks test_data$gender[test_data$gender == ""] <- NA # Code all males as 'males', and females as 'females'. For our purposes, # everything else should be NA # Tidy up Males. Luckily, these all being with an 'm'! test_data$gender[grepl("^m", test_data$gender, ignore.case = TRUE)] <- "male" # Tidy up Females # No-one has entered 'female', so convert gender to character to avoid factor # level warning test_data$gender <- as.character(test_data$gender) # Munge levels test_data$gender[test_data$gender %in% c("Female", "f", "F", "Woman")] <- "female" # Make everything that isn't male or female missing test_data$gender[!test_data$gender %in% c("male", "female")] <- NA test_data$gender <- factor(test_data$gender)
The code above is perfectly functional, but headache inducing in a number of ways. First, while regular expressions have been used to keep the code short, they can also produce spurious data if a new pattern is added to the data set (e.g. in the above, anything beginning with m, like "mixed", "multiple", or "mind your own business!").
Secondly if "male" drops out of the dataset (or the code is run on a subset), then the code will fail with a factor levels warning.
However, we can make things easier on our future selves by writing factor munge code that's offensively simple:
# Refactor Gender --------------------------------------------------------- new_vals <- list( # FROM TO c("", NA ), c("<NA>", NA ), c("F", "female"), c("Female", "female"), c("m", "male" ), c("M", "male" ), c("Male", "male" ), c("Man", "male" ), c("Woman", "female"), c("n/a", NA ) ) test_data$gender <- refactor(test_data$gender, new_vals)
This approach has several advantages:
throw_error = TRUE
)The main disadvantage of this code is that it looks no fun to write. That's where the refactor_list
function comes in, automatically generating the R code for arbitary factor levels. All you have to do is copy and paste it in to your script, and make a your edits to the 'TO' column.
refactor_list(test_data$gender)
You can even make refactor_list
take a first pass at those edits for you, by setting the consolidate
parameter to TRUE
. This will run the 'TO' factor levels through the consolidate_values
function, which uses a few simple rules to try and reduce variations (mainly cases, spacing, and punctuation, see ?consolidate_values
).
consolidate_values
scale_strip
refactor
refactor_list
library(brocks)
Does this work? r html_tri(1:-1)
Does it?
Does this work? r html_tri(1:-1, symbols = c("up" = "😊", "down" = "😞", "nochange" = "😐"))
Does it?
# This will output 'raw' HTML. To see the final result in an HTML markdown # document, see the package vignette; vignette("brocks") html_tri(1:-1) # You could use other HTML symbols, even emojis if you like! # These are HTML decimal codes (only unicode allowed in R packages), but # you could use any valid characters (e.g. copy and paste) html_tri(1:-1, symbols = c("up" = "😊", "down" = "😞", "nochange" = "😐"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.