In brendan-r/brocks: The Personal R Functions of Brendan Rocks

Data Munging Functions

Proverb:

Data cleaning code cannot be clean. It's a sort of sin eater.

--- \@StatFact 25 Jul 2014

These functions will never be able to make your munge code look beautfiul. However, they should make it easier to understand when you have to come back to it!

library(brocks)
data(test_data)

Even with healthy commenting, much factor-munge code can be difficult to read, and transformations which look appropriate can be overridden by changes made further down a script. If something breaks and you need to know how a factor's been recoded, often there isn't a single place to look (or command to fail).

Things get more complicated the more levels a factor has, but even with just two levels, things can still be cumbersome. Here's an example using a dichotimised classification of gender:

Haphazard refactoring code

# Kill off blanks
test_data$gender[test_data$gender == ""] <- NA

# Code all males as 'males', and females as 'females'. For our purposes, 
# everything else should be NA

# Tidy up Males. Luckily, these all being with an 'm'!
test_data$gender[grepl("^m", test_data$gender, ignore.case = TRUE)] <- "male"

# Tidy up Females
# No-one has entered 'female', so convert gender to character to avoid factor 
# level warning
test_data$gender <- as.character(test_data$gender)

# Munge levels
test_data$gender[test_data$gender %in% c("Female", "f", "F", "Woman")] <- "female"

# Make everything that isn't male or female missing
test_data$gender[!test_data$gender %in% c("male", "female")] <- NA

test_data$gender <- factor(test_data$gender)

The code above is perfectly functional, but headache inducing in a number of ways. First, while regular expressions have been used to keep the code short, they can also produce spurious data if a new pattern is added to the data set (e.g. in the above, anything beginning with m, like "mixed", "multiple", or "mind your own business!").

Secondly if "male" drops out of the dataset (or the code is run on a subset), then the code will fail with a factor levels warning.

Clearer refactoring code

However, we can make things easier on our future selves by writing factor munge code that's offensively simple:

# Refactor Gender ---------------------------------------------------------

new_vals <- list(
   # FROM      TO
   c("",        NA     ),
   c("<NA>",    NA     ),
   c("F",      "female"),
   c("Female", "female"),
   c("m",      "male"  ),
   c("M",      "male"  ),
   c("Male",   "male"  ),
   c("Man",    "male"  ),
   c("Woman",  "female"),
   c("n/a",     NA     )
 )

test_data$gender <- refactor(test_data$gender, new_vals)

This approach has several advantages:

Visually distinctive and complete
you don't have to scour a file to work out where all the changes to a factor level (should) happen
It's so intellectually impoverished
That you can understand what's happening in a few seconds (not minutes, or worse)
Trivial to edit
(for almost anyone)
Can be set to fail if data's not as expected
(throw_error = TRUE)

The main disadvantage of this code is that it looks no fun to write. That's where the refactor_list function comes in, automatically generating the R code for arbitary factor levels. All you have to do is copy and paste it in to your script, and make a your edits to the 'TO' column.

refactor_list(test_data$gender)

You can even make refactor_list take a first pass at those edits for you, by setting the consolidate parameter to TRUE. This will run the 'TO' factor levels through the consolidate_values function, which uses a few simple rules to try and reduce variations (mainly cases, spacing, and punctuation, see ?consolidate_values).

consolidate_values

scale_strip

refactor

refactor_list

Number Formatting

library(brocks)

Does this work? r html_tri(1:-1) Does it?

Does this work? r html_tri(1:-1, symbols = c("up" = "😊", "down" = "😞", "nochange" = "😐")) Does it?

# This will output 'raw' HTML. To see the final result in an HTML markdown
# document, see the package vignette; vignette("brocks")

html_tri(1:-1)

# You could use other HTML symbols, even emojis if you like!
# These are HTML decimal codes (only unicode allowed in R packages), but
# you could use any valid characters (e.g. copy and paste)

html_tri(1:-1, symbols = c("up" = "&#128522;", "down" = "&#128542;",
  "nochange" = "&#128528;"))