require(DataComputing)
options(TRY_interactive = FALSE)

Data wrangling commands are often composed of a sequence of data very operations connected by pipes. For example, here's a sequence with just two steps, involving the verbs group_by() and summarise().

YearlyAverages <-
  BabyNames %>% 
  group_by(sex) %>%
  summarise(mean(count))

It's easy to make mistakes when composing such commands. As a teacher, I find that much time is spent trying to find mistakes in students' code. For students, the time they spend trying to fix mistakes on their own or waiting for me to get around to helping them is frustrating and, for students who can't find their own mistakes, shakes their confidence.

I want to encourage the beginning student to try to identify the origin of errors and fix things. Although I'm happy to help students, it's better for them not to have to rely on an expert. Instead, they should become experts themselves.

To illustrate a mistake that a student (or anyone else!) might make, consider this wrangling command intended to find the yearly average number of births in the US, broken down by sex. Following current fashion, the student wants to refer to sex with the word "gender."

YearlyAverages <- 
  BabyNames %>%
  rename(sex = gender) %>%
  group_by(gender) %>%
  summarise(mean(count))

The error message indicates that there's a problem with gender, which is very helpful. TRY(), part of the DataComputing package, attempts to be more helpful.

The TRY() function examines a data-verb step and checks for common errors. When it finds an error (or a construction that could be problematic), TRY() writes a diagnostic message and asks the user if he or she wants to move on. It also displays the input data table and the output of the step. (In this document, such interactivity is impossible. Just the diagnostic messages will be show.)

YearlyAverages <- 
  BabyNames %>%
  TRY( rename(sex = gender) ) %>%
  TRY( group_by(gender) )  %>%
  TRY( summarise(mean(count)) )

First, note that TRY() indicates that there was a problem with the rename() step. It also gives a better hint about what the problem is: "Old variable names go to the right of =." This might be enough to help the user fix things, so ...

YearlyAverages <- 
  BabyNames %>%
  TRY( rename(gender = sex) ) %>%
  TRY( group_by(gender) )  %>%
  TRY( summarise(mean(count)) )

So long as TRY() doesn't encounter an error, the result of the command is passed to the next step (if any) in the chain of commands. Here, both the rename() and group_by() steps worked find. The problem identified by TRY() in the substitute() step isn't fatal, so the chain proceeds to a conclusion. Nevertheless, TRY() points out some possible problems.

YearlyAverages <- 
  BabyNames %>%
  TRY( rename(gender = sex) ) %>%
  TRY( group_by(gender) )  %>%
  TRY( summarise(ave = mean(count, na.rm=TRUE)) )

Examining a step

You can suspend the processing of the command chain by adding a waypoint with the WAYPOINT() function. Once suspended, the input to the previous step, and the output from that step are displayed in the TRY INPUT and TRY OUTPUT tabs of the editor. This gives you a chance to explore what the step did (and whether it did what you were expecting). You can have more than one waypoint in a command, so it's best to give each of them a name so you know where you are at.

YearlyAverages <- 
  BabyNames %>%
  TRY( rename(gender = sex) ) %>% WAYPOINT(first) %>%
  TRY( group_by(gender) )  %>% WAYPOINT(second) %>%
  TRY( summarise(ave = mean(count, na.rm=TRUE)) )

You need to try the above commands interactively to see how this works.

Once you are satisfied with a step, you can delete the following waypoint, or comment it out to deactivate it, like this:

TRY( rename(gender = sex) ) %>% ## WAYPOINT(first) %>%

Multiple steps in a single TRY( )

Putting multiple steps into a single TRY() will stop at the first step that displays problems or causes an error. That might be helpful in screening a long command for problems, but I don't think it's better than arranging one TRY() for each step.

Overall <- 
  KidsFeet %>% 
  TRY(
    group_by(sexx) %>% 
      tally() %>% 
      arrange(nn)
    )

TRY( ) with ggplot

TRY() has a few diagnostics for ggplot() and various geoms. For instance, TRY() checks whether aesthetics inside aes() are mapped to variables and whether aestheetics outside aes() are set to constants. It also checks the names of aesthetics and whether variables used for mapping are contained in the data table used by a layer.

Since ggplot() uses + to connect the various commands constructing layers, facets, themes, etc., you need to pack all of the ggplot-related commands into a single TRY(). For example:

KidsFeet %>% 
  TRY(
    ggplot(aes(x=width, y=length, shape=sex)) + 
      geom_point(color= 'red') + 
      geom_path(aes(fill = sex))
    )

You cannot use WAYPOINT() inside of TRY().

If you want to break up the ggplot commands into separate TRY() statements so that you can set a waypoint, you will have to connect the TRY() with the %>% pipe, not +.

KidsFeet %>% 
  TRY( ggplot(aes(x=width, y=length, shape=sex)) ) %>%
  TRY( geom_point(color= 'red') ) %>% WAYPOINT(dots) %>%
  TRY( geom_path(aes(fill = sex)) )

Regrettably, when removing the TRY() after the statement has been debugged, you will need to pay attention to changing the %>% into +. This is prone to mistakes and such mistakes lead to an cryptic error message from ggplot().

Working with Rmd files

It's a good practice to develop your commands within an Rmd file. When you do this, you will want TRY() and WAYPOINT() to be interactive when you execute the commands from the Rmd to the console, but you want them non-interactive when compiling the Rmd.

To accomplish this, set the TRY_interactive option to FALSE in a chunk of the Rmd document:

options(TRY_interactive = FALSE)

Don't run that chunk in the console.

[In draft: test whether knitr is running to disable interactivity?]

Extending TRY( )

The diagnostics in TRY() are heuristics aimed to identify specific, common mistakes. It's possible to add new heuristics tuned for a specific dplyr or ggplot2 function (or potentially, functions from other packages). If you identify a common mistake that TRY() doesn't handle, send a note describing it to kaplan@macalester.edu, or, if you know how, add it into TRY() yourself and send a pull request.



DataComputing/DataComputing documentation built on May 6, 2019, 1:39 p.m.