knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

Purpose and Overview

Most TNTP analysts use the package dplyr to clean data and perform basic analyses. And for good reason, it's a fantastic set of functions that makes this work substantially easier. However, to determine the extent to which the dplyr functions accomplished one's intended goals, analysts often have to write additional code to check their work, or eyeball their manipulated data to ensure correctness. For example, to see how many rows were lost after filtering data, one would have to remember the number of original rows in the data and then manually subtract the rows in the newly filtered data; to see how many rows were dropped for each group would take additional code. To see how many rows actually joined when performing a merge, or to see if the newly joined data contains more rows than expected because of duplicated matches, an analyst would have to write code before and after implementing a dplyr join function, like left_join or full_join.

The reviewr package makes performing these checks and others easy to implement and read. By simply adding the suffix _qc to the end of several core dplyr functions, you will get the exact same result as you would had you used the normal dplyr functions, but additional information about what just happened will automatically print. For example, using filter_qc instead of filter will filter the data exactly as filter will, but will also report how many (and what percent) of rows were dropped. This makes using reviewr easy and allows you to quickly toggle back and forth between regular dplyr functions and their reviewr counterparts. No wrappers around your code or any other set-up; just add _qc.

reviewr, therefore, allows you to get under the hood of data cleaning with dplyr. It does not attempt or intend to tell if you dplyr "worked", but rather whether how you used the dplyr function(s) matched what you intended. (Does it make sense that you dropped 20 rows when performing the filter? Did you expect the join to not be a perfect 1-1 join? etc.)

Why use reviewr?

Some of the checks performed by reviewr are simple and are easy to do without the package. Determining how many rows were dropped after a filter is often just a matter of subtraction. Although reviewr saves you the step, it has the additional value of being more easily used in a series of piped functions. For example, the following chunk of code performs filters at two steps. To determine how many rows were dropped at each step would require the analyst to break apart the code and test both. But with reviewr, you can keep the pipe as is, and simply add _qc.

library(reviewr)
library(dplyr)
library(magrittr)
normal_output <- mtcars %>% 
  filter(mpg < 25) %>% 
  group_by(vs) %>% 
  filter(hp > mean(hp))
reviewr_output <- mtcars %>% 
  filter_qc(mpg < 25) %>% 
  group_by(vs) %>% 
  filter_qc(hp > mean(hp))

The two outputs are identical; reviewr simply prints more information.

all.equal(normal_output, reviewr_output)

Because reviewr was built on the premise of toggling back and forth between dplyr and reviewr functions when needed, it is meant to particularly assist those performing Quality Control checks, as the checker can simply add _qc at the end of the existing code to dive more deeply into what the code did.

Working with grouped data

If you simply add _qc to the end of dplyr code, reviewr could be helpful. But it also includes an option to learn more about what happened to each group in your data if you have a grouped structure. For example, how many rows in each group were dropped when filtering? Which groups, if any, had NA values created when doing a group_by() and summarize() combination, and on which variables? Many of the reviewr functions have an option to print this information:

reviewr_output <- mtcars %>% 
  filter_qc(mpg < 25) %>% 
  group_by(vs) %>% 
  filter_qc(hp > mean(hp), .group_check = T)

Again, the two outputs are identical; reviewr simply prints more information.

all.equal(normal_output, reviewr_output)

reviewr Functions

There are currently four sets of dplyr functions to which you can add _qc suffix and get a reviewr equivalent

Although, some reviewr functions have additional options, all reviewr functions will work by simply using them as you would dplyr. This means you can start using them now: if you know how to use their dplyr versions, the reviewr functions will be implemented identically.

Below are examples for each set of reviewr functions and a brief discussion of additional options you can use.

Filtering

The set of reviewr filtering functions report the number of rows that were dropped after performing the filter.

filter_qc(mtcars, gear == 4, mpg > 30)

If the data is grouped, you can set the option .group_check = T to see how many rows were dropped per group. Note that this could print a big table if you have a lot of groups. The default is .group_check = F

result <- mtcars %>%
  group_by(vs) %>%
  filter_qc(gear == 4, mpg > 30, .group_check = T)

reviewr has _qc versions of the scoped filter functions, too.

filter_at_qc(mtcars, vars(vs, am), all_vars(. == 1))

All of the scoped filter functions have an option for group_check

Mutating

The set of reviewr mutate functions report the number of NA or INF values created in the newly mutated variable(s).

output <- mtcars %>% 
  mutate_qc(
    mpg_c21 = mpg - 21,
    inv_mpg_c21 = 1 / mpg_c21
  ) %>%
  select(mpg, mpg_c21, inv_mpg_c21)

head(output)

With grouped data and with .group_check = T, you can find out the number of missing rows per group and on which variables. With this option, a table will be printed with a columns called "var_name" and "n_missing", indicating how many missing values there are for each newly mutated variable for each group. Note that this could print a big table if you have a lot of groups.

result_normal <- mtcars %>% 
  group_by(vs) %>%
  mutate(
    mpg_c21 = mpg - 21,
    inv_mpg_c21 = 1 / mpg_c21,
    inv_am = 1 / am
  )

result_withOption <- mtcars %>% 
  group_by(vs) %>%
  mutate_qc(
    mpg_c21 = mpg - 21,
    inv_mpg_c21 = 1 / mpg_c21,
    inv_am = 1 / am,
    .group_check = T
  )

Whether or not group_check = T, the returned mutated data is identical. .group_check = T simply prints more information. The default is .group_check = F

all.equal(result_normal, result_withOption)

reviewr has _qc versions of the scoped mutate functions, too.

output <- mutate_all_qc(mtcars, funs(inv = 1 / .))

All of the scoped mutate functions have an option for .group_check. reviewr has _qc versions of transmute() and all of its scoped variants, as well.

Summarizing

The set of reviewr summarize functions report the number of NA or INF values created in the newly summarized variable(s). Though available, this feature is typically not useful when using summarize() to create small summary tables - it will be pretty obvious in these cases when something is NA - but could be helpful when using summarize() on grouped data to create new variables you'll later use.

mtcars %>% 
  group_by(vs) %>%
  summarize_qc(mean_inv_mpg_c21 = mean(1 / (mpg - 21)))

In this case, the .group_check = T option prints a table with each group with at least one missing value, and a single variable listing the newly summarized variables for the group that are missing. (Because summarize() by design only creates one row per group, there is no need to indicate the number of missing values, just flag which variables are missing.) Note that only groups with at least one missing value are printed.

result <- mtcars %>% 
  group_by(vs) %>%
  summarize_qc(mean_inv_mpg_c21 = mean(1 / (mpg - 21)), .group_check = T)

Whether or not group_check = T, the returned summarized data is identical. .group_check = T simply prints more information. The default is .group_check = F.

reviewr has _qc versions of the scoped summarize functions, too.

result <- mtcars %>%
  group_by(am) %>%
  summarize_at_qc(vars(mpg, hp, vs), funs(inv = mean(1 / .)))

All of the scoped summarize functions have an option for .group_check.

reviewr has _qc versions of summarize() and summarise() and all of their scoped variants. But this is America, so use a "z"!

Joins

The set of reviewr join functions report several things:

As with all the reviewr functions, simply add a _qc to the end of a dplyr join function:

data_A <- 
  data.frame(
    id = 1:10, 
    var_A = 11:20
)
data_B <- 
  data.frame(
    id = c(5, 5, 5, 5, 6, 7, 7, 9, 10, 11),
    id_A = c(1:10),
    var_B = 21:30
  )

joined_data_normal <- left_join(data_A, data_B, by = "id")
joined_data_reviewr <- left_join_qc(data_A, data_B, by = "id")

If you do not use any of the additional options in the reviewr join functions (discussed below), the returned object is identical to dplyr.

all.equal(joined_data_normal, joined_data_reviewr)

reviewr automatically detects the type of join:

joined_data <- left_join_qc(data_A, data_B, by = c("id" = "id_A"))

There are two options available. .merge and .extra. Set .merge equal to a name you want to use for a new variable that will indicate, for each row, whether it was a match, from the left only, or from the right only.

joined_data <- full_join_qc(data_A, data_B, .merge = "name_me")

print.data.frame(joined_data)

Set .extra equal to a name you want to use for a new variable that will indicate, for each row, whether it was a a combination of IDs that was duplicated from the left data, from the right data, or from both. These only occur in one to many, many to one, or many to many joins. In the example below, rows with the id of 5 from the left data was duplicated multiple times because 5 appeared on multiple rows in the right data. .extra flags these rows as "extra_on_left". The same happened with rows that had an id of 7.

joined_data <- full_join_qc(data_A, data_B, .extra = "name_me_too")

print.data.frame(joined_data)

The defaults for both .merge and .extra are NULL. If either option is used, note that the returned object is no longer identical to dplyr because we've created new variables, though everything else beyond that is the same.

There are no group options for the join functions. As in dplyr, groups are ignored for the purposes of joining, but the groups from the first data set are preserved.

What else?

I'd love to know what other types of quick QCs you find yourself doing repeatedly when manipulating data, especially those anchored to the dplyr functions. I expect and hope to widen this package going forward.



adamMaier/reviewr documentation built on Nov. 5, 2023, 7:21 a.m.