knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
Most TNTP analysts use the package dplyr
to clean data and perform basic analyses. And for good reason, it's a fantastic set of functions that makes this work substantially easier. However, to determine the extent to which the dplyr
functions accomplished one's intended goals, analysts often have to write additional code to check their work, or eyeball their manipulated data to ensure correctness. For example, to see how many rows were lost after filtering data, one would have to remember the number of original rows in the data and then manually subtract the rows in the newly filtered data; to see how many rows were dropped for each group would take additional code. To see how many rows actually joined when performing a merge, or to see if the newly joined data contains more rows than expected because of duplicated matches, an analyst would have to write code before and after implementing a dplyr
join function, like left_join
or full_join
.
The reviewr
package makes performing these checks and others easy to implement and read. By simply adding the suffix _qc
to the end of several core dplyr
functions, you will get the exact same result as you would had you used the normal dplyr
functions, but additional information about what just happened will automatically print. For example, using filter_qc
instead of filter
will filter the data exactly as filter
will, but will also report how many (and what percent) of rows were dropped. This makes using reviewr
easy and allows you to quickly toggle back and forth between regular dplyr
functions and their reviewr
counterparts. No wrappers around your code or any other set-up; just add _qc
.
reviewr
, therefore, allows you to get under the hood of data cleaning with dplyr
. It does not attempt or intend to tell if you dplyr
"worked", but rather whether how you used the dplyr
function(s) matched what you intended. (Does it make sense that you dropped 20 rows when performing the filter? Did you expect the join to not be a perfect 1-1 join? etc.)
Some of the checks performed by reviewr
are simple and are easy to do without the package. Determining how many rows were dropped after a filter is often just a matter of subtraction. Although reviewr saves you the step, it has the additional value of being more easily used in a series of piped functions. For example, the following chunk of code performs filters at two steps. To determine how many rows were dropped at each step would require the analyst to break apart the code and test both. But with reviewr
, you can keep the pipe as is, and simply add _qc
.
library(reviewr) library(dplyr) library(magrittr)
normal_output <- mtcars %>% filter(mpg < 25) %>% group_by(vs) %>% filter(hp > mean(hp))
reviewr_output <- mtcars %>% filter_qc(mpg < 25) %>% group_by(vs) %>% filter_qc(hp > mean(hp))
The two outputs are identical; reviewr
simply prints more information.
all.equal(normal_output, reviewr_output)
Because reviewr
was built on the premise of toggling back and forth between dplyr
and reviewr
functions when needed, it is meant to particularly assist those performing Quality Control checks, as the checker can simply add _qc
at the end of the existing code to dive more deeply into what the code did.
If you simply add _qc
to the end of dplyr
code, reviewr
could be helpful. But it also includes an option to learn more about what happened to each group in your data if you have a grouped structure. For example, how many rows in each group were dropped when filtering? Which groups, if any, had NA values created when doing a group_by()
and summarize()
combination, and on which variables? Many of the reviewr
functions have an option to print this information:
reviewr_output <- mtcars %>% filter_qc(mpg < 25) %>% group_by(vs) %>% filter_qc(hp > mean(hp), .group_check = T)
Again, the two outputs are identical; reviewr
simply prints more information.
all.equal(normal_output, reviewr_output)
There are currently four sets of dplyr
functions to which you can add _qc
suffix and get a reviewr
equivalent
filter_qc
, filter_all_qc
, filter_at_qc
, and filter_if_qc
mutate_qc
, mutate_all_qc
, mutate_at_qc
, and mutate_if_qc
. And transmute_qc
, transmute_all_qc
, transmute_at_qc
, and transmute_if_qc
summarize_qc
, summarize_all_qc
, summarize_at_qc
, and summarize_if_qc
full_join_qc
, inner_join_qc
, left_join_qc
, right_join_qc
, semi_join_qc
, and anti_join_qc
Although, some reviewr
functions have additional options, all reviewr
functions will work by simply using them as you would dplyr
. This means you can start using them now: if you know how to use their dplyr
versions, the reviewr
functions will be implemented identically.
Below are examples for each set of reviewr
functions and a brief discussion of additional options you can use.
The set of reviewr
filtering functions report the number of rows that were dropped after performing the filter.
filter_qc(mtcars, gear == 4, mpg > 30)
If the data is grouped, you can set the option .group_check = T
to see how many rows were dropped per group. Note that this could print a big table if you have a lot of groups. The default is .group_check = F
result <- mtcars %>% group_by(vs) %>% filter_qc(gear == 4, mpg > 30, .group_check = T)
reviewr
has _qc
versions of the scoped filter functions, too.
filter_at_qc(mtcars, vars(vs, am), all_vars(. == 1))
All of the scoped filter functions have an option for group_check
The set of reviewr
mutate functions report the number of NA or INF values created in the newly mutated variable(s).
output <- mtcars %>% mutate_qc( mpg_c21 = mpg - 21, inv_mpg_c21 = 1 / mpg_c21 ) %>% select(mpg, mpg_c21, inv_mpg_c21) head(output)
With grouped data and with .group_check = T
, you can find out the number of missing rows per group and on which variables. With this option, a table will be printed with a columns called "var_name" and "n_missing", indicating how many missing values there are for each newly mutated variable for each group. Note that this could print a big table if you have a lot of groups.
result_normal <- mtcars %>% group_by(vs) %>% mutate( mpg_c21 = mpg - 21, inv_mpg_c21 = 1 / mpg_c21, inv_am = 1 / am ) result_withOption <- mtcars %>% group_by(vs) %>% mutate_qc( mpg_c21 = mpg - 21, inv_mpg_c21 = 1 / mpg_c21, inv_am = 1 / am, .group_check = T )
Whether or not group_check = T
, the returned mutated data is identical. .group_check = T
simply prints more information. The default is .group_check = F
all.equal(result_normal, result_withOption)
reviewr
has _qc
versions of the scoped mutate functions, too.
output <- mutate_all_qc(mtcars, funs(inv = 1 / .))
All of the scoped mutate functions have an option for .group_check
. reviewr
has _qc
versions of transmute()
and all of its scoped variants, as well.
The set of reviewr
summarize functions report the number of NA or INF values created in the newly summarized variable(s). Though available, this feature is typically not useful when using summarize()
to create small summary tables - it will be pretty obvious in these cases when something is NA
- but could be helpful when using summarize()
on grouped data to create new variables you'll later use.
mtcars %>% group_by(vs) %>% summarize_qc(mean_inv_mpg_c21 = mean(1 / (mpg - 21)))
In this case, the .group_check = T
option prints a table with each group with at least one missing value, and a single variable listing the newly summarized variables for the group that are missing. (Because summarize()
by design only creates one row per group, there is no need to indicate the number of missing values, just flag which variables are missing.) Note that only groups with at least one missing value are printed.
result <- mtcars %>% group_by(vs) %>% summarize_qc(mean_inv_mpg_c21 = mean(1 / (mpg - 21)), .group_check = T)
Whether or not group_check = T
, the returned summarized data is identical. .group_check = T
simply prints more information. The default is .group_check = F
.
reviewr
has _qc
versions of the scoped summarize functions, too.
result <- mtcars %>% group_by(am) %>% summarize_at_qc(vars(mpg, hp, vs), funs(inv = mean(1 / .)))
All of the scoped summarize functions have an option for .group_check
.
reviewr
has _qc
versions of summarize()
and summarise()
and all of their scoped variants. But this is America, so use a "z"!
The set of reviewr
join functions report several things:
Type of Join Joins are classified as one to one, one to many, many to one, or many to many. These definitions describe the extent to which there are duplicated rows of unique combinations of the by
variable(s).
Match Distribution in Joined Data The number of percent of rows in the newly joined data that were from the left only, right only, or matched.
As with all the reviewr
functions, simply add a _qc
to the end of a dplyr
join function:
data_A <- data.frame( id = 1:10, var_A = 11:20 ) data_B <- data.frame( id = c(5, 5, 5, 5, 6, 7, 7, 9, 10, 11), id_A = c(1:10), var_B = 21:30 ) joined_data_normal <- left_join(data_A, data_B, by = "id") joined_data_reviewr <- left_join_qc(data_A, data_B, by = "id")
If you do not use any of the additional options in the reviewr
join functions (discussed below), the returned object is identical to dplyr
.
all.equal(joined_data_normal, joined_data_reviewr)
reviewr
automatically detects the type of join:
joined_data <- left_join_qc(data_A, data_B, by = c("id" = "id_A"))
There are two options available. .merge
and .extra
. Set .merge
equal to a name you want to use for a new variable that will indicate, for each row, whether it was a match, from the left only, or from the right only.
joined_data <- full_join_qc(data_A, data_B, .merge = "name_me") print.data.frame(joined_data)
Set .extra
equal to a name you want to use for a new variable that will indicate, for each row, whether it was a a combination of IDs that was duplicated from the left data, from the right data, or from both. These only occur in one to many, many to one, or many to many joins. In the example below, rows with the id
of 5 from the left data was duplicated multiple times because 5 appeared on multiple rows in the right data. .extra
flags these rows as "extra_on_left". The same happened with rows that had an id
of 7.
joined_data <- full_join_qc(data_A, data_B, .extra = "name_me_too") print.data.frame(joined_data)
The defaults for both .merge
and .extra
are NULL
. If either option is used, note that the returned object is no longer identical to dplyr
because we've created new variables, though everything else beyond that is the same.
There are no group options for the join
functions. As in dplyr
, groups are ignored for the purposes of joining, but the groups from the first data set are preserved.
I'd love to know what other types of quick QCs you find yourself doing repeatedly when manipulating data, especially those anchored to the dplyr
functions. I expect and hope to widen this package going forward.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.