Example Data Checks with **inspectr**"

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(inspectr)
data("dataset")

Package format

The inspectr package is designed to perform basic data checks without the user needing to understand the intricacies of apply functions.

The functions can be grouped into two categories: column checks and basic fidelity checks.

Column check functions

These are the basic functions used to perform checks. Each function checks one column for data fidelity, and functions exist to check that column against one or two additional columns. A data frame and a column name (or names) go in; a filtered set of records exhibiting issues comes out (either as a dataframe or as an .xlsx document - your choice!)

Basic fidelity check functions

These functions are designed to be used with the column check functions. They perform basic checks on the data, like ensuring that all data in a column are of the same type or ensuring that all values in column 1 are less than their corresponding values in column 2.

Checking an example dataset

To illustrate how to use these functions, let's apply some basic checks to a sample dataset:

knitr::kable(dataset, format = "markdown", align = "l", row.names = FALSE)

Single-column checks

The col_check function is designed to check a single column of data for fidelity to a given check. Several of the basic functions are appropriate for the single column check: numeric_check, character_check, character_blanks_check, date_check, and val_check.

Numeric checks

The numeric_check function checks to ensure all of the values in the column can be coerced into numeric values by as.numeric. For example, in the example dataset, the goal is to ensure that all of the IDs in the ID_var column are numeric.

When checking the example dataset with this function, the results show that there are two records that have non-numeric characters in their ID variables. By setting output = FALSE in the arguments, the function returns a dataframe containing only the records with errors.

col_check("ID_var", dataset, numeric_check, output = FALSE)
Character checks

These character_check and character_blanks_check functions ensure that all of the values in the column can be coerced into character strings by as.character. While character_check does not tolerate blank values, character_blanks_check allows blanks as acceptable values for the purposes of the check. This difference is illustrated by the different results each check yields from the sample dataset:

col_check("FName", dataset, character_check)

col_check("FName", dataset, character_blanks_check)

As you can see, neither of these checks tolerates NA values.

Value check

The value_check function allows the user to input their own values to set the parameters of the check. The user supplies a vector of accepted values to the values argument, and the check ensures that all values in the column are within that set of accepted values. Blank values and NA values are not tolerated by default, though they can be included in the vector of accepted values.

##Need to resolve three dots issues for date_check
col_check("Perf_Lvl", dataset, val_check, values = c("Basic", "Intermediate", "Advanced"))

col_check("Var1", dataset, val_check, values = c(1:25))
Date check

The date_check function allows the user to input a beginning and end date to set the parameters of the check. The check ensures that all values in the column are equal to or between the specified beginning and end dates and returns all values that do not fall within the given range.

col_check("dates", dataset, date_check, begin = "06/02/1982", end = "11/11/1986")

Two-column checks

The two_col_check function is designed to check one column of data against values in another column. Several of the basic functions are appropriate for the two column check: less_than, less_than_equalto, greater_than, and greater_than_equalto.

two_col_check("Var1", "Var2", dataset, less_than)

two_col_check("Var1", "Var2", dataset, less_than_equalto)

The greater_than and greater_than_equalto functions work similarly. Notice that for these checks,the order of the input columns is reversed; Var2 is the column being checked for fidelity, and Var1 is the reference column.

two_col_check("Var2", "Var1", dataset, greater_than)

two_col_check("Var2", "Var1", dataset, greater_than_equalto)

Three-column checks

As of version r packageVersion("inspectr"), inspectr does not include any basic fidelity check functions that are designed to work with three_col_check. However, you are encouraged to write your own and plug them in! The example below shows a function written to check the Perf_Lvl column against Var1 and Var2 as reference columns. In order to pass the check, the value of Perf_Lvl has to be either "Basic", "Intermediate", or "Advanced"; OR if Perf_Lvl is NA, then Var2 must be even and Var1 must be odd.

This is sort of a silly check, but it illustrates the way a user-defined function can be used with three_col_check. Of course, you can also use user-defined functions with col_check and two_col_check, as well.

three_col_check(colname1 = "Perf_Lvl", colname2 = "Var1", colname3 = "Var2",
                data = dataset, fun = function(col1, col2, col3){
                  col1 %in% c("Basic", "Intermediate", "Advanced") |
                    (is.na(col1) & (col3 %% 2 ==0) & (col2 %% 2 ==1 ))
})


Try the inspectr package in your browser

Any scripts or data that you put into this service are public.

inspectr documentation built on May 2, 2019, 5:15 a.m.