knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(inspectr) data("dataset")
The inspectr package is designed to perform basic data checks without the user needing to understand the intricacies of apply functions.
The functions can be grouped into two categories: column checks and basic fidelity checks.
These are the basic functions used to perform checks. Each function checks one column for data fidelity, and functions exist to check that column against one or two additional columns. A data frame and a column name (or names) go in; a filtered set of records exhibiting issues comes out (either as a dataframe or as an .xlsx document - your choice!)
These functions are designed to be used with the column check functions. They perform basic checks on the data, like ensuring that all data in a column are of the same type or ensuring that all values in column 1 are less than their corresponding values in column 2.
To illustrate how to use these functions, let's apply some basic checks to a sample dataset:
knitr::kable(dataset, format = "markdown", align = "l", row.names = FALSE)
The col_check
function is designed to check a single column of data for fidelity to a given check. Several of the basic functions are appropriate for the single column check: numeric_check
, character_check
, character_blanks_check
, date_check
, and val_check
.
The numeric_check
function checks to ensure all of the values in the column can be coerced into numeric values by as.numeric
. For example, in the example dataset, the goal is to ensure that all of the IDs in the ID_var column are numeric.
When checking the example dataset with this function, the results show that there are two records that have non-numeric characters in their ID variables. By setting output = FALSE
in the arguments, the function returns a dataframe containing only the records with errors.
col_check("ID_var", dataset, numeric_check, output = FALSE)
These character_check
and character_blanks_check
functions ensure that all of the values in the column can be coerced into character strings by as.character
. While character_check
does not tolerate blank values, character_blanks_check
allows blanks as acceptable values for the purposes of the check. This difference is illustrated by the different results each check yields from the sample dataset:
col_check("FName", dataset, character_check) col_check("FName", dataset, character_blanks_check)
As you can see, neither of these checks tolerates NA values.
The value_check
function allows the user to input their own values to set the parameters of the check. The user supplies a vector of accepted values to the values
argument, and the check ensures that all values in the column are within that set of accepted values. Blank values and NA values are not tolerated by default, though they can be included in the vector of accepted values.
col_check("Perf_Lvl", dataset, val_check, values = c("Basic", "Intermediate", "Advanced")) col_check("Var1", dataset, val_check, values = c(1:25))
The date_check
function allows the user to input a beginning and end date to set the parameters of the check. The check ensures that all values in the column are equal to or between the specified beginning and end dates and returns all values that do not fall within the given range.
col_check("dates", dataset, date_check, begin = "06/02/1982", end = "11/11/1986")
The two_col_check
function is designed to check one column of data against values in another column. Several of the basic functions are appropriate for the two column check: less_than
, less_than_equalto
, greater_than
, and greater_than_equalto
.
two_col_check("Var1", "Var2", dataset, less_than) two_col_check("Var1", "Var2", dataset, less_than_equalto)
The greater_than
and greater_than_equalto
functions work similarly. Notice that for these checks,the order of the input columns is reversed; Var2 is the column being checked for fidelity, and Var1 is the reference column.
two_col_check("Var2", "Var1", dataset, greater_than) two_col_check("Var2", "Var1", dataset, greater_than_equalto)
As of version r packageVersion("inspectr")
, inspectr does not include any basic fidelity check functions that are designed to work with three_col_check
. However, you are encouraged to write your own and plug them in! The example below shows a function written to check the Perf_Lvl column against Var1 and Var2 as reference columns. In order to pass the check, the value of Perf_Lvl has to be either "Basic", "Intermediate", or "Advanced"; OR if Perf_Lvl is NA, then Var2 must be even and Var1 must be odd.
This is sort of a silly check, but it illustrates the way a user-defined function can be used with three_col_check
. Of course, you can also use user-defined functions with col_check
and two_col_check
, as well.
three_col_check(colname1 = "Perf_Lvl", colname2 = "Var1", colname3 = "Var2", data = dataset, fun = function(col1, col2, col3){ col1 %in% c("Basic", "Intermediate", "Advanced") | (is.na(col1) & (col3 %% 2 ==0) & (col2 %% 2 ==1 )) })
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.