knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The dataqualitycontrol package helps the user in performing data quality checks within a dataset
get all your files in csv format first:
IMPORTANT NOTE: your data and questionnaire must adhere to the standard xml style output from kobo otherwise this will either not work or produce wrong results.
- column headers unchanged
- questionnaire names unchanged
- xml values (NOT labeled)
- select_multiple questions have one column with all responses concatenated (separated by blank space " "), and one column for each response named [question name].[choice name]
This line you only have to run once when using dataqualitycontrol for the first time, or to update to a new version.
remotes::install_github("agualtieri/dataqualitycontrol", build_opts = c())
(the build_opts = c()
makes sure the package includes all extra help pages & documentation)
For this step only, you need a more or less stable internet connection.
If while installing R asks you to update certain packages, select "None" and continue the installation process. If the installation process fails, feel free to contact us at: alberto.gualtieri@reach-initiative.org
library("dataqualitycontrol")
Before loading your data this is the list of libraries that are required to run a data quality check process:
library("tidyverse") library("readr")
read.csv
takes two argumenta file
which is the path to the csv file and stringAsFactors
which is set up as FALSE
assessment_data <- read.csv("./vignettes/example/test_dataset.csv", stringsAsFactors = F)
Conditions:
The questionnaire depends on the question and the choices sheet as a csv.
questions <- read.csv("./vignettes/example/questionnaire_questions.csv", stringsAsFactors = F) choices <- read.csv("./vignettes/example/questionnaire_choices.csv", stringsAsFactors = F)
Conditions:
The conditions list file has to be prepared by the user and must contain two columns:
conditions_list <- read.csv("./vignettes/example/conditions_list.csv", stringsAsFactors = F)
The first step is to produce a simple cleaning loq showing how many times each data quality issue appears in the dataset.
You do that using the run_checks_from_dataframe
function which requires four arguments:
df
: the assessment dataframeconditions_df
: the list of conditionscondition.colum
: the column in the conditions list including the quality check codestest.name.column
: the column in the conditions list including the quality check namesmeta_to_keep
: the vector of variable you want to included in the final cleaning logcleaning_log <- run_checks_from_dataframe(df = assessment_data, conditions_df = conditions_list, condition.column = "conditions", test.name.column = "check_names", meta_to_keep = c("uuid", "A1_Metadata", "A2_Metadata", "A3_Metadata"))
The cleaning log produyced with run_checks_from_dataframe
is useful to have an overview of how many times a quality issues presents itself overall in the assessment dataset but does not help with the data cleaning. To produce a more useful data cleaning log we need to transpose the initial cleaning log divide each quality checks into one row per variable. This process requires three intermediate steps.
You can do so with the quality_checks_log_to_long_format
function. This function takes in two factors:
data
: the initial cleaning log datasetmeta_not_to_transform
: the matadata you want to keep in wide formatcleaning_log_melt <- quality_checks_log_to_long_format(data = cleaning_log, meta_not_to_transform = c("uuid", "A1_Metadata", "A2_Metadata", "A3_Metadata"))
For the final version fo the cleanign log, it would be useful to have the quality check names into the dataframe. To join the them from the condition list dataframe we need to do a number of operations.
## Add quality checks to dataframe cleaning_log_melt$quality_checks <- conditions_list$conditions[match(cleaning_log_melt$variable, conditions_list$check_names)] ## Rename column variable and delete the value column clog_separated <- reshape::rename(cleaning_log_melt, c(variable = "description")) clog_separated$value <- NULL
Since each quality check is made of one ore multiple individual variable checked against each other we need to separate them into a one row per variable format so that we are able to identify which variable will need to be cleaned.
To reformat the cleaning log you can use the separate_on_multiple
function that takes in the following arguments:
data
: the separated cleaning log created before using the quality_checks_log_to_long_format
with the conditions names addedvar_to_separate
: the column you want to separate into a one row per variable formatsep1
: the first separator charactersep2
: if present, the second separator charactersep3
: of present, the third separator characger## Separate reformatted quality checks into three variable to allow for easier data cleaning clog_reformatted <- separate_on_multiple(clog_separated, "quality_checks", sep1 = "&", sep2 = "|") ## Create an "old value" variable by splitting the "quality checks" column clog_reformatted$qchecks_sep <- as_tibble(str_replace_all(clog_reformatted$quality_checks, "[= | !=]", " ")) var_split <- str_split_fixed(clog_reformatted$qchecks_sep, " ", 2) cleaning_log_final <- cbind(clog_reformatted, var_split) cleaning_log_final <- reshape::rename(cleaning_log_final, c(V1 = "variable_name", V2 = "old_value")) ## Add a "new value" colimn cleaning_log_final$new_value <- NA ## Delete unnecessary columns cleaning_log_final$quality_checks <- NULL cleaning_log_final$qchecks_sep <- NULL # Clean the "old values" codes from quotes cleaning_log_final$old_value<- gsub("\"", "", cleaning_log_final$old_value) cleaning_log_final$old_value <- gsub("\ ", "", cleaning_log_final$old_value)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.