Supervising the quality of data collection is not straightforward. Survey questionnaires are often quite long and have systematic control is worth automatizing.
HighFrequencyCheck
can be used to detect programming errors, surveyor errors, data fabrication, poorly understood questions, and other issues. The results of these checks can also be useful in improving your survey, identifying enumerator effects, and assessing the reliability of your outcome measures. It allows teams to catch survey issues and collection mistakes in time to correct them and ensure high-quality data
The HighFrequencyCheck
package is a translation in in R of the Stata package based on best practices from Innovations for Poverty Action. High Frequency Checks are also recommend by the World Bank. It can be installed from github with devtools::install_github("unhcr/HighFrequencyChecks")
.
The package brings a series of convenience functions to monitor data quality during the data collection when running a survey with KoboToolbox (or any xlsform
compatible platform).
Those are the basis of a feedback mechanism with enumerators and can be performed periodically during the data collection process to check for possible errors and provide meaningful inputs to the enumerators. All these functions do not have to be ran at the same period of time. They are provided there to help data supervisor to build reports:
A wrapper
function is included to generate directly an final data quality assessment Rmd Report
A ShinyApp
Interface is also included to provide a live monitoring dashboard to be run locally.
Data collection quality monitoring includes 4 different dimensions
Ideally the data collection monitoring dashboard should be known to all enumerators so that they are aware of the data quality metrics that will be used to assess the quality of their work and potentially some incentive can be offered for the enumerators performing on the quality metrics (It is good to recall that each records in household survey cost between 15 to 50 USD). Some of those indicators can support some remedial supervision interventions, such as calling individually the enumerator and point some specific issues that were detected.
It is important to prepare high frequency checks (code and instructions), as part of the data quality assurance plan, before before starting with field data collection.
Below are the required configuration and an illustration of those indicators based on a demo dataset included in the package.
library(knitr) library(gsubfn) library(dplyr) library(data.table) library(HighFrequencyChecks) library(kableExtra) library(DT) library(ggplot2) library(plotly)
In a production environment, it is possible to connect this a live API (kobotoolbox, ODK , etc.)
ds <- HighFrequencyChecks::sample_dataset # correction for uppercase/lowercase in the site name ds$union_name <- tolower(ds$union_name)
The sampling plan is defined through the sampling strategy. It includes for each enumerator the sufficient details for the enumerator to reach out to respondent satisfying the sampling target definition (i.e. name, location, phone number)
SampleSize <- HighFrequencyChecks::SampleSize # correction for uppercase/lowercase in the site name SampleSize$Union <- tolower(SampleSize$Union) # name as a string of the field where the number of points # generated is stored in the sampling frame sf_nbpts="TotPts" # name as a string of the field in the dataset where the site is stored sf_site="Union" # name as a string of the field in the sampling frame where the site is stored sf_target="SS"
Sampling targets shall also be defined
# formulas as a list of string used to compute the final number # of eligible surveys and the variance from the target # (C('formula1','formula2')). # the values/fields available are: done and the ones generated # according the survey consent values (one per value) formul = c("done-no-not_eligible-deleted", "done-no-not_eligible-deleted-SS") # column names as a list of string to order the colums in the result # the columns available are: site, done, final, variance and # the ones generated according the survey consent values (one per value) colorder = c("site", "SS", "TotPts", "done", "not_eligible", "no", "deleted", "yes", "final", "variance")
Often sampling strategy includes a geographic coverage.
#This can be overview through either: # a defined polygon, aka area or admin unit adm <- HighFrequencyChecks::admin # Unique key with the survey dataset is changed to all small cap for further join adm$Union <- tolower(adm$Union) # OR a sampling point, around which enumerators are supposed to randomly interview persons. pts <- HighFrequencyChecks::SamplePts
When it comes to back checks, variables can be divided into the following different categories:
Type 1 variables are based on straightforward questions where there is very little possibility of error. For example, age and education. If there is an error in these variables, it means there is a serious problem with enumerator, or with the questions.
Type 2 variables are based on questions where a risk of error is possible. For example, questions based on sensitive topics, or questions that involve calculations or classification with possibility of 'or other'. If there is an error in these variables, there might be a need to provide further training for enumerators.
Type 3 variables are based on questions about the survey instrument, and errors in this case provide feedback which can help improve the survey instrument itself. This is often the case for questions including or other, please specify question type.
Below are initialized variables to perform data quality control for this specific dataset
# Name of variables for geographic coordinates as recorded by data collection device GPS df_coord <- c("X_gps_reading_longitude","X_gps_reading_latitude") # Name of location (__matching external geodata__) df_site <- "union_name" #Name of unique key in polygon file admin_site <- "Union"
# Variable recording initial consent survey_consent <- "survey_consent" # Variable recording multiples screening questions questions <- c("consent_received.shelter_nfi.non_food_items[.]", "consent_received.food_security.main_income[.]", "consent_received.child_protection.boy_risk[.]", "consent_received.child_protection.girl_risk[.]") # Variable to be checked reportingcol <- c("enumerator_id","X_uuid")
# unique ID for each record UniqueID <- "X_uuid" # dates dates <- c("survey_start","end_survey") ## Official date for start of data collection start_collection <- "11/11/2018" <<<<<<< HEAD surveydate <- "survey_date" ======= ```r list[dts,err,var3,var4] <- chk1di_GIS_site(adm, sample_dataset, df_site, df_coord, admin_site, consent, reportcol, TRUE) sample_dataset <- dts >>>>>>> ff115024ba0c0d240709b337cd49478e53f134fa # Variable used to record enumerator identifiers enumeratorID <- "enumerator_id"
# What is the minimum survey duration in minutes (when using all skip logic)? minduration <- 10 # minans answers per specific questions minans <- 3 # Standard value sdvalue <- 2 #Size of the buffer in meters around the assigned data collection location points buffer <- 10
otherpattern <- "_other$" dateformat <- "%m/%d/%Y" delete <- FALSE correct <- TRUE
These checks are designed to ensure that responses are consistent for a particular survey instrument, and that the responses fall within a particular range. For example, checks to ensure that all variables are standardized, and there are no outliers in the data. Share a daily log of such errors, and check if it is an issue with enumerators, in which case there might be a need to re-train enumerators.
Missing data: Are some questions skipped more than others? Are there questions that no respondents answered? This may indicate a programming error.
Categorical variables: Are respondents selecting the given categories or are many respondents selecting “None of the above”, or “other”? If conducting a survey, you may want to add categories or modify your existing ones.
Too many similar responses: Is there a question where all respondents answer in the same way?
Respondent IDs: Are there duplicates of your unique identifiers? If so, does the reason why make sense? (e.g., one circumstance in which there may be duplicates of unique IDs is when surveyors have to end and restart an interview.) Are there blank or invalid IDs? This might be a sign your surveyors are not interviewing the correct respondent.
list_unique_id <- chk2b_unique_id(ds, UniqueID, survey_consent, reportingcol, delete) ds <- list_unique_id[[1]] if(nrow(list_unique_id[[2]])>0){ DT::datatable(list_unique_id[[2]], caption = "Detected records with errors: Duplicate respondent ID") } else { cat(">__No errors__: All records have a unique repondent ID") }
list_missing_id <- chk2a_missing_id(ds, UniqueID, survey_consent, reportingcol, delete) ds <- list_missing_id[[1]] if(nrow(list_missing_id[[2]] )>0){ DT::datatable(list_missing_id[[2]], caption = "Detected records with errors: Missing respondent ID") } else { cat(">__No errors__: All records have an ID") }
list_date_mistake <- chk3a_date_mistake(ds, survey_consent, dates, reportingcol, delete) ds <- list_date_mistake[[1]] if(nrow(list_date_mistake[[2]])>0){ DT::datatable(list_date_mistake[[2]], caption = "Detected records with errors") } else { cat(">__No errors__: All interviews ended on the same day as they started") }
list_date_mistake2 <- chk3b_date_mistake(ds, survey_consent, dates, reportingcol, delete) #ds <- list_date_mistake2[[1]] if(nrow(list_date_mistake2[[2]])>0){ DT::datatable(list_date_mistake2[[2]], caption = "Detected records with errors: interviews ended before they start") } else { cat(">__No errors__: All interviews ended before they start") }
list_date_mistake3 <- chk3d_date_mistake(ds, survey_consent, dates, reportingcol, delete) #ds <- list_date_mistake3[[1]] if(nrow(list_date_mistake3[[2]])>0){ DT::datatable(list_date_mistake3[[2]], caption = "Detected records with errors - date are not in the future") } else { cat(">__No errors__: records date are not in the future") }
list_date_mistake4 <- chk3c_date_mistake(ds, dates, survey_consent, start_collection, reportingcol, delete) ds <- list_date_mistake4[[1]] if(nrow(list_date_mistake4[[2]])>0){ DT::datatable(list_date_mistake4[[2]], caption = "Detected records with errors - records occured after the official beginning of data collection") } else { cat(paste0(">__No errors__: all records occured after the official beginning of data collection on ", start_collection)) }
list_site <- chk1di_GIS_site(adm, ds, df_site, df_coord, admin_site, survey_consent, reportingcol, correct) ds <- list_site[[1]] if(nrow(list_site[[2]])>0){ DT::datatable(list_site[[2]], caption = "Detected records with errors - location name not matching with GPS") } else { cat(">__No errors__: all records location name not matching with GPS") }
r buffer
meter buffer from a sample pointlist_sitept <- chk1dii_GIS_Xm(ds, pts, df_coord, buffer, survey_consent, reportingcol, TRUE) ds <- list_sitept[[1]] if(nrow(list_sitept[[2]])>0){ DT::datatable(list_sitept[[2]], caption = "Detected records with errors - recorded location too far from sampling points") } else { cat(">__No errors__: all records location in accpetable distance from sampling points") }
r minduration
minuteslist_duration_Xmin <- chk5b_duration_Xmin(ds, survey_consent, dates, reportingcol, minduration, TRUE) ds <- list_duration_Xmin[[1]] if(nrow(list_duration_Xmin[[2]])>0){ DT::datatable(list_duration_Xmin[[2]], caption = paste0("Detected records with errors - Interviews duration shorter than ", minduration)) } else { cat(paste0(">__No errors__: No interviews duration shorter than ", minduration)) }
Test for target number: since surveys are submitted in daily waves, keep track of the numbers of surveys submitted and the target number of surveys needed for an area to be completed.
trackingSheet <- chk7bii_tracking(ds, SampleSize, df_site, sf_site, survey_consent, sf_target, sf_nbpts, formul, colorder) DT::datatable(trackingSheet, caption = "trackingSheet")
These are designed to check if data shared by any particular enumerator is significantly different from the data shared by other enumerators. Some parameters to check enumerator performance include percentage of "Don't know" responses, or average interview duration. In the first case, there might be a need to re-draft the questions, while in the second case, there might be a need to re-train enumerators.
Outliers: Are some respondents reporting values drastically higher or lower than the average response? Do these variables need to be top or bottom coded? Many outlier checks can be directly programmed into the survey, either to flag responses or bar responses that are outside the acceptable range.
Beware that Interviews with potential errors on the dates are not marked for deletion which can lead to weird duration
list_dur <- chk5a_duration(ds, dates) # ds <- dts cat("> The total time of data collection is ", list_dur[[1]], " minutes and the average time per survey is ", list_dur[[2]], " minutes")
r minans
answers per specific questionsreportlog_less_X_answers <- chk6g_question_less_X_answers(ds, enumeratorID, questions, minans) if(nrow(reportlog_less_X_answers)>0){ DT::datatable(reportlog_less_X_answers, caption = paste0("Detected records with errors - Enumerators who pick up less than ",minans, " answers per specific questions")) }
reportlog_others_values <- chk4biv_others_values(ds, otherpattern, enumeratorID, TRUE) if(nrow(reportlog_others_values)>0){ DT::datatable(reportlog_others_values, caption = paste0("Detected of other distinct values")) }
reportlog_productivity <- chk7ai_productivity(ds, surveydate, dateformat, survey_consent) if(nrow(reportlog_productivity)>0){ DT::datatable(reportlog_productivity, caption = paste0("Completed interview per day")) }
chk7aii_productivity_hist(ds,
surveydate,
dateformat,
survey_consent)
reportlog_nb_status <- chk7bi_nb_status(ds, surveydate, dateformat, survey_consent) if(nrow(reportlog_nb_status)>0){ DT::datatable(reportlog_nb_status, caption = paste0("attempted interview per day and obtained consent")) }
reportlog_refusal <- chk6a_refusal(ds, survey_consent, enumeratorID) if(nrow(reportlog_refusal)>0){ DT::datatable(reportlog_refusal, caption = paste0("Percentage of survey per consent status by enumerator")) }
reportlog_duration <- chk6b_duration(ds, dates, enumeratorID) if(nrow(reportlog_others_values)>0){ DT::datatable(reportlog_duration, caption = paste0("Average interview duration by enumerator")) }
reportlog_nb_survey <- chk6c_nb_survey(ds, surveydate, enumeratorID) if(nrow(reportlog_nb_survey)>0){ DT::datatable(reportlog_nb_survey, caption = paste0("Number of surveys per day by enumerator")) }
reportlog_productivity <- chk6f_productivity(ds, enumeratorID, surveydate, sdvalue) if(nrow(reportlog_productivity)>0){ DT::datatable(reportlog_productivity, caption = paste0("Enumerators with productivity significantly different from the average")) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.