In gnoblet/impactR: Ease IMPACT's data and database officers woRk

knitr::opts_chunk$set(echo = TRUE)

Why?

This document aims at providing the workflow that may arise from and through the use of impactR. Let's load the package.

# If you haven't done so 
# devtools::install_github("impactR)
library(impactR)

Below, let's use the 'airports' dataset from package nycflights13

# Load dataset in environment
data(data)

# Show the first lines and types of airports' tibble
data <- data |> tibble::as_tibble()
data

Most of impactR's functions are written with the assumption that the provided data may be coerced to a tibble, since it extensively use the tidyverse.

Let's sat we have a survey sheet composed as:

data(survey)
survey <- survey |> tibble::as_tibble(survey)

First thing to do with 'survey' as it will be used elsewhere: separate the type column.

# All are already defaults apart from 'col_to_split'
survey <- survey |>  
  split_survey(
    col_to_split = "type",
    into = c("type", "list_name"),
    sep = " ",
    fill = "right")

Monitor data collection

Check for outliers

In order to obtain the so-called outliers (for numerical variables), you can use the make_log_outlier function which allows you to obtain two types of outliers: the deviation from the mean using the outliers_sd function and the interquartile range using the outliers_iqr function.

# Helper to get numeric columns
# numeric_cols(data) # Get all numeric columns: it includes for example numeric enumerator ids or gps locations
# numeric_cols(data, survey) # Check for numeric columns using the survey sheet.

# Get IQR outliers (1.5 default rule)
# outliers_iqr(data, col = i_enquete_age,times = 1.5,  id_col = uuid)  # id_col is usually uuid with Kobo

# Get standard deviation outliers
# outliers_sd(data, i_enquete_age, times = 3, id_col = uuid)

# Create the full log of outliers of all numeri columns
log_outliers <- make_log_outlier(data, survey, id_col = uuid, today, i_enum_id, i_zad)
log_outliers

Check for others

This functions needs the Kobo question with "other" answer to be defined as "variable" for the parent question and "other_variable" for the parent question. "other_" maybe different and is defined in the following question by the 'other' arg, being the character pattern. In the example that follows, it is "autre_".

# Get other answers
other_cols <- other_cols(data, "autre_", id_col = uuid)

# Get other parent answers
other_parent_cols(data, other_cols, "autre_", id_col = uuid)

# Fabriquer le journal de nettoyage des "autres"
log_others <- make_log_other(data, survey, "autre_", uuid, i_enum_id, i_zad) 
log_others

Cleaning log based on a checklist

All you need to do is to have produced an Excel file of logical checks in advance, which can be modified during the collection process.

data(check_list)
check_list <- check_list |> tibble::as_tibble()
check_list

The check excel spreadsheet must follow a few rules in order to be read by the impactR [add list] functions. For example, it is necessary that all variables present in the logical tests (column 'logical_test' of the check table) also exist in the data, object data. For this, we can use the check_check_list function, which aims to validate or not a logical check table [note: it is not yet robust, but it already allows to do a number of checks].

For example, below the column survey_duration does not exist in the data. However, there are logical checks that take it into account in the logical checks spreadsheet check_list. If we run the following command, we would get an error:

check_check_list(check_list, data)
# following column/s from `question_name` is/are missing in `.tbl`: survey_duration, survey_duration, survey_duration

So we will add the survey_duration column using the survey_duration function. We take the opportunity to add the time difference between two surveys per interviewer:

data <- data |> 
  survey_duration(start, end, new_colname = "survey_duration") #|>
  # NOT RUN!
  # survey_difftime(start, end, new_colname = "survey_difftime", i_enum_id)
data$survey_duration

We can check the logical check spreadsheet again:

check_check_list(check_list, data)

This time, it's ok, the function gives TRUE, so we can proceed with the production of the cleaning log.

log_check_list <- make_log_from_check_list(data, survey, check_list, uuid, today, i_enum_id, i_zad)

log_check_list

Full cleaning log and export

Finally, we just need to combine all these cleaning logs into one. We can then export it to an excel file.

log <- list(log_outliers, log_check_list, log_others) |> 
  purrr::map(~ .x |> dplyr::mutate(dplyr::across(.fns = as.character))) |> 
  dplyr::bind_rows() |> 
  readr::type_convert()

It is also possible to use the make_all_logs function which combines these three functions and outputs a single cleanup log.

log <- make_all_logs(data, survey, check_list, "autre_", uuid, today, i_enum_id, i_zad)
log

Finally, there are several ways to export. Here we give the example with the package writexl:

# Not run! The simplest one
# writexl::write_xlsx(log, "output/log.xlsx)

# You can add the current date to allow tracking
# writexl::write_xlsx(log, paste0("output/log_", Sys.Date(), ".xlsx))

gnoblet/impactR documentation built on March 20, 2023, 2:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com