In kamclean/collaborator: Scalable multi-centre research using R and REDCap

knitr::opts_chunk$set(collapse = FALSE)
library(collaborator);library(dplyr)

Collaborator: Generating Missing Data Reports

Ensuring high levels of completeness within research projects is an important task for ensuring the highest quality dataset for subsequent analyses. However, determining what data is missing within a REDCap project, particularly accounting for appropriately missing data (such as in the case of unfulfilled branching logic) can be a challenging and time-consuming task to produce in real-time.

The report_miss() function is designed to easily produce a high quality and informative report of missing data at a data_access_group and individual record level. This report highlights all missing data within a REDCap project (delineating between appropriately missing and true missing data), while removing all other data (so this can be shared in line with duties of data protection).

Requirements:

The report_miss() function is designed to be simple from the point of use - the only requirements are a valid URI (Uniform Resource Identifier) and API (Application Programming Interface) for the REDCap project.

There is a high degree of customisability, with the following able to be specified to focus on a subset of the dataset:

Variables (columns): Modified using the "var_include" and "var_exclude" parameters.
Records / DAGs (rows): Modified using the "record_include"/"dag_include" and "record_exclude"/"dag_exclude".

Limitations:

This function has not yet been tested on REDCap projects with multiple events.

Output:

Record level report

Example of a record level report of missing data. This not only quantifies the missing data within the record, but also highlights it's location within the dataset.

1. Record level summary

miss_n is the number of missing data fields ("M").
fields_n is the number of all data fields (excluding appropriately missing data).
miss_prop / miss_pct are respective proportions and percentages of data that are missing for each record.
miss_threshold is a yes/no variable indicating if the variable has over the specified missing data threshold (default = 5%).

2. Missing data locations (column 8 onwards)

"NA" fields represent appropriately missing data (e.g. secondary to unfulfilled branching logic). Therefore, these are excluded from the missing data count entirely.
"M" fields represent 'true' missing data (which may require follow up), and so all counts of missing data are based these.

collaborator::report_miss(redcap_project_uri = Sys.getenv("collaborator_test_uri"),
                          redcap_project_token = Sys.getenv("collaborator_test_token"))$record %>%
  View()
  head(15) %>% # first 15 records
  knitr::kable()

Data access group level report

Example of a data access group (DAG) level report of missing data (summarising missing data for all records within the DAG).

n_pt is the number of patients within the data_access_group.
n_threshold is the number of patients over the specified missing data threshold (default = 5%).
cen_miss_n is the number of missing data fields ("M") within the data_access_group.
fields_n is the number of all data fields within the data_access_group (excluding appropriately missing data).
cen_miss_prop / cen_miss_pct are respective proportions and percentages of data that are missing for each data_access_group.

report_miss(redcap_project_uri = Sys.getenv("collaborator_test_uri"),
            redcap_project_token = Sys.getenv("collaborator_test_token"), missing_threshold = 0.2)$group %>%
  knitr::kable()