knitr::opts_chunk$set(collapse = TRUE)
library(collaborator); library(dplyr)

REDCap Data for R

REDCap is a fantastic database, however the ability to export data is limited to the "raw" data (e.g. factors stored as numbers) or "labelled" data (e.g. factors stored as characters). While code is able to be obtained to convert data into the appropriate format, this the unwieldy and needs to be refershed if the underlying project design is changed.

The redcap_data() function provides a simple way to export, clean, and format data ready for analysis in R. This utilities data available in the metadata of the project to ensure numeric data is numeric, factors are factors in the appropriate order, dates are dates objects, etc. It remains aligned to the data on REDCap despite any changes in the project design.

redcap <- collaborator::redcap_data(redcap_project_uri = Sys.getenv("collaborator_test_uri"),
                          redcap_project_token = Sys.getenv("collaborator_test_token"),
                          include_original = T)

There are 3 potential outputs from this function:

  1. data: The cleaned and formatted REDCap dataset:
knitr::kable(redcap$data %>% head(5))
  1. metadata: The metadata used to create data.
knitr::kable(redcap$metadata %>% head(5))
  1. original: An optional output showing the raw dataset extracted from REDCap (included if include_original = T):
knitr::kable(redcap$original %>% head(5))

Handling Repeating Instruments

You may note above that the structure of redcap$data and redcap$original are different, and that there are multiple rows with the record ID 1. This is because the project includes repeating instruments (forms that can be completed repeatedly to facilitate longitudinal data collection) - this add complexity to how the data is structured / has to be handled.

The default for both the original and formatted data is to provide the data in a "long" format - one row per repeating instrument:

knitr::kable(redcap$data %>% dplyr::select(record_id,pt_age:pt_sex, redcap_repeat_instance:last_col())%>% head(5))

However if you require 1 record per row for analysis (the majority of cases), but wish to keep data from ALL repeating instruments you can easily change the data structure by applying redcap_format_repeat() to redcap_data()$data or specify in redcap_data()upfront using the format argument. This has 2 options instead of "long" format:

  1. wide: The repeating instruments are all transposed and numbered accordingly. The consistent naming scheme allows re-conversion to a long format using tidyr::pivot_longer().
redcap$data %>%
  redcap_format_repeat(format = "wide") %>%
  dplyr::select(record_id, contains("instance")) %>%
  knitr::kable()
  1. list: The repeating instruments are all stored as a nested list for each record (more efficent storage of data). This can be unnested at a later point using tidyr::unnest().
redcap$data %>%
  redcap_format_repeat(format = "list") %>%
  dplyr::select(record_id, redcap_repeat_instance:last_col()) %>%
  knitr::kable()

Generating a Simple, Easily-Shareable Data Dictionary

The function data_dict() can be used to generate an easily sharable and informative data dictionary for an R dataframe. Unlike the str() function typically used to display the internal structure of dataframes in R, this produces a dataframe alongside summarising information relevant to the class of variable, and the proportion of missing data (NA) within each variable.

This can be useful in quickly understanding how data is structured within the dataset, and in assessing data quality (e.g. outlying and incorrect or quantity of missing values). This can be easily exported from R and shared as a spreadsheet.

Requirements

The data_dict() function can be applied to any dataframe object. At present, it supports the following classes (other classes will be shown as "Class not supported" in the values column):

Output

The data_dict() function produces a dataframe which identifies the class, summarised values, and proportion of missing data for each variable in the original dataframe.

The output can be easily converted to a spreadsheet file (e.g. csv file) and exported for sharing. Let's use the data extracted above.

data <- redcap$data %>% redcap_format_repeat(format = "wide")

data_dict(data) %>%
  knitr::kable()

Through summarising the variables, data will not necessarily be linkable to individual patients (bar in the circumstance where variable(s) contain a direct patient identifier e.g. Community Health Index (CHI) Number, hospital numbers, etc).

However, should any variable(s) (such as a direct patient identifier) be desirable to exclude from the output, this can be achieved using the "var_exclude" parameter.

knitr::kable(collaborator::data_dict(data, var_exclude = c("id_num","sex")))


kamclean/collaborator documentation built on Nov. 17, 2023, 3:52 a.m.