Retrieving Longitudinal and Repeating Structures

#| include = FALSE
library(knitr)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy    = FALSE
)

knit_print.data.frame <- function(x, ...) {
  # See https://cran.r-project.org/package=knitr/vignettes/knit_print.html
  . <- NULL # Avoid lintr check
  x %>%
    # rmarkdown::print.paged_df() %>%
    kable(
      col.names = gsub("_", "<br>", colnames(.)),
      # col.names = paste0("<code>", gsub("_", "<br>", colnames(.)), "</code>"),
      # col.names = gsub("_", " ", colnames(.)),
      escape = FALSE,
      format = "html"
    ) %>%
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed", "responsive"),
      full_width        = FALSE
    ) %>%
    c("", "", .) %>%
    paste(collapse = "\n") %>%
    asis_output()
}

# register the method
registerS3method("knit_print", "data.frame", knit_print.data.frame)

Background

This vignette pertains to reading REDCap records from a project that (a) has longitudinal events or (b) has a repeating measure. The first section conceptually discusses how REDCap stores complex structures. The remaining sections describe how to best retrieve complex structures with the REDCapTidyieR and REDCapR packages.

If your REDCap project is longitudinal or contains repeating measures, a single call to the API (or a single export through the browser) will return a dataset that is not readily analyzed. Instead, the dataset will resemble Table 5. This isn't because of a software bug, but because you haven't told the software how you would like the data structured. There isn't a good way to jam this multidimensional space into a rectangle of points. Our advice for querying REDCap is the same as querying any database system: request separate datasets that have a natural "grain" and assemble them as to fit your analyses.

Illustration of How Data Points are Structured

Possible Table Structures

Suppose you have two patients (i.e., "1" and "2") with three intake variables (i.e., height, weight, and bmi). If you record this on a piece of paper, it would probably look like Table 1. The table's grain is "patient", because each row represents a distinct patient. Understanding the grain of each structure below will help you understand how the structures are re-expressions of the same set of observations.

Table 1: patient grain

| pt
id | height | weight | bmi | | :------: | :----: | :----: | :---: | | 1 | 1.0 | 11.0 | 111.0 | | 2 | 2.0 | 22.0 | 222.0 |

This patient-grain structure is how the data points are most comfortably inputted by humans into REDCap, and it is the default when exported through the browser and API. However it is stored differently by REDCap's internal database.

REDCap's flexibility is a driver of its success. Once a research team learns REDCap, it can reuse the knowledge to capture anything from leukemia to lunch orders. But to achieve this flexibility in the world of REDCap and EMRs, data are stored along the observation grain. In computer science, this is commonly called an EAV structure (which stands for entity-attribute-value). The patient's ID is the entity, the variable type is the attribute, and the observed point is the value. It can also be thought of as a "key-value store" nested within a patient (where "key" is a synonym of "attribute"). Notice that the two wider rows have morphed into six skinnier rows --one row per observation. If you are a curious database administrator, peek at the the structure and rows of the redcap_data table. It is the most important table in the database.

Table 2: observation grain for intake instrument

REDCap and EMR databases store observations in their underlying table. This table is a simplification of the redcap_data table, which is the heart of the REDCap's internal database.

| pt
id | key | value | | :------: | :----- | ----: | | 1 | height | 1.0 | | 1 | weight | 11.0 | | 1 | bmi | 111.0 | | 2 | height | 2.0 | | 2 | weight | 22.0 | | 2 | bmi | 222.0 |

If the investigation gains a longitudinal or repeating component, it becomes necessary to include the dimension of time. Suppose the protocol specifies five time points; the blood pressure instrument is captured at times 1, 2, & 3 while the laboratory instrument is captured at times 1 & 2. If you record this stage on paper, it will likely resemble Tables 3a & 3b: one for vitals and one for labs.

Table 3a: patient-time grain for blood_pressure instrument

| pt
id | time | sbp | dbp | | :------: | :--: | :--: | :--: | | 1 | 1 | 1.1 | 11.1 | | 1 | 2 | 1.2 | 11.2 | | 1 | 3 | 1.3 | 11.3 | | 2 | 1 | 2.1 | 22.1 | | 2 | 2 | 2.2 | 22.2 | | 2 | 3 | 2.3 | 22.3 |

Table 3b: patient-time grain for laboratory instrument

| pt
id | time | lab | dose | | :------: | :--: | :--: | :----: | | 1 | 1 | aa1 | 1.1 mg | | 1 | 2 | aa2 | 1.2 mg | | 2 | 1 | bb1 | 2.1 mg | | 2 | 2 | bb2 | 2.2 mg |

When these measurements are added to REDCap's observation table, it resembles Table 4. Two new columns are required to uniquely distinguish the instrument and its ordinal position. Notice the first six rows are copied from Table 2; they have empty values for the repeating structure.

Table 4: observation grain for intake, blood_pressure, and laboratory instruments

| pt
id | repeat
instrument | repeat
instance | key | value | | :------: | :------------------- | :----------------: | :----- | ------: | | 1 | -- | -- | height | 1.0 | | 1 | -- | -- | weight | 11.0 | | 1 | -- | -- | bmi | 111.0 | | 2 | -- | -- | height | 2.0 | | 2 | -- | -- | weight | 22.0 | | 2 | -- | -- | bmi | 222.0 | | 1 | blood_pressure | 1 | sbp | 1.1 | | 1 | blood_pressure | 1 | dbp | 11.1 | | 1 | blood_pressure | 2 | sbp | 1.2 | | 1 | blood_pressure | 2 | dbp | 11.2 | | 1 | blood_pressure | 3 | sbp | 1.3 | | 1 | blood_pressure | 3 | dbp | 11.3 | | 1 | laboratory | 1 | lab | aa1 | | 1 | laboratory | 1 | conc | 1.1 ppm | | 1 | laboratory | 2 | lab | aa2 | | 1 | laboratory | 2 | conc | 1.2 ppm | | 2 | blood_pressure | 1 | sbp | 2.1 | | 2 | blood_pressure | 1 | dbp | 22.1 | | 2 | blood_pressure | 2 | sbp | 2.2 | | 2 | blood_pressure | 2 | dbp | 22.2 | | 2 | blood_pressure | 3 | sbp | 2.3 | | 2 | blood_pressure | 3 | dbp | 22.3 | | 2 | laboratory | 1 | lab | bb1 | | 2 | laboratory | 1 | conc | 2.1 ppm | | 2 | laboratory | 2 | lab | bb2 | | 2 | laboratory | 2 | conc | 2.2 ppm |

As mentioned above, there isn't a universally good way to coerce Tables 1, 3a, and 3b into a single rectangle because the rows represent different things. Or from REDCap's perspective, there's not a good transformation of redcap_data (i.e., Table 4) that is appropriate for most statistical programs.

When forced to combine the different entities, the best option is probably Table 5. We call this a "block dataset", borrowing from linear algebra's block matrix term. You can see the mishmash of tables masquerading as a unified dataset. The rows lack the conceptual coherency of Tables 1, 3a, & 3b.

Table 5: mishmashed grain

| pt
id | repeat
instrument | repeat
instance | height | weight | bmi | sbp | dbp | lab | conc | | :------: | :------------------- | :----------------: | :----: | :----: | :---: | :--: | :--: | :--: | :-----: | | 1 | -- | -- | 1.0 | 11.0 | 111.0 | -- | -- | -- | -- | | 1 | blood_pressure | 1 | -- | -- | -- | 1.1 | 11.1 | -- | -- | | 1 | blood_pressure | 2 | -- | -- | -- | 1.2 | 11.2 | -- | -- | | 1 | blood_pressure | 3 | -- | -- | -- | 1.3 | 11.3 | -- | -- | | 1 | laboratory | 1 | -- | -- | -- | -- | -- | aa1 | 1.1 ppm | | 1 | laboratory | 2 | -- | -- | -- | -- | -- | aa2 | 1.2 ppm | | 2 | -- | -- | 2.0 | 22.0 | 222.0 | -- | -- | -- | -- | | 2 | blood_pressure | 1 | -- | -- | -- | 2.1 | 22.1 | -- | -- | | 2 | blood_pressure | 2 | -- | -- | -- | 2.2 | 22.2 | -- | -- | | 2 | blood_pressure | 3 | -- | -- | -- | 2.3 | 22.3 | -- | -- | | 2 | laboratory | 1 | -- | -- | -- | -- | -- | bb1 | 2.1 ppm | | 2 | laboratory | 2 | -- | -- | -- | -- | -- | bb2 | 2.2 ppm |

A block dataset is not inherently bad. After all, Table 5 can be unambiguously transformed to and from Table 4.

Table 5's primary limitation is that a block dataset is not understood by analysis software used in conventional medical research. At best, the dataset always will require additional preparation. At worst, the analyst will model the rows inappropriately, which will produce misleading conclusions.

Table 5's secondary limitation is inefficiency. The empty cells aren't computationally free. Every cell must be queried from the database and concatenated in REDCap's web server in order to return Table 5 in the plain-text csv, json, or xml format. In our simple example, more than half of the block dataset's cells are wasted. The emptiness frequently exceeds 90% in real-world REDCap projects (because they tend to have many more variables and repeating instances). The emptiness always exceeds 99.9% in real-world EMRs.

For this reason, REDCap and EMR design their observation table to resemble the computational structure of a sparse matrix. (The only important difference is that REDCap's unspecified cells are interpreted as null/empty, while a sparse matrix's unspecified cells are interpreted as zero.)

In the case of a sparse matrix, substantial memory requirement reductions can be realized by storing only the non-zero entries. Depending on the number and distribution of the non-zero entries, different data structures can be used and yield huge savings in memory when compared to the basic approach. The trade-off is that accessing the individual elements becomes more complex and additional structures are needed to be able to recover the original matrix unambiguously. (source: Wikipedia: Sparse matrix - storage)

Terminology

observation

The term "observation" in the world of medical databases has a different and more granular meaning than it does in the tidyverse literature. In REDCap and medical databases, an observation is typically a single point (such as a heart rate or systolic blood pressure) with contextual variables (such as the the associated date, unit, visit ID, and patient ID); see Tables 2 and 4 above. In the tidyverse publications, an observation is roughly equivalent to a REDCap instrument (which is a collection of associated values); see Tables 1, 3a, and 3b.

(We use the medical terminology in this vignette. We'd love to hear if someone has another term that's unambiguous.)

| Concept | REDCap & Medical World | Tidyverse Literature | | :--- | :--------------------- | :------------------- | | A single measured point | observation | value | | A collection of associated points | instrument | observation |

Retrieving from REDCap

Many new REDCap users will submit a single API call and unintentionally obtain something like Table 5; they then try to extract something resembling Tables 1, 3a, & 3b. Although this can be successful, we strongly discourage it. The code is difficult to maintain and is not portable to REDCap projects with different instruments. (The code is really slow and ugly too.)

Our advice is to start before Table 5 is assembled --retrieve the information in a better way. Like other database systems, request the three tables separately from the server and then combine them on your desktop to fit your analyses if necessary.

Two approaches are appropriate for most scenarios:

  1. multiple calls to REDCapR's redcap_read(), or
  2. a single call to REDCapTidieR's redcap_read_tidy().

The code in the vignette requires the magrittr package for the %>% (alternatively you can use |> if you're using R 4.0.2 or later).

The vignette uses these credentials to retrieve the practice/fake dataset. This is not appropriate for datasets containing PHI or other sensitive information. Please see Part 2 - Retrieve Protected Token of the Typical REDCap Workflow for a Data Analyst vignette for secure approaches.

# Support pipes
library(magrittr)

# Retrieve token
path_credential <- system.file("misc/example.credentials", package = "REDCapR")
credential  <- REDCapR::retrieve_credential_local(
  path_credential = path_credential,
  project_id      = 3181
)

One REDCapR Call for Each Table

The tidy datasets represented in Tables 1, 3a, and 3b can be obtained by calling REDCapR three times --one call per table. Using the forms parameter, pass "intake" to get Table 1, "blood_pressure" to get Table 3a, and "laboratory" to get Table 3b.

Although it is not required, we recommend specifying a readr::cols() object to ensure the desired variable data types.

Retrieve patient-level table (corresponding to Table 1)

col_types_intake <-
  readr::cols_only(
    record_id                 = readr::col_integer(),
    height                    = readr::col_double(),
    weight                    = readr::col_double(),
    bmi                       = readr::col_double()
  )

ds_intake <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri, # From the previous code snippet.
    token       = credential$token,
    forms       = c("intake"),
    col_types   = col_types_intake,
    verbose     = FALSE,
  )$data

ds_intake

Retrieve patient-time-level tables (corresponding to Tables 3a & 3b)

col_types_blood_pressure <-
  readr::cols(
    record_id                 = readr::col_integer(),
    redcap_repeat_instrument  = readr::col_character(),
    redcap_repeat_instance    = readr::col_integer(),
    sbp                       = readr::col_double(),
    dbp                       = readr::col_double(),
    blood_pressure_complete   = readr::col_integer()
  )

ds_blood_pressure <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri,
    token       = credential$token,
    forms       = c("blood_pressure"),
    col_types   = col_types_blood_pressure,
    verbose     = FALSE
  )$data

ds_blood_pressure %>%
  tidyr::drop_na(redcap_repeat_instrument)

col_types_laboratory  <-
  readr::cols(
    record_id                 = readr::col_integer(),
    redcap_repeat_instrument  = readr::col_character(),
    redcap_repeat_instance    = readr::col_integer(),
    lab                       = readr::col_character(),
    conc                      = readr::col_character(),
    laboratory_complete       = readr::col_integer()
  )

ds_laboratory  <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri,
    token       = credential$token,
    forms       = c("laboratory"),
    col_types   = col_types_laboratory,
    verbose     = FALSE
  )$data

ds_laboratory %>%
  tidyr::drop_na(redcap_repeat_instrument)

Retrieve block tables (corresponding to Table 5)

If for some reason you need the block dataset through the API, one call will retrieve it.

ds_block <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri,
    token       = credential$token,
    col_types   = readr::cols(.default = readr::col_character()),
    verbose     = FALSE,
  )$data

ds_block

One REDCapTidieR Call for All Tables

REDCapTidieR's initial motivation is to facilitate longitudinal analyses and promote tidy data hygiene.

{Stephan Kadauke & Richard Hanna, please represent your package as you wish. Tell me if I've positioned it differently than you would have.}

Choosing between the Approaches

When retrieving data from REDCap, we recommend calling REDCapTidieR in many scenarios, such as:

However we recommend calling REDCapR in other scenarios. It could be worth calling REDCapR multiple times if:

If in doubt, start with REDCapTidieR. Escalate to REDCapR if your download time is too long and might be decreased by reducing the information retrieved from the server and transported across the network.

And of course many scenarios are solved best with a combination of both packages, such as (a) REDCapR populates the initial demographics in REDCap, (b) research staff enter measures collected from patients over time, (c) REDCapTidieR retrieves the complete longitudinal dataset, (d) dplyr joins the tibbles, and finally (e) lme4 tests hypotheses involving patient trajectories over time.

Escalating to REDCapR

Even if you think you'll need REDCapR's low-level control, consider starting with REDCapTidieR anyway. ...particularly if you are unsure how to specify the grain of each table. The structure of REDCapTidieR's tables easily compatible with conventional analyses. If you need the performance of REDCapR but are unsure how the tables should look, simply execute something like REDCapTidieR::redcap_read_tidy(url, project_token) and study its output. Then try to mimic it exactly with REDCapR::redcap_read() calls.

Finally, cull unwanted cells using the parameters of REDCapR::redcap_read(). These data points will not even leave the REDCap instance, which will improve performance. Some possible strategies include passing arguments to

Note that the efficiency gain from moving from the block dataset to REDCapTidieR is different than the gain from moving from REDCapTidieR to REDCapR. When moving to from Table 5 to a REDCapTidieR Supertibble, you are eliminating empty cells that will never contain worthwhile data. When moving from a REDCapTidieR Supertibble call to a collection of REDCapR calls, you are eliminating cells that contain data, but may not be relevant to your analysis (such as a patient's name or the time a lab specimen was collected). {This paragraph needs work.}

Advanced

Longitudinal

{TODO: Generalize the argument to longitudinal events}

| pt | repeat
instrument | repeat
instance | height | weight | bmi | sbp | dbp | lab | conc | | ---- | :----------------------: | ---------------------: | -------: | -------: | ----: | ----: | ----: | ----: | ------: | | 1 | -- | -- | 1.0 | 11.0 | 111.0 | -- | -- | -- | -- | | 1 | blood_pressure | 1 | -- | -- | -- | 1.1 | 11.1 | -- | -- | | 1 | blood_pressure | 2 | -- | -- | -- | 1.2 | 11.2 | -- | -- | | 1 | blood_pressure | 3 | -- | -- | -- | 1.3 | 11.3 | -- | -- | | 1 | laboratory | 3 | -- | -- | -- | -- | -- | aa3 | 1.3 ppm | | 1 | laboratory | 4 | -- | -- | -- | -- | -- | aa4 | 1.4 ppm | | 2 | -- | -- | 2.0 | 22.0 | 222.0 | -- | -- | -- | -- | | 2 | blood_pressure | 1 | -- | -- | -- | 2.1 | 22.1 | -- | -- | | 2 | blood_pressure | 2 | -- | -- | -- | 2.2 | 22.2 | -- | -- | | 2 | blood_pressure | 3 | -- | -- | -- | 2.3 | 22.3 | -- | -- | | 2 | laboratory | 3 | -- | -- | -- | -- | -- | bb3 | 2.3 ppm | | 2 | laboratory | 4 | -- | -- | -- | -- | -- | bb4 | 2.4 ppm |

Caching Results to Improve Performance

If escalating to REDCapR didn't decrease the duration enough, consider the strategy of calling REDCap only once per day (with either package) and saving the results to a secured disk. This will be efficient when the analyses require a large dataset, but not a real-time dataset.

In many retrospective projects, multiple analyses during the day can reuse the same dataset retrieved the night before (with a function like readr::write_rds()). This has helped our complicated investigations when multiple statisticians frantically tackle overlapping aspects before a funder's deadline.

Essentially you are transferring security responsibilities from REDCap to the file server. To balance its advantages, the approach's drawbacks include:

Caching Results for other Languages

Packages for other programming languages have been developed that access REDCap's API, such PyCap and PhpCap. Please see a more complete list at https://redcap-tools.github.io/projects/. The caching strategy described above may also benefit your investigation if:

Session Information

For the sake of documentation and reproducibility, the current report was rendered in the following environment. Click the line below to expand.

Environment

if (requireNamespace("sessioninfo", quietly = TRUE)) {
  sessioninfo::session_info()
} else {
  sessionInfo()
}



Try the REDCapR package in your browser

Any scripts or data that you put into this service are public.

REDCapR documentation built on Sept. 11, 2024, 7:46 p.m.