vignettes/vignette_data.md

REDCap Data for R

REDCap is a fantastic database, however the ability to export data is limited to the “raw” data (e.g. factors stored as numbers) or “labelled” data (e.g. factors stored as characters). While code is able to be obtained to convert data into the appropriate format, this the unwieldy and needs to be refershed if the underlying project design is changed.

The redcap_data() function provides a simple way to export, clean, and format data ready for analysis in R. This utilities data available in the metadata of the project to ensure numeric data is numeric, factors are factors in the appropriate order, dates are dates objects, etc. It remains aligned to the data on REDCap despite any changes in the project design.

redcap <- collaborator::redcap_data(redcap_project_uri = Sys.getenv("collaborator_test_uri"),
                          redcap_project_token = Sys.getenv("collaborator_test_token"),
                          include_original = T)

There are 3 potential outputs from this function:

  1. data: The cleaned and formatted REDCap dataset:
knitr::kable(redcap$data %>% head(5))
record\_id redcap\_data\_access\_group dmy\_hms enrol\_tf enrol\_signature pt\_age pt\_sex smoking\_status body\_mass\_index pmh\_\_\_1 pmh\_\_\_2 pmh\_\_\_3 asa\_grade pt\_ethnicity pt\_ethnicity\_other adm\_date adm\_vas time2op op\_urgency op\_procedure\_code follow\_up follow\_up\_readm follow\_up\_mort file redcap\_repeat\_instance crp\_yn crp\_value day hb\_value 1 hospital\_a NA NA FALSE 45 Male Current smoker 22 Ischaemic Heart Disease (IHD) Chronic Obstructive Pulmonary Disease (COPD) NA V White NA 2018-07-29 NA NA Elective 0D9J00Z Yes Yes Yes TRUE 1 Yes 120 1 100 1 hospital\_a NA NA FALSE 45 Male Current smoker 22 Ischaemic Heart Disease (IHD) Chronic Obstructive Pulmonary Disease (COPD) NA V White NA 2018-07-29 NA NA Elective 0D9J00Z Yes Yes Yes TRUE 2 Yes 100 2 110 1 hospital\_a NA NA FALSE 45 Male Current smoker 22 Ischaemic Heart Disease (IHD) Chronic Obstructive Pulmonary Disease (COPD) NA V White NA 2018-07-29 NA NA Elective 0D9J00Z Yes Yes Yes TRUE 3 NA NA 3 NA 1 hospital\_a NA NA FALSE 45 Male Current smoker 22 Ischaemic Heart Disease (IHD) Chronic Obstructive Pulmonary Disease (COPD) NA V White NA 2018-07-29 NA NA Elective 0D9J00Z Yes Yes Yes TRUE 4 NA NA 4 140 2 hospital\_a NA NA FALSE 23 Female NA NA NA Chronic Obstructive Pulmonary Disease (COPD) NA V Black / African / Caribbean / Black British NA 2018-07-30 NA NA Emergency 0D9J0ZZ Yes Yes Yes TRUE 1 NA NA NA NA
  1. metadata: The metadata used to create data.
knitr::kable(redcap$metadata %>% head(5))
variable\_name class form\_name matrix\_name variable\_type variable\_validation variable\_validation\_min variable\_validation\_max branch\_logic variable\_identifier variable\_required variable\_label select\_choices\_or\_calculations factor\_level factor\_label arm redcap\_event\_name form\_repeat record\_id character example\_data NA text NA NA NA NA No No Record ID NA NULL NULL NA NA No redcap\_repeat\_instrument NA NA NA NA NA NA NA NA NA NA redcap\_repeat\_instrument NA NULL NULL NULL NULL No redcap\_repeat\_instance NA NA NA NA NA NA NA NA NA NA redcap\_repeat\_instance NA NULL NULL NULL NULL No redcap\_data\_access\_group factor NA NA NA NA NA NA NA NA NA REDCap Data Access Group NA NULL NULL NULL NULL No dmy\_hms datetime example\_data NA text datetime\_seconds\_dmy NA NA NA No Yes Time of entry NA NULL NULL NA NA No
  1. original: An optional output showing the raw dataset extracted from REDCap (included if include_original = T):
knitr::kable(redcap$original %>% head(5))
record\_id redcap\_repeat\_instrument redcap\_repeat\_instance redcap\_data\_access\_group dmy\_hms enrol\_tf enrol\_signature pt\_age pt\_sex smoking\_status body\_mass\_index pmh\_\_\_1 pmh\_\_\_2 pmh\_\_\_3 asa\_grade pt\_ethnicity pt\_ethnicity\_other adm\_date adm\_vas time2op op\_urgency op\_procedure\_code follow\_up follow\_up\_readm follow\_up\_mort crp\_yn crp\_value day hb\_value file 1 NA NA hospital\_a NA NA NA 45 1 1 22 1 1 0 5 4 NA 2018-07-29 NA NA 1 0D9J00Z 1 1 1 NA NA NA NA 1\_result.csv 1 crp 1 hospital\_a NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 120 NA NA NA 1 crp 2 hospital\_a NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 100 NA NA NA 1 hb 1 hospital\_a NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 100 NA 1 hb 2 hospital\_a NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 110 NA

Handling Repeating Instruments

You may note above that the structure of redcap$data and redcap$original are different, and that there are multiple rows with the record ID 1. This is because the project includes repeating instruments (forms that can be completed repeatedly to facilitate longitudinal data collection) - this add complexity to how the data is structured / has to be handled.

The default for both the original and formatted data is to provide the data in a “long” format - one row per repeating instrument:

knitr::kable(redcap$data %>% dplyr::select(record_id,pt_age:pt_sex, redcap_repeat_instance:last_col())%>% head(5))
record\_id pt\_age pt\_sex redcap\_repeat\_instance crp\_yn crp\_value day hb\_value 1 45 Male 1 Yes 120 1 100 1 45 Male 2 Yes 100 2 110 1 45 Male 3 NA NA 3 NA 1 45 Male 4 NA NA 4 140 2 23 Female 1 NA NA NA NA

However if you require 1 record per row for analysis (the majority of cases), but wish to keep data from ALL repeating instruments you can easily change the data structure by applying redcap_format_repeat() to redcap_data()$data or specify in redcap_data()upfront using the format argument. This has 2 options instead of “long” format:

  1. wide: The repeating instruments are all transposed and numbered accordingly. The consistent naming scheme allows re-conversion to a long format using tidyr::pivot_longer().
redcap$data %>%
  redcap_format_repeat(format = "wide") %>%
  dplyr::select(record_id, contains("instance")) %>%
  knitr::kable()
record\_id crp\_yn\_instance1 crp\_yn\_instance2 crp\_yn\_instance3 crp\_yn\_instance4 crp\_value\_instance1 crp\_value\_instance2 crp\_value\_instance3 crp\_value\_instance4 day\_instance1 day\_instance2 day\_instance3 day\_instance4 hb\_value\_instance1 hb\_value\_instance2 hb\_value\_instance3 hb\_value\_instance4 1 Yes Yes NA NA 120 100 NA NA 1 2 3 4 100 110 NA 140 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA 1 NA 3 NA 110 NA 120 NA 4 Yes No Yes NA NA NA 120 NA NA NA NA NA NA NA NA NA 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 11 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 12 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 16 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 17 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 18 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 19 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 20 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 21 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 22 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 23 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 24 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 26 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 27 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 28 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 29 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 30 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 31 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 32 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 33 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 34 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 35 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 36 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 37 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 38 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 39 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 40 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 41 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 42 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 43 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 44 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 45 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 46 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 47 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 48 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 49 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 50 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
  1. list: The repeating instruments are all stored as a nested list for each record (more efficent storage of data). This can be unnested at a later point using tidyr::unnest().
redcap$data %>%
  redcap_format_repeat(format = "list") %>%
  dplyr::select(record_id, redcap_repeat_instance:last_col()) %>%
  knitr::kable()
record\_id redcap\_repeat\_instance crp\_yn crp\_value day hb\_value 1 1, 2, 3, 4 2, 2, NA, NA 120, 100, NA, NA 1, 2, 3, 4 100, 110, NA , 140 2 1 NA NA NA NA 3 1, 2, 3 NA, NA, NA NA, NA, NA 1 , NA, 3 110, NA , 120 4 1, 2, 3 2, 1, 2 NA, NA, 120 NA, NA, NA NA, NA, NA 5 1 NA NA NA NA 6 1 NA NA NA NA 7 1 NA NA NA NA 8 1 NA NA NA NA 9 1 NA NA NA NA 10 1 NA NA NA NA 11 1 NA NA NA NA 12 1 NA NA NA NA 13 1 NA NA NA NA 14 1 NA NA NA NA 15 1 NA NA NA NA 16 1 NA NA NA NA 17 1 NA NA NA NA 18 1 NA NA NA NA 19 1 NA NA NA NA 20 1 NA NA NA NA 21 1 NA NA NA NA 22 1 NA NA NA NA 23 1 NA NA NA NA 24 1 NA NA NA NA 25 1 NA NA NA NA 26 1 NA NA NA NA 27 1 NA NA NA NA 28 1 NA NA NA NA 29 1 NA NA NA NA 30 1 NA NA NA NA 31 1 NA NA NA NA 32 1 NA NA NA NA 33 1 NA NA NA NA 34 1 NA NA NA NA 35 1 NA NA NA NA 36 1 NA NA NA NA 37 1 NA NA NA NA 38 1 NA NA NA NA 39 1 NA NA NA NA 40 1 NA NA NA NA 41 1 NA NA NA NA 42 1 NA NA NA NA 43 1 NA NA NA NA 44 1 NA NA NA NA 45 1 NA NA NA NA 46 1 NA NA NA NA 47 1 NA NA NA NA 48 1 NA NA NA NA 49 1 NA NA NA NA 50 1 NA NA NA NA

Generating a Simple, Easily-Shareable Data Dictionary

The function data_dict() can be used to generate an easily sharable and informative data dictionary for an R dataframe. Unlike the str() function typically used to display the internal structure of dataframes in R, this produces a dataframe alongside summarising information relevant to the class of variable, and the proportion of missing data (NA) within each variable.

This can be useful in quickly understanding how data is structured within the dataset, and in assessing data quality (e.g. outlying and incorrect or quantity of missing values). This can be easily exported from R and shared as a spreadsheet.

Requirements

The data_dict() function can be applied to any dataframe object. At present, it supports the following classes (other classes will be shown as “Class not supported” in the values column):

Output

The data_dict() function produces a dataframe which identifies the class, summarised values, and proportion of missing data for each variable in the original dataframe.

The output can be easily converted to a spreadsheet file (e.g. csv file) and exported for sharing. Let’s use the data extracted above.

data <- redcap$data %>% redcap_format_repeat(format = "wide")

data_dict(data) %>%
  knitr::kable()
variable class value na\_pct record\_id character 50 Unique: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 0.0% redcap\_data\_access\_group factor 8 Levels: hospital\_a, hospital\_b, hospital\_c, hospital\_d, hospital\_e, hospital\_f, hospital\_g, hospital\_h 0.0% enrol\_tf logical NA, NA, NA, NA, NA, NA, NA, NA, NA, NA 100.0% enrol\_signature logical FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE 0.0% pt\_age numeric Mean: 46; Median: 44.5; Range: 15 to 79 0.0% pt\_sex factor 2 Levels: Male, Female 4.0% smoking\_status factor 3 Levels: Current smoker, Ex-smoker, Non-smoker 4.0% body\_mass\_index numeric Mean: 30.3; Median: 30; Range: 21 to 47 22.0% pmh\_\_\_1 factor 1 Levels: Ischaemic Heart Disease (IHD) 48.0% pmh\_\_\_2 factor 1 Levels: Chronic Obstructive Pulmonary Disease (COPD) 62.0% pmh\_\_\_3 factor 1 Levels: Diabetes Mellitus 56.0% asa\_grade factor 5 Levels: I, II, III, IV, V 6.0% pt\_ethnicity factor 5 Levels: Asian / Asian British, Black / African / Caribbean / Black British, Mixed / Multiple ethnic groups, White, Other ethnic group 0.0% pt\_ethnicity\_other character 1 Unique: NA 100.0% adm\_date Date Range: 2018-07-29 to 2018-08-11 0.0% adm\_vas numeric Mean: NaN; Median: NA; Range: Inf to -Inf 100.0% time2op numeric Mean: NaN; Median: NA; Range: Inf to -Inf 100.0% op\_urgency factor 2 Levels: Elective, Emergency 0.0% op\_procedure\_code character 20 Unique: 0D9J00Z, 0D9J0ZZ, 0D9J40Z, 0D9J4ZZ, 0DQJ0ZZ, 0DQJ4ZZ, 0DTJ0ZZ, 0DTJ4ZZ, 0F140D3, 0F140D5 0.0% follow\_up factor 2 Levels: No, Yes 0.0% follow\_up\_readm factor 2 Levels: No, Yes 30.0% follow\_up\_mort factor 2 Levels: No, Yes 32.0% file logical TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE 0.0% crp\_yn\_instance1 factor 2 Levels: No, Yes 96.0% crp\_yn\_instance2 factor 2 Levels: No, Yes 96.0% crp\_yn\_instance3 factor 2 Levels: No, Yes 98.0% crp\_yn\_instance4 factor 2 Levels: No, Yes 100.0% crp\_value\_instance1 numeric Mean: 120; Median: 120; Range: 120 to 120 98.0% crp\_value\_instance2 numeric Mean: 100; Median: 100; Range: 100 to 100 98.0% crp\_value\_instance3 numeric Mean: 120; Median: 120; Range: 120 to 120 98.0% crp\_value\_instance4 numeric Mean: NaN; Median: NA; Range: Inf to -Inf 100.0% day\_instance1 character 2 Unique: 1, NA 96.0% day\_instance2 character 2 Unique: 2, NA 98.0% day\_instance3 character 2 Unique: 3, NA 96.0% day\_instance4 character 2 Unique: 4, NA 98.0% hb\_value\_instance1 character 3 Unique: 100, NA, 110 96.0% hb\_value\_instance2 character 2 Unique: 110, NA 98.0% hb\_value\_instance3 character 2 Unique: NA, 120 98.0% hb\_value\_instance4 character 2 Unique: 140, NA 98.0%

Through summarising the variables, data will not necessarily be linkable to individual patients (bar in the circumstance where variable(s) contain a direct patient identifier e.g. Community Health Index (CHI) Number, hospital numbers, etc).

However, should any variable(s) (such as a direct patient identifier) be desirable to exclude from the output, this can be achieved using the “var_exclude” parameter.

knitr::kable(collaborator::data_dict(data, var_exclude = c("id_num","sex")))
variable class value na\_pct record\_id character 50 Unique: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 0.0% redcap\_data\_access\_group factor 8 Levels: hospital\_a, hospital\_b, hospital\_c, hospital\_d, hospital\_e, hospital\_f, hospital\_g, hospital\_h 0.0% enrol\_tf logical NA, NA, NA, NA, NA, NA, NA, NA, NA, NA 100.0% enrol\_signature logical FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE 0.0% pt\_age numeric Mean: 46; Median: 44.5; Range: 15 to 79 0.0% pt\_sex factor 2 Levels: Male, Female 4.0% smoking\_status factor 3 Levels: Current smoker, Ex-smoker, Non-smoker 4.0% body\_mass\_index numeric Mean: 30.3; Median: 30; Range: 21 to 47 22.0% pmh\_\_\_1 factor 1 Levels: Ischaemic Heart Disease (IHD) 48.0% pmh\_\_\_2 factor 1 Levels: Chronic Obstructive Pulmonary Disease (COPD) 62.0% pmh\_\_\_3 factor 1 Levels: Diabetes Mellitus 56.0% asa\_grade factor 5 Levels: I, II, III, IV, V 6.0% pt\_ethnicity factor 5 Levels: Asian / Asian British, Black / African / Caribbean / Black British, Mixed / Multiple ethnic groups, White, Other ethnic group 0.0% pt\_ethnicity\_other character 1 Unique: NA 100.0% adm\_date Date Range: 2018-07-29 to 2018-08-11 0.0% adm\_vas numeric Mean: NaN; Median: NA; Range: Inf to -Inf 100.0% time2op numeric Mean: NaN; Median: NA; Range: Inf to -Inf 100.0% op\_urgency factor 2 Levels: Elective, Emergency 0.0% op\_procedure\_code character 20 Unique: 0D9J00Z, 0D9J0ZZ, 0D9J40Z, 0D9J4ZZ, 0DQJ0ZZ, 0DQJ4ZZ, 0DTJ0ZZ, 0DTJ4ZZ, 0F140D3, 0F140D5 0.0% follow\_up factor 2 Levels: No, Yes 0.0% follow\_up\_readm factor 2 Levels: No, Yes 30.0% follow\_up\_mort factor 2 Levels: No, Yes 32.0% file logical TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE 0.0% crp\_yn\_instance1 factor 2 Levels: No, Yes 96.0% crp\_yn\_instance2 factor 2 Levels: No, Yes 96.0% crp\_yn\_instance3 factor 2 Levels: No, Yes 98.0% crp\_yn\_instance4 factor 2 Levels: No, Yes 100.0% crp\_value\_instance1 numeric Mean: 120; Median: 120; Range: 120 to 120 98.0% crp\_value\_instance2 numeric Mean: 100; Median: 100; Range: 100 to 100 98.0% crp\_value\_instance3 numeric Mean: 120; Median: 120; Range: 120 to 120 98.0% crp\_value\_instance4 numeric Mean: NaN; Median: NA; Range: Inf to -Inf 100.0% day\_instance1 character 2 Unique: 1, NA 96.0% day\_instance2 character 2 Unique: 2, NA 98.0% day\_instance3 character 2 Unique: 3, NA 96.0% day\_instance4 character 2 Unique: 4, NA 98.0% hb\_value\_instance1 character 3 Unique: 100, NA, 110 96.0% hb\_value\_instance2 character 2 Unique: 110, NA 98.0% hb\_value\_instance3 character 2 Unique: NA, 120 98.0% hb\_value\_instance4 character 2 Unique: 140, NA 98.0%

kamclean/collaborator documentation built on Nov. 17, 2023, 3:52 a.m.