Home

/

GitHub

/

In stopsack/prostateredcap: Load and check the MSK Prostate REDCap Prostate clinical database

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Install the prostateredcap package

This step is only required once. To install (or update) the prostateredcap package from GitHub, use the remotes package:

install.packages("remotes")  # skip if 'remotes' package is already installed
remotes::install_github("stopsack/prostateredcap")

An example dataset

The prostateredcap R package contains an example dataset of the prostate cancer database in the same format as it would be exported from REDCap as a "labeled CSV." All data in the example dataset are designed to mimick real clinical data but do not correspond to any real patients.

First, load the dplyr package for data handling, and take a look at the raw example dataset provided as part of the prostateredcap package.

library(dplyr)

raw_data <- system.file("extdata",
                        "SampleGUPIMPACTDatab_DATA_LABELS_2021-05-26.csv",
                        package = "prostateredcap")

readr::read_csv(file = raw_data) %>% 
  print(max_extra_cols = 0)  # do not print all other columns

The dataset, as a typical REDCap export, contains multiple rows per person, with each of the REDCap "forms" (baseline data, sample data, ...) in a separate row and blank values for variables not part of that "form."

Loading the data

We will load the prostateredcap library, read in the same dataset again, and display its contents.

library(prostateredcap)

pts_smp <- load_prostate_redcap(raw_data)

Warnings that the example data, which has data on 8 patients, does not contain all tumor/stage combinations are expected.

load_prostate_redcap() has returned a list with two separate data elements:

pts, the data frame with patient-level data.
smp, the sample-level data frame. smp can have multiple rows per patient that can be merged with the patient-level data pts using the ptid variable present in both datasets.

The data in pts_smp is preprocessed. For example, rather than containing data on date of birth and date of diagnosis, the pts dataset contains age at diagnosis (age_dx, in years).

pts_smp$pts

pts_smp$smp

By default, the argument deidentify = TRUE is set in load_prostate_redcap(). Thus, any identifiers except the sample IDs, which are needed to merge in molecular data and are shared on cBioPortal, have been removed from the returned datasets.

Performing quality control

To help ensure data quality, the prostateredcap package contains the function check_prostate_redcap(), which further processes the output of load_prostate_redcap() (in our example, pts_smp):

A set of internal consistency checks is run on the pts and on the smp dataset.
For each step, the number of records that pass and that are being excluded based on each criterion are recorded.
The pts and smp datasets are returned, excluding samples that do not pass a given level of internal consistency checks. Exclusion of samples that do not pass checks can be disabled altogether.
To obtain only those derived variables that are recommended for analyses, use check_prostate_redcap(recommended_only = TRUE).

Passing the data to check_prostate_redcap() with default parameters and reviewing the number of records that do not pass checks:

pts_smp_qcd <- pts_smp %>%
  check_prostate_redcap(recommended_only = TRUE)

pts_smp_qcd$qc_pts

Accessing pts_smp_qcd$qc_pts shows that 1 record failed on criterion 3 that filtered for records with missing data of birth or missing date of diagnosis. This record is excluded from the final "quality-controlled" return dataset, pts_smp_qcd$pts. Instead of r nrow(pts_smp$pts) records before quality control, this dataset only includes records on r nrow(pts_smp_qcd$pts) patients.
For the sample dataset smp, QC results are accessible as pts_smp_qcd$qc_smp and the final version via pts_smp_qcd$smp. The first step for sample-level data (with index == 2) is to check whether corresponding patient-level passed quality control.
The criteria used by check_prostate_redcap() are defined in qc_criteria_pts() and qc_criteria_smp() and can be modified as needed.
By running check_prostate_redcap(qc_level_pts = 1, qc_level_smp = 1), no exclusions will be performed. Provide different levels than 1 to define the last QC criterion to use for exclusions. qc_pts and qc_smp will still display what effect of all steps on the data would be.
The sequential exclusions and respective counts available in qc_pts and qc_smp can be used in study flowcharts of patient inclusion/exclusion.

Running analyses

The data are now ready to be used for analyses. For example, the sample data and patient data can be merged into one data frame.

inner_join(pts_smp_qcd$pts,
           pts_smp_qcd$smp,
           by = "ptid") %>%
  rmarkdown::paged_table()  # print formatted version

See the data dictionary of all derived variables recommended for analyses.

stopsack/prostateredcap documentation built on June 3, 2023, 12:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stopsack/prostateredcap
Load and check the MSK Prostate REDCap Prostate clinical database

In stopsack/prostateredcap: Load and check the MSK Prostate REDCap Prostate clinical database

Install the prostateredcap package

An example dataset

Loading the data

Performing quality control

Running analyses

R Package Documentation

Browse R Packages

We want your feedback!

stopsack/prostateredcap Load and check the MSK Prostate REDCap Prostate clinical database

In stopsack/prostateredcap: Load and check the MSK Prostate REDCap Prostate clinical database

Install the prostateredcap package

An example dataset

Loading the data

Performing quality control

Running analyses

R Package Documentation

Browse R Packages

We want your feedback!

stopsack/prostateredcap
Load and check the MSK Prostate REDCap Prostate clinical database