knitr::opts_chunk$set(comment = "##",
                      tidy = FALSE, #`styler` to use styler:style_text() to reformat code
                      tidy.opts = list(blank = FALSE, width.cutoff = 40),
                      echo = TRUE,
                      eval = TRUE,
                      cache = TRUE,
                      cache.path = "cache/",
                      child = NULL, #file/s to knit and then include,
                      collapse = FALSE, #collapse all output into a single block,
                      error = TRUE, #display error messages in doc. FALSE stops render when error is thrown
                      fig.align = "center", #left, right, center, or default
                      include = TRUE, #include chunk?
                      message = TRUE, #display code messages?
                      tidy = TRUE, #tidy code 
                      warning = TRUE, #include warnings?
                      results = "markup"
                        # "asis": passthrough results
                        # "hide": do not display results 
                        # "hold": put all results below all code
                      )

library(tidyverse)

Introduction

To adequately assess the data quality of a clinical dataset, the following overarching attributes about the data as a whole should be taken into account:

(@) Cohort definition: that population that is being studied, which are answered by questions such as:

* _Adult or pediatric population?_   
* _What is the inclusion criteria?_   
* _What is the exclusion criteria?_

An example of a cohort definition is "All the adult HIV-positive patients that were admitted to the ICU at MSK".

(@) Timeline: the timeframe that the study entails. Extending the example given, the timeline for this study can be the clinical outcomes of HIV-positive patients 1 year after their first HIV-related hospitalization.

Cohort Definition and Timeline serve as the bare minimum knowledge for a data-driven research study and confirmation of both should be the first step in the QA process.

Principles

Datatype QA Matrix

Once the preliminary research study characterization is established, the Framework itself will take the R class of each field in the dataset, and each R class is associated with a unique data quality pipeline related to the permitted values within the given field. For the purposes of this brief demonstration, rules are applied on the limited set of datatypes of category, text, float, integer, date, and datetime. In reality, many additional datatypes may exist, such as time for a timestamp without a date, or large integers that could be considered the equivalent to bigint in SQL.

In addition to the datatypes, there are two types of flags:
Hard Flag: clinically impossible value. For example, a birthdate that takes place in the future is clinically impossible.
Soft Flag: clinically improbable value, such as an extremely high white blood cell count of 30000.

`` {r, echo=FALSE} datatype_matrix <- tibble::tribble( ~datatype, ~r_class, ~rule, ~soft_flag,~hard_flag, "identifier", "factor", "value in established valueset", "", "value not in valueset", "category", "factor", "value in established constraints", "value not in constraints", "", "text", "character", "", "", "", "float", "c('numeric', 'double')", "only numeric characters with maximum 1 decimal point", "greater or less than 2.5 standard deviations from the mean", "", "integer", "integer", "only numeric characters with 0 decimal point", "greater or less than 2.5 standard deviations from the mean", "", "date", "Date", "maximum 8 and minimum 4 numeric characters, maximum 2 punctuation characters", "greater or less than 2.5 standard deviations from the mean", "future date", "datetime", "POSIXct", "maximum 6 numeric characters, maximum 2 punctuation characters", "", "times greater than or equal to 24:00") kableExtra::kable( x = datatype_matrix, caption = "Matrix that maps adatatypeof a field in a dataframe to the R object class (r_class). Each field has a singledatatypethat is constrained to a specific R object class and a set of additional rules associated with the datatype (rule). Asoft_flagrepresents clinically improbable values that will require confirmation that the value is not due to error such as a typo. Ahard_flag` is a flag where the value is clinically impossible and must be corrected such as a date value that occurs in the future. Note that a matrix such as this one would require regular refinement to accommodate for nuances such as if a dataset includes a field such as date of next appointment, which will contain dates that may occur in the future. This is an example of where a new datatype alloting future dates may need to be introduced to avoid any hard flags for acceptable values.")

<br>  

It is important to note that the **category** datatype requires special
handling. Each field of this datatype is defined by a vector of permissible
values, also known as a *valueset*. Therefore a prequisite for a quality process
for this datatype is knowing the valueset that constrains the range of possible
values. From a maintenance perspective, *category* fields require iterative
updates with new allowable values added to the valueset.

The purpose of applying these fundamental rules is to confirm the data integrity
before moving forward with more complex data quality rules. For example, a more
complex rule may be one where the *date of death* must be preceded by a *date of
birth*. However, it is the foundational rule on the datatype *date* that
preliminarily confirms that all of *the date of death* and *date of birth* data
are in a parseable format, values that fall above the 95th percentile in either
direction have been reviewed and confirmed, and particular attention is paid to
any data that shows that *date of death* and *date of birth* occurring in the
future.


### Surveying Source Data

Following the matrix above, each field in the clinical dataset should be
surveyed and assigned one of the datatypes. Mapping a source data field to a
datatype can be demonstrated using the 100 sample records from the `Condition
Occurrence` table in **Polyester**, a database of synthetic clinical data
generated by **Synthea** and ETL'd into the **OMOP Common Data Model** ([learn
more](https://meerapatelmd.github.io/polyester2/)).

```r
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")
str(condition_occurrence)

The survey above serves a guide to assign the appropriate datatype to each column. A few notes representing the challenges faced when applying data quality rules to real world clinical data:

  1. Though the gender field data was read into the R environment as a character class, the result of our survey indicates that it should be of the factor class and thus, be assigned the category datatype.
  2. Another similar example that is debatable is whether or not to consider condition_source a text or category datatype. This would depend on whether this field came from an abstraction or from a structured EHR data capture. Here, I have chosen to treat it as a category.

condition_occurrence_metadata <-
tibble::tribble(
  ~field, ~datatype,
  "person_id", "identifier",
  "gender", "category", 
  "condition_start_date", "date",
  "condition_end_date", "date",
  "condition_source", "category") 

kableExtra::kable(x = condition_occurrence_metadata,
                  caption = "Map between the fields in the sample of 100 records from the `Condition Occurrence` table in **Polyester** and the assigned datatype.")


Application

To apply the data quality rules, the fields can be grouped by assigned datatype, and each grouping can be sent to its respective data quality pipeline as defined by the rule in the datatype matrix.

condition_occurrence_datatypes <- 
  split(condition_occurrence_metadata, condition_occurrence_metadata$datatype) %>%
  map(select, -datatype) %>%
  map(unlist) %>%
  map(unname)
condition_occurrence_datatypes

Datatype: Identifier

Hard Flag

Fields of datatype identifier should have unique values that equivalent to the length that is expected sample size. The programmatic rule used is confirming that all the expected unique identifiers are found in the data. In this example the unique identifiers are the sequence from 1 to 100 (1:100).

condition_occurrence <-   
  readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")   

condition_occurrence %>%   
  select(all_of(condition_occurrence_datatypes$identifier)) %>%  
  rubix::tibble_to_list() %>%  
  map(as.character) %>%  
  map(function(x) c(x[!(x %in% as.character(1:100))],  
                    as.character(1:100)[!(as.character(1:100) %in% x)])  
      )  

Results

Condition Occurrence failed the hard flag that dictates that all the values within the identifier datatype are to be found within a predefined valueset. In this case, the person_id 54 is missing from the data, triggering a hard flag to ensure that 54 is brought back into the source dataset.

Datatype: Date

Hard Flag

Fields of datatype date should not have any dates that occur in the future.

condition_occurrence <-  
  readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")  

condition_occurrence %>%  
  select(all_of(condition_occurrence_datatypes$date)) %>%  
  rubix::tibble_to_list() %>%  
  map(as.Date) %>%  
  map(~ .[centipede::no_na(.) > Sys.Date()])  

Results

Condition Occurrence did not fail the hard flag that dictates that all the values within the date datatype cannot contain any dates that occur in the future from today.

Soft Flag

condition_occurrence <-  
  readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")  

condition_occurrence %>%  
  select(all_of(condition_occurrence_datatypes$date)) %>%  
  rubix::tibble_to_list() %>%  
  map(as.Date) %>%  
  map(~ summary(.))  

Results

Condition Occurrence did assed the soft flag that dictates that all the values outside the 95th percentile should be reviewed for accuracy.

Datatype: Category

The category datatype requires the most user feedback for this framework since it requires a set of predetermined constraints that will ultimately determine the degree of flagging that will occur. In this sample data, there are 2 fields of this datatype: gender and condition_source. Each will require a vector of constraints for the QA process

constraints <-  
  list(gender = c("FEMALE", "MALE"),  
       condition_source = c(  
                    'Fatigue',  
                    'Otitis media',  
                    'Loss of taste',  
                    'Streptococcal sore throat',  
                    'Acute bacterial sinusitis',  
                    'Cough',  
                    'Viral sinusitis',  
                    'Fever',  
                    'Acute bronchitis',  
                    'Acute viral pharyngitis'))  

Hard Flag

Fields of datatype category should fall within their constraints.

condition_occurrence <-  
  readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")  

condition_occurrence %>%  
  select(all_of(condition_occurrence_datatypes$category)) %>%  
  rubix::tibble_to_list() %>%  
  map(as.factor) %>%  
  map(levels) %>%  
  map(unique)  
list(constraints = constraints,  
     condition_occurrence = condition_occurrence) %>%  
  transpose() %>%  
  map(function(x) x$condition_occurrence[!(x$condition_occurrence %in% x$constraints)]) %>%  
  map(unique) %>%  
  keep(~ length(.) > 0)  


meerapatelmd/clinicalDataQuality documentation built on Jan. 1, 2021, 9:27 a.m.