knitr::opts_chunk$set(comment = "##", tidy = FALSE, #`styler` to use styler:style_text() to reformat code tidy.opts = list(blank = FALSE, width.cutoff = 40), echo = TRUE, eval = TRUE, cache = TRUE, cache.path = "cache/", child = NULL, #file/s to knit and then include, collapse = FALSE, #collapse all output into a single block, error = TRUE, #display error messages in doc. FALSE stops render when error is thrown fig.align = "center", #left, right, center, or default include = TRUE, #include chunk? message = TRUE, #display code messages? tidy = TRUE, #tidy code warning = TRUE, #include warnings? results = "markup" # "asis": passthrough results # "hide": do not display results # "hold": put all results below all code ) library(tidyverse)
To adequately assess the data quality of a clinical dataset, the following overarching attributes about the data as a whole should be taken into account:
(@) Cohort definition: that population that is being studied, which are answered by questions such as:
* _Adult or pediatric population?_ * _What is the inclusion criteria?_ * _What is the exclusion criteria?_
An example of a cohort definition is "All the adult HIV-positive patients that were admitted to the ICU at MSK".
(@) Timeline: the timeframe that the study entails. Extending the example given, the timeline for this study can be the clinical outcomes of HIV-positive patients 1 year after their first HIV-related hospitalization.
Cohort Definition and Timeline serve as the bare minimum knowledge for a data-driven research study and confirmation of both should be the first step in the QA process.
Once the preliminary research study characterization is established, the Framework itself will take the R class of each field in the dataset, and each R class is associated with a unique data quality pipeline related to the permitted values within the given field. For the purposes of this brief demonstration, rules are applied on the limited set of datatypes of category, text, float, integer, date, and datetime. In reality, many additional datatypes may exist, such as time for a timestamp without a date, or large integers that could be considered the equivalent to bigint in SQL.
In addition to the datatypes, there are two types of flags:
Hard Flag: clinically impossible value. For example, a birthdate that takes
place in the future is clinically impossible.
Soft Flag: clinically improbable value, such as an extremely high white
blood cell count of 30000.
`` {r, echo=FALSE}
datatype_matrix <-
tibble::tribble(
~datatype, ~r_class, ~rule, ~soft_flag,~hard_flag,
"identifier", "factor", "value in established valueset", "", "value not in valueset",
"category", "factor", "value in established constraints", "value not in constraints", "",
"text", "character", "", "", "",
"float", "c('numeric', 'double')", "only numeric characters with maximum 1 decimal point", "greater or less than 2.5 standard deviations from the mean", "",
"integer", "integer", "only numeric characters with 0 decimal point", "greater or less than 2.5 standard deviations from the mean", "",
"date", "Date", "maximum 8 and minimum 4 numeric characters, maximum 2 punctuation characters", "greater or less than 2.5 standard deviations from the mean", "future date",
"datetime", "POSIXct", "maximum 6 numeric characters, maximum 2 punctuation characters", "", "times greater than or equal to 24:00")
kableExtra::kable(
x = datatype_matrix,
caption = "Matrix that maps a
datatypeof a field in a dataframe to the R object class
(
r_class). Each field has a single
datatypethat is constrained to a
specific R object class and a set of additional rules associated with the
datatype (
rule). A
soft_flagrepresents clinically improbable values that
will require confirmation that the value is not due to error such as a typo. A
hard_flag` is a flag where the value is clinically impossible and must be
corrected such as a date value that occurs in the future. Note that a matrix
such as this one would require regular refinement to accommodate for nuances
such as if a dataset includes a field such as date of next appointment, which
will contain dates that may occur in the future. This is an example of where a
new datatype alloting future dates may need to be introduced to avoid any hard
flags for acceptable values.")
<br> It is important to note that the **category** datatype requires special handling. Each field of this datatype is defined by a vector of permissible values, also known as a *valueset*. Therefore a prequisite for a quality process for this datatype is knowing the valueset that constrains the range of possible values. From a maintenance perspective, *category* fields require iterative updates with new allowable values added to the valueset. The purpose of applying these fundamental rules is to confirm the data integrity before moving forward with more complex data quality rules. For example, a more complex rule may be one where the *date of death* must be preceded by a *date of birth*. However, it is the foundational rule on the datatype *date* that preliminarily confirms that all of *the date of death* and *date of birth* data are in a parseable format, values that fall above the 95th percentile in either direction have been reviewed and confirmed, and particular attention is paid to any data that shows that *date of death* and *date of birth* occurring in the future. ### Surveying Source Data Following the matrix above, each field in the clinical dataset should be surveyed and assigned one of the datatypes. Mapping a source data field to a datatype can be demonstrated using the 100 sample records from the `Condition Occurrence` table in **Polyester**, a database of synthetic clinical data generated by **Synthea** and ETL'd into the **OMOP Common Data Model** ([learn more](https://meerapatelmd.github.io/polyester2/)). ```r condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") str(condition_occurrence)
The survey above serves a guide to assign the appropriate datatype to each column. A few notes representing the challenges faced when applying data quality rules to real world clinical data:
gender
field data was read into the R environment as a
character
class, the result of our survey indicates that it should be of
the factor
class and thus, be assigned the category
datatype. condition_source
a text
or category
datatype. This would depend on
whether this field came from an abstraction or from a structured EHR data
capture. Here, I have chosen to treat it as a category. condition_occurrence_metadata <- tibble::tribble( ~field, ~datatype, "person_id", "identifier", "gender", "category", "condition_start_date", "date", "condition_end_date", "date", "condition_source", "category") kableExtra::kable(x = condition_occurrence_metadata, caption = "Map between the fields in the sample of 100 records from the `Condition Occurrence` table in **Polyester** and the assigned datatype.")
To apply the data quality rules, the fields can be grouped by assigned datatype, and each grouping can be sent to its respective data quality pipeline as defined by the rule
in the datatype matrix
.
condition_occurrence_datatypes <- split(condition_occurrence_metadata, condition_occurrence_metadata$datatype) %>% map(select, -datatype) %>% map(unlist) %>% map(unname) condition_occurrence_datatypes
Fields of datatype identifier
should have unique values that equivalent to the
length that is expected sample size. The programmatic rule used is confirming
that all the expected unique identifiers are found in the data. In this example
the unique identifiers are the sequence from 1 to 100 (1:100
).
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$identifier)) %>% rubix::tibble_to_list() %>% map(as.character) %>% map(function(x) c(x[!(x %in% as.character(1:100))], as.character(1:100)[!(as.character(1:100) %in% x)]) )
Condition Occurrence
failed the hard flag
that dictates that all the values
within the identifier
datatype are to be found within a predefined valueset.
In this case, the person_id
54 is missing from the data, triggering a hard
flag to ensure that 54 is brought back into the source dataset.
Fields of datatype date
should not have any dates that occur in the future.
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$date)) %>% rubix::tibble_to_list() %>% map(as.Date) %>% map(~ .[centipede::no_na(.) > Sys.Date()])
Condition Occurrence
did not fail the hard flag
that dictates that all the
values within the date
datatype cannot contain any dates that occur in the
future from today.
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$date)) %>% rubix::tibble_to_list() %>% map(as.Date) %>% map(~ summary(.))
Condition Occurrence
did assed the soft flag
that dictates that all the
values outside the 95th percentile should be reviewed for accuracy.
The category
datatype requires the most user feedback for this framework since
it requires a set of predetermined constraints that will ultimately determine
the degree of flagging that will occur. In this sample data, there are 2 fields
of this datatype: gender
and condition_source
. Each will require a vector of
constraints for the QA process
constraints <- list(gender = c("FEMALE", "MALE"), condition_source = c( 'Fatigue', 'Otitis media', 'Loss of taste', 'Streptococcal sore throat', 'Acute bacterial sinusitis', 'Cough', 'Viral sinusitis', 'Fever', 'Acute bronchitis', 'Acute viral pharyngitis'))
Fields of datatype category
should fall within their constraints.
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$category)) %>% rubix::tibble_to_list() %>% map(as.factor) %>% map(levels) %>% map(unique)
list(constraints = constraints, condition_occurrence = condition_occurrence) %>% transpose() %>% map(function(x) x$condition_occurrence[!(x$condition_occurrence %in% x$constraints)]) %>% map(unique) %>% keep(~ length(.) > 0)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.