knitr::opts_chunk$set(echo = TRUE, eval = TRUE, cache = TRUE, cache.path = "reports/cache/", child = NULL, #file/s to knit and then include, collapse = FALSE, #collapse all output into a single block, error = TRUE, #display error messages in doc. FALSE stops render when error is thrown fig.align = "center", #left, right, center, or default include = TRUE, #include chunk? message = TRUE, #display code messages? tidy = TRUE, #tidy code warning = TRUE, #include warnings? results = "markup" # "asis": passthrough results # "hide": do not display results # "hold": put all results below all code ) library(tidyverse)
To adequately assess the data quality of a clinical dataset, the following overarching attributes about the data as a whole should be taken into account:
Cohort definition: that population that is being studied, which are answered by questions such as:
Adult or pediatric population?
What is the inclusion criteria?
What is the exclusion criteria?
An example of a cohort definition is "All the adult HIV-positive patients that were admitted to the ICU at MSK".
Cohort Definition and Timeline serve as the bare minimum knowledge for a data-driven research study and confirmation of both should be the first step in the QA process.
Once the preliminary research study characterization is established, the Framework itself will take the R class of each field in the dataset, and each R class is associated with a unique data quality pipeline related to the permitted values within the given field. For the purposes of this brief demonstration, rules are applied on the limited set of datatypes of category, text, float, integer, date, and datetime. In reality, many additional datatypes may exist, such as time for a timestamp without a date, or large integers, the equivalent to bigint in SQL.
The purpose of applying these fundamental rules is to confirm the data integrity before moving forward with more complex data quality rules. For example, a more complex rule may be one where the date of death must be preceded by a date of birth. However, it is the foundational rule on the datatype date that preliminarily confirms that all the date of death and date of birth data are in a parseable format, that those that fall within the 95% percentile in either direction have been reviewed and confirmed, and that particular attention has been paid to any data that indicates dates of death and birth that are occurring in the future from this moment in time.
tibble::tribble( ~datatype, ~r_class, ~rule, ~soft_flag,~hard_flag, "category", "factor", "matches predetermined constraints", "not found within control data", NA_character_, "text", "character", NA_character_, NA_character_, NA_character_, "float", "c('numeric', 'double')", "only numeric characters with maximum 1 decimal point", "greater or less than 2.5 standard deviations from the mean", NA_character_, "integer", "integer", "only numeric characters with 0 decimal point", "greater or less than 2.5 standard deviations from the mean", NA_character_, "date", "Date", "maximum 8 and minimum 4 numeric characters, maximum 2 punctuation characters", "greater or less than 2.5 standard deviations from the mean", "future date", "datetime", "POSIXct", "maximum 6 numeric characters, maximum 2 punctuation characters", NA_character_, "times greater than or equal to 24:00")
Once the preliminary research study characterization is established, the Framework itself will take the R class of each field in the dataset, and each R class is associated with a unique data quality pipeline related to the permitted values within the given field. For the purposes of this brief demonstration, rules are applied on the limited set of datatypes of category, text, float, integer, date, and datetime. In reality, many additional datatypes may exist, such as time for a timestamp without a date, or large integers that could be considered the equivalent to bigint in SQL.
datatype_matrix <- tibble::tribble( ~datatype, ~r_class, ~rule, ~soft_flag,~hard_flag, "identifier", "factor", "value in established valueset", "", "value not in valueset", "category", "factor", "value in established constraints", "value not in constraints", "", "text", "character", "", "", "", "float", "c('numeric', 'double')", "only numeric characters with maximum 1 decimal point", "greater or less than 2.5 standard deviations from the mean", "", "integer", "integer", "only numeric characters with 0 decimal point", "greater or less than 2.5 standard deviations from the mean", "", "date", "Date", "maximum 8 and minimum 4 numeric characters, maximum 2 punctuation characters", "greater or less than 2.5 standard deviations from the mean", "future date", "datetime", "POSIXct", "maximum 6 numeric characters, maximum 2 punctuation characters", "", "times greater than or equal to 24:00") kableExtra::kable( x = datatype_matrix, caption = "Matrix that maps a `datatype` of a field in a dataframe to the R object class (`r_class`). Each field has a single `datatype` that is constrained to a specific R object class and a set of additional rules associated with the datatype (`rule`). A `soft_flag` represents clinically improbable values that will require confirmation that the value is not due to error such as a typo. A `hard_flag` is a flag where the value is clinically impossible and must be corrected such as a date value that occurs in the future. Note that a matrix such as this one would require regular refinement to accommodate for nuances such as if a dataset includes a field such as _date of next appointment_, which will contain dates that may occur in the future. This is an example of where a new datatype alloting future dates may need to be introduced to avoid any _hard flags_ for acceptable values.")
The purpose of applying these fundamental rules is to confirm the data integrity before moving forward with more complex data quality rules. For example, a more complex rule may be one where the date of death must be preceded by a date of birth. However, it is the foundational rule on the datatype date that preliminarily confirms that all of the date of death and date of birth data are in a parseable format, values that fall above the 95th percentile in either direction have been reviewed and confirmed, and particular attention is paid to any data that shows that date of death and date of birth occurring in the future.
Following the matrix above, each field in the clinical dataset should be surveyed and assigned one of the datatypes. Mapping a source data field to a datatype can be demonstrated using the 100 sample records from the Condition Occurrence
table in Polyester, a database of synthetic clinical data generated by Synthea and ETL'd into the OMOP Common Data Model (learn more).
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") str(condition_occurrence)
The survey above serves a guide to assign the appropriate datatype to each column. A few notes representing the challenges faced when applying data quality rules to real world clinical data:
gender
field data was read into the R environment as a character
class, the result of our survey indicates that it should be of the factor
class and thus, be assigned the category
datatype. condition_source
a text
or category
datatype. This would depend on whether this field came from an abstraction or from a structured EHR data capture. Here, I have chosen to treat it as a category. condition_occurrence_metadata <- tibble::tribble( ~field, ~datatype, "person_id", "identifier", "gender", "category", "condition_start_date", "date", "condition_end_date", "date", "condition_source", "category") kableExtra::kable(x = condition_occurrence_metadata, caption = "Map between the fields in the sample of 100 records from the `Condition Occurrence` table in **Polyester** and the assigned datatype.")
To apply the data quality rules, the fields can be grouped by assigned datatype, and each grouping can be sent to its respective data quality pipeline as defined by the rule
in the datatype matrix
.
condition_occurrence_datatypes <- split(condition_occurrence_metadata, condition_occurrence_metadata$datatype) %>% map(select, -datatype) %>% map(unlist) %>% map(unname) condition_occurrence_datatypes
Fields of datatype identifier
should have unique values that equivalent to the length that is expected sample size. The programmatic rule used is confirming that all the expected unique identifiers are found in the data. In this example the unique identifiers are the sequence from 1 to 100 (1:100
).
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$identifier)) %>% rubix::tibble_to_list() %>% map(as.character) %>% map(function(x) c(x[!(x %in% as.character(1:100))], as.character(1:100)[!(as.character(1:100) %in% x)]) )
Condition Occurrence
failed the hard flag
that dictates that all the values within the identifier
datatype are to be found within a predefined valueset. In this case, the person_id
54 is missing from the data, triggering a hard flag to ensure that 54 is brought back into the source dataset.
Fields of datatype date
should not have any dates that occur in the future.
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$date)) %>% rubix::tibble_to_list() %>% map(as.Date) %>% map(~ .[centipede::no_na(.) > Sys.Date()])
Condition Occurrence
did not fail the hard flag
that dictates that all the values within the date
datatype cannot contain any dates that occur in the future from today.
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$date)) %>% rubix::tibble_to_list() %>% map(as.Date) %>% map(~ summary(.))
Condition Occurrence
did assed the soft flag
that dictates that all the values outside the 95th percentile should be reviewed for accuracy.
The category
datatype requires the most user feedback for this framework since it requires a set of predetermined constraints that will ultimately determine the degree of flagging that will occur. In this sample data, there are 2 fields of this datatype: gender
and condition_source
. Each will require a vector of constraints for the QA process
constraints <- list(gender = c("FEMALE", "MALE"), condition_source = c( 'Fatigue', 'Otitis media', 'Loss of taste', 'Streptococcal sore throat', 'Acute bacterial sinusitis', 'Cough', 'Viral sinusitis', 'Fever', 'Acute bronchitis', 'Acute viral pharyngitis'))
Fields of datatype category
should fall within their constraints.
condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv") condition_occurrence %>% select(all_of(condition_occurrence_datatypes$category)) %>% rubix::tibble_to_list() %>% map(as.factor) %>% map(levels) %>% map(unique)
list(constraints = constraints, condition_occurrence = condition_occurrence) %>% transpose() %>% map(function(x) x$condition_occurrence[!(x$condition_occurrence %in% x$constraints)]) %>% map(unique) %>% keep(~ length(.) > 0)
Dataframe characterization entails surveying the dataset as a whole, and then diving deeper into the data by assigning a datatype for each field:
str(INPUT_DATA)
The distribution of values for each variable is also surveyed:
summary(INPUT_DATA %>% select(-PKEY) %>% mutate_all(as.factor))
We then weed out duplicate values from the input data (INPUT_DATA object), which is then added to a dataframe of flagged data (OUTPUT_FLAGS object) in the same original structure with the addition of FLAG_REASON and FLAG_TYPE variables. The FLAG_REASON value for this data is "duplicate".
A 5 observation sample of the flagged data is:
INPUT_DATA_TOTAL <- INPUT_DATA INPUT_DATA_FINAL <- INPUT_DATA_TOTAL %>% distinct(PATIENT_ID, PATIENT_DIAGNOSIS, INSURANCE_TYPE, AGE, SEX, VISIT_DATE, PT_SCHEDULED_APPT, PT_ARRIVE_TIME, PT_START_TIME, PT_END_TIME, PROVIDER_NAME, .keep_all = TRUE) OUTPUT_FLAGS_01 <- INPUT_DATA_TOTAL[!(INPUT_DATA_TOTAL$PKEY %in% INPUT_DATA_FINAL$PKEY),] %>% mutate(FLAG_REASON = "duplicate") print(OUTPUT_FLAGS_01 %>% sample_n(5) )
For this exercise, the possible datatypes are limited to four: category, number, date, and time. In practice, there are additional types such as datetime, string, and boolean. Each datatype follows its assigned rules. An overall overview of the datatypes and their associated rules can be seen in Table 1. These rules are called Standalone Rules because it is assessing the data quality in an isolated manner and not relative to other variables found in this data such as relative to visit dates. Relative Rules are implemented downstream with increasing complexity, which will be explained later.
load('01_dataframe_characterization.RData') kable(STANDALONE_RULES, caption = "Standalone Rules") %>% kable_styling(position = "left", font_size = 8) %>% column_spec(column = c(2, 4, 5), width = "3cm") %>% column_spec(column = c(1), width = "2.5cm") %>% column_spec(column = c(6), width = ".75cm") %>% column_spec(column = c(3), width = "3.5cm")
Other important definitions include:
Hard Flag: clinically impossible value. For example, a birthdate that takes place in the future is clinically impossible.
Soft Flag: clinically improbable value, such as an extremely high white blood cell count of 30000.
Each variable is assigned a datatype and joined with the rules to create a SCRIPT_MAP object that is used to guide the remainder of the framework. It is a list that has divided the column names based on datatype to execute scripts against. Each datatype has its own downstream pipeline indicated by the QA_STEP variable that this script map will direct the data flow with for Step 02 in the framework.
load('01_dataframe_characterization.RData') print(SCRIPT_MAP)
source('~/R/MSK/02a_category_qa.R') load("02a_category_qa.RData")
A category datatype is defined as a vector of possible values that the content of the column can fall under. Therefore a requisite for a quality process for this datatype is having the necessary control data to compare it to.
The script map for this step is customized to guide this effort:
load("02a_category_qa.RData") print(SCRIPT_MAP)
Though PKEY and PATIENT_ID variables are categorical, they serve as identifiers in the data and do not require controls. The remaining variables require an associated control vector, the object name of which is in the CONTROL_OBJ_NAME variable. In this exercise, I've assumed that all the unique values for each column is the control. However, in more realistic cases, the control vectors are iteratively updated with new values or error values can be mapped to an existing control value. The controls are as follows:
PATIENT_DIAGNOSIS_CONTROL <- tolower(c("Prostate Cancer","Kidney Cancer", "Bladder Cancer","Testicular Cancer")) INSURANCE_TYPE_CONTROL <- tolower(c("Private Insurance","Medicaid", "Self Insured","Medicare","")) SEX_CONTROL <- tolower(c("M", "F")) PROVIDER_NAME_CONTROL <- tolower(c("L. Svenson","I. Petrov", "E. Ahuja", "J. Smith","N. Fulano","M. Dupont", "S. Moreau","W. Plinge","C. S. Ming","C. Siu Ming"))
If there is a mismatch, the mismatched observations are compiled in an OUTPUT_FLAGS_02A object and are removed from the input data. It is also noted at this point that there are two providers "C. S. Ming" and "C. Siu Ming" that may potentially be duplicates. In real-life circumstances, I would check to see if they are the same provider, in which case the control set PROVIDER_NAME_CONTROL would not include one of the values. The unmatched value would be flagged and mapped to the correct control value.
source('~/R/MSK/02b_number_qa.R') load("02b_number_qa.RData")
The customized script map for the number datatype is:
load("02b_number_qa.RData") print(SCRIPT_MAP)
In each variable, a sequential check on the number of numbers, decimals, other punctuation, and letters is used to flag values that do not fall within the rule of only numbers and no more than one decimal point. In this exercise, only one column is designated a number datatype, and analysis on the mentioned character counts returns no observations to flag. All the values in the AGE variable consist of only 2 numbers.
source('~/R/MSK/02c_date_qa.R') load("02c_date_qa.RData")
Like the previous datatype, a series of sequential checks were conducted on the number of different types of characters. The rules for the date datatype in this exercise were no greater than 2 punctuation marks, no more than 8 numbers, no less than 4 numbers, values that did not parse using the lubridate::ymd function, and parsed dates that occurred in the future.
The script map is very similar to the one for number datatype:
load("02c_date_qa.RData") print(SCRIPT_MAP)
Applying the rules above on the single column, returns no values to flag. All values in the VISIT_DATE variable follow the format of "YYYY-MM-DD" and none of the dates occur in the future.
source('~/R/MSK/02d_time_qa.R') load("02d_time_qa.RData")
Like the previous 2 datatypes, a series of sequential checks were conducted on the number of different types of characters. The rules for the time datatype in this exercise were no greater than 2 punctuation marks, no more than 6 numbers, no less than 3 numbers, and values that did not parse using the lubridate::hm function.
The script map contains 4 columns this time:
load("02d_time_qa.RData") print(SCRIPT_MAP)
Applying the rules above on all columns, returns no values to flag, however. All data in the 4 columns contain valid hour and minute values as determined by the clinical rules applied.
The final output is final_standalone_flag_output.xlsx. The final validated data is de-duplicated. The duplicates and flags from Step 02a to 02d are combined into a single output flags dataframe. Additionally, there is an overlap between the final validated data and flagged data that is designated in an overlap tab. More detailed definitions for the three tabs:
VALIDATED_DATA: all observations that have been de-duplicated and has passed the standalone datatype quality checks.
OUTPUT_FLAGS: observations that have been flagged with the reason in the FLAG_REASON variable and flag severity (hard, soft, or NA) in the FLAG_TYPE variable.
OVERLAP_INPUT_FLAGS: ideally only one observation is accounted for between the validated data and the flagged data, but an overlap might occur inevitably. In the case of this exercise, the overlap occured because blank values are flagged regardless by default, but a blank value was also included in the control data for INSURANCE_TYPE variable. This overlap will require addressing whether blank values should be considered a unique control value, imputed with another value, or considered missing for this specific variable.
Conducting a quality check past the standalone flags are beyond the scope of this assignment. However, for a more thorough quality check, some soft flags such as values greater than or less than 2.5 standard deviations from the mean for enumerated datatypes and relative flags are employed in increasing complexity.
Soft flags are a means to ensure that the outliers are indeed outliers as opposed to technical or human error. All flags for category data are typically soft because the control data is dynamically updated and can have an infinite number of control values in theory. However, the highest and lowest values in enumerated datatypes such as number and date may be fed into a pipeline that involves patient record checks to ensure validity and reliability. The only exception would be time datatype, where the data value is circular rather than linear.
These flags can be relative to the datatype, patient, provider, clinic, diagnosis, and so on. For example, a relative flag for the time datatype in the data here would be making sure that the PT_START_TIME variable occurs prior to the PT_END_TIME variable. On the other hand, a flag relative to the patient record can be checking if a reported appointment date falls between the patient's date of birth and date of death. These flags can scale in the number of variables being compared against each other at infinite levels that include patient-level, provider-level, diagnosis-level, etc.
After adjudicating on the flagged and overlap data, the final validated data would be ready for analysis. Assumptions made to this effect include that blank values in the INSURANCE_TYPE variable are valid control values and are included in the data used for analysis.
The validated data has r nrow(INPUT_DATA_FINAL)
unique observations. The original data had r nrow(INPUT_DATA)
observations, but r nrow(INPUT_DATA)-nrow(INPUT_DATA_FINAL)
were found to be duplicates and excluded.
The variables of the final validated data is converted to the appropriate data class using the script map created in Step 01 Data Characterization that designates the datatype.
load('01_dataframe_characterization.RData') SCRIPT_MAP <- lapply(SCRIPT_MAP, call_mr_clean) print(bind_rows(SCRIPT_MAP) %>% select(-QA_STEP)) INPUT_DATA_ANALYSIS <- INPUT_DATA_FINAL
Category datatype columns will be kept as character class while number will be converted to double class, date will be converted to date class, and time will be converted to POSIXct:
for (i in 1:nrow(SCRIPT_MAP$date)) { x <- SCRIPT_MAP$date$COL_NAME[i] x <- enquo(x) INPUT_DATA_ANALYSIS <- INPUT_DATA_ANALYSIS %>% mutate_at(vars(!!x), list(~ymd(.))) } for (i in 1:nrow(SCRIPT_MAP$number)) { x <- SCRIPT_MAP$number$COL_NAME[i] x <- enquo(x) INPUT_DATA_ANALYSIS <- INPUT_DATA_ANALYSIS %>% mutate_at(vars(!!x), list(~as.double(.))) } for (i in 1:nrow(SCRIPT_MAP$time)) { x <- SCRIPT_MAP$time$COL_NAME[i] x <- enquo(x) INPUT_DATA_ANALYSIS <- INPUT_DATA_ANALYSIS %>% mutate_at(vars(!!x), list(~hm(.))) }
The patient population can be captured with a histogram plotting the distribution of patient qualities. The distribution of age is as follows:
hist(INPUT_DATA_ANALYSIS$AGE, col = "green4", breaks = 50, xlim = c(15,95), main = "Frequency Distribution of Age", xlab = "Age", ylab = "Frequency")
It is common assumption that the more complex the patient, the more clinic time is needed to treat the patient. In this dataset, a patient with multiple primary cancers may be considered a complex patient. However, summary data indicates that all patients in this dataset have one primary cancer diagnosis.
INPUT_DATA_PATIENT <- INPUT_DATA_ANALYSIS %>% group_by(PATIENT_ID) %>% summarize(CANCER_TYPE_NUMBER = length(unique(PATIENT_DIAGNOSIS))) %>% arrange(desc(CANCER_TYPE_NUMBER))
We can also delve into assessing provider performance. Again assuming that "C. S. Ming" and "C. Siu Ming" are two different providers, we can look at the patient population each provider serves. This bar chart was meant to stack patients based on cancer type, but upon visualization, it is clear that each provider treats one cancer type at this clinic.
INPUT_DATA_PROVIDER <- INPUT_DATA_ANALYSIS %>% group_by(PROVIDER_NAME, PATIENT_DIAGNOSIS) %>% summarise(COUNT = length(PATIENT_DIAGNOSIS)) %>% arrange(PATIENT_DIAGNOSIS) %>% ungroup() %>% mutate(PROVIDER_NAME = factor(PROVIDER_NAME, levels = PROVIDER_NAME)) library(ggplot2) ggplot(INPUT_DATA_PROVIDER , aes(x = PROVIDER_NAME, y = COUNT, fill = PATIENT_DIAGNOSIS)) + geom_bar(stat = "identity") + ggtitle("Patient Panel Size and Cancer Type for Each Provider") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The total clinic time in minutes was calculated based on PT_START_TIME and PT_END_TIME variables. Relatively flagged data was identified when the PT_START_TIME varible took place after the PT_END_TIME variable. An assumption was made that these values were interchanged to calculate the results rather than excluded.
The frequency distribution was then visualized for clinic time for all appointments.
INPUT_DATA_CANCER <- suppressMessages( INPUT_DATA_ANALYSIS %>% mutate(TOTAL_CLINIC_TIME = (period_to_seconds(PT_END_TIME - PT_START_TIME))/60) %>% mutate(TOTAL_CLINIC_TIME = ifelse(TOTAL_CLINIC_TIME < 0, TOTAL_CLINIC_TIME*(-1), TOTAL_CLINIC_TIME))) hist(INPUT_DATA_CANCER$TOTAL_CLINIC_TIME, col = "gray", main = "Frequency Distribution of All Clinic Visits", xlab = "Total Clinic Time (Minutes)", xlim = c(0,50))
While the above histogram provides a general idea of the distribution of time dedicated to patients at this clinic, the cancer types represented in this dataset (testicular, kidney, prostate, bladder) vary widely in life expectancy, treatment pathways, and prevalent patient characteristics. For example, prostate cancer is less deadly than the other cancers listed, where many patients may die of other age-related causes before the cancer progresses. Testicular cancer tends to affect younger populations and the long-term survival rate is also higher than other cancers. Bladder and kidney cancers are much more to affect life expectency and require prompt aggressive treatment.
Sieving the clinic time distribution by cancer type shows a different story. Bladder cancer appointments tend to require longer clinic times. Kidney cancer also has a higher density in clinic times greater than 30 minutes, but its distribution has the largest range across times than the other cancers. As expected, prostate and testicular cancers fall on the lower end of the spectrum, since these types of visits are more focused on surveillance than aggressive treatment.
ggplot(INPUT_DATA_CANCER, aes(TOTAL_CLINIC_TIME, fill = PATIENT_DIAGNOSIS)) + geom_density(alpha = 0.2) + ggtitle("Density of Clinic Times by Cancer Type") +labs(y= "Density", x = "Clinic Time (Minutes)")
```
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.