In meerapatelmd/clinicalDataQuality: What the Package Does (One Line, Title Case)

knitr::opts_chunk$set(echo = TRUE,
                      eval = TRUE,
                      cache = TRUE,
                      cache.path = "reports/cache/",
                      child = NULL, #file/s to knit and then include,
                      collapse = FALSE, #collapse all output into a single block,
                      error = TRUE, #display error messages in doc. FALSE stops render when error is thrown
                      fig.align = "center", #left, right, center, or default
                      include = TRUE, #include chunk?
                      message = TRUE, #display code messages?
                      tidy = TRUE, #tidy code 
                      warning = TRUE, #include warnings?
                      results = "markup"
                        # "asis": passthrough results
                        # "hide": do not display results 
                        # "hold": put all results below all code
                      )

library(tidyverse)

Introduction

To adequately assess the data quality of a clinical dataset, the following overarching attributes about the data as a whole should be taken into account:

Cohort definition: that population that is being studied, which are answered by questions such as:
- Adult or pediatric population?
- What is the inclusion criteria?
- What is the exclusion criteria?

An example of a cohort definition is "All the adult HIV-positive patients that were admitted to the ICU at MSK".

Timeline: the timeframe that the study entails. Extending the example given, the timeline for this study can be the clinical outcomes of HIV-positive patients 1 year after their first HIV-related hospitalization.

Cohort Definition and Timeline serve as the bare minimum knowledge for a data-driven research study and confirmation of both should be the first step in the QA process.

Framework Foundation

Foundational Quality Rules

Once the preliminary research study characterization is established, the Framework itself will take the R class of each field in the dataset, and each R class is associated with a unique data quality pipeline related to the permitted values within the given field. For the purposes of this brief demonstration, rules are applied on the limited set of datatypes of category, text, float, integer, date, and datetime. In reality, many additional datatypes may exist, such as time for a timestamp without a date, or large integers, the equivalent to bigint in SQL.

The purpose of applying these fundamental rules is to confirm the data integrity before moving forward with more complex data quality rules. For example, a more complex rule may be one where the date of death must be preceded by a date of birth. However, it is the foundational rule on the datatype date that preliminarily confirms that all the date of death and date of birth data are in a parseable format, that those that fall within the 95% percentile in either direction have been reviewed and confirmed, and that particular attention has been paid to any data that indicates dates of death and birth that are occurring in the future from this moment in time.

tibble::tribble(
    ~datatype, ~r_class, ~rule, ~soft_flag,~hard_flag,
    "category", "factor", "matches predetermined constraints", "not found within control data", NA_character_, 
    "text", "character", NA_character_, NA_character_, NA_character_,
    "float", "c('numeric', 'double')", "only numeric characters with maximum 1 decimal point", "greater or less than 2.5 standard deviations from the mean", NA_character_, 
    "integer", "integer", "only numeric characters with 0 decimal point", "greater or less than 2.5 standard deviations from the mean", NA_character_,
    "date", "Date", "maximum 8 and minimum 4 numeric characters, maximum 2 punctuation characters", "greater or less than 2.5 standard deviations from the mean", "future date", 
"datetime", "POSIXct", "maximum 6 numeric characters, maximum 2 punctuation characters", NA_character_, "times greater than or equal to 24:00")

Framework Foundation

Datatype Matrix

datatype_matrix <- 
  tibble::tribble(
    ~datatype, ~r_class, ~rule, ~soft_flag,~hard_flag,
    "identifier", "factor", "value in established valueset", "", "value not in valueset",
    "category", "factor", "value in established constraints", "value not in constraints", "", 
    "text", "character", "", "", "",
    "float", "c('numeric', 'double')", "only numeric characters with maximum 1 decimal point", "greater or less than 2.5 standard deviations from the mean", "", 
    "integer", "integer", "only numeric characters with 0 decimal point", "greater or less than 2.5 standard deviations from the mean", "",
    "date", "Date", "maximum 8 and minimum 4 numeric characters, maximum 2 punctuation characters", "greater or less than 2.5 standard deviations from the mean", "future date", 
"datetime", "POSIXct", "maximum 6 numeric characters, maximum 2 punctuation characters", "", "times greater than or equal to 24:00")
kableExtra::kable(
    x = datatype_matrix,
    caption = "Matrix that maps a `datatype` of a field in a dataframe to the R object class (`r_class`). Each field has a single `datatype` that is constrained to a specific R object class and a set of additional rules associated with the datatype (`rule`). A `soft_flag` represents clinically improbable values that will require confirmation that the value is not due to error such as a typo. A `hard_flag` is a flag where the value is clinically impossible and must be corrected such as a date value that occurs in the future. Note that a matrix such as this one would require regular refinement to accommodate for nuances such as if a dataset includes a field such as _date of next appointment_, which will contain dates that may occur in the future. This is an example of where a new datatype alloting future dates may need to be introduced to avoid any _hard flags_ for acceptable values.")

The purpose of applying these fundamental rules is to confirm the data integrity before moving forward with more complex data quality rules. For example, a more complex rule may be one where the date of death must be preceded by a date of birth. However, it is the foundational rule on the datatype date that preliminarily confirms that all of the date of death and date of birth data are in a parseable format, values that fall above the 95th percentile in either direction have been reviewed and confirmed, and particular attention is paid to any data that shows that date of death and date of birth occurring in the future.

Surveying Source Data

Following the matrix above, each field in the clinical dataset should be surveyed and assigned one of the datatypes. Mapping a source data field to a datatype can be demonstrated using the 100 sample records from the Condition Occurrence table in Polyester, a database of synthetic clinical data generated by Synthea and ETL'd into the OMOP Common Data Model (learn more).

condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")
str(condition_occurrence)

The survey above serves a guide to assign the appropriate datatype to each column. A few notes representing the challenges faced when applying data quality rules to real world clinical data:

Though the gender field data was read into the R environment as a character class, the result of our survey indicates that it should be of the factor class and thus, be assigned the category datatype.
Another similar example that is debatable is whether or not to consider condition_source a text or category datatype. This would depend on whether this field came from an abstraction or from a structured EHR data capture. Here, I have chosen to treat it as a category.

condition_occurrence_metadata <-
tibble::tribble(
  ~field, ~datatype,
  "person_id", "identifier",
  "gender", "category", 
  "condition_start_date", "date",
  "condition_end_date", "date",
  "condition_source", "category") 

kableExtra::kable(x = condition_occurrence_metadata,
                  caption = "Map between the fields in the sample of 100 records from the `Condition Occurrence` table in **Polyester** and the assigned datatype.")

Applying Data Quality Rules

To apply the data quality rules, the fields can be grouped by assigned datatype, and each grouping can be sent to its respective data quality pipeline as defined by the rule in the datatype matrix.

condition_occurrence_datatypes <-
  split(condition_occurrence_metadata, condition_occurrence_metadata$datatype) %>%
  map(select, -datatype) %>%
  map(unlist) %>%
  map(unname)
condition_occurrence_datatypes

Datatype: Identifier

Hard Flag

Fields of datatype identifier should have unique values that equivalent to the length that is expected sample size. The programmatic rule used is confirming that all the expected unique identifiers are found in the data. In this example the unique identifiers are the sequence from 1 to 100 (1:100).

condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")
condition_occurrence %>%
  select(all_of(condition_occurrence_datatypes$identifier)) %>%
  rubix::tibble_to_list() %>%
  map(as.character) %>%
  map(function(x) c(x[!(x %in% as.character(1:100))],
                    as.character(1:100)[!(as.character(1:100) %in% x)])
      )

Results

Condition Occurrence failed the hard flag that dictates that all the values within the identifier datatype are to be found within a predefined valueset. In this case, the person_id 54 is missing from the data, triggering a hard flag to ensure that 54 is brought back into the source dataset.

Datatype: Date

Hard Flag

Fields of datatype date should not have any dates that occur in the future.

condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")
condition_occurrence %>%
  select(all_of(condition_occurrence_datatypes$date)) %>%
  rubix::tibble_to_list() %>%
  map(as.Date) %>%
  map(~ .[centipede::no_na(.) > Sys.Date()])

Results

Condition Occurrence did not fail the hard flag that dictates that all the values within the date datatype cannot contain any dates that occur in the future from today.

Soft Flag

condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")
condition_occurrence %>%
  select(all_of(condition_occurrence_datatypes$date)) %>%
  rubix::tibble_to_list() %>%
  map(as.Date) %>%
  map(~ summary(.))

Results

Condition Occurrence did assed the soft flag that dictates that all the values outside the 95th percentile should be reviewed for accuracy.

Datatype: Category

The category datatype requires the most user feedback for this framework since it requires a set of predetermined constraints that will ultimately determine the degree of flagging that will occur. In this sample data, there are 2 fields of this datatype: gender and condition_source. Each will require a vector of constraints for the QA process

constraints <- 
  list(gender = c("FEMALE", "MALE"),
       condition_source = c(
                    'Fatigue', 
                    'Otitis media', 
                    'Loss of taste', 
                    'Streptococcal sore throat', 
                    'Acute bacterial sinusitis', 
                    'Cough', 
                    'Viral sinusitis', 
                    'Fever', 
                    'Acute bronchitis', 
                    'Acute viral pharyngitis'))

Hard Flag

Fields of datatype category should fall within their constraints.

condition_occurrence <- readr::read_csv("data-raw/CONDITION_OCCURRENCE.csv")
condition_occurrence %>%
  select(all_of(condition_occurrence_datatypes$category)) %>%
  rubix::tibble_to_list() %>%
  map(as.factor) %>% 
  map(levels) %>%
  map(unique)

list(constraints = constraints,
     condition_occurrence = condition_occurrence) %>%
  transpose() %>%
  map(function(x) x$condition_occurrence[!(x$condition_occurrence %in% x$constraints)]) %>%
  map(unique) %>%
  keep(~ length(.) > 0)

Surveying Source Data

Dataframe characterization entails surveying the dataset as a whole, and then diving deeper into the data by assigning a datatype for each field:

str(INPUT_DATA)

The distribution of values for each variable is also surveyed:

summary(INPUT_DATA %>% select(-PKEY) %>% mutate_all(as.factor))

We then weed out duplicate values from the input data (INPUT_DATA object), which is then added to a dataframe of flagged data (OUTPUT_FLAGS object) in the same original structure with the addition of FLAG_REASON and FLAG_TYPE variables. The FLAG_REASON value for this data is "duplicate".

A 5 observation sample of the flagged data is:

INPUT_DATA_TOTAL <- INPUT_DATA
INPUT_DATA_FINAL <-
INPUT_DATA_TOTAL %>%
        distinct(PATIENT_ID, PATIENT_DIAGNOSIS, INSURANCE_TYPE, AGE, SEX, VISIT_DATE, PT_SCHEDULED_APPT, PT_ARRIVE_TIME, PT_START_TIME, PT_END_TIME, PROVIDER_NAME, .keep_all = TRUE)

OUTPUT_FLAGS_01 <-
        INPUT_DATA_TOTAL[!(INPUT_DATA_TOTAL$PKEY %in% INPUT_DATA_FINAL$PKEY),] %>%
        mutate(FLAG_REASON = "duplicate")


print(OUTPUT_FLAGS_01 %>%
        sample_n(5)
)

For this exercise, the possible datatypes are limited to four: category, number, date, and time. In practice, there are additional types such as datetime, string, and boolean. Each datatype follows its assigned rules. An overall overview of the datatypes and their associated rules can be seen in Table 1. These rules are called Standalone Rules because it is assessing the data quality in an isolated manner and not relative to other variables found in this data such as relative to visit dates. Relative Rules are implemented downstream with increasing complexity, which will be explained later.

load('01_dataframe_characterization.RData')
kable(STANDALONE_RULES, caption = "Standalone Rules") %>%
  kable_styling(position = "left", font_size = 8) %>%
  column_spec(column = c(2, 4, 5), width = "3cm") %>%
  column_spec(column = c(1), width = "2.5cm") %>%
  column_spec(column = c(6), width = ".75cm") %>%
  column_spec(column = c(3), width = "3.5cm")

Other important definitions include:
Hard Flag: clinically impossible value. For example, a birthdate that takes place in the future is clinically impossible.
Soft Flag: clinically improbable value, such as an extremely high white blood cell count of 30000. Each variable is assigned a datatype and joined with the rules to create a SCRIPT_MAP object that is used to guide the remainder of the framework. It is a list that has divided the column names based on datatype to execute scripts against. Each datatype has its own downstream pipeline indicated by the QA_STEP variable that this script map will direct the data flow with for Step 02 in the framework.

load('01_dataframe_characterization.RData')
print(SCRIPT_MAP)

02. Quality Framework By Datatype

a. Category (02a_category_qa.R)

source('~/R/MSK/02a_category_qa.R')
load("02a_category_qa.RData")

A category datatype is defined as a vector of possible values that the content of the column can fall under. Therefore a requisite for a quality process for this datatype is having the necessary control data to compare it to.

The script map for this step is customized to guide this effort:

load("02a_category_qa.RData")
print(SCRIPT_MAP)

Though PKEY and PATIENT_ID variables are categorical, they serve as identifiers in the data and do not require controls. The remaining variables require an associated control vector, the object name of which is in the CONTROL_OBJ_NAME variable. In this exercise, I've assumed that all the unique values for each column is the control. However, in more realistic cases, the control vectors are iteratively updated with new values or error values can be mapped to an existing control value. The controls are as follows:

PATIENT_DIAGNOSIS_CONTROL <- tolower(c("Prostate Cancer","Kidney Cancer",
                                       "Bladder Cancer","Testicular Cancer"))
INSURANCE_TYPE_CONTROL <- tolower(c("Private Insurance","Medicaid",
                                    "Self Insured","Medicare",""))
SEX_CONTROL <- tolower(c("M", "F"))
PROVIDER_NAME_CONTROL <- tolower(c("L. Svenson","I. Petrov", "E. Ahuja",
                                   "J. Smith","N. Fulano","M. Dupont", 
                                   "S. Moreau","W. Plinge","C. S. Ming","C. Siu Ming"))

If there is a mismatch, the mismatched observations are compiled in an OUTPUT_FLAGS_02A object and are removed from the input data. It is also noted at this point that there are two providers "C. S. Ming" and "C. Siu Ming" that may potentially be duplicates. In real-life circumstances, I would check to see if they are the same provider, in which case the control set PROVIDER_NAME_CONTROL would not include one of the values. The unmatched value would be flagged and mapped to the correct control value.

b. Number (02b_number_qa.R)

source('~/R/MSK/02b_number_qa.R')
load("02b_number_qa.RData")

The customized script map for the number datatype is:

load("02b_number_qa.RData")
print(SCRIPT_MAP)

In each variable, a sequential check on the number of numbers, decimals, other punctuation, and letters is used to flag values that do not fall within the rule of only numbers and no more than one decimal point. In this exercise, only one column is designated a number datatype, and analysis on the mentioned character counts returns no observations to flag. All the values in the AGE variable consist of only 2 numbers.

c. Date (02c_date_qa.R)

source('~/R/MSK/02c_date_qa.R')
load("02c_date_qa.RData")

Like the previous datatype, a series of sequential checks were conducted on the number of different types of characters. The rules for the date datatype in this exercise were no greater than 2 punctuation marks, no more than 8 numbers, no less than 4 numbers, values that did not parse using the lubridate::ymd function, and parsed dates that occurred in the future.

The script map is very similar to the one for number datatype:

load("02c_date_qa.RData")
print(SCRIPT_MAP)

Applying the rules above on the single column, returns no values to flag. All values in the VISIT_DATE variable follow the format of "YYYY-MM-DD" and none of the dates occur in the future.

d. Time (02d_time_qa.R)

source('~/R/MSK/02d_time_qa.R')
load("02d_time_qa.RData")

Like the previous 2 datatypes, a series of sequential checks were conducted on the number of different types of characters. The rules for the time datatype in this exercise were no greater than 2 punctuation marks, no more than 6 numbers, no less than 3 numbers, and values that did not parse using the lubridate::hm function.

The script map contains 4 columns this time:

load("02d_time_qa.RData")
print(SCRIPT_MAP)

Applying the rules above on all columns, returns no values to flag, however. All data in the 4 columns contain valid hour and minute values as determined by the clinical rules applied.

03. Final Output (03_final.R)

The final output is final_standalone_flag_output.xlsx. The final validated data is de-duplicated. The duplicates and flags from Step 02a to 02d are combined into a single output flags dataframe. Additionally, there is an overlap between the final validated data and flagged data that is designated in an overlap tab. More detailed definitions for the three tabs:
VALIDATED_DATA: all observations that have been de-duplicated and has passed the standalone datatype quality checks.
OUTPUT_FLAGS: observations that have been flagged with the reason in the FLAG_REASON variable and flag severity (hard, soft, or NA) in the FLAG_TYPE variable.
OVERLAP_INPUT_FLAGS: ideally only one observation is accounted for between the validated data and the flagged data, but an overlap might occur inevitably. In the case of this exercise, the overlap occured because blank values are flagged regardless by default, but a blank value was also included in the control data for INSURANCE_TYPE variable. This overlap will require addressing whether blank values should be considered a unique control value, imputed with another value, or considered missing for this specific variable.

04. Beyond Standalone Flags

Conducting a quality check past the standalone flags are beyond the scope of this assignment. However, for a more thorough quality check, some soft flags such as values greater than or less than 2.5 standard deviations from the mean for enumerated datatypes and relative flags are employed in increasing complexity.

Soft Flags

Soft flags are a means to ensure that the outliers are indeed outliers as opposed to technical or human error. All flags for category data are typically soft because the control data is dynamically updated and can have an infinite number of control values in theory. However, the highest and lowest values in enumerated datatypes such as number and date may be fed into a pipeline that involves patient record checks to ensure validity and reliability. The only exception would be time datatype, where the data value is circular rather than linear.

Relative Flags

These flags can be relative to the datatype, patient, provider, clinic, diagnosis, and so on. For example, a relative flag for the time datatype in the data here would be making sure that the PT_START_TIME variable occurs prior to the PT_END_TIME variable. On the other hand, a flag relative to the patient record can be checking if a reported appointment date falls between the patient's date of birth and date of death. These flags can scale in the number of variables being compared against each other at infinite levels that include patient-level, provider-level, diagnosis-level, etc.

Part Two: Analysis

After adjudicating on the flagged and overlap data, the final validated data would be ready for analysis. Assumptions made to this effect include that blank values in the INSURANCE_TYPE variable are valid control values and are included in the data used for analysis.

The validated data has r nrow(INPUT_DATA_FINAL) unique observations. The original data had r nrow(INPUT_DATA) observations, but r nrow(INPUT_DATA)-nrow(INPUT_DATA_FINAL) were found to be duplicates and excluded.

The variables of the final validated data is converted to the appropriate data class using the script map created in Step 01 Data Characterization that designates the datatype.

load('01_dataframe_characterization.RData')
SCRIPT_MAP <- lapply(SCRIPT_MAP, call_mr_clean)
print(bind_rows(SCRIPT_MAP) %>% select(-QA_STEP))
INPUT_DATA_ANALYSIS <- INPUT_DATA_FINAL

Category datatype columns will be kept as character class while number will be converted to double class, date will be converted to date class, and time will be converted to POSIXct:

for (i in 1:nrow(SCRIPT_MAP$date)) {
          x <- SCRIPT_MAP$date$COL_NAME[i]
          x <- enquo(x)
          INPUT_DATA_ANALYSIS <-
          INPUT_DATA_ANALYSIS %>%
                    mutate_at(vars(!!x), list(~ymd(.)))
}

for (i in 1:nrow(SCRIPT_MAP$number)) {
          x <- SCRIPT_MAP$number$COL_NAME[i]
          x <- enquo(x)
          INPUT_DATA_ANALYSIS <-
          INPUT_DATA_ANALYSIS %>%
                    mutate_at(vars(!!x), list(~as.double(.)))
}

for (i in 1:nrow(SCRIPT_MAP$time)) {
          x <- SCRIPT_MAP$time$COL_NAME[i]
          x <- enquo(x)
          INPUT_DATA_ANALYSIS <-
          INPUT_DATA_ANALYSIS %>%
                    mutate_at(vars(!!x), list(~hm(.)))
}

01. Patient-Level

The patient population can be captured with a histogram plotting the distribution of patient qualities. The distribution of age is as follows:

hist(INPUT_DATA_ANALYSIS$AGE, col = "green4", breaks = 50, xlim = c(15,95), main = "Frequency Distribution of Age", xlab = "Age", ylab = "Frequency")

It is common assumption that the more complex the patient, the more clinic time is needed to treat the patient. In this dataset, a patient with multiple primary cancers may be considered a complex patient. However, summary data indicates that all patients in this dataset have one primary cancer diagnosis.

INPUT_DATA_PATIENT <-
        INPUT_DATA_ANALYSIS %>%
        group_by(PATIENT_ID) %>%
        summarize(CANCER_TYPE_NUMBER = length(unique(PATIENT_DIAGNOSIS))) %>%
        arrange(desc(CANCER_TYPE_NUMBER))

02. Provider

We can also delve into assessing provider performance. Again assuming that "C. S. Ming" and "C. Siu Ming" are two different providers, we can look at the patient population each provider serves. This bar chart was meant to stack patients based on cancer type, but upon visualization, it is clear that each provider treats one cancer type at this clinic.

INPUT_DATA_PROVIDER <-
INPUT_DATA_ANALYSIS %>%
  group_by(PROVIDER_NAME, PATIENT_DIAGNOSIS) %>%
  summarise(COUNT = length(PATIENT_DIAGNOSIS)) %>%
  arrange(PATIENT_DIAGNOSIS) %>%
  ungroup() %>%
  mutate(PROVIDER_NAME = factor(PROVIDER_NAME, levels = PROVIDER_NAME))


library(ggplot2)
ggplot(INPUT_DATA_PROVIDER , aes(x = PROVIDER_NAME, y = COUNT, fill =  PATIENT_DIAGNOSIS)) +
        geom_bar(stat = "identity") + 
        ggtitle("Patient Panel Size and Cancer Type for Each Provider") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

03. Cancer Type

The total clinic time in minutes was calculated based on PT_START_TIME and PT_END_TIME variables. Relatively flagged data was identified when the PT_START_TIME varible took place after the PT_END_TIME variable. An assumption was made that these values were interchanged to calculate the results rather than excluded.

The frequency distribution was then visualized for clinic time for all appointments.

INPUT_DATA_CANCER <-
  suppressMessages(
INPUT_DATA_ANALYSIS %>%
  mutate(TOTAL_CLINIC_TIME = (period_to_seconds(PT_END_TIME - PT_START_TIME))/60) %>%
  mutate(TOTAL_CLINIC_TIME = ifelse(TOTAL_CLINIC_TIME < 0, TOTAL_CLINIC_TIME*(-1), TOTAL_CLINIC_TIME)))

hist(INPUT_DATA_CANCER$TOTAL_CLINIC_TIME, col = "gray", main = "Frequency Distribution of All Clinic Visits", xlab = "Total Clinic Time (Minutes)", xlim = c(0,50))

While the above histogram provides a general idea of the distribution of time dedicated to patients at this clinic, the cancer types represented in this dataset (testicular, kidney, prostate, bladder) vary widely in life expectancy, treatment pathways, and prevalent patient characteristics. For example, prostate cancer is less deadly than the other cancers listed, where many patients may die of other age-related causes before the cancer progresses. Testicular cancer tends to affect younger populations and the long-term survival rate is also higher than other cancers. Bladder and kidney cancers are much more to affect life expectency and require prompt aggressive treatment.

Sieving the clinic time distribution by cancer type shows a different story. Bladder cancer appointments tend to require longer clinic times. Kidney cancer also has a higher density in clinic times greater than 30 minutes, but its distribution has the largest range across times than the other cancers. As expected, prostate and testicular cancers fall on the lower end of the spectrum, since these types of visits are more focused on surveillance than aggressive treatment.

ggplot(INPUT_DATA_CANCER, aes(TOTAL_CLINIC_TIME, fill = PATIENT_DIAGNOSIS)) + geom_density(alpha = 0.2) + ggtitle("Density of Clinic Times by Cancer Type") +labs(y= "Density", x = "Clinic Time (Minutes)")

```

Questions For End Users

How should the blank insurance value be handled? It can be considered a valid control with its own unique definition or it can be imputed/mapped to an agreed upon value.
Are C. S. Ming and C. Sui Ming the same provider?
When the start time is after the end time, is it safe to assume that these values were accidentally interchanged at the time of capture?

meerapatelmd/clinicalDataQuality documentation built on Jan. 1, 2021, 9:27 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

meerapatelmd/clinicalDataQuality
What the Package Does (One Line, Title Case)

In meerapatelmd/clinicalDataQuality: What the Package Does (One Line, Title Case)

Introduction

Framework Foundation

Foundational Quality Rules

Framework Foundation

Datatype Matrix

Surveying Source Data

Applying Data Quality Rules

Datatype: Identifier

Hard Flag

Results

Datatype: Date

Hard Flag

Results

Soft Flag

Results

Datatype: Category

Hard Flag

Surveying Source Data

02. Quality Framework By Datatype

a. Category (02a_category_qa.R)

b. Number (02b_number_qa.R)

c. Date (02c_date_qa.R)

d. Time (02d_time_qa.R)

03. Final Output (03_final.R)

04. Beyond Standalone Flags

Soft Flags

Relative Flags

Part Two: Analysis

01. Patient-Level

02. Provider

03. Cancer Type

Questions For End Users

R Package Documentation

Browse R Packages

We want your feedback!

meerapatelmd/clinicalDataQuality What the Package Does (One Line, Title Case)

In meerapatelmd/clinicalDataQuality: What the Package Does (One Line, Title Case)

Introduction

Framework Foundation

Foundational Quality Rules

Framework Foundation

Datatype Matrix

Surveying Source Data

Applying Data Quality Rules

Datatype: Identifier

Hard Flag

Results

Datatype: Date

Hard Flag

Results

Soft Flag

Results

Datatype: Category

Hard Flag

Surveying Source Data

02. Quality Framework By Datatype

a. Category (02a_category_qa.R)

b. Number (02b_number_qa.R)

c. Date (02c_date_qa.R)

d. Time (02d_time_qa.R)

03. Final Output (03_final.R)

04. Beyond Standalone Flags

Soft Flags

Relative Flags

Part Two: Analysis

01. Patient-Level

02. Provider

03. Cancer Type

Questions For End Users

R Package Documentation

Browse R Packages

We want your feedback!

meerapatelmd/clinicalDataQuality
What the Package Does (One Line, Title Case)