knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(Ramses)

Introduction

This vignette demonstrates how to clean and validate hospital records and load them into a data warehouse.

Ramses is provided simulated datasets consisting of records of r formatC(length(unique(inpatient_episodes$encounter_id)), big.mark = ",") hospitalisations for r formatC(length(unique(inpatient_episodes$patient_id)), big.mark = ",") patients, complete with prescriptions, diagnoses and care episodes records.

Setup

First, packages must be loaded.

library(Ramses)
library(dplyr)    # to facilitate data manipulations

Second, a relational database server is needed.

For the purposes of this vignette, one may use on a small DuckDB database on the local hard drive. This database will behave in a similar way as a dedicated relational database server. This method should only be used for training with simulated data as it is not a secure method of handling actual patient data.

ramses_db <- connect_local_database("ramses-db.duckdb")

Alternatively, one may use a dedicated PostgreSQL server (other SQL databases are not currently supported). In this case, we use the DBI and RPostgres packages to initiate the connection:

ramses_db <- DBI::dbConnect(
  RPostgres::Postgres(), 
  host = 'example_server_name' ,
  dbname = 'example_database_name',
  user = 'username', 
  password = 'password', 
  port = 5432, 
  # Do not forget to specify a time zone to correctly collect datetime records
  timezone = "Europe/London")

# Specify the name of the schema Ramses should use on the database 
# (if not, Ramses will use 'public' by default)
DBI::dbExecute(ramses_db, "SET search_path = ramses_schema, public;")

Whatever method is used, the database connection information is stored in an object (here ramses_db) which must be provided to most functions in the Ramses package.

We can now begin preparing records and loading them into the database. This involves three steps:

  1. Transform: adding metadata and transforming records to conform with Ramses data validation constraints
  2. Validate: using dedicated functions (eg: validate_inpatient_episodes()) to verify that all data conform to Ramses data validation constraints
  3. Load: loading records into a SQL database.

Medication records

Drug prescriptions

We begin with prescription records (you may choose to start with other records).

str(drug_prescriptions)

Prescriptions records must conform to data validation requirements listed in the documentation for validate_prescriptions()

We first map drug concepts using the as.ab() function provided by the AMR package. This function takes a character variable as input and returns a three-letter code identifying antifungals and antibacterials. These give access to the AMR package's wide range of analytical methods. Although the AMR package does not currently provide equivalent mapping for antiparasitic and antiviral drugs, these can be mapped to any substance-level terminology of your choice. RxNorm Ingredients and SNOMED CT UK Edition Virtual Therapeutic Moieties are both good options.

# attempting to map drug name using AMR package
drug_prescriptions$drug_code <- AMR::as.ab(drug_prescriptions$tr_DESC)

Some manual editing may be required when drugs are not successfully mapped, as shown in the example below.

# editing drug names
drug_prescriptions$drug_code <- gsub("Vancomycin protocol", 
                                   "Vancomycin",
                                   drug_prescriptions$tr_DESC)
# mapping drug name using AMR package
drug_prescriptions$drug_code <- AMR::as.ab(drug_prescriptions$drug_code)
drug_prescriptions$drug_name <- AMR::ab_name(drug_prescriptions$drug_code)

We also map the route of administration to the World Health Organisation Anatomical Therapeutic Chemical Classification System (ATC) codes:

# recoding route of administration
drug_prescriptions <- mutate(drug_prescriptions, 
  ATC_route = 
    case_when(
      route %in% c(NULL) ~ "Implant", 
      route %in% c("NEB", "INHAL") ~ "Inhal", 
      route %in% c("TOP", "EYE", "EYEL", "EYER", "EYEB", 
                   "EAR", "EARL", "EARR", "EARB") ~ "Instill", 
      route %in% c("NASAL", "NOST", "NOSTL", "NOSTR", "NOSTB") ~ "N", 
      route %in% c("ORAL", "NAS", "PEG") ~ "O", 
      route %in% c("IV", "IVB", "IVI", "IMI", "IT", "IVT") ~ "P", 
      route %in% c("PR") ~ "R", 
      route %in% c("BUCC", "SB", "OROM", "SUBL") ~ "SL", 
      route %in% c("SC", "ID") ~ "TD", 
      route %in% c("PV") ~ "V", 
      TRUE ~ NA_character_
    ))

Using the AMR package, it is possible to infer important properties of antibacterial/antifungal drugs:

# extracting the ATC code (eg J01CR02) group (eg Beta-lactam antibacterials, penicillins)
drug_prescriptions$ATC_code <- AMR::ab_atc(drug_prescriptions$drug_code, only_first = TRUE)
drug_prescriptions$ATC_group <- AMR::ab_atc_group1(drug_prescriptions$drug_code)
# removing antifungal drugs, if applicable
drug_prescriptions <- filter(drug_prescriptions, 
                             ATC_group != "Antimycotics for systemic use")

Working out the prescription DDDs can be complex in the case of compound medications. The ATC provides reference DDDs with respect to the main substance in a given medication. For example:

paste0(AMR::ab_ddd("co-amoxiclav", "oral"), 
       AMR::ab_ddd("co-amoxiclav", "oral", unit = T))

In order to obtain the correct DDDs for compound medications, it is recommended to map data to a suitable dictionary of medicines such as the RxNorm or the SNOMED CT UK Drug Extension and extract the corresponding ATC codes and DDDs.

In this example, however, we use a look-up table for the strength of compound medications.

# prepare DDD extraction
compound_strength_lookup <- data.frame(list(
  drug_code = c("AMC", "AMC", "TZP", "SMX"),
  route = c("oral", "oral", "oral", "oral"),
  dose = c(625, 1.2, 4.5, 480),
  units = c("mg", "g", "g", "mg"),
  strength = c(500, 1, 4, 400),
  basis_of_strength = c("AMX", "AMX", "PIP", "SMX")
), stringsAsFactors = F)

drug_prescriptions <- merge(drug_prescriptions, 
                            compound_strength_lookup, 
                            all.x = T)

drug_prescriptions <- drug_prescriptions %>% 
  mutate(strength = if_else(is.na(strength), dose, strength),  
         basis_of_strength = if_else(is.na(basis_of_strength),
                                     as.character(drug_code),
                                     basis_of_strength))

Ramses provides another look-up table Ramses::reference_drug_frequency to help convert the frequency character variable into a numeric variable indicating the daily frequency of administration.

drug_prescriptions <- merge(drug_prescriptions, 
                            reference_drug_frequency, by = "frequency", all.x = T)

We can now compute the prescription DDD based on the ATC reference DDDs and the daily frequency of administration:

drug_prescriptions <- drug_prescriptions %>% 
  mutate(daily_dose = strength * daily_frequency) %>% 
  mutate(DDD = compute_DDDs(
    ATC_code = AMR::ab_atc(basis_of_strength, only_first = TRUE),
    ATC_administration = ATC_route,
    dose = daily_dose,
    unit = units)) %>% 
  mutate(duration_days = if_else(
    daily_frequency == -1,
    "one-off", 
    if_else(
      round(difftime(prescription_end, 
                     prescription_start, 
                     units = "days")) == 1,
      "1 day", 
      paste(round(difftime(prescription_end, 
                           prescription_start, 
                           units = "days")),
            "days")
      )
  ))
drug_prescriptions <- drug_prescriptions %>% 
  transmute(patient_id,
            prescription_id,
            # prescription summary for displaying in interfaces
            prescription_text = paste0(
              drug_name, " ", 
              route, " ", 
              dose, units, " ", 
              duration_days
            ),
            drug_code,
            drug_name = drug_name,
            drug_display_name = drug_name,
            # in this example dataset, only antibacterial drugs are included
            antiinfective_type = "antibacterial", 
            ATC_code,
            ATC_group,
            ATC_route,
            authoring_date,
            prescription_start,
            prescription_end,
            prescription_status = "completed",
            prescription_context = "inpatient",
            dose,
            unit = units,
            route,
            frequency,
            daily_frequency,
            DDD)

Drug administrations

An identical process is followed for drug administration, with the difference that DDDs refer to the dose of a single administration, and thus do not involve a frequency variable.

drug_administrations$drug_code <- gsub("Vancomycin protocol", 
                                     "Vancomycin", 
                                     drug_administrations$tr_DESC)
drug_administrations$drug_code <- AMR::as.ab(drug_administrations$drug_code)
drug_administrations$drug_name <- AMR::ab_name(drug_administrations$drug_code)
drug_administrations$drug_group <- AMR::ab_group(drug_administrations$drug_code)
# recoding route of administration
drug_administrations <- mutate(drug_administrations, 
  ATC_route = 
    case_when(
      route %in% c(NULL) ~ "Implant", 
      route %in% c("NEB", "INHAL") ~ "Inhal", 
      route %in% c("TOP", "EYE", "EYEL", "EYER", "EYEB", 
                   "EAR", "EARL", "EARR", "EARB") ~ "Instill", 
      route %in% c("NASAL", "NOST", "NOSTL", "NOSTR", "NOSTB") ~ "N", 
      route %in% c("ORAL", "NAS", "PEG") ~ "O", 
      route %in% c("IV", "IVB", "IVI", "IMI", "IT", "IVT") ~ "P", 
      route %in% c("PR") ~ "R", 
      route %in% c("BUCC", "SB", "OROM", "SUBL") ~ "SL", 
      route %in% c("SC", "ID") ~ "TD", 
      route %in% c("PV") ~ "V", 
      TRUE ~ "NA_character_"
    ))
drug_administrations$ATC_code <- AMR::ab_atc(drug_administrations$drug_code, only_first = TRUE)
drug_administrations <- filter(drug_administrations, 
                               drug_group != "Antifungals/antimycotics")

drug_administrations <- merge(drug_administrations, compound_strength_lookup, all.x = T)
drug_administrations <- drug_administrations %>% 
  mutate(strength = if_else(is.na(strength), dose, strength),
         basis_of_strength = if_else(is.na(basis_of_strength),
                                     as.character(drug_code),
                                     basis_of_strength))

drug_administrations <- drug_administrations %>% 
  mutate(DDD = compute_DDDs(
    ATC_code = AMR::ab_atc(basis_of_strength, only_first = TRUE),
    ATC_administration = ATC_route,
    dose = dose,
    unit = units 
  )) 

# Create an identifier
# Note: An alternative is encryption using openssl::sha256()

drug_administrations <- drug_administrations %>%
  group_by(patient_id, drug_code, route,
           dose, units, administration_date) %>% 
  mutate(administration_id = dplyr::cur_group_id()) %>% 
  ungroup()

drug_administrations <- drug_administrations %>% 
    transmute(
      patient_id,
      administration_id = as.character(administration_id),
      prescription_id,
      administration_text = paste0(
          drug_name, " ", route, " ", dose, units),
      drug_code,
      drug_name,
      drug_display_name = drug_name,
      drug_group,
      # in this example dataset, only antibacterial drugs are included
      antiinfective_type = "antibacterial", 
      ATC_code,
      ATC_route,
      dose,
      unit = units,
      route,
      administration_date,
      administration_status = "completed",
      DDD
    )

This data frame can now be validated, before being loaded into the database.

validate_administrations(drug_administrations)

load_medications(conn = ramses_db, 
                 prescriptions = drug_prescriptions, 
                 administrations = drug_administrations, 
                 overwrite = TRUE,
                 silent = FALSE)

Inpatient care records

Admissions

We now turn to records of patient demographics, admissions, discharges and transfers. Ramses is designed around the notion of hospital encounters identified by variable encounter_id, which correspond to the window from the time a patient is admitted to the time a patient is discharged.

In addition, Ramses use subdivisions of hospital encounters called episodes of care, which designate a continuous period of inpatient care under the responsibility of a department, specialty, or clinical team. This corresponds to the 'finished consultant episode' in the UK NHS Admitted Patient Care data model [@Herbert2017]. All hospital encounters consist of at least one episode of care. Episodes of care are used to calculate bed-days as well as rates of infection, prescribing and other stewardship metrics for individual specialties and teams of clinicians.

Hospital encounters and episodes of care are recorded in a single table in which every row corresponds to one episode of care.

str(inpatient_episodes)

Dates of admissions and episode start/end must be valid:

validate_inpatient_episodes(patients = patients,
                            episodes = inpatient_episodes)

Optionally, Ramses can use data on ward movements. These records must be validated together with episodes of care

validate_inpatient_episodes(patients = patients,
                            episodes = inpatient_episodes, 
                            wards = inpatient_wards)

The data can now be loaded into the database.

load_inpatient_episodes(conn = ramses_db,
                        patients_data = patients,
                        episodes_data = inpatient_episodes,
                        wards_data =  inpatient_wards)

Diagnosis codes (ICD-10)

In order to exploit clinical diagnoses coded by doctors during hospital admissions, Ramses relies on a range of look up tables which must be built to suit the precise ICD-10 version used by the data. Ramses uses three types of look up tables:

ICD-10 reference table

This example uses the Ramses::inpatient_diagnoses example dataset. First, we download the ICD-10-CM reference look up table

icd10cm <- download_icd10cm()
str(icd10cm)

We verify that all ICD-10 codes in the dataset can be mapped to the reference table. The first validation fails, because some of the icd_code in inpatient_diagnoses are missing (all diagnoses that are not infections). These are removed before validation is attempted again. In the present example, the dataset is simulated using ICD-10 5th Edition, rather than ICD-10-CM. The match is therefore not complete on the second validation attempt. Although the validation produces a warning, it passes (returns TRUE). This means it will not prevent loading the data into the warehouse.

validate_inpatient_diagnoses(inpatient_diagnoses, icd10cm)
inpatient_diagnoses <- filter(inpatient_diagnoses, !is.na(icd_code))
validate_inpatient_diagnoses(inpatient_diagnoses, icd10cm)

Next, we classify the reference look-up table to other classifications:

Although this is done automatically when loading diagnoses into the warehouse, this step can be useful to first check that the ICD-10 reference table is successfully mapped to other classifications.

icd_infection_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_infections_abx_indications(df = ., icd_column = "icd_code")

comorbidity_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_charlson_comorbidities(df = ., icd_column = "icd_code")

ccsr_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_ICD10_CCSR(df = ., icd_column = "icd_code")

ccs_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_ICD10_CCS(df = ., icd_column = "icd_code")

head(icd_infection_lkup)

Data may now be loaded into the database. Diagnoses and their reference look-up are loaded in one step.

load_inpatient_diagnoses(conn = ramses_db, 
                         diagnoses_data = inpatient_diagnoses, 
                         diagnoses_lookup = icd10cm, 
                         overwrite = TRUE)

Investigation records

Microbiology culture isolates and susceptibilities

A small sample of records of microbial isolates is provided along with susceptibility testing results:

str(Ramses::inpatient_microbiology)

Before validating and loading these records into the database, some transformations will be required to create three separate data frames:

  1. specimens contains one row per microbial sample sent to the laboratory
  2. isolates contains zero, one, or multiple rows for every specimen -- as many rows as the number of microorganisms isolated from the specimen, whatever the technique (eg mass spectrometry, PCR, bacterial culture)
  3. susceptibility contains results of susceptibility testing: one row per substance tested.

For this, we create a list of data frames named microbiology_data, and commence with cleaning the raw data. We employ the AMR package to code microorganism species names (AMR::as.mo()) and antimicrobial names (AMR::as.ab()).

microbiology_data <- list()

# Start by cleaning the raw data
microbiology_data$raw <- inpatient_microbiology
microbiology_data$raw <- microbiology_data$raw %>% 
  mutate(
    organism_code = AMR::as.mo(if_else(
      organism_display_name == "No growth",
      NA_character_,
      organism_display_name)),
    agent_code = AMR::as.ab(agent_display_name)) %>% 
  mutate(organism_name = AMR::mo_name(organism_code),
         agent_name = AMR::ab_name(agent_code))

As for specimen types (eg bood, bronchoalveolar lavage fluid, etc.), they must be coded using SNOMED CT specimen concepts. A complete list of concepts from the SNOMED CT International Edition can be found in Ramses::reference_specimen_type.

Susceptibility testing results in the example dataset are already interpreted as Resistant/Increased exposure/Susceptible (RSI). Where needed, minimum inhibitory concentrations and disk diffusion diameters can be interpreted as RSI using the as.sir() function provided by the AMR package.

If possible, use the most specific concept available (eg 309176002 | Bronchial brushings sample (specimen) |) over a more generic term (eg 258415003 | Biopsy sample (specimen) |). Specific concepts can always be reclassified to higher-level concepts using SNOMED CT inference (in this case, not just 258415003 | Biopsy sample (specimen) |, but also 258603007 | Respiratory sample (specimen) |). The snomedizer package can be used to query SNOMED CT concept parents and descendants.

microbiology_data$raw <- microbiology_data$raw %>% 
  mutate(specimen_type_code = case_when(
    specimen_type_display == "Blood Culture" ~ 
      "446131002", # Blood specimen obtained for blood culture
    specimen_type_display == "Faeces" ~ 
      "119339001", # Stool specimen
    specimen_type_display == "MRSA Screen" ~ 
      "697989009", # Anterior nares swab 
    specimen_type_display == "Urine" ~ 
      "122575003", # Urine specimen
    TRUE ~ NA_character_
  )) %>% 
  left_join(transmute(reference_specimen_type,
                      specimen_type_code = conceptId,
                      specimen_type_name = pt_term))

The three data frames can now be created.

microbiology_data$specimens <- microbiology_data$raw %>% 
  transmute(specimen_id,
            patient_id,
            status = "available",
            specimen_datetime,
            specimen_type_code,
            specimen_type_name,
            specimen_type_display) %>% 
  distinct() # Removing duplicates created by multiple isolates and susceptibility testing

microbiology_data$isolates <- microbiology_data$raw %>% 
  transmute(isolate_id,
            specimen_id,
            patient_id,
            organism_code,
            organism_name,
            organism_display_name,
            isolation_datetime) %>% 
  distinct() # Removing duplicates created by susceptibility testing

microbiology_data$susceptibilities <- microbiology_data$raw %>% 
  filter(!is.na(organism_code)) %>%  # Remove no growth
  transmute(isolate_id,
            specimen_id,
            patient_id,
            organism_code,
            organism_name,
            organism_display_name,
            agent_code,
            agent_name,
            agent_display_name,
            rsi_code,
            concept_code = NA_character_) %>% 
  distinct()

An optional step is to classify isolates by multidrug resistance phenotype [@Magiorakos2012]. The AMR package function AMR::mdro() provides classification based on:

In this vignette, we adopt intrinsic resistances as defined by EUCAST guidelines [@EUCASTExpertRules32]. The tidyr package is used to pivot susceptibility test results into wide format (antimicrobials arranged as columns).

mdr_classification <- microbiology_data$susceptibilities  %>% 
  select(isolate_id,
         organism_code,
         agent_code,
         rsi_code) %>% 
  tidyr::pivot_wider(id_cols = c("isolate_id", "organism_code"),
                     names_from = agent_code,
                     values_from = rsi_code)

mdr_classification$multidrug_resistance <- 
  as.character(AMR::mdro(
    x = mdr_classification,
    guideline = "EUCAST",
    col_mo = "organism_code"))

# Merging back into the isolates dataset
microbiology_data$isolates <- left_join(
  microbiology_data$isolates,
  select(mdr_classification, isolate_id, multidrug_resistance)
)

The data may now be validated, and loaded.

validate_microbiology(specimens = microbiology_data$specimens,
                      isolates = microbiology_data$isolates,
                      susceptibilities = microbiology_data$susceptibilities)
load_inpatient_microbiology(
  conn = ramses_db,
  specimens = microbiology_data$specimens,
  isolates = microbiology_data$isolates,
  susceptibilities = microbiology_data$susceptibilities,
  overwrite = TRUE
)

Clinical investigations

Finally, we import all other observation and test results, from basic vitals such as respiration rate and blood pressure, to urine antigen test results. Analysis of such data requires the use of standard test vocabularies.

The international standard code system for laboratory data is LOINC. A sample of important codes may be found in dataset Ramses::reference_loinc. Other observable entity concepts or evaluation procedure concepts may be imported from SNOMED CT.

data(inpatient_investigations)
str(inpatient_investigations)

One particular set of validation constraints apply:

Many important units in medicine are not listed in the UCUM. For instance:

By default, these three custom units are passed to the custom_units argument of validate_investigations(). Any other custom unit in the dataset must be handled with this argument.

validate_investigations(inpatient_investigations, 
                        custom_units = c("breaths", "beats", "U"))

Data may now be loaded into the database.

load_inpatient_investigations(
  conn = ramses_db,
  investigations_data = inpatient_investigations, 
  overwrite = TRUE
)

Final steps

There is one final step to create ancillary tables:

# Create bridge tables
bridge_tables(conn = ramses_db, overwrite = TRUE)

# Create reference AWaRe table (see ?reference_aware)
dplyr::copy_to(dest = ramses_db,
               df = filter(reference_aware, version == "England" & year == "2019"),
               name = "reference_aware",
               temporary = FALSE,
               overwrite = TRUE)

The database is now ready.

The main vignette (getting started) demonstrates how to use it.

Once you are done

Always remember close database connections when you are finished.

DBI::dbDisconnect(ramses_db, shutdown = TRUE)
file.remove("ramses-db.duckdb")

References



ramses-antibiotics/ramses-package documentation built on Feb. 13, 2024, 1:01 p.m.