Pharmacoepidemiology with doseminer"
In doseminer: Extract Drug Dosages from Free-Text Prescriptions

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(knitr.kable.NA = '')

This vignette presents an example analysis that might resemble a real-world study in pharmacoepidemiology. For a quick look at the functions and utilities available in doseminer, see the Introduction to doseminer vignette.

Extract dosage from freetext prescription

Let's import an example dataset containing prescriptions in free-text form. The data include product codes (prodcode) identifying the drugs prescribed; patient identifiers (patid); the date of the prescription start (event_date); the total quantity of drug prescribed (qty) and the actual free text (text) containing the dosage instructions for the medication.

Technically, the package doseminer uses the latter, but combined with the other variables we can make inferences about drug exposure for patients.

data(cprd, package = 'doseminer')
str(cprd)

Extract dosage information from the text. To avoid redundant computation, we remove duplicates, so each unique text string is only processed once. The results can then be joined back with the original prescriptions data, using the raw column from the output data frame.

The doseminer function extract_from_prescription() only takes a character vector as input (not a single-column data frame, yet) so should pull() the text data out as a vector before processing.

library(doseminer)

free_text <- with(cprd, text[!duplicated(text) & nchar(text) > 0])
extracted <- extract_from_prescription(free_text)
head(extracted)

Now, we can relate the extracted prescription information back to the original dataset.

dosages <- merge(extracted, cprd, by.x = 'raw', by.y = 'text', all.x = TRUE)
head(dosages)

Estimating drug exposure

The original data provided the total quantity of drug and the start date, but not an end date. Using the information that doseminer infers about daily dose, we can estimate the number of days the patient can go at that average dose before they run out of medication. Hence we estimate a window of time that a patient was taking (exposed to) the drug, which can be used to determine if adverse events (e.g. fractures, given as separate data) occurred during drug exposure or not.

Missing values

You might notice that some data are missing, either because it isn't explicitly mentioned in the prescription text or because the text itself was missing. In general, there are a range of methods one might use to impute or exclude such values, and the topic is beyond the scope of doseminer, but the focus of an upcoming package called DrugPrepCPRD, which explores the 'multiverse' of possible imputation decisions.

For now, we will either (a) ignore incomplete prescriptions (complete case analysis) or (b) replace missing values with the mean for that patient and drug.

In other scenarios, you might see a range of dose, frequency or interval: for example "take 1-2" or "every 2-3 hours". Again, you can choose how to summarise these values: taking the minimum, maximum or mean. If a dose is optional, you might want to include the value zero in this range. You should ensure your results are robust to this decision (again: see DrugPrepCPRD).

Drug exposure time

The length of a prescription, in days, is defined as the total quantity of drug (qty) divided by the average number of units administered per day. In turn, the average number of units per day is calculated as the dose in each sitting, multiplied by the daily frequency (freq) and divided by the interval between 'dose-days' (itvl).

Here is one way of estimating drug exposure windows for these data.

library(dplyr)
library(tidyr)
library(ggplot2)
dosages %>%
  separate(dose, c('min_dose', 'max_dose'), sep = '-',
           convert = TRUE, fill = 'right') %>%
  mutate(dose = coalesce((min_dose + max_dose) / 2, min_dose),
         itvl = replace_na(as.numeric(itvl), 1),
         freq = as.numeric(freq),
         daily_dose = freq * dose / itvl,
         end_date = date + qty / daily_dose) %>%
  ggplot() +
  aes(y = as.factor(patid), xmin = date, xmax = end_date) +
  geom_errorbarh(height = .5) +
  ylab('patient ID')