knitr::opts_chunk$set( eval = FALSE, collapse = TRUE, comment = "#>" )
UKB primary care presciption data are included in the Primary Care Linked data, accessed through the UKB Data Portal. The data can be explored online with SQL queries, the results of which can be downloaded. Alternatively, the full table can be downloaded.
For each KCL UKB application with access to primary care prescription
data, we have downloaded the full table and converted it to an R
disk.frame for fast local queries with ukbkings bio_record
and
bio_record_map
. (See the article
Access UKB data on Rosalind/CREATE
for a detailed description of the use of bio_record
and bio_record_map
to query all
record-level data available to a particular KCL UKB application)
bio_record
returns a diskframe, an object which can be queried in
chunks without reading the data into R.
library(ukbkings) library(dplyr) library(stringr) library(readxl) library(lubridate) project_dir <- "<path_to_project_directory>" diskf <- bio_record(project_dir, record = "gp_scripts") head(diskf)
eid data_provider issue_date read_2 bnf_code dmd_code drug_name quantity
1: XXXXXXX X XX/XX/XXXX bi33. 02050501
You can use dplyr verbs to filter
rows, select
columns, etc., and
finally collect
to retrieve the resulting data into a local dataframe.
Below is an example of using a combination of bio_record
and dplyr
verbs to retrieve antidepressant prescriptions.
The primary care prescription data include 3 prescription coding systems
| code | meaning | |---|---| | bnf_code | British National Formulary (BNF) is a reference book containing the standard list of medicines used in UK | | dmd_code | Dictionary of Medicines and Devices (dm+d), the catalogue of things that can be prescribed in the NHS | | read_2 | Read Codes (version 2 (v2)) are a coded thesaurus of clinical terms (used in the NHS since 1985). There is also a version 3 (CTV3 or v3) |
To retrieve prescriptions for a particular drug or drug class, search all three drug code systems and drug_name which includes name and dosage information combined. Below is an example retrieving antidepressant prescriptions.
bnf <- bio_record(project_dir, record = "gp_scripts") %>% filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>% collect()
Here I'm using bio_code_primary_care
to read
the UKB-supplied primary care code lookup tables to find Read 2 code
equivalents of BNF antidepressant codes.
read2_from_bnf <- bio_code_primary_care(project_dir, "read_v2_drugs_bnf") %>% filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>% pull(read_code) # Make a regular expression for search filter read2_rgx <- read2_from_bnf %>% str_replace_all("\\.", "\\\\.") %>% str_c("^", ., collapse = "|") %>% regex() read2 <- bio_record(project_dir, record = "gp_scripts") %>% filter(str_detect(read_2, read2_rgx)) %>% collect() %>% mutate(read_2 = str_replace(read_2, "00$", ""))
The mutate
drops a trailing "00" on some read2 codes (a UKB artifact?).
Note. Fabbri et al. (2021) Genetic and clinical characteristics of treatment-resistant depression using primary care records in two UK cohorts use an augmented list of read 2 codes, which includes the additional codes: du6z. du61. gdez. gdey. gde2. gde1.
This list is in supplementary material table 1, available from the UKB resources folder on Rosalind/CREATE. (See the paper's code for treatment-resistant depression phenotype derivation, as well as antidepressant extraction)
# -curated antidepressants - PMID33753889 cf_path <- file.path( dirname(project_dir), "resources/pmid33753889_medrxiv_supplementary_table1.xlsx" ) excel_sheets(cf_path)
[1] "readv2_mood_psych_sub" "readv3_mood_psych_sub" "ADs_read_2_codes" "bnf_drug_names" "dmd_drug_names" "read2_drug_names"
Chiara's augmented list of read 2 codes is in the sheet "ADs_read_2_codes".
# CF Read 2 code list cf_path <- file.path( dirname(project_dir), "resources/pmid33753889_medrxiv_supplementary_table1.xlsx" ) read2_from_cf <- read_excel(cf_path, sheet = "ADs_read_2_codes") %>% pull(read2_code) read2_rgx <- read2_from_cf %>% str_replace_all("\\.", "\\\\.") %>% str_c("^", ., collapse = "|") %>% regex() read2 <- bio_record(project_dir, record = "gp_scripts") %>% filter(str_detect(read_2, read2_rgx)) %>% collect() %>% mutate(read_2 = str_replace(read_2, "00$", ""))
You could also search for known dm+d codes in column dmd_code, and/or
known drug names in drug_name, to retrieve any prescriptions with
missing BNF and Read 2 codes. For example, Chiara's list of curated
antidepressant drug names is included in the ukbkings dataset
drug_dmd_antidep
.
drug_name_rgx <- str_c(drug_dmd_antidep$dmd_name) %>% str_c(collapse = "|") %>% regex(ignore_case = TRUE) other <- bio_record(project_dir, record = "gp_scripts") %>% filter(bnf_code == "" & read_2 == "") %>% filter(str_detect(drug_name, drug_name_rgx)) %>% collect()
The filter
on empty strings (missing values) in bnf_code and read_2
code searches among prescriptions not already captured by a bnf or
read 2 code.
Merge prescriptions
px <- bind_rows(bnf, read2, other) %>% distinct() %>% as_tibble()
Replace empty strings with NA
, fix column types, and drop
prescriptions without an associated date (if appropriate).
px <- px %>% mutate( across(.fns = ~ na_if(.x, "")), across(read_2:quantity, as.character), issue_date = as_date(issue_date, format = "%d/%m/%Y"), ) %>% filter(!is.na(issue_date))
The drug_name and quantity columns include a mix of incomplete information (e.g., drug, dosage, form (i.e., capsule, tablet, etc), count, company) and style (upper/lower case, integer/floating point, same form expressed in many ways (e.g., tab, tablet, tablets, tablet(s))). A random sample of 20 rows shows some the above
eid data_provider issue_date read_2 bnf_code dmd_code drug_name quantity <int> <int> <date> <chr> <chr> <chr> <chr> <chr>
1 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Citalopram 20mg tablets 28 tablets
2 XXXXXXX X XXXX-XX-XX NA 04.03.04.00.00 NA Venlafaxine 75mg modified-release capsules 56 capsule(s) - 75 mg
3 XXXXXXX X XXXX-XX-XX NA 0403010N0AAAEAE NA IMIPRAMINE HYDROCHLORIDE TABLETS 25MG 60.000
4 XXXXXXX X XXXX-XX-XX da91. NA NA NA NA
5 XXXXXXX X XXXX-XX-XX da93. NA NA NA NA
6 XXXXXXX X XXXX-XX-XX NA 040301 NA Dosulepin Hydrochloride CAPS 25MG 168.000
7 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Prozac 20mg capsules (Eli Lilly and Company Ltd) 1 OP
8 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Fluoxetine 20mg capsules 60 capsule
9 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Fluoxetine 20mg capsules 60 capsules
10 XXXXXXX X XXXX-XX-XX NA 04.03.04.00.00 NA Venlafaxine 75mg modified-release capsules 84 capsule(s)
11 XXXXXXX X XXXX-XX-XX NA 04.03.04.00.00 NA Mirtazapine 45mg tablets 28 tablet
12 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Sertraline 100mg tablets 7 tablet
13 XXXXXXX X XXXX-XX-XX da41. 04030300 NA Fluoxetine 20mg capsules 30.000
14 XXXXXXX X XXXX-XX-XX d712. NA 321746008 AMITRIPTYLINE HCl tabs 25mg 56.000
15 XXXXXXX X XXXX-XX-XX NA 04.03.01.00.00 NA Amitriptyline 25mg tablets 168 tablet(s) - 25 mg
16 XXXXXXX X XXXX-XX-XX daBz. NA NA NA NA
17 XXXXXXX X XXXX-XX-XX NA 04.03.01.00.00 NA Amitriptyline 10mg tablets 28 tablets
18 XXXXXXX X XXXX-XX-XX NA 04.03.01.00.00 NA Clomipramine 50mg capsules 168 capsules - 50 mg
19 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Sertraline 100mg tablets 14 tablets
20 XXXXXXX X XXXX-XX-XX d79z. NA NA NA NA
In this subset of prescriptions, there are 1,655 and 4,830 unique entries for drug_name and quantity respectively.
px %>% distinct(drug_name) %>% nrow() px %>% distinct(quantity) %>% nrow()
[1] 1655
[1] 4830
A first pass tidy up of these two columns significantly reduces the number of unique entries.
tidy_rgx <- "[\\*\\-\\&\\'\\,]|\\(.*[\\)\\.{3}]|\\.{3}|\\.000$|\\[.*\\]|[\\[\\]]" # \\/ part of dosage px <- px %>% mutate( across(c(drug_name, quantity), ~ { tolower(.x) %>% str_replace_all(tidy_rgx, " ") %>% str_trim() %>% str_squish() }) ) px %>% distinct(drug_name) %>% nrow() px %>% distinct(quantity) %>% nrow()
[1] 1116
[1] 3568
This data will require further tidying, some of which will be specific to the subset of prescriptions you collect. In the following, I've created a series of regular expressions to capture alternatives for the same descriptor, then added a new columns for drug, dose, and form.
tab_rgx <- c("tablets", "tablet", "tabs", "tab", "soltab") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() cap_rgx <- c("capsules", "capsule", "caps", "cap", "ca", "cps", "capsules\\d+mg") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() sol_rgx <- c("solution", "soln") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() drp_rgx <- c("drop", "drops") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() sus_rgx <- c("suspension", "susp", "sus") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() syr_rgx <- c("syrup", "syrp", "syr") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() liq_rgx <- c("liquid", "liq") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() eli_rgx <- c("elixir", "eli") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() mix_rgx <- c("mixture", "mix") %>% str_c("\\b", ., "\\b", collapse = "|") %>% regex() px <- px %>% mutate( drug = word(drug_name, 1), dose = str_extract(drug_name, "\\d*\\ *mg|\\d*\\ *microgram") %>% str_trim(), form = case_when( str_detect(drug_name, tab_rgx) | str_detect(quantity, tab_rgx) ~ "tablet", str_detect(drug_name, cap_rgx) | str_detect(quantity, cap_rgx) ~ "capsule", str_detect(drug_name, sol_rgx) | str_detect(quantity, sol_rgx) ~ "solution", str_detect(drug_name, drp_rgx) | str_detect(quantity, drp_rgx) ~ "drops", str_detect(drug_name, sus_rgx) | str_detect(quantity, sus_rgx) ~ "suspension", str_detect(drug_name, syr_rgx) | str_detect(quantity, syr_rgx) ~ "syrup", str_detect(drug_name, liq_rgx) | str_detect(quantity, liq_rgx) ~ "liquid", str_detect(drug_name, eli_rgx) | str_detect(quantity, eli_rgx) ~ "elixir", str_detect(drug_name, mix_rgx) | str_detect(quantity, mix_rgx) ~ "mixture" ) )
This is not an exhautive clean of the data. You could continue to clean this data by iteratively running the above chunk, then inspecting the remaining unique entries in drug_name for missing values in drug/ form/ dose as follows
px %>% filter(is.na(form)) %>% distinct(drug_name) %>% print(n = Inf)
Add further terms to the regular expressions above, and repeat until the data are sufficiently clean for your purposes.
|data|source| |---|---| | drug_pharmgkb | PharmGKB curated drug list (https://www.pharmgkb.org/downloads, drugs.zip) | | drug_gwas | ukb.gwas.medication.supplementary.data.1.xlsx - Genome-wide association study of medication-use and associated disease in the UK Biobank, Nature Communications | | drug_dmd_antidep | A curated list of dm+d antidepressant drug names from Fabbri et al. |
NHS prescription drugs - https://openprescribing.net/bnf/ - https://www.thedatalab.org/blog/161/prescribing-data-bnf-codes/ - https://digital.nhs.uk/services/terminology-and-classifications/read-codes - https://digital.nhs.uk/data-and-information/areas-of-interest/prescribing/practice-level-prescribing-in-england-a-summary
ATC codes - https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm - https://bioportal.bioontology.org/ontologies/ATC/?p=summary
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.