CREATE HPC

knitr::opts_chunk$set(
    eval = FALSE,
    collapse = TRUE,
    comment = "#>"
)

UKB primary care presciption data are included in the Primary Care Linked data, accessed through the UKB Data Portal. The data can be explored online with SQL queries, the results of which can be downloaded. Alternatively, the full table can be downloaded.

For each KCL UKB application with access to primary care prescription data, we have downloaded the full table and converted it to an R disk.frame for fast local queries with ukbkings bio_record and bio_record_map. (See the article Access UKB data on Rosalind/CREATE for a detailed description of the use of bio_record and bio_record_map to query all record-level data available to a particular KCL UKB application)

Query prescription data

bio_record returns a diskframe, an object which can be queried in chunks without reading the data into R.

library(ukbkings)
library(dplyr)
library(stringr)
library(readxl)
library(lubridate)

project_dir <- "<path_to_project_directory>"

diskf <- bio_record(project_dir, record = "gp_scripts")
head(diskf)

       eid data_provider issue_date read_2 bnf_code dmd_code                                    drug_name quantity
1: XXXXXXX             X XX/XX/XXXX  bi33. 02050501                           Lisinopril 10mg tablets   56.000
2: XXXXXXX             X XX/XX/XXXX  blb1. 02060200                            Amlodipine 5mg tablets   56.000
3: XXXXXXX             X XX/XX/XXXX  bxd5. 02120400                          Simvastatin 40mg tablets   56.000
4: XXXXXXX             X XX/XX/XXXX  l412. 12010151      Otosporin ear drops (GlaxoSmithKline UK Ltd)   10.000
5: XXXXXXX             X XX/XX/XXXX  bi33. 02050501                           Lisinopril 10mg tablets   28.000
6: XXXXXXX             X XX/XX/XXXX  blb1. 02060200                            Amlodipine 5mg tablets   28.000

You can use dplyr verbs to filter rows, select columns, etc., and finally collect to retrieve the resulting data into a local dataframe. Below is an example of using a combination of bio_record and dplyr verbs to retrieve antidepressant prescriptions.

The primary care prescription data include 3 prescription coding systems

| code | meaning | |---|---| | bnf_code | British National Formulary (BNF) is a reference book containing the standard list of medicines used in UK | | dmd_code | Dictionary of Medicines and Devices (dm+d), the catalogue of things that can be prescribed in the NHS | | read_2 | Read Codes (version 2 (v2)) are a coded thesaurus of clinical terms (used in the NHS since 1985). There is also a version 3 (CTV3 or v3) |

To retrieve prescriptions for a particular drug or drug class, search all three drug code systems and drug_name which includes name and dosage information combined. Below is an example retrieving antidepressant prescriptions.

Antidepressant prescriptions

Search by BNF code

bnf <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>%
    collect()

Search by Read 2 code

Here I'm using bio_code_primary_care to read the UKB-supplied primary care code lookup tables to find Read 2 code equivalents of BNF antidepressant codes.

read2_from_bnf <- bio_code_primary_care(project_dir, "read_v2_drugs_bnf") %>%
    filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>%
    pull(read_code)

# Make a regular expression for search filter
read2_rgx <- read2_from_bnf %>%
    str_replace_all("\\.", "\\\\.") %>%
    str_c("^", ., collapse = "|") %>%
    regex()

read2 <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(str_detect(read_2, read2_rgx)) %>%
    collect() %>%
    mutate(read_2 = str_replace(read_2, "00$", ""))

The mutate drops a trailing "00" on some read2 codes (a UKB artifact?).

Note. Fabbri et al. (2021) Genetic and clinical characteristics of treatment-resistant depression using primary care records in two UK cohorts use an augmented list of read 2 codes, which includes the additional codes: du6z. du61. gdez. gdey. gde2. gde1.

This list is in supplementary material table 1, available from the UKB resources folder on Rosalind/CREATE. (See the paper's code for treatment-resistant depression phenotype derivation, as well as antidepressant extraction)

# -curated antidepressants - PMID33753889
cf_path <- file.path(
    dirname(project_dir),
    "resources/pmid33753889_medrxiv_supplementary_table1.xlsx"
)

excel_sheets(cf_path)

[1] "readv2_mood_psych_sub" "readv3_mood_psych_sub" "ADs_read_2_codes"      "bnf_drug_names"        "dmd_drug_names"        "read2_drug_names"

Chiara's augmented list of read 2 codes is in the sheet "ADs_read_2_codes".

# CF Read 2 code list
cf_path <- file.path(
    dirname(project_dir),
    "resources/pmid33753889_medrxiv_supplementary_table1.xlsx"
)

read2_from_cf <- read_excel(cf_path, sheet = "ADs_read_2_codes") %>%
    pull(read2_code)

read2_rgx <- read2_from_cf %>%
    str_replace_all("\\.", "\\\\.") %>%
    str_c("^", ., collapse = "|") %>%
    regex()

read2 <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(str_detect(read_2, read2_rgx)) %>%
    collect() %>%
    mutate(read_2 = str_replace(read_2, "00$", ""))

Search by dm+d code and/or drug name

You could also search for known dm+d codes in column dmd_code, and/or known drug names in drug_name, to retrieve any prescriptions with missing BNF and Read 2 codes. For example, Chiara's list of curated antidepressant drug names is included in the ukbkings dataset drug_dmd_antidep.

drug_name_rgx <- str_c(drug_dmd_antidep$dmd_name) %>%
    str_c(collapse = "|") %>%
    regex(ignore_case = TRUE)

other <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(bnf_code == "" & read_2 == "") %>%
    filter(str_detect(drug_name, drug_name_rgx)) %>%
    collect()

The filter on empty strings (missing values) in bnf_code and read_2 code searches among prescriptions not already captured by a bnf or read 2 code.

Merge prescriptions

px <- bind_rows(bnf, read2, other) %>%
    distinct() %>%
    as_tibble()

Initial clean and format

Replace empty strings with NA, fix column types, and drop prescriptions without an associated date (if appropriate).

px <- px %>%
    mutate(
        across(.fns = ~ na_if(.x, "")),
        across(read_2:quantity, as.character),
        issue_date = as_date(issue_date, format = "%d/%m/%Y"),
    ) %>%
    filter(!is.na(issue_date))

The drug_name and quantity columns include a mix of incomplete information (e.g., drug, dosage, form (i.e., capsule, tablet, etc), count, company) and style (upper/lower case, integer/floating point, same form expressed in many ways (e.g., tab, tablet, tablets, tablet(s))). A random sample of 20 rows shows some the above

A tibble: 20 × 8

   eid data_provider issue_date read_2 bnf_code        dmd_code  drug_name                                        quantity             
 <int>         <int> <date>     <chr>  <chr>           <chr>     <chr>                                            <chr>

1 XXXXXXX X XXXX-XX-XX NA 2 XXXXXXX X XXXX-XX-XX NA 3 XXXXXXX X XXXX-XX-XX NA 4 XXXXXXX X XXXX-XX-XX da91. NA 5 XXXXXXX X XXXX-XX-XX da93. NA 6 XXXXXXX X XXXX-XX-XX NA 7 XXXXXXX X XXXX-XX-XX NA 8 XXXXXXX X XXXX-XX-XX NA 9 XXXXXXX X XXXX-XX-XX NA 10 XXXXXXX X XXXX-XX-XX NA 11 XXXXXXX X XXXX-XX-XX NA 12 XXXXXXX X XXXX-XX-XX NA 13 XXXXXXX X XXXX-XX-XX da41. 14 XXXXXXX X XXXX-XX-XX d712. NA 15 XXXXXXX X XXXX-XX-XX NA 16 XXXXXXX X XXXX-XX-XX daBz. NA 17 XXXXXXX X XXXX-XX-XX NA 18 XXXXXXX X XXXX-XX-XX NA 19 XXXXXXX X XXXX-XX-XX NA 20 XXXXXXX X XXXX-XX-XX d79z. NA

04.03.03.00.00 NA Citalopram 20mg tablets 28 tablets
04.03.04.00.00 NA Venlafaxine 75mg modified-release capsules 56 capsule(s) - 75 mg 0403010N0AAAEAE NA IMIPRAMINE HYDROCHLORIDE TABLETS 25MG 60.000
NA NA NA
NA NA NA
040301 NA Dosulepin Hydrochloride CAPS 25MG 168.000
04.03.03.00.00 NA Prozac 20mg capsules (Eli Lilly and Company Ltd) 1 OP
04.03.03.00.00 NA Fluoxetine 20mg capsules 60 capsule
04.03.03.00.00 NA Fluoxetine 20mg capsules 60 capsules
04.03.04.00.00 NA Venlafaxine 75mg modified-release capsules 84 capsule(s)
04.03.04.00.00 NA Mirtazapine 45mg tablets 28 tablet
04.03.03.00.00 NA Sertraline 100mg tablets 7 tablet
04030300 NA Fluoxetine 20mg capsules 30.000
321746008 AMITRIPTYLINE HCl tabs 25mg 56.000
04.03.01.00.00 NA Amitriptyline 25mg tablets 168 tablet(s) - 25 mg NA NA NA
04.03.01.00.00 NA Amitriptyline 10mg tablets 28 tablets
04.03.01.00.00 NA Clomipramine 50mg capsules 168 capsules - 50 mg 04.03.03.00.00 NA Sertraline 100mg tablets 14 tablets
NA NA NA

In this subset of prescriptions, there are 1,655 and 4,830 unique entries for drug_name and quantity respectively.

px %>%
    distinct(drug_name) %>%
    nrow()
px %>%
    distinct(quantity) %>%
    nrow()

[1] 1655
[1] 4830

A first pass tidy up of these two columns significantly reduces the number of unique entries.

tidy_rgx <- "[\\*\\-\\&\\'\\,]|\\(.*[\\)\\.{3}]|\\.{3}|\\.000$|\\[.*\\]|[\\[\\]]"
# \\/ part of dosage

px <- px %>%
    mutate(
        across(c(drug_name, quantity), ~ {
            tolower(.x) %>%
                str_replace_all(tidy_rgx, " ") %>%
                str_trim() %>%
                str_squish()
        })
    )

px %>%
    distinct(drug_name) %>%
    nrow()
px %>%
    distinct(quantity) %>%
    nrow()

[1] 1116
[1] 3568

Additional tidying

This data will require further tidying, some of which will be specific to the subset of prescriptions you collect. In the following, I've created a series of regular expressions to capture alternatives for the same descriptor, then added a new columns for drug, dose, and form.

tab_rgx <- c("tablets", "tablet", "tabs", "tab", "soltab") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

cap_rgx <- c("capsules", "capsule", "caps", "cap", "ca", "cps", "capsules\\d+mg") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

sol_rgx <- c("solution", "soln") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

drp_rgx <- c("drop", "drops") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

sus_rgx <- c("suspension", "susp", "sus") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

syr_rgx <- c("syrup", "syrp", "syr") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

liq_rgx <- c("liquid", "liq") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

eli_rgx <- c("elixir", "eli") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

mix_rgx <- c("mixture", "mix") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

px <- px %>%
    mutate(
        drug = word(drug_name, 1),
        dose = str_extract(drug_name, "\\d*\\ *mg|\\d*\\ *microgram") %>% str_trim(),
        form = case_when(
            str_detect(drug_name, tab_rgx) |
                str_detect(quantity, tab_rgx) ~ "tablet",
            str_detect(drug_name, cap_rgx) |
                str_detect(quantity, cap_rgx) ~ "capsule",
            str_detect(drug_name, sol_rgx) |
                str_detect(quantity, sol_rgx) ~ "solution",
            str_detect(drug_name, drp_rgx) |
                str_detect(quantity, drp_rgx) ~ "drops",
            str_detect(drug_name, sus_rgx) |
                str_detect(quantity, sus_rgx) ~ "suspension",
            str_detect(drug_name, syr_rgx) |
                str_detect(quantity, syr_rgx) ~ "syrup",
            str_detect(drug_name, liq_rgx) |
                str_detect(quantity, liq_rgx) ~ "liquid",
            str_detect(drug_name, eli_rgx) |
                str_detect(quantity, eli_rgx) ~ "elixir",
            str_detect(drug_name, mix_rgx) |
                str_detect(quantity, mix_rgx) ~ "mixture"
        )
    )

This is not an exhautive clean of the data. You could continue to clean this data by iteratively running the above chunk, then inspecting the remaining unique entries in drug_name for missing values in drug/ form/ dose as follows

px %>%
    filter(is.na(form)) %>%
    distinct(drug_name) %>%
    print(n = Inf)

Add further terms to the regular expressions above, and repeat until the data are sufficiently clean for your purposes.

3. Other medication reference data

|data|source| |---|---| | drug_pharmgkb | PharmGKB curated drug list (https://www.pharmgkb.org/downloads, drugs.zip) | | drug_gwas | ukb.gwas.medication.supplementary.data.1.xlsx - Genome-wide association study of medication-use and associated disease in the UK Biobank, Nature Communications | | drug_dmd_antidep | A curated list of dm+d antidepressant drug names from Fabbri et al. |

4. Resources

NHS prescription drugs - https://openprescribing.net/bnf/ - https://www.thedatalab.org/blog/161/prescribing-data-bnf-codes/ - https://digital.nhs.uk/services/terminology-and-classifications/read-codes - https://digital.nhs.uk/data-and-information/areas-of-interest/prescribing/practice-level-prescribing-in-england-a-summary

ATC codes - https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm - https://bioportal.bioontology.org/ontologies/ATC/?p=summary

kenhanscombe/ukbkings documentation built on April 28, 2023, 12:47 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com