In rmgpanw/ukbwranglr: Exploring UKB Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(ukbwranglr)
library(dplyr)
library(tidyr)
library(tidyselect)
library(readr)
library(tibble)

Overview

The basic workflow is as follows:

Create a data dictionary for your main UK Biobank dataset with make_data_dict().
Read selected variables into R with read_ukb().
Summarise continuous variables with summarise_numerical_variables().
Tidy clinical events data with tidy_clinical_events() or make_clinical_events_db(), and extract outcomes of interest with extract_phenotypes().
Analyse.

These steps are now illustrated with a dummy dataset included with ukbwranglr:

# path to dummy data
ukb_main_path <- get_ukb_dummy("dummy_ukb_main.tsv",
                                     path_only = TRUE)

# raw data
read_tsv(ukb_main_path)

You can try out the steps either locally by installing ukbwranglr on your own machine, or online by clicking on the following link to RStudio Cloud[^1] and navigating to this Rmd file in the 'vignettes' directory:

[^1]: You will be asked to sign up for a free account if you do not have one already.

Basic workflow

1. Create data dictionary

First download a copy of the UK Biobank data dictionary and codings files from the UK Biobank data showcase website. Dummy versions are used here:

# get required metadata
ukb_data_dict <- get_ukb_dummy("dummy_Data_Dictionary_Showcase.tsv")
ukb_codings <- get_ukb_dummy("dummy_Codings.tsv")

Then create a data dictionary using make_data_dict():

# make data dictionary
data_dict <- make_data_dict(ukb_main_path, 
                            ukb_data_dict = ukb_data_dict)

data_dict

2. Read selected variables into R

Read a main UK Biobank dataset into R using read_ukb():

read_ukb(path = ukb_main_path,
         ukb_data_dict = ukb_data_dict,
         ukb_codings = ukb_codings) %>% 
  # (convert to tibble for concise print method)
  as_tibble()

A UK Biobank main dataset file is typically too large to fit into memory on a personal computer. Often however, only a subset of the data is required. To read a selection of variables, filter the data dictionary created with make_data_dict() for a subset of variables and supply this to read_ukb():[^2]

[^2]: It may be preferable at this stage to write the data dictionary to a csv file (using readr::write_csv()), manually filter in a text editor like Microsoft Excel, then reload into R (using readr::read_csv()).

# Filter data dictionary for sex, year of birth, body mass index, systolic blood pressure, self-reported non-cancer illness and summary hospital diagnoses (ICD10) fields
data_dict_selected <- data_dict %>%
  filter(FieldID %in% c(
    # Participant ID
    "eid",

    # Sex
    "31",

    # Year of birth
    "34",

    # Body mass index
    "21001",

    # Systolic blood pressure
    "4080",

    # Self-reported non-cancer medical conditions
    "20002",
    "20008",

    # Summary hospital diagnoses (ICD10)
    "41270",
    "41280"
  ))

# Read selected variables into R
ukb_main <- read_ukb(path = ukb_main_path,
         data_dict = data_dict_selected,
         ukb_data_dict = ukb_data_dict,
         ukb_codings = ukb_codings)

# Convert to tibble for concise print method
as_tibble(ukb_main)

3. Summarise continuous variables

Some variables such as body mass index and systolic blood pressure will have been measured on more than one occasion. In these cases it may be desirable to calculate a summary value (e.g. mean). Use summarise_numerical_variables():

# calculate the mean value across all repeated continuous variable measurements
ukb_main_numerical_vars_summarised <- summarise_numerical_variables(
  ukb_main = ukb_main,
  ukb_data_dict = ukb_data_dict,
  summary_function = "mean", 
  .drop = TRUE
) %>% 
  # reorder variables
  select(eid,
         sex_f31_0_0,
         year_of_birth_f34_0_0,
         mean_body_mass_index_bmi_x21001,
         mean_systolic_blood_pressure_automated_reading_x4080)

as_tibble(ukb_main_numerical_vars_summarised)

4. Tidy clinical events data and extract outcomes of interest

Use tidy_clinical_events() to reshape clinical events fields (such as self-reported non-cancer medical conditions) to long format:

# tidy clinical events
clinical_events <- tidy_clinical_events(
  ukb_main = ukb_main, 
  ukb_data_dict = ukb_data_dict, 
  ukb_codings = ukb_codings,
  clinical_events_sources = c("self_report_non_cancer",
                              "summary_hes_icd10")
)

# returns a named list of data frames
clinical_events

# combine with dplyr
clinical_events <- dplyr::bind_rows(clinical_events)

clinical_events

To identify participants with a condition of interest, first decide which codes will capture this. For example, the following includes a (non-exhaustive) list of clinical codes for diabetes:[^3]

[^3]: The codemapper package includes functions to facilitate building clinical code lists.

example_clinical_codes()

Supply this to extract_phenotypes() to filter for participants who have any matching clinical codes in their records. By default, only the earliest date is extracted:

# extract phenotypes
diabetes_cases <- extract_phenotypes(clinical_events = clinical_events,
                                     clinical_codes = example_clinical_codes())

diabetes_cases

5. Analyse

Merge the output from steps 4 and 5:

ukb_main_processed <-
  # first summarise `diabetes_cases` - for each eid, get the earliest date
  diabetes_cases %>%
  group_by(eid,
           disease) %>%
  summarise(diabetes_min_date = min(date, na.rm = TRUE)) %>%
  ungroup() %>%

  # create indicator column for diabetes
  mutate(diabetes_indicator = case_when(!is.na(diabetes_min_date) ~ "Yes",
                                        TRUE ~ "No")) %>%

  # join with `ukb_main_numerical_vars_summarised`
  dplyr::full_join(ukb_main_numerical_vars_summarised,
                   by = "eid")

ukb_main_processed

Describe:

ukb_main_processed %>%
  select(-eid) %>%
  group_by(diabetes_indicator) %>%
  summarise(pct_female = sum(sex_f31_0_0 == "Female", na.rm = TRUE) / n(),
            across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))

Setup for multiple projects

The following setup is recommended to reduce duplicated steps between multiple projects that draw on the same datasets:

Download the UK Biobank data dictionary and codings files (available from the UK Biobank data showcase) and for each new R project, place the following .Renviron file in the project root directory (replacing PATH/TO with the correct file paths):
```
UKB_DATA_DICT=/PATH/TO/Data_Dictionary_Showcase.tsv

UKB_CODINGS=/PATH/TO/Codings.tsv
```
Functions with arguments ukb_data_dict and ukb_codings use get_ukb_data_dict() and get_ukb_codings() by default, which will automatically search for environmental variables UKB_DATA_DICT and UKB_CODINGS and read the files from these locations.
Create a clinical events database using make_clinical_events_db(). This function includes the option to incorporate primary care data.[^4] Having connected to the database, phenotypes may be extracted with extract_phenotypes():

[^4]: It will take \\~1 hour to run if including both the primary care clinical event records and prescription records datasets.

# build dummy clinical events SQLite DB in tempdir
ukb_db_path <- tempfile(fileext = ".db")

make_clinical_events_db(ukb_main_path = ukb_main_path,
                        ukb_db_path = ukb_db_path, 
                        gp_clinical_path = get_ukb_dummy("dummy_gp_clinical.txt",
                                                         path_only = TRUE),
                        gp_scripts_path = get_ukb_dummy("dummy_gp_scripts.txt",
                                                         path_only = TRUE), 
                        ukb_data_dict = ukb_data_dict,
                        ukb_codings = ukb_codings)

# Connect to the database
con <- DBI::dbConnect(RSQLite::SQLite(), 
                      ukb_db_path)

# Convert to a named list of dbplyr::tbl_dbi objects
ukbdb <- ukbwranglr::db_tables_to_list(con)

# Value columns (from `gp_clinical.txt`) and prescription names/quantities (from `gp_scripts.txt`) are stored separately from the main `clinical_events` table
ukbdb

# extract phenotypes
diabetes_cases <-
  extract_phenotypes(clinical_events = ukbdb$clinical_events,
                     clinical_codes = example_clinical_codes(),
                     verbose = FALSE)

diabetes_cases