knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(ukbwranglr) library(dplyr) library(tidyr) library(tidyselect) library(readr) library(tibble)
The basic workflow is as follows:
make_data_dict()
.read_ukb()
.summarise_numerical_variables()
.tidy_clinical_events()
or make_clinical_events_db()
, and extract outcomes of interest with extract_phenotypes()
.These steps are now illustrated with a dummy dataset included with ukbwranglr:
# path to dummy data ukb_main_path <- get_ukb_dummy("dummy_ukb_main.tsv", path_only = TRUE) # raw data read_tsv(ukb_main_path)
You can try out the steps either locally by installing ukbwranglr on your own machine, or online by clicking on the following link to RStudio Cloud[^1] and navigating to this Rmd file in the 'vignettes' directory:
[^1]: You will be asked to sign up for a free account if you do not have one already.
First download a copy of the UK Biobank data dictionary and codings files from the UK Biobank data showcase website. Dummy versions are used here:
# get required metadata ukb_data_dict <- get_ukb_dummy("dummy_Data_Dictionary_Showcase.tsv") ukb_codings <- get_ukb_dummy("dummy_Codings.tsv")
Then create a data dictionary using make_data_dict()
:
# make data dictionary data_dict <- make_data_dict(ukb_main_path, ukb_data_dict = ukb_data_dict) data_dict
Read a main UK Biobank dataset into R using read_ukb()
:
read_ukb(path = ukb_main_path, ukb_data_dict = ukb_data_dict, ukb_codings = ukb_codings) %>% # (convert to tibble for concise print method) as_tibble()
A UK Biobank main dataset file is typically too large to fit into memory on a personal computer. Often however, only a subset of the data is required. To read a selection of variables, filter the data dictionary created with make_data_dict()
for a subset of variables and supply this to read_ukb()
:[^2]
[^2]: It may be preferable at this stage to write the data dictionary to a csv file (using readr::write_csv()
), manually filter in a text editor like Microsoft Excel, then reload into R (using readr::read_csv()
).
# Filter data dictionary for sex, year of birth, body mass index, systolic blood pressure, self-reported non-cancer illness and summary hospital diagnoses (ICD10) fields data_dict_selected <- data_dict %>% filter(FieldID %in% c( # Participant ID "eid", # Sex "31", # Year of birth "34", # Body mass index "21001", # Systolic blood pressure "4080", # Self-reported non-cancer medical conditions "20002", "20008", # Summary hospital diagnoses (ICD10) "41270", "41280" )) # Read selected variables into R ukb_main <- read_ukb(path = ukb_main_path, data_dict = data_dict_selected, ukb_data_dict = ukb_data_dict, ukb_codings = ukb_codings) # Convert to tibble for concise print method as_tibble(ukb_main)
Some variables such as body mass index and systolic blood pressure will have been measured on more than one occasion. In these cases it may be desirable to calculate a summary value (e.g. mean). Use summarise_numerical_variables()
:
# calculate the mean value across all repeated continuous variable measurements ukb_main_numerical_vars_summarised <- summarise_numerical_variables( ukb_main = ukb_main, ukb_data_dict = ukb_data_dict, summary_function = "mean", .drop = TRUE ) %>% # reorder variables select(eid, sex_f31_0_0, year_of_birth_f34_0_0, mean_body_mass_index_bmi_x21001, mean_systolic_blood_pressure_automated_reading_x4080) as_tibble(ukb_main_numerical_vars_summarised)
Use tidy_clinical_events()
to reshape clinical events fields (such as self-reported non-cancer medical conditions) to long format:
# tidy clinical events clinical_events <- tidy_clinical_events( ukb_main = ukb_main, ukb_data_dict = ukb_data_dict, ukb_codings = ukb_codings, clinical_events_sources = c("self_report_non_cancer", "summary_hes_icd10") ) # returns a named list of data frames clinical_events # combine with dplyr clinical_events <- dplyr::bind_rows(clinical_events) clinical_events
To identify participants with a condition of interest, first decide which codes will capture this. For example, the following includes a (non-exhaustive) list of clinical codes for diabetes:[^3]
[^3]: The codemapper package includes functions to facilitate building clinical code lists.
example_clinical_codes()
Supply this to extract_phenotypes()
to filter for participants who have any matching clinical codes in their records. By default, only the earliest date is extracted:
# extract phenotypes diabetes_cases <- extract_phenotypes(clinical_events = clinical_events, clinical_codes = example_clinical_codes()) diabetes_cases
Merge the output from steps 4 and 5:
ukb_main_processed <- # first summarise `diabetes_cases` - for each eid, get the earliest date diabetes_cases %>% group_by(eid, disease) %>% summarise(diabetes_min_date = min(date, na.rm = TRUE)) %>% ungroup() %>% # create indicator column for diabetes mutate(diabetes_indicator = case_when(!is.na(diabetes_min_date) ~ "Yes", TRUE ~ "No")) %>% # join with `ukb_main_numerical_vars_summarised` dplyr::full_join(ukb_main_numerical_vars_summarised, by = "eid") ukb_main_processed
Describe:
ukb_main_processed %>% select(-eid) %>% group_by(diabetes_indicator) %>% summarise(pct_female = sum(sex_f31_0_0 == "Female", na.rm = TRUE) / n(), across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
The following setup is recommended to reduce duplicated steps between multiple projects that draw on the same datasets:
Download the UK Biobank data dictionary and codings files (available from the UK Biobank data showcase) and for each new R project, place the following .Renviron
file in the project root directory (replacing PATH/TO
with the correct file paths):
UKB_DATA_DICT=/PATH/TO/Data_Dictionary_Showcase.tsv UKB_CODINGS=/PATH/TO/Codings.tsv
Functions with arguments ukb_data_dict
and ukb_codings
use get_ukb_data_dict()
and get_ukb_codings()
by default, which will automatically search for environmental variables UKB_DATA_DICT
and UKB_CODINGS
and read the files from these locations.
Create a clinical events database using make_clinical_events_db()
. This function includes the option to incorporate primary care data.[^4] Having connected to the database, phenotypes may be extracted with extract_phenotypes()
:
[^4]: It will take \\~1 hour to run if including both the primary care clinical event records and prescription records datasets.
# build dummy clinical events SQLite DB in tempdir ukb_db_path <- tempfile(fileext = ".db") make_clinical_events_db(ukb_main_path = ukb_main_path, ukb_db_path = ukb_db_path, gp_clinical_path = get_ukb_dummy("dummy_gp_clinical.txt", path_only = TRUE), gp_scripts_path = get_ukb_dummy("dummy_gp_scripts.txt", path_only = TRUE), ukb_data_dict = ukb_data_dict, ukb_codings = ukb_codings)
# Connect to the database con <- DBI::dbConnect(RSQLite::SQLite(), ukb_db_path) # Convert to a named list of dbplyr::tbl_dbi objects ukbdb <- ukbwranglr::db_tables_to_list(con) # Value columns (from `gp_clinical.txt`) and prescription names/quantities (from `gp_scripts.txt`) are stored separately from the main `clinical_events` table ukbdb
# extract phenotypes diabetes_cases <- extract_phenotypes(clinical_events = ukbdb$clinical_events, clinical_codes = example_clinical_codes(), verbose = FALSE) diabetes_cases
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.