read_ukb: Read a UK Biobank main dataset file
In rmgpanw/ukbwranglr: Exploring UKB Data

read_ukb

R Documentation

Read a UK Biobank main dataset file

Description

Reads a UK Biobank main dataset file into R using either fread or read_dta. Optionally renames variables with descriptive names, add variable labels and label coded values of type character as factors.

Usage

read_ukb(
  path,
  delim = "auto",
  data_dict = NULL,
  ukb_data_dict = get_ukb_data_dict(),
  ukb_codings = get_ukb_codings(),
  descriptive_colnames = TRUE,
  label = TRUE,
  max_n_labels = 30,
  na.strings = c("", "NA"),
  nrows = Inf,
  ...
)

Arguments

`path`	The path to a UK Biobank main dataset file.
`delim`	Delimiter for the UKB main dataset file. Default is "auto" (see `data.table::fread()`). Ignored if the file name ends with `.dta` (i.e. is a STATA file) or if `ukb_main` is a data frame.
`data_dict`	A data dictionary specific to the UKB main dataset file, generated by `make_data_dict`. To load only a selection of columns, supply a filtered copy of this data dictionary containing only the required variables. If `NULL` (default) then all fields will be read.
`ukb_data_dict`	The UKB data dictionary (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type `character`.
`ukb_codings`	The UKB codings file (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type `character`.
`descriptive_colnames`	If `TRUE`, rename columns with longer descriptive names derived from the UK Biobank's data dictionary 'Field' column.
`label`	If `TRUE`, apply variable labels and label coded values as factors.
`max_n_labels`	Coded variables with associated value labels less than or equal to this threshold will be labelled as factors. If `NULL`, then all value labels will be applied. Default value is 30.
`na.strings`	A character vector of strings which are to be interpreted as `NA` values. By default, `",,"` for columns of all types, including type `character` is read as `NA` for consistency. `,"",` is unambiguous and read as an empty string. To read `,NA,` as `NA`, set `na.strings="NA"`. To read `,,` as blank string `""`, set `na.strings=NULL`. When they occur in the file, the strings in `na.strings` should not appear quoted since that is how the string literal `,"NA",` is distinguished from `,NA,`, for example, when `na.strings="NA"`.
`nrows`	The maximum number of rows to read. Unlike `read.table`, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by `fread` almost instantly using the large sample of lines. `nrows=0` returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.
`...`	Additional parameters are passed on to either `fread` or `read_dta`

Details

Note that na.strings is not recognised by read_dta. Reading in a STATA file may therefore require careful checking for empty strings that need converting to NA.

Value

A UK Biobank phenotype dataset as a data table with human-readable variables labels and data values.

Examples

library(magrittr)
# get dummy UKB data dictionary and codings
dummy_ukb_data_dict <- get_ukb_dummy("dummy_Data_Dictionary_Showcase.tsv")
dummy_ukb_codings <- get_ukb_dummy("dummy_Codings.tsv")

# file path to dummy UKB main dataset
dummy_ukb_main_path <- get_ukb_dummy("dummy_ukb_main.tsv", path_only = TRUE)

# read dummy UKB main dataset into R
read_ukb(
  path = dummy_ukb_main_path,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings
) %>%
  # (convert to tibble for concise print method)
  tibble::as_tibble()

# to read only a subset of variables, create a data dictionary and filter
# for selected variables, then supply to `read_ukb()`
data_dict_selected <- make_data_dict(
  ukb_main = dummy_ukb_main_path,
  ukb_data_dict = dummy_ukb_data_dict
) %>%
  dplyr::filter(FieldID %in% c("eid", "31", "34", "21001"))

read_ukb(
  path = dummy_ukb_main_path,
  data_dict = data_dict_selected,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings
)

# set `descriptive_colnames` and `label` to FALSE to read the raw dataset as is
read_ukb(
  path = dummy_ukb_main_path,
  data_dict = data_dict_selected,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings,
  descriptive_colnames = FALSE,
  label = FALSE
)

rmgpanw/ukbwranglr documentation built on April 30, 2024, 7:47 a.m.