read_ukb: Read a UK Biobank main dataset file

View source: R/read_ukb.R

read_ukbR Documentation

Read a UK Biobank main dataset file

Description

Reads a UK Biobank main dataset file into R using either fread or read_dta. Optionally renames variables with descriptive names, add variable labels and label coded values of type character as factors.

Usage

read_ukb(
  path,
  delim = "auto",
  data_dict = NULL,
  ukb_data_dict = get_ukb_data_dict(),
  ukb_codings = get_ukb_codings(),
  descriptive_colnames = TRUE,
  label = TRUE,
  max_n_labels = 30,
  na.strings = c("", "NA"),
  nrows = Inf,
  ...
)

Arguments

path

The path to a UK Biobank main dataset file.

delim

Delimiter for the UKB main dataset file. Default is "auto" (see data.table::fread()). Ignored if the file name ends with .dta (i.e. is a STATA file) or if ukb_main is a data frame.

data_dict

A data dictionary specific to the UKB main dataset file, generated by make_data_dict. To load only a selection of columns, supply a filtered copy of this data dictionary containing only the required variables. If NULL (default) then all fields will be read.

ukb_data_dict

The UKB data dictionary (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type character.

ukb_codings

The UKB codings file (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type character.

descriptive_colnames

If TRUE, rename columns with longer descriptive names derived from the UK Biobank's data dictionary 'Field' column.

label

If TRUE, apply variable labels and label coded values as factors.

max_n_labels

Coded variables with associated value labels less than or equal to this threshold will be labelled as factors. If NULL, then all value labels will be applied. Default value is 30.

na.strings

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

nrows

The maximum number of rows to read. Unlike read.table, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by fread almost instantly using the large sample of lines. nrows=0 returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.

...

Additional parameters are passed on to either fread or read_dta

Details

Note that na.strings is not recognised by read_dta. Reading in a STATA file may therefore require careful checking for empty strings that need converting to NA.

Value

A UK Biobank phenotype dataset as a data table with human-readable variables labels and data values.

Examples

library(magrittr)
# get dummy UKB data dictionary and codings
dummy_ukb_data_dict <- get_ukb_dummy("dummy_Data_Dictionary_Showcase.tsv")
dummy_ukb_codings <- get_ukb_dummy("dummy_Codings.tsv")

# file path to dummy UKB main dataset
dummy_ukb_main_path <- get_ukb_dummy("dummy_ukb_main.tsv", path_only = TRUE)

# read dummy UKB main dataset into R
read_ukb(
  path = dummy_ukb_main_path,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings
) %>%
  # (convert to tibble for concise print method)
  tibble::as_tibble()

# to read only a subset of variables, create a data dictionary and filter
# for selected variables, then supply to `read_ukb()`
data_dict_selected <- make_data_dict(
  ukb_main = dummy_ukb_main_path,
  ukb_data_dict = dummy_ukb_data_dict
) %>%
  dplyr::filter(FieldID %in% c("eid", "31", "34", "21001"))

read_ukb(
  path = dummy_ukb_main_path,
  data_dict = data_dict_selected,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings
)

# set `descriptive_colnames` and `label` to FALSE to read the raw dataset as is
read_ukb(
  path = dummy_ukb_main_path,
  data_dict = data_dict_selected,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings,
  descriptive_colnames = FALSE,
  label = FALSE
)

rmgpanw/ukbwranglr documentation built on April 30, 2024, 7:47 a.m.