knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.width = 7, 
  fig.height = 7, 
  fig.align = "center")
library(tidyverse)
library(ggforestplot)

Overview

In this tutorial we will go through the basic steps of importing blood metabolomics data into R, joining them with phenotypes or endpoint data, and performing basic epidemiological analysis. The tutorial uses simulated demo datasets from Nightingale's NMR platform.

Prerequisites

This tutorial uses R, which is free, compiles and runs in most operating systems. If you don't have it already installed in your system, you may install it by following the instructions here. Additionally, we recommend using the RStudio IDE (Integrated Development Environment), which can be downloaded and installed following the instructions here for the free "RStudio Desktop (Open Source License)".

You should be able to reproduce easily the code in this tutorial, even with very little knowledge in programming. If you have never programmed in R before and are intrested to learn more, you may find the following book useful, Hands on Programming with R by Garrett Grolemund.

Throughout the tutorial, we will be using the collection of tidyverse R packages and try to abide by their philosophy. You can install the complete tidyverse by typing the following line of code in the R console:

# Install the tidyverse packages
install.packages("tidyverse")

# Attach the tidyverse packages into the current session
library(tidyverse)

We also note the usage of the pipe %>% symbol throughout the code of this tutorial. The pipe is a way to write a series of operations on an R object, e.g. a dataset, in an easy-to-read way. As an example, the operation x %>% f(y) effectivly means f(x, y). You can read more on the pipe here.

Overview of Nightingale blood biomarker data

The NMR biomarker concentrations are provided in xlsx, csv and tsv formats. R can read either but in this tutorial we will use the csv format.

For the purposes of this tutorial we will use an example csv file. Download the zipped folder here and unzip it. It contains two files, one with metabolomics data and one with corresponding clinical data that we will use later on.

Type the commands below in your R session to load the metabolomics data and view the available biomarkers. Mind to put the correct path in the file parameter below.

# Read the biomarker concentration file
df_nmr_results <- readr::read_csv(
  # Enter the correct location for your file below
  file = "/path/to/file/12345-Results.csv",
  # Set not only NA but TAG string as <NA> 
  na = c("NA", "TAG"), 
  col_types = cols(.default = col_double(), 
                   Sample_id = col_character())
)
# Read the biomarker concentration file
df_nmr_results <- readr::read_csv(
  file = "data/12345-Results.csv",
  # Set not only NA but TAG string as <NA> 
  na = c("NA", "TAG"), 
  col_types = cols(.default = col_double(), 
                   Sample_id = col_character())
)

Print the names of all the variables in the dataset for inspection.

names(df_nmr_results)

The first column, Sample_id, contains the sample identifiers. Columns 2 to 19 contain information on different tags that a sample may contain. Explanation on any tags present in an actual result file is given along with the analysis report but here all possible tag columns are provided for demonstarational purposes. The rest of the columns correspond to machine readable abbreviations of the NMR-quantified blood biomarkers.

Let us drop the tag columns as they are not used in this demo.

df_nmr_results <- 
  df_nmr_results %>%
  dplyr::select(
    .data$Sample_id,
    tidyselect::any_of(
      ggforestplot::df_NG_biomarker_metadata$machine_readable_name
    )
  )

Rename alternative biomarker names

If you have a Nightingale result file with alternative biomarker names, for example XXL-VLDL-C_% instead of XXL_VLDL_C_pct or LA/FA instead of LA_pct, you may utilize the variable alternative_ids in the ggforestplot::df_NG_biomarker_metadata dataset to convert from one type to the other. An example is provided below, where we transform a dataset with alternative biomarker name (as long as they are found in the ggforestplot::df_NG_biomarker_metadata$alternative_ids list) to their ggforestplot::df_NG_biomarker_metadata$machine_readable_name.

# Assume that your data frame, containing alternative biomarker names, is called 
# df_nmr_results_alt_names

alt_names <- 
  names(df_nmr_results)

new_names <- 
  alt_names %>% 
  purrr::map_chr(function(id) {
    # Look through the alternative_ids
    hits <-
      purrr::map_lgl(
        df_NG_biomarker_metadata$alternative_names,
        ~ id %in% .
      )

    # If one unambiguous hit, return it.
    if (sum(hits) == 1L) {
      return(df_NG_biomarker_metadata$machine_readable_name[hits])
      # If not found, give a warning and pass through the input.
    } else {
      warning("Biomarker not found: ", id, call. = FALSE)
      return(id)
    } 
  })

# Name the vector with the new names  
names(alt_names) <- new_names

# Rename your result data frame with machine_readable_names 
df_nmr_results_alt_names <- 
  df_nmr_results %>% 
  rename(!!alt_names)

Association analysis

Join the blood biomarker data with other variables

Next we join the blood biomarker data above with other variables we may have, such as patient basic information (e.g. gender, age, e.t.c.) or patient outcomes (e.g. a disease event).

The second dataset you downloaded earlier (from here), called clinical_data.csv, has additional to the NMR information corresponding to the same fictional individuals. It has information on gender, age, body mass index (BMI) as well as incident T2D diabetes.

We will join the two datasets using the IDs, i.e. column Sample_id in the NMR data and column identifier in the clinical data. Except for the columns having different names in the two files, the IDs themselves are not the same either (which is also often the case in real life). Sample_id in the NMR is a character column with the prefix ResearchInstitutionProjectLabId_ while identifier in clinical data is a character consisting of 4 numbers.

Below we read the clinical data and join the two datasets after a few adjustments.

# Read the clinical_data.csv data file
df_clinical_data <- readr::read_csv(
  # Enter the correct location for your file below
  file = "/path/to/file/clinical_data.csv") %>%
  # Rename the identifier column to Sample_id as in the NMR result file
  rename(Sample_id = identifier) %>% 
  # Harmonize the id entries in clinical data with the ids in the NMR data 
  mutate(Sample_id = paste0(
    "ResearchInstitutionProjectLabId_", 
    as.numeric(Sample_id)
  ))
# Read the clinical_data.csv data file
df_clinical_data <- readr::read_csv(
  file = "data/clinical_data.csv"
) %>%
  # Rename the identifier column to Sample_id as in the NMR result file
  rename(Sample_id = identifier) %>% 
  # Harmonize the id entries in clinical data with the ids in the NMR data 
  mutate(Sample_id = paste0(
    "ResearchInstitutionProjectLabId_", 
    as.numeric(Sample_id)
  ))
# Inspect the first 10 entries
print(df_clinical_data)

# Join NMR result file with the clinical data using column "Sample_id"
df_full_data <- dplyr::left_join(
  x = df_nmr_results,
  y = df_clinical_data,
  by = "Sample_id"
)

Linear regression

We will first plot the distribution for a specific biomarker to inspect the data more closely. We choose glycoprotein acetyls (abbrev. GlycA) which reflects the amount of N-acetyl groups in circulating glycoproteins, many of which are involved in the acute-phase inflammatory response.

We first add a column to the dataset, marking the subjects as obese or not-obese. We then plot the GlycA distributions for the two sets as boxplots.

df_full_data %>% 
  # Add a column to mark obese/non-obese subjects
  mutate(obesity = ifelse(BMI >= 30, yes = "Obese", no = "Not obese")) %>% 
  # Filter out subjects with missing values (if any)
  filter(!is.na(obesity)) %>% 
  # Box plots for each group
  ggplot(aes(y = GlycA, x = obesity, color = obesity)) +
  geom_boxplot() + 
  # Remove x axis title 
  theme(axis.title.x = element_blank())

In the box plot above, the lower, middle and higher hinges correspond to 25%, 50% and 75% quantiles, respectively. Therefore, we see that the GlycA distribution for obese subjects is clearly shifted to higher values as compared to non-obese subjects, indicating that this marker is likely positively associated with obesity.

We will now estimate associations of each biomarker to BMI via linear regression:

\begin{align} y = \beta * x + \alpha \end{align}

where the blood biomarker is the outcome, $y$, and BMI is the exposure $x$. The association of $x$ with $y$ refers to the beta coefficient ($\beta$).

Below, we use function ggforestplot::discovery_regression(), with input parameter model set to 'lm' to estimate multiple, adjusted, linear associations, for all the biomarkers. Note that we scale the biomarkers so that the magnitude of associations are directly comparable across the different biomarkers. Log-transformation is done in order to satisfy the assumption of linear regression for normal distribution of the residuals.

# Extract names of NMR biomarkers
nmr_biomarkers <- dplyr::intersect(
  ggforestplot::df_NG_biomarker_metadata$machine_readable_name,
  colnames(df_nmr_results)
)

# NMR biomarkers here should be 250
stopifnot(length(nmr_biomarkers) == 250)

# Select only variables to be used for the model and collapse to a long data 
# format
df_long <-
  df_full_data %>%
  # Select only model variables
  dplyr::select(tidyselect::all_of(nmr_biomarkers), gender, baseline_age, BMI) %>%
  # Log-transform and scale biomarkers
  dplyr::mutate_at(
    .vars = dplyr::vars(tidyselect::all_of(nmr_biomarkers)),
    .funs = ~ .x %>% log1p() %>% scale %>% as.numeric()
  ) %>%
  # Collapse to a long format
  tidyr::gather(
    key = biomarkerid,
    value = biomarkervalue,
    tidyselect::all_of(nmr_biomarkers)
  )

# Estimate sex- and age-adjusted associations of metabolite to BMI
df_assoc_per_biomarker_bmi <-
  ggforestplot::discovery_regression(
    df_long = df_long,
    model = "lm",
    formula =
      formula(
       biomarkervalue ~ BMI + factor(gender) + baseline_age
      ),
    key = biomarkerid,
    predictor = BMI
  ) %>% 
  # Join this dataset with the grouping data in order to choose a different 
  # biomarker naming option
  left_join(
    select(
      df_NG_biomarker_metadata, 
      name,
      biomarkerid = machine_readable_name
    ), 
    by = "biomarkerid")

head(df_assoc_per_biomarker_bmi)

The data frame df_assoc_per_biomarker_bmi above, contains 250 rows, one for each blood biomarker, and 5 variables (columns): the biomarkerid (same as variable machine_readable_name in df_NG_biomarker_metadata) which has the biomarker abbreviations in the same format as in the input csv file, the biomarker name which is a bit more descriptive than biomarkerid, and the variables estimate, se and pvalue that correspond to the linear regression coefficient $\beta$, the standard error and the p-value, respectively.

Forest plot

Let's now visualize the results. For the purposes of this demo, we will plot a selected subset of the biomarkers in a forestplot layout. At the end of the tutorial, you will find an example of how to plot associations for all Nightingale biomarkers in a pdf (see also vignette("ggforestplot") for more details on the usage of ggforestplot::forestplot()).

The ggforestplot package includes a data frame, called ggforestplot::df_NG_biomarker_metadata, with metadata on the Nightingale blood biomarkers, such as different naming options, descriptions and group/sugroup information. We will use the group (or sugroup) information in this data frame to decide which groups of biomarkers and in what order we want to visualize here.

# Display blood biomarker groups
ggforestplot::df_NG_biomarker_metadata %>% 
  pull(group) %>% 
  unique()

# # You may wish, alternatively, to display blood biomarker subgroups and plot 
# # according to that. You can see which subgroups are available in the 
# # grouping data by uncommenting the following lines. In several cases, subgroup
# # is identical to group. 
# ggforestplot::df_NG_biomarker_metadata %>% 
#   pull(subgroup) %>% 
#   unique()

# Choose the groups you want to plot and define the order with which group 
# categories will appear in the plot
group_order <- c(
  "Branched-chain amino acids",
  "Aromatic amino acids",
  "Amino acids",
  "Fluid balance",
  "Inflammation",
  "Fatty acids",
  "Triglycerides",
  "Other lipids",
  "Cholesterol"
)

# Extract a subset of the df_NG_biomarker_metadata, with the desired group
# order, to set the order of biomarkers in the forestplot later on
df_with_groups <- 
  ggforestplot::df_NG_biomarker_metadata %>% 
  # Select subset of variables
  select(name = name,
         group) %>% 
  # Filter and arrange for the wanted groups
  filter(group %in% group_order) %>%
  arrange(factor(group, levels = group_order))
Statistical significance for multiple testing

We'd also like to add to the forrestplot a Bonferroni correction to account for multiple testing. Here we have a comparison of 250 biomarkers. If we suppose the common significance threshold $\alpha = 0.05$, the Bonferroni correction for each individual hypothesis, assuming the 250 tests are independent, is $\alpha = 0.05 / 250 \approx 0.0002$. However, for biologically correlated measures this correction is too strict; Here, instead of correcting for the total number of biomarkers we will use the number of principal components that explain 99% of the variance in the data.

# Perform principal component analysis on metabolic data
df_pca <- 
  df_full_data %>% 
  select(tidyselect::all_of(nmr_biomarkers)) %>% 
  nest(data = dplyr::everything()) %>% 
  mutate(
    # Perform PCA on the data above, after scaling and centering to 0 
    pca = map(data, ~ stats::prcomp(.x, center = TRUE, scale = TRUE)),
    # Augment the original data with columns containing each observation's 
    # projection into the PCA space. Each PCA-space dimension is signified 
    # with the .fitted prefix in its name 
    pca_aug = map2(pca, data, ~broom::augment(.x, data = .y))
  )

# Estimate amount of variance explained by each principal component
df_pca_variance <- 
  df_pca %>% 
  unnest(pca_aug) %>% 
  # Estimate variance for the PCA projected variables
  summarize_at(.vars = vars(starts_with(".fittedPC")), .funs = ~ var(.x)) %>% 
  # Gather data in a long format 
  gather(key = PC, value = variance) %>% 
  # Estimate cumulative normalized variance
  mutate(cumvar = cumsum(variance / sum(variance)),
         PC = str_replace(PC, ".fitted", ""))

# Find number of principal components that explain 99% 
pc_99 <- 
  df_pca_variance %>% 
  filter(cumvar <= 0.99) %>% 
  nrow()

print(pc_99)

# Corrected significance threshold
psignif <- signif(0.05 / pc_99, 1)

Therefore, our significance threshold here will be $\alpha = 0.05 / r pc_99 \approx r psignif$

Below we plot the results. In the forestplot function, our main input is the data frame df_assoc_per_biomarker_bmi, while we use the previously constructed data frame df_with_groups and its variable group to impose the grouping and the order of biomarkers in the plot. For the statistical significance, we input the significance threshold as the parameter psignif =r psignif` and we define explicitly the variable name indf_assoc_per_biomarker_bmi` that contains the p-values of the linear regression, in this case the column pvalue.

# Join the association data frame with group data
df_to_plot <-
  df_assoc_per_biomarker_bmi %>%
  # use right_join, with df_grouping on the right, to preserve the order of 
  # biomarkers it specifies. 
  dplyr::right_join(., df_with_groups, by = "name") %>%
  dplyr::mutate(
    group = factor(.data$group, levels = group_order)
  ) %>%
  tidyr::drop_na(.data$estimate)

# Draw a forestplot of cross-sectional, linear associations.
forestplot(
  df = df_to_plot,
  pvalue = pvalue,
  psignif = psignif,
  xlab = "1-SD increment in biomarker concentration\nper unit increment in BMI"
) +
  ggforce::facet_col(
    facets = ~group,
    scales = "free_y",
    space = "free"
  )

The figure shows that BMI is significantly associated with several of the plotted blood biomarkers. Specifically, higher values of branched chain amino acids, aromatic amino acids, omega-6 and glycoprotein acetylation are significantly associated with obesity, as has been previously seen in Würtz P et al. Metabolic Signatures of Adiposity in Young Adults: Mendelian Randomization Analysis and Effects of Weight Change., PLoS Med. 2014.

Cox proportional hazards regression

Similar analysis may be performed for other phenotypic traits. Two other very common regression approaches in epidemiology are logistic regression for studying the association to binary outcomes and Cox proportional hazards for time-to-event outcomes.

In similar spirit as above, we utilize ggforestplot::discovery_regression() to demonstrate an example of fitting Cox proportional hazards in all of the biomarkers to estimate the gender-, age- and BMI- adjusted hazard ratios for type 2 diabetes.

# Select only variables to be used for the model and collapse to a long data 
# format
df_long <-
  df_full_data %>%
  # Select only model variables
  dplyr::select(tidyselect::all_of(nmr_biomarkers), gender, baseline_age, BMI, age_at_diabetes, incident_diabetes) %>%
  # Log-transform and scale biomarkers
  dplyr::mutate_at(
    .vars = dplyr::vars(tidyselect::all_of(nmr_biomarkers)),
    .funs = ~ .x %>% log1p() %>% scale %>% as.numeric()
  ) %>%
  # Collapse to a long format
  tidyr::gather(
    key = biomarkerid,
    value = biomarkervalue,
    tidyselect::all_of(nmr_biomarkers)
  )

# Estimate sex-adjusted associations of metabolite to T2D
# Notice that parameter model below is set to 'coxph'
df_assoc_per_biomarker_diab <-
  discovery_regression(
    df_long = df_long, 
    model = "coxph",
    formula =
      formula(
        survival::Surv(
          time = baseline_age,
          time2 = age_at_diabetes,
          event = incident_diabetes
        ) ~ biomarkervalue + factor(gender) + BMI
      ),
    key = biomarkerid,
    predictor = biomarkervalue
  ) %>% 
  # Join this dataset with the grouping data in order to choose a different 
  # biomarker naming option
  left_join(
    select(
      df_NG_biomarker_metadata, 
      name,
      biomarkerid = machine_readable_name
    ), 
    by = "biomarkerid")

head(df_assoc_per_biomarker_diab)

Plot results for fatty acids.

# Filter df_NG_biomarker_metadata for only fatty acids
df_grouping <-
  df_NG_biomarker_metadata %>% 
  filter(group %in% "Fatty acids") 

# Join the association data frame with group data
df <-
  df_assoc_per_biomarker_diab %>%
  # use right_join, with df_grouping on the right, to preserve the order of 
  # biomarkers it specifies. 
  dplyr::right_join(., df_grouping, by = "name") 

# Draw a forestplot of time-to-event associations for t2d. 
ggforestplot::forestplot(
  df = df, 
  name = name,
  se = se,
  pvalue = pvalue,
  psignif = 0.001,
  xlab = "Hazards ratio for incident type 2 diabetes (95% CI)\nper 1−SD increment in metabolite concentration",
  title = "Fatty acids and risk of future T2D",
  logodds = TRUE
) 

Notice that the main difference in the latter forestplot as compared to the forestplot for linear associations above, is parameter logodds = TRUE. In this case, the beta values, here log hazard ratios, are exponentiated inside the forestplot to obtain hazard ratios and plotted in a logarithmic scale. The confidence intervals for the odds ratios are estimated using the standard errors of the log hazard ratios, as follows $\text{CI}\text{low} = \exp(\beta - 1.96 * \text{SE})$ and $\text{CI}\text{high} = \exp(\beta + 1.96 * \text{SE})$. If you wish to use some confidence interval other than the default 95%, you may specify this in the parameter ci of the function.

A forestplot for all Nightingale blood biomarkers

Finally, the ggforestplot package provides a custom function that plots and prints all Nightingale blood biomarkers in a single pdf file, called plot_all_NG_biomarkers(). There are two layouts available for 2016 and 2020 versions of the biomarker platform. Alternatively, one can also input a custom layout. Its usage is straightforward.

# Plot in all biomarkers in a two-page pdf file
plot_all_NG_biomarkers(
  df = df_assoc_per_biomarker_bmi, 
  machine_readable_name = biomarkerid,
  name = name, 
  estimate = estimate, 
  se = se, 
  pvalue = pvalue, 
  xlab = "1-SD increment in biomarker concentration\nper unit increment in BMI",
  filename = "all_ng_associations_bmi.pdf"
)
# ------------> Evaluate this only when you want to refresh the pdf plots in 
# ------------> figures/ Don't update them everytime if not necessary as git 
# ------------> saves multiple copies. 
# Plot in all biomarkers in a two-page pdf file
plot_all_NG_biomarkers(
  df = df_assoc_per_biomarker_bmi, 
  machine_readable_name = biomarkerid,
  name = name, 
  estimate = estimate, 
  se = se, 
  pvalue = pvalue, 
  xlab = "1-SD increment in biomarker concentration\nper unit increment in BMI",
  filename = "figures/all_ng_associations_bmi.pdf"
)

You may view the output of this function here.

Final remarks

The purpose of this demo is to familiarize the user with the Nightingale Health biomarker dataset and to showcase simple association analysis using linear and Cox proportional hazards regression.

Further details on data analysis approaches may be found in the following example publications.

References



NightingaleHealth/ggforestplot documentation built on April 10, 2020, 7:01 p.m.