# This is needed to wrap the code for the pdf rm(list=ls.str()) library(knitr) library(survival) library(forestplotNMR) opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE)
ForestplotNMR is an R package to plot associations of diseases and phenotypes in the form a forest plot (a.k.a. blobbogram). Specifically, it offers built-in naming and grouping features to assist visualizations of epidimiological analysis with NMR metabolomics quantified biomarkers (see section "Overview of NMR data"). The associations may be linear (linear regression), odds ratios (logistic regression) or hazard ratios (e.g. Cox Proportional hazards).
To install, first open an R session in the parent directory of the forestplot/
folder. You will need the package devtools
, if not already installed, install it by typing:
# Install install.packages("devtools")
Then install and load forestplotNMR
by typing:
# (Make sure you are in the parent directory of forestplot) # Install devtools::install(pkg = "forestplotNMR") # Load in session library(forestplotNMR)
Hopefully, dependencies will be installed automatically. If not let me know. (See also pdf manual for list of dependencies.)
The biomarker concentration data are often available in both tsv and xlsx formats. R can read both but in this tutorial we will focus on tsv format. We will be using the collection of the tidyverse
packages which you need to have installed. Naturally, the previous step installed also tidyverse
since it is a dependency for forestplotNMR
. If tidyverse
is not installed, install it by typing >install.packages("tidyverse")
and then load it by typing >library(tidyverse)
.
Below, we load two example tsv files in R and view the available biomarkers (the files can be found in the demos
folder):
# Read the biomarker concentration file metabol_data_1 <- read_tsv(file = "example_nmr_1.tsv", na = c("NA", "NDEF", "TAG"), col_types = cols(.default = col_double(), sampleid = col_character())) metabol_data_2 <- read_tsv(file = "example_nmr_2.tsv", na = c("NA", "NDEF", "TAG"), col_types = cols(.default = col_double(), sampleid = col_character())) class(metabol_data_1) print(head(metabol_data_1))
Print and inspect all the biomarker abbreviations
print(names(metabol_data_1)) # You may also see their full names by printing the built-in dataset head(forestplotNMR::biomarkers)
We will now read the phenotype data that correspond to the blood biomarker data and join nmr with phenotype datasets.
# Read the biomarker concentration file pheno_data_1 <- read_tsv(file = "example_pheno_1.tsv", col_types = cols(.default = col_double(), sampleid = col_character(), VMALE=col_factor(levels = c(0,1)))) pheno_data_2 <- read_tsv(file = "example_pheno_2.tsv", col_types = cols(.default = col_double(), sampleid = col_character(), VMALE=col_factor(levels = c(0,1)))) metabol_n_pheno_1 <- left_join(x = metabol_data_1, y = pheno_data_1, by = "sampleid") metabol_n_pheno_2 <- left_join(x = metabol_data_2, y = pheno_data_2, by = "sampleid")
We will now look at the association of BMI with each metabolite via linear regression:
\begin{align} y = \beta * x + \alpha \end{align}
Where metabolite is the outcome , $y$, and BMI is the exposure $x$. The association of $x$ with $y$ refers to the beta coefficient ($\beta$).
# We will use the built-in dataframe `biomarkers`, which contains all possible # biomarkers of the NMR platform under discussion. (The dataframe contains # standard abbreviation, spelled-out name and forestplot display name and one # suggestive categorization of the biomarkers.) # The biomarkers we will estimate the associations for are the intersection # between the column biomarkers$abbrev and the column names of the metabol_n_pheno # and the column bmrs <- biomarkers$abbrev %>% intersect(., colnames(metabol_n_pheno_1)) %>% intersect(., colnames(metabol_n_pheno_2)) # No of bmr nobmr <- length(bmrs)
# FIRST COHORT ANALYSIS # Log transform the metabolites metabol_n_pheno_1[,bmrs] <- metabol_n_pheno_1[,bmrs] %>% apply(2, log1p) # Initialize data_frames that will store beta, SE and p-values beta_cohort1 <- se_cohort1 <- pval_cohort1 <- data_frame(abbrev=character(nobmr), BMI=double(nobmr)) ## Sex- and age-adjusted associations of cohort to BMI cohort1 <- metabol_n_pheno_1 %>% gather(., key = bmr, value = bmr_value, bmrs) %>% split(.$bmr) %>% map(~ lm(scale(bmr_value) ~ BMI+VMALE+AGE, data = .)) %>% map(summary) %>% map("coefficients") # Assign values to beta, se and pval dataframe beta_cohort1$BMI <- cohort1 %>% sapply(., function(x) x["BMI","Estimate"]) beta_cohort1$abbrev <- cohort1 %>% names se_cohort1$BMI <- cohort1 %>% sapply(., function(x) x["BMI","Std. Error"]) se_cohort1$abbrev <- cohort1 %>% names pval_cohort1$BMI <- cohort1 %>% sapply(., function(x) x["BMI","Pr(>|t|)"]) pval_cohort1$abbrev <- cohort1 %>% names
# SECOND COHORT ANALYSIS # Log transform the metabolites metabol_n_pheno_2[,bmrs] <- metabol_n_pheno_2[,bmrs] %>% apply(2, log1p) # Initialize data_frames that will store beta, SE and p-values beta_cohort2 <- se_cohort2 <- pval_cohort2 <- data_frame(abbrev=character(nobmr), BMI=double(nobmr)) ## Sex- and age-adjusted associations of cohort to BMI cohort2 <- metabol_n_pheno_2 %>% gather(., key = bmr, value = bmr_value, bmrs) %>% split(.$bmr) %>% map(~ lm(scale(bmr_value) ~ BMI+VMALE+AGE, data = .)) %>% map(summary) %>% map("coefficients") # Assign values to beta, se and pval dataframe beta_cohort2$BMI <- cohort2 %>% sapply(., function(x) x["BMI","Estimate"]) beta_cohort2$abbrev <- cohort2 %>% names se_cohort2$BMI <- cohort2 %>% sapply(., function(x) x["BMI","Std. Error"]) se_cohort2$abbrev <- cohort2 %>% names pval_cohort2$BMI <- cohort2 %>% sapply(., function(x) x["BMI","Pr(>|t|)"]) pval_cohort2$abbrev <- cohort2 %>% names
Let's now attempt to plot all the biomarkers with the forestplot_nmr function.
# We first merge the betas, se and pval dataframes beta <- full_join(x = beta_cohort1, y = beta_cohort2, by = "abbrev") se <- full_join(x = se_cohort1, y = se_cohort2, by = "abbrev") pval <- full_join(x = pval_cohort1, y = pval_cohort2, by = "abbrev") names(beta) <- names(se) <- names(pval) <- c("abbrev", "cohort1", "cohort2") # Create a grouping (in this case we just use one of the built in options) # Function bmr_selected_grouping() takes one argument "bmr_grouping_choice", # which must be one of the following four strings: "onepage_forestplot", # "serum_all", "edta_plasma_all", "short"). bmr_all_grouped <- bmr_selected_grouping(bmr_grouping_choice = "serum_all") forestplot_nmr(beta=beta, se=se, pval=pval, biomarker_groups_as_list=bmr_all_grouped, filename='plot_linear_comparison.pdf', plot_title="Linear associations to BMI", is_log_odds_ratio=F, xlabel="SD difference (95% CI)", signif_cutoff=0.05, legend_vars=names(beta)[2:3], biomarker_name_option = 1, height = 12, width = 9)
Notice that when only one cohort (or study) is plotted, the beta values are displayed
on the right y-axis. For one-study plots the default color is black, but this is customizable, see ?forestplot
option plotcolors
.
Notice also below that we are free to input other parameter specifications that will be fed into the pdf
device used for the plot. In this case below I use "a4", which I highly recommend for the cases where all biomarkers are plotted.
forestplot_nmr(beta=beta[,1:2], se=se[,1:2], pval=pval[,1:2], biomarker_groups_as_list=bmr_all_grouped, filename='plot_linear_cohort1.pdf', plot_title="Linear associations to BMI", is_log_odds_ratio=F, xlabel="SD difference (95% CI)", signif_cutoff=0.05, biomarker_name_option = 1, paper="a4", height = 12, width = 9)
Let's now plot a smalled group of biomarkers and see how we can customize the looks of the output. We can build our own grouping but we will start by using the existing example in bmr_selected_grouping()
function.
# A grouping with smaller number of biomarkers bmr_all_grouped <- bmr_selected_grouping(bmr_grouping_choice = "example_short") forestplot_nmr(beta=beta, se=se, pval=pval, biomarker_groups_as_list=bmr_all_grouped, filename='plot_linear_cohort1_short.pdf', plot_title="Linear associations to BMI", is_log_odds_ratio=F, xlabel="SD difference (95% CI)", signif_cutoff=0.05, legend_vars=names(beta)[2:3], biomarker_name_option = 1, paper="a4", height = 12, width = 9)
If you inspect the last pdf file, you'll notice that the layout needs some further customizing. Let's start by moving the y-axis labels closer to the
We will perform a logistic regression with a dummy event in order to plot also with the
is_log_odds_ratio
flag on.
# Create a random event metabol_n_pheno_1 <- metabol_n_pheno_1 %>% mutate(., event=BMI>30) # Initialize data_frames that will store beta, SE and p-values beta_cohort1 <- se_cohort1 <- pval_cohort1 <- data_frame(abbrev=character(nobmr), event=double(nobmr)) ## Sex- and age-adjusted associations of cohort to event cohort1 <- metabol_n_pheno_1 %>% gather(., key = bmr, value = bmr_value, bmrs) %>% split(.$bmr) %>% map(~ glm(event ~ scale(bmr_value) + VMALE+AGE, data = ., family = "binomial")) %>% map(summary) %>% map("coefficients") # Assign values to beta, se and pval dataframe beta_cohort1$event <- cohort1 %>% sapply(., function(x) x["scale(bmr_value)","Estimate"]) beta_cohort1$abbrev <- cohort1 %>% names se_cohort1$event <- cohort1 %>% sapply(., function(x) x["scale(bmr_value)","Std. Error"]) se_cohort1$abbrev <- cohort1 %>% names pval_cohort1$event <- cohort1 %>% sapply(., function(x) x["scale(bmr_value)","Pr(>|z|)"]) pval_cohort1$abbrev <- cohort1 %>% names
# We first merge the betas, se and pval dataframes beta <- beta_cohort1 se <- se_cohort1 pval <- pval_cohort1 names(beta) <- names(se) <- names(pval) <- c("abbrev", "event") # Create a grouping (in this case we just use one of the built in options) # Function bmr_selected_grouping() takes one argument "bmr_grouping_choice", # which must be one of the following four strings: "onepage_forestplot", # "serum_all", "edta_plasma_all", "short"). bmr_all_grouped <- bmr_selected_grouping(bmr_grouping_choice = "serum_all") forestplot_nmr(beta=beta, se=se, pval=pval, biomarker_groups_as_list=bmr_all_grouped, filename='plot_logistic_cohort1.pdf', plot_title="Odds ratios", is_log_odds_ratio=T, xlabel="SD-scaled odds ratios (95% CI)", signif_cutoff=0.05, plotpointshape = 23, legend_vars=names(beta)[2], biomarker_name_option = 1, paper="a4", height = 12, width = 9)
Below there is also an example with a small grouping that is built from scratch by the user. Here we will also make use of the indices
parameter, a list of numeric vectors, that has either 1, 2 or 4 components corresponding to 1, 2 or 4 columns respectively and each containing the indices of the biomarker_groups_as_list
to be plotted in each column.
# A grouping with smaller number of biomarkers metabo_groups_custom <- list("Branched-chain amino acids"=c("Ile", "Leu", "Val"), "Aromatic amino acids"=c("Phe", "Tyr", "His"), "Other amino acids"=c("Ala", "Gly", "Gln", NA), "Glycolysis-related metabolites"=c("Lac", "Pyr", "Glol"), "Ketone bodies"=c("AcAce", "bOHBut"), "Inflammation"=c("Gp"), "Miscellaneous"=c("Crea", "Alb", "Ace", "Cit", NA), "Fatty acid ratios"=c("TotFA", "SFA/FA","MUFA/FA","PUFA/FA", "FAw6/FA","LA/FA","FAw3/FA","DHA/FA","UnSat"), "Phospholipids"=c("SM", "TotCho", "PC", "TotPG", "TG/PG")) forestplot_nmr(beta=beta, se=se, pval=pval, biomarker_groups_as_list=metabo_groups_custom, indices=list(c(1:22), c(23:44)), filename='plot_logistic_cohort1_small.pdf', plot_title="Odds ratios", is_log_odds_ratio=T, xlabel="SD-scaled odds ratios (95% CI)", signif_cutoff=0.05, biomarker_name_option = 1, width=10, height=5)
So for example, in the indices
parameter, indices 1 to 4 correspond to "Branched-chain amino acids", "Ile", "Leu", and "Val" from metabo_groups_custom
. Mind that you would naturally prefer to equally sepqrate the biomarkers over the two columns, which is why here we plot the first 22 in the left column and the second 22 in the second column. However, equal number of biomarkers per column is not a requirement. Finally, notice that by adding NA values you can add empty space wherever you want in order to visually separate the groups or to better align the two columns.
Finally, below there is a demo using the option 2 of biomarker_name_option. The main difference between the two options of this parameter is how the names of the lipoprotein subclasses are displayed. For example, option 2 will display XXL-VLDL-TG % for the ratio of triglycerides in XXL VLDL particles, whereas option 1 assumes that plotting will be done according to lipid type, e.g. all triglycerides plotted in the same subgroup (as in the previous examples), therefore it would only display "Extremely large VLDL" (under the category "Triglycerides in lipoproteins").
# Create a grouping (in this case we just use one of the built in options) # Function bmr_selected_grouping() takes one argument "bmr_grouping_choice", # which must be one of the following four strings: "onepage_forestplot", # "serum_all", "edta_plasma_all", "short"). bmr_all_grouped <- bmr_selected_grouping(bmr_grouping_choice = "serum_all") forestplot_nmr(beta=beta, se=se, pval=pval, biomarker_groups_as_list=bmr_all_grouped, filename='plot_logistic_cohort1_option2_naming.pdf', plot_title="Odds ratios", is_log_odds_ratio=T, xlabel="SD-scaled odds ratios (95% CI)", signif_cutoff=0.05, plotpointshape = 23, legend_vars=names(beta)[2], biomarker_name_option = 2, paper="a4", height = 12, width = 9)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.