Software

Methods

We aim to use best practices of scientific computing [@pmid24415924]. We report all software used in the research. Briefly, we use the statistical computing language R with the RStudio IDE and key external packages including the so-called tidyverse suite of general data science tools, WGCNA and mixOmics for omics data -specialized analyses, rmarkdown for literate programming, and many more [@rref, @tidyverserref, @rmarkdownref, @wgcnaref, @mixomicsref]. We organize the project as an open-source R package which lets us leverage the language's software development infrastructure, including tools for project and dependency management, documentation, and version control.

Results

We set the following global R and RMarkdown code chunk options.

# Global R options
options(scipen=999) # = don't print scientific notation, e.g., 1e-9
set.seed(65934) # = reproducible pseudo-random number generation

# RMarkdown code chunk options
knitr::opts_chunk$set(
    error = TRUE, # keep rendered report clean
    warning = FALSE, # keep rendered report clean 
    message = FALSE,  # keep rendered report clean
    eval = TRUE, # use eval = FALSE to not evaluate chunk
    fig.align = "center", 
    fig.height = 5, 
    fig.width = 7,
    cache = TRUE, # TRUE to speed up rendering
    dev = "png" # change to "svg" for easier post-modification (avoid manual modification though)
    # and all other defaults...
)

Before setting up the software environment below, one should note the following. Both CRAN- and BioConductor-hosted packages are needed. Watch out for breaking other analyses. Use isolated environments (e.g. conda, docker, packrat) or other strategies. If many required packages are missing (or old versions), installation can take a very long time. System/OS-level software is needed and they're not installed automatically. The installation returns an error if one is missing. The missing dependencies are reported in the log and error messages. Avoid namespace conflict bugs by loading only the most necessary packages into the global environment.

You may need to install the following packages on an Ubuntu/Debian -system.

sudo apt install libcurl4-openssl-dev libgit2-dev libssl-dev libssh2-1-dev libxml2-dev

We set up the development software.

# install.packages(c("devtools", "tidyverse", "janitor"))
library(devtools)

load_all()

tvsproject::install_dependencies()
tvsproject::load_dependencies()

Alternatively, install and load everything via the built library. This will be the primary method eventually.

# devtools::install_github("eteppo/tvs-project")
# library(tvsproject)

Finally, the current software environment is shown below.

tvsproject::session_platform()
tvsproject::session_packages()

Reporting

Methods

We aim to adhere to open science best practice and make our research as fully reported, transparent, findable, accessible, interoperable, reproducible, and reusable as possible.

We use relevant reporting guidelines, including the STROBE guideline for observational epidemiology and the PRISMA guideline for systematic reviews. We aim to publish primary reports also in freely accessible forums, such as medical pre-print archives. Sensitive data may be securely shared only on reasonable request since sharing the raw data publicly compromises research participant privacy and consent under our legislation. We use Git and GitHub version control utilities for all non-sensitive information.

Results

The git-tracked public project repository without sensitive information can be accessed at github.com/eteppo/tvs-project. This document has been rendered at r lubridate::now().

Review

In this section, we describe the current state of atherosclerosis research in general and with a focus on the space of blood lipidomics, small metabolites, and transcriptomics. The goal of the review is to describe the most plausible model of atherosclerosis given the current state of evidence on atherosclerosis. This is the source for all of the project's introduction and background sections in research plans, applications, and manuscripts.

Methods

Even though this is not a formal systematic review, we aim to describe the review processes as outlined in the PRISMA checklist for reporting systematic reviews @pmid19631508.

General review

As the amount of information on atherosclerosis in general is extremely large for a small project, we need to introduce many limitations to the general review as follows.

A standard way to describe atherosclerosis on the tissue-level was developed by the Committee on Vascular Lesions of the Council on Arteriosclerosis, of the American Heart Association, in the early 1990's based on a review of the available publications and the expertise of the council [@pmid7648691; @pmid8181179; @pmid1728483].

In 2017, the European Atherosclerosis Society formed a consensus panel that aimed to review all of the published research on the causal relationship between low-density lipoproteins and atherosclerotic cardiovascular disease (ASCVD) events [@pmid28444290]. Notably, they described the effects with the following models where $RRR$ is the expected relative risk reduction in ASCVD events, $\Delta c$ is the reduction in plasma LDL-C (mmol/l), and $\Delta t$ is the duration of the reduction.

$$E[RRR] = 1 - \frac{R_1}{R_0}$$ $$E[RRR_{5}] = 1 - 0.78^{\Delta c}$$ $$E[RRR_{40}] = 1 - 0.46^{\Delta c}$$ $$E[RRR_{\Delta t}] = 1 - e^{\Delta c \left( −0.0152\times(\Delta t − 5) − 0.249 \right)}$$

plot_ldlc_ascvd_eas <- function() {

  rrr_dt <- crossing(
    dt = seq(5, 40, length.out = 100), 
    dc = seq(0, 5, length.out = 100)
    ) %>%
    mutate(rrr = 1 - exp((-0.0152*(dt - 5) - 0.249) * dc))

  plot <- rrr_dt %>%
    ggplot(aes(x = dt, y = dc, fill = rrr, z = rrr)) +
      geom_raster(interpolate = TRUE) +
      stat_contour(
        breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0), 
        color = "white", 
        size = 0.1
      ) +
      scale_fill_gradient(
        low = "white", 
        high = "black", 
        limits = c(0, 1),
        breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0)
      ) +
      scale_x_continuous(breaks = seq(5L, 40L, by = 5)) +
      theme_minimal() +
      labs(
        x = "Duration (years)", 
        y = "Reduction in plasma LDL-C (mmol/l)", 
        fill = str_wrap("Expected RRR for ASCVD events", 10)
      ) +
      theme(panel.grid = element_blank(), text = element_text(size = 10))

  return(plot)
}
plot_ldlc_ascvd_eas()
ggsave(
  plot = plot_ldlc_ascvd_eas(), 
  filename = "ldlc-ascvd-eas-ference-2017.pdf", 
  path = here::here("analysis", "images"),
  width = 12,
  height = 8,
  units = "cm"
)

Weber and Noels (2011) reviewed the pathogenesis of atherosclerosis and suggest therapeutic options @pmid22064431. Back et al. (2019) synthesized the current state of research on the inflammatory system in atherosclerosis, with an emphasis on mechanisms of failed inflammation resolution @pmid30846875. The Global Burden of Disease project calculates state-of-the-art global distributions of disease, disability, and death. We collect the 2017 results of the variables most relevant to atherosclerosis and compare the quantities to other classes of health problems [@pmid30496104, @pmid30496103].

here("raw-data", "metadata", "review-numbers.csv") %>%
  read_csv(col_types = "cccccnnnc")

We also review the current state of clinical practice guidelines of atherosclerotic conditions in this section for a view on the current state of clinical atherosclerosis medicine [@10.1093/eurheartj/ehx095; @10.1093/eurheartj/ehw106; @pmid23996286].

The visual representation of the extracted variables and causal relationships is in progress below

here("raw-data", "metadata", "review-nodes.csv") %>%
  read_csv(col_names = "name", col_types = "c") %>%
  arrange(name) %>%
  distinct_all() # %>%
  # as_tbl_graph() %>%
  # plot_causal_network(
  #   geom = "text", 
  #   str_wrap_width = 15
  # )

Finally, methods used to measure variables and effects relevant to atherosclerosis are also being collected below to gain a better overall picture of the evidence base.

here("raw-data", "metadata", "review-methods.csv") %>%
  read_csv(col_names = "method", col_types = "c") %>%
  plot_strings(label = method)

Plasma lipidomics and NMR metabolomics review

The review is not based on a pre-defined protocol but the process is documented in the corresponding results section.

The search query lipidom* AND review in PubMed gave 830 results. After screening for eligibility, 71 results remain. We also found 7 additional eligible results from other related searches, giving 78 results.

"29165427 28859259 27469345 28752333 27663237 28340309 28457845 26703190 24606142 26507169 28822794 27070078 27816901 29655461 25376488 29534960 29446847 27286762 30625374 16052242 29959947 28216054 28238863 25409862 30926037 28408341 22070478 22316558 31141713 15389848 28302589 30128447 26856187 27068982 26109865 23432956 29166560 20606693 30928338 26212444 23948799 24995945 29665359 26394722 25694038 26064851 23830914 25445283 26358168 20931646 19126783 24921707 30032454 28341148 26231889 21318352 29654757 21708282 21632539 28161582 17052686 17153937 21939287 15892569 16099399 25309011 24957365 17954217 25516624 24957366 26190681 24055840 18041669 27756782 22178427 21673308 20425241 19408941"

A search in PubMed targeting original lipidomics results in cardiovascular disease, (atherosclerosis OR cardiovascular OR coronary artery OR peripheral artery) AND (lipidome OR lipidomics) NOT review, gave 334 results. After screening titles and abstracts for eligibility, 40 results remain.

"31190850 31131674 30999225 30944221 30870703 30711004 30710063 29982602 29913478 29858423 29366988 29306452 28814398 28728066 28714336 28650171 28527370 28472752 27765765 27756783 27756781 27543723 26545014 26523994 26486974 26162515 26142205 26093887 25855318 25517033 25528424 25444678 25363705 24622385 24448739 23967253 23868910 23241314 22795978 21511877 31270131"
here("raw-data", "metadata", "review-lipid-class-edges.csv") %>%
  read_delim(delim = ";", col_types = "cc") %>%
  as_tbl_graph() %>%
  plot_causal_network(
    layout = "gem",
    text_size = 4,
    edge_width = 0.5,
    arrow_length = 2,
    clip = "on"
  )
ggsave(
  plot = p,
  filename = "review-lipid-class-network.png",
  path = here("analysis", "images"),
  width = 30,
  height = 30,
  units = "cm",
  dpi = 320
)
knitr::include_graphics(
  path = here("analysis", "images", "review-lipid-class-network.png"),
  dpi = 650
)

Results

General review

Atherosclerosis is a histopathological state of the arterial wall characterized by lipid accumulation and crystallization, chronic inflammation, cell death, fibrosis, calcification, structural deformation, rupture, hemorrhage, and thrombosis [@pmid7648691; @pmid8181179; @pmid1728483].

Present causal hypotheses of atherosclerosis are complex, including hundreds of environmental, hemodynamical, immunological, lipidomic, biogerontological, and other variables [@pmid30846875; @pmid22064431]. At the same time, the key effect of the plasma concentration of apoprotein B (apoB) -containing and low-density lipoprotein (LDL) particles on atherosclerotic cardiovascular disease has become ever more certain [@pmid28444290] (Figure 1).

knitr::include_graphics(
  path = here("analysis", "images", "ldlc-ascvd-eas-ference-2017.png"),
  dpi = 320
)

The harmful effects of atherosclerosis are currently unmatched in occurence, including disease states of the major arteries and valves and their effects on symptoms, disability, death, and the society at large (Table 1). For example, almost the same proportion of deaths globally in 2017 have been attributed to ischaemic heart disease (16 %) – most of which wouldn't exist without atheroslcerosis – as to all cancers (17 %). Similarly, 7 % of disability-adjusted life years (DALYs) were related to ischaemic heart disease when, for example, all mental disorders were estimated to have caused 5 %. [@pmid30496104; @pmid30496103]

options(knitr.kable.NA = '')

here("raw-data", "metadata", "review-numbers.csv") %>%
  read_csv(col_types = "cccccnnnc") %>%
  select(-time, -location, -pmid) %>%
  knitr::kable(
    format = "latex",
    caption = "Major effects of atherosclerosis with global 2017 estimates based on the Global Burden of Disease 2017 results.",
    booktabs = TRUE,
    format.args = list(big.mark = ","),
    col.names = c("Cause", "Effect", "Measure", "Low", "Point", "High")
  ) %>%
  kableExtra::kable_styling(font_size = 7, position = "center") %>%
  kableExtra::column_spec(1:2, width = "3cm") %>%
  kableExtra::column_spec(3, width = "2cm") %>%
  kableExtra::column_spec(4:6, width = "1cm") %>%
  kableExtra::add_header_above(c(" " = 3, "Estimate" = 3)) %>%
  kableExtra::collapse_rows(1:3, valign = "top")

Diagnostic and therapeutic solutions to the problem of atherosclerotic disease are outlined in practice guidelines that are updated regularly by various expert panels around the world – using various methods – although actual practice may vary much more widely [@10.1093/eurheartj/ehx095; @10.1093/eurheartj/ehw106; @pmid23996286]. In summary, the main interventions currently include physical exercise, nutritional methods, smoking cessation, acetylsalisylic acid, clopidogrel, statins, angiotensin convertase inhibitors, nitrates, beta blockers, calcium channel blockers, angioplasty, stenting, and bypass. The main diagnostic methods used to select and configure the interventions include clinical history and exam, rest and stress electrocardiography, stress ultrasound, stress nuclear imaging, computed tomography, and angiography.

Given the magnitude of the effects of atherosclerosis we still experience today, we are in dire need of better predictive and therapeutic interventions as well as implementation of interventions.

Plasma lipidomics review

It has been estimated that almost three fourths of the biological human plasma mass are lipids – consisting of up to hundreds of thousands of distinct, chemically diverse molecules. Out of the almost 600 species that were measureable in 2010, almost all of the total concentration are sterols, glycerophosholipids (160 species), or glycerolipids, with increasingly smaller molar contributions from sphingolipids (204 species), fatty acyls, and prenol lipids. [@pmid22070478] Lipids are also constantly haphazardly modified in oxidative reactions. All individual lipid structures are classified in the LIPIDMAPS database [@pmid19098281]. The lipids species are closely connected in the their metabolic network (Figure 2). At the higher level, plasma lipids are either free, protein-bound, or packed into various lipoprotein particles and lipid structures.

knitr::include_graphics(
  path = here("analysis", "images", "review-lipid-class-network.png"),
  dpi = 650
)

Plasma lipidomics epidemiology should be viewed as hypothesis-generating in its ability to provide evidence for a particular causal model – like other omics studies – even though the dimensionality is not yet as large. In lipidomics studies, the simultaneous measurement of hundreds of lipids from prepared samples is based on mass spectrometry. Mass spectrometry can be run with different partitioning, ionization, acceleration, sensing, and computational methods – ultimately leading to different performance profiles and sources of error. [@pmid27286762]

Major lipidomics results for cardiovascular diseases were reviewed in 2018 [@pmid29665359]. Dozens of lipid species and their subclasses have been associated with features from plaque stability to mortality, and the literature is growing for the eventual make-or-break systematic review, meta-analysis, and validation.

Objectives

Methods

A simplified causal diagram of the variables available in the study is in progress below (the hypothesis).

"
name
sudden cardiac death
cardiovascular death
all-cause death
vascular diseases (4)
stenosis
ejection fraction
plaque aha phenotype
plaque stability
plasma lipidomics state (529)
plasma lipoprotein state (89)
plasma small metabolite state (22)
whole blood transcriptomics state (27252)
blood monocyte transcriptomics state (27252)
clinical blood chemistry state (10)
disease state (11)
smoking
alcohol consumption
blood pressure
drugs (4 groups)
age
sex
weight
height
percutaneous transluminal angioplasty
coronary artery bypass grafting
" %>%
  read_csv() %>%
  arrange(name) %>%
  mutate(index = row_number())
"
from,to
1,3
1,4
1,5
1,6
1,7
1,8
1,9
1,10
1,11
1,13
1,14
1,15
1,16
1,17
1,18
1,21
1,22
1,23
1,24
1,25
2,3
2,5
2,6
2,9
2,14
2,15
2,20
2,21
2,22
2,23
4,3
4,6
4,9
4,14
4,15
4,22
4,23
5,3
5,6
5,9
5,10
5,11
5,14
5,15
5,21
5,22
5,23
7,3
7,6
7,9
7,10
7,14
7,15
7,16
7,17
7,21
7,22
7,23
8,3
8,6
8,11
8,22
8,23
9,3
"

Results

In this research project, we aim to contribute valueable evidence on how 1) the plasma lipidome, 2) the serum NMR metabolome, and 3) the whole blood and monocyte transcriptomes relate to atherosclerotic lesions, vascular events, and mortality, as well as many traditional causes and predictors of atherosclerosis, in the Tampere Vascular Study cohort.

This evidence will contribute to the crucial ongoing effort to produce an accurate systems-wide model of atherosclerosis at the molecular level in order to design optimal predictive and therapeutic technology in the future.

Data collection

In this section, we describe the selection and measurement methods of the study. The goal is to organize this metadata in a structured manner and write a summary that serves as the source for most of the methods sections in other reports.

Methods

Design

The Tampere Vascular Study (TVS) cohort consists of two consecutive series of patients scheduled for vascular surgery or exercise stress testing who are followed up for vascular events and measured for an extensive set of samples and variables. Both studies have been described in publications previously [@pmid24122613; @pmid16515696].

Setting

Patients have been recruited in two departments of the Tampere University Hospital in Tampere, Finland, from 10/2001 to 2009 – more specifically in the Division of Vascular Surgery and the Heart Center during 2005-2009 and in the Division of Clinical Physiology during 10/2001-1/2008.

In the exercise stress testing series, venous blood samples have been drawn, prepared, and stored in 2008 by the FimLab Laboratories (Tampere, Finland) personnel. In the vascular surgery series, arterial samples were collected at the time of participation under the supervision of a senior vascular surgeon in the Division of Vascular Surgery and Heart Center (Tampere University Hospital, Tampere, Finland) and prepared and stored by the FimLab Laboratories (Tampere, Finland) personnel.

All patients have filled a questionnaire at the time and place of participation. Health record -based information has been collected by members of the research group during the study. Clinical chemistry data access reaches back to 10/1998 and other EHR-based data have no time constraints. The microscopy of the arterial tissue slides have been done by a senior pathologist of the FimLab Laboratories (Tampere, Finland). NMR metabolomics measurements were done in 2010 by the NMR Metabolomics Laboratory (University of Kuopio) and the Computational Medicine Research Group (University of Oulu). Lipidomics measurements have been done in 2017 by Zora Biosciences Oy (Espoo, Finland).

Participants

Recruiters asked vascular surgery patients to participate if they were receiving a carotid endarterectomy for over 70 % carotid stenosis, a coronary artery bypass for symptomatic coronary artery disease, or a femoral or aortic endarterectomy with aortoiliac or aortobifemoral bypass for symptomatic peripheral arterial disease. Similarly, recruiters asked exercise stress test patients to participate if they were receiving an exercise stress test or a spiroergometer test, and from this population, patients who had subsequently received a coronary angiography were selected to the present cohort. Patients who didn't give informed consent to participate are excluded.

Variables

The complete study database includes data about the patients and their whole blood, monocytes, plasma, serum, and arterial samples. Main sets of variables include DNA (single nucleotide polymorphisms, copy number variations), RNA, micro RNA, bacteria, lipids, lipoproteins, small metabolites, histology, stenosis, hemodynamics, mortality, as well as relevant diagnoses, symptoms, treatments, environmental exposures, and clinical chemistry.

Measurement

Questionnaire

All participants have filled a questionnaire asking about basics including age, sex, weight, and height, exposures including cigarette or pipe smoking history and current alcohol consumption, current diagnoses including diabetes, hypercholesterolemia, hypertension, cardiac insufficiency, cancer, lung disease, liver disease, rheumatic disease, kidney disease, thyroid gland disease, or dementia, as well as current treatments including hormone replacement therapy, diabetes drugs, statins, and blood pressure drugs.

Health records and registries

Mortality data, including the date and cause of death, are collected from the national cause of death register or the electronic health records for each analysis. Similarly, vascular states and events, including coronary artery disease, myocardial infarction, coronary heart disease, arteriosclerosis obliterans, cerebrovascular disease, and peripheral vascular disease as well as symptoms according to the NYHA and CCS classification systems have been assessed from the electronic health records of the Tampere University Hospital.

For all participants, clinical chemistry variables including the maximum plasma total, LDL, and HDL cholesterol and triglyceride concentrations, and minimum CRP, leukocyte, and kreatinine concentrations, as well as hematocrit, hemoglobin, and platelet concentrations have been collected from the participants' electronic health records in the Tampere University Hospital. The variables have been measured as part of the participants' routine healthcare since with a Cobas Integra 700 automatic analyser with reagents and calibrators recommended by the manufacturer (Hoffmann-La Roche Ltd., Basel, Switzerland) and the Friedewald formula (LDL-C).

Physiological data on the extent of arterial stenosis, blood pressure, and ankle-brachial index is also collected via the health records, as well as treatment data on the history of percutaneous transluminal coronary angioplasty (PCTA) and coronary artery bypass grafting (CABG).

Arterial samples

For all patients selected to the vascular surgery series, endarterectomy specimens were collected. The coronary artery bypass patients were a source of lesion-free longitudinal internal thoracic artery (LITA) samples, and others were a source of atherosclerotic samples from the diseased artery.

The arterial tissue samples going for RNA measurements were stabilized with RNALater (Ambion Inc., Austin, TX, USA), and purified with the Trizol reagent (Invitrogen, Carlsbad, CA, USA) and the RNAeasy Kit with DNAase Set (Qiagen, Valencia, CA, USA). The RNA concentrations were checked with the BioPhotometer (Eppendorf, Wesseling-Berzdorf, Germany) and the samples were stored in \SI{-80} {\celsius}.

The RNA was quantified from the stored samples using the following protocol. First, RNA quantity in the samples were validated with the Nanodrop ND-1000 spectrophotometer (Nanodrop Technologies, Wilmington, DE, USA). Then for each RNA sample, 200 ng was reverse-transcribed with Illumina RNA Amplification kit (Ambion, Inc., Austin, TX, USA) (cat. no I1755) and cDNA-cRNA-transcribed in vitro for 14 hours with biotin-11-deoxy uridine triphosphate for labeling (PerkinElmer Life And Analytical Sciences, Inc., Boston, MA, USA). The RNA quantity was again validated with Nanodrop ND-1000 spectrophotometer (Nanodrop Technologies, Wilmington, DE, USA) and the RNA quality was checked with the Experion Automated Electrophoresis System and RNA StdSens Analysis Kit (BioRad Laboratories, Inc., Hercules, CA, USA). Then 1500 ng of cRNA was hybridized to the BeadChip for 18 hours in \SI{55}{\celsius} in the Sentrix Human-8 Expression BeadChip arrays (Illumina, San Diego, CA, USA). The cRNA was labelled with 1 µg/ml Cyanine-3-streptavidine (Amersham Biosciences, Pistacataway, NJ, USA) and finally the labelled cRNA was scanned with the Illumina BeadArray Reader (Illumina, San Diego, CA, USA).

Arterial samples going for the histological measurements were fixed, dissected, sliced, and stained following a standard clinical laboratory protocol and microscopically classified by a pathologist, following the AHA atherosclerotic lesion classification system [@pmid7648691; @pmid8181179; @pmid1728483]. Some histological samples have also undergone targeted immunohistochemical staining in previous studies to assess localization of molecules.

Whole blood and monocyte samples

Whole blood samples going for the gene expression analyses went through RNA isolation with PAXgene tubes (BD, Franklin lakes, NJ, USA) with PAXgene Blood RNA Kit with DNAase Set (Qiagen). The RNA quantity was checked with the BioPhotometer (Eppendorf, Wesseling-Berzdorf, Germany) and the samples were stored in \SI{-80} {\celsius}. Monocyte fractions were acquired by Ficoll-Paque density-gradient centrifugation (Amershan Pharmacia Biotech UK Limited, Buckinghamshire, England), the RNA was isolated from the fractions with the RNeasy Mini Kit (Qiagen), RNA quantities were checked with the BioPhotometer (Eppendorf, Wesseling-Berzdorf, Germany), and the samples were stored in \SI{-80} {\celsius}.

RNA quantification from the stored RNA samples was done using the same method as for the arterial RNA samples described above except that the Illumina HumanHT-12 v3 Expression BeadChip with Illumina iScan system (Illumina, San Diego, CA, USA) was used.

Plasma and serum samples

Lipid extraction for lipidomic measurement was done for all available plasma and serum samples according to the following protocol. First 10 µl of 10 mM 2,6-di-tert-butyl-4-methylphenol methanol, 20 µl of internal standards (Avanti Polar Lipids Inc., Alabaster, AL), and 300 µl of chloroform:methanol (2:1, v:v) (Sigma- Aldrich GmbH, Steinheim, Germany) was added to 10 µl of sample. The sample was mixed and sonicated in water for 10 min, incubated for 40 minutes, and centrifugated for 15 min in 5700 g. The upper phase was transferred and evaporated using nitrogen. The lipid phase was suspended again in 100 µl of water saturated butanol, sonicated in water for 5 minutes, added 100 µl of methanol, centrifuged for 5 min in 3500 g, and finally the supernatant was extracted for mass spectrometry. The extraction method has been described in more detail previously [@pmid27044508].

Lipids were measured from the prepared lipidomic samples with a hybrid triple quadrupole/linear ion trap mass spectrometer (QTRAP 5500, AB Sciex, Concord, Canada) equipped with an ultra-high-performance liquid chromatography (Nexera-X2, Shimadzu, Kyoto, Japan) on Acquity BEH C18, 2.1x50 mm id. 1.7 µm columns (Waters Corporation, Milford, MA, USA) and a scheduled multiple reaction monitoring algorithm. Data preprocessing was done with the Analyst and MultiQuant 3.0 software (AB Sciex). The measurement method has been described in more detail previously [@pmid29262533].

All available serum samples have been also assessed for lipoproteins and small metabolites as follows. The samples were first thawed from storage to \SI{+4} {\celsius} overnight, mixed, centrifuged in 3400 g, and 300 µl of sample was mixed with 300 µl of sodium phosphate buffer solution – all by a Gilson Liquid Handler. The prepared samples are then kept at \SI{+6} {\celsius} until being preheated to \SI{+37.5} {\celsius} for proton nuclear magnetic resonance spectroscopy with the Bruker AVANCE III spectrometer (Bruker BioSpin) with two configurations, the LIPO window for lipoproteins and the LMWM window for small metabolites. Finally, the metabolomic variables are estimated from the measured spectra with regression models, iterative lineshape fitting analysis, and the extended Friedewald method. The method has been described in more detail previously [@pmid19684899].

Bias

TODO

Study size

The size of the cohort hasn't been originally designed for omics but for hypothesis-driven analysis and the presently planned analyses have low statistical power on their own. This will be taken into account in the analysis and the interpretation of results as the need for meta-analysis and validation to achieve reasonable levels of certainty is clear in all omics research.

Results

Raw data files are securely stored, documented, and updated for follow up variables for new analysis.

Data import & cleaning

Methods

In this section, we prepare the collected raw data for analysis by applying the following operations on the archived raw data files:

We generate an RData-file cache of cleaned R dataset objects and load the cache file into R memory when needed. The cache will be re-generated when the import-functions are updated.

Results

Data import and cleaning steps have been abstracted out. Please see the source code in the R-directory of the project repository when needed.

# TODO lots
tvsproject::generate_cache(
  data_directory = here::here("raw-data", "data"),
  metadata_directory = here::here("raw-data", "metadata"),
  cache_directory = here::here("raw-data", "cache")
)

Load the latest cache into the global environment.

load(here("raw-data", "cache", "tvs-objects-2019-07-29.RData"))

Data analysis

Methods

We detect error, inform modelling decisions, and describe the study population and internal structure within datasets by first exploring univariate distributions and patterns with visualizations and statistical summarization and then by exploring multivariate patterns with four main methods, including topological overlap matrix (TOM) networks, agglomerative hierarchical clustering, principal component analysis (PCA), and uniform manifold approximation and projection (UMAP). We study the composition of potential hidden features and their relationships to other variables with simple regression models.

We use regression modelling techniques, including survival models, to study both the crude associations and adjusted causal effects between bloodomics and the outcomes. For causal inference, we minimize counfounding with plausible causal assumptions, target trial emulation, and a suitable g-method, such as inverse probability weighting or g-computation. We also study the predictive power of the bloodomics variables by using flexible blackbox learning methods, such as stacked emsembles.

Preprocessing methods, including the handling of outliers, missing values, scale, dispersion, and shape of distributions and feature engineering, are an inherent part of each analytical pipeline. The same applies to the post-processing methods – or how the observation- and model-level results such as features, parameters, and evaluation measures are further analyzed, interpreted, and presented. Pre- and post-processing steps are highly data-dependent and made explicit on a case by case basis.

As a general principle, when there is little reason to select a particular version of a pipeline apriori (e.g. hyperparameters), we run multiple versions as a sensitivity analysis, and interpret the multiplicity of results accordingly.

Results

Participants

The study contains a total of 290 patients; 134 (46 %) are from the vascular surgery series and 156 (54 %) from the exercise testing series.

Descriptive data

tabulate_descriptive <- function() {

  options(knitr.kable.NA = '')

  # TODO: compute directly from data – this is just a test
  "
  Variable,Summary,Missing
  Age,63 years; range 22-91,NA
  Sex,70 % male; 30 % female,NA
  Coronary artery disease,70 %,50 (17 %)
  Coronary heart disease,67 %,23 (8 %)
  Myocardial infarction,35 %,8 (3 %)
  Peripheral artery disease,16 %,4 (1 %)
  Asymptomatic,3 %,NA
  Idiopathic symptoms,1 %,NA
  Claudication,8 %,NA
  Acute limb ischaemia,0.3 %,NA
  Critical limb ischaemia,3 %,NA
  Cerebrovascular disease,22 %,4 (1 %)
  Asymptomatic,1 %, NA
  Idiopathic symptoms,3 %, NA
  Amaurosis fugax,5 %, NA 
  Transient ischaemic attack,6 %, NA
  Infarctus cerebri, 8 %, NA
  Cancer,6%, 2 (1 %)
  Cardiac insufficiency,19 %,2 (1 %)
  Dementia,1 %,7 (2 %)
  Diabetes,22%,2 (1 %)
  Type I,1 %,NA
  Type II,17 %,NA
  Hypercholesterolemia,71 %,5 (2 %)
  Hypertension,91 %,3 (1 %)
  Kidney disease,8 %,2 (1 %)
  Rheumatic disease,5 %,2 (1 %)
  Thyroid disease,7 %,2 (1 %)
  Alcohol consumption,66 %,61 (21 %)
  Rarely,50 %,NA
  Moderately,7 %,NA
  Unknown quantity,7 %,NA
  Smoking history,63 %,6 (2 %)
  Quit,36 %,NA
  Current,27 %,NA
  Smoking duration,median 28 years; range 1-60 years,NA
  Smoking amount,NA,56 (19 %)
  0 / day,45 %,NA
  1-4 / day,15 %,NA
  5-14 / day,11 %,NA
  15-25 / day,17 %,NA
  > 25 / day,3 %,NA
  " %>%
    read_csv() %>%
    kableExtra::kable(
      format = "html", 
      # caption = "Table 2. Baseline characteristics of the Tampere Vascular Study cohort.",
      align = "lcc"
    ) %>%
    kableExtra::kable_styling("condensed") %>%
    kableExtra::column_spec(1:3, width = 2) %>%
    kableExtra::add_indent(7:11) %>%
    kableExtra::add_indent(13:17) %>%
    kableExtra::add_indent(22:23) %>%
    kableExtra::add_indent(30:32) %>%
    kableExtra::add_indent(34:36) %>%
    kableExtra::add_indent(38:42)
}

tabulate_descriptive()

Histology

Vascular surgery and exercise stress testing series can be seen clearly in the histological data. (TODO: unstable probably has just pseudo missing values, fix them.)

plot_missing(histo)

A total of 290 unique patients, out of which 134 (46 %) are from the vascular series and 156 (54 %) from the exercise testing series.

histo %>%
  distinct(patient_id, .keep_all = TRUE) %>%
  mutate(has_sample_id = if_else(is.na(sample_id), "No", "Yes")) %>%
  group_by(has_sample_id) %>%
  summarise(n = n()) %>%
  mutate(prop = n / 290)
histo %>%
  select(-contains("_id")) %>%
  plot_variables(geom = "bar")

Treatments

plot_missing(treatments)
treatments %>%
    select(-patient_id) %>%
    summarise_variables()
treatments %>%
    select(-patient_id) %>%
    plot_variables(geom = "bar")

Diagnoses

plot_missing(diagnoses)
diagnoses %>%
    select(-patient_id) %>%
    summarise_variables()
diagnoses %>%
  select(-patient_id) %>%
  plot_variables(geom = "bar")

Symptoms

Symptom-data seem to contain too many genuine missing values to be useful in our study. This is probably because the measurement of these variables is retrospective via electronic health records.

plot_missing(symptoms)

Clinical chemistry

plot_missing(clinchem)
clinchem %>%
    select(-patient_id) %>%
    summarise_variables()
clinchem %>%
  select(-patient_id) %>%
  plot_variables(geom = "bar")

Exposures

plot_missing(exposures)
exposures %>%
    select(-patient_id) %>%
    summarise_variables()
exposures %>%
  select(-patient_id) %>%
  plot_variables(geom = "bar")

What is the statistical summary of smoking duration when restricted to patients with smoking history?

exposures %>%
  filter(q_smoking_dur != 0) %>%
  summarise(
    q_smoking_median = median(q_smoking_dur, na.rm = TRUE),
    q_smoking_min = min(q_smoking_dur, na.rm = TRUE),
    q_smoking_max = max(q_smoking_dur, na.rm = TRUE)
  ) %>%
  gather("measure", "value_years")

Macroscopic features

Stenosis information is systematically missing for ascending aorta samples, and only a few abdominal aorta samples have stenosis information.

plot_missing(macro)

The patients were 69 % male and a median of 63 (range 22-91) years old at the start of follow up.

macro %>%
    select(-patient_id) %>%
    summarise_variables()
macro %>%
    select(-patient_id) %>%
    plot_variables(geom = "bar")

Physiology

Ankle-brachial-index (ABI) is specific to atherosclerosis of the lower extremities so it has mostly missing values.

plot_missing(physiology)
physiology %>%
    select(-patient_id) %>%
    summarise_variables()
physiology %>%
  select(-patient_id) %>%
  plot_variables(geom = "bar")

Deaths

plot_missing(deaths_patients)
plot_missing(deaths_events)
deaths_patients %>%
  select(-patient_id, -follow_up_start, -death_date) %>%
  summarise_variables()
deaths_events %>%
  select(-patient_id, -date) %>%
  summarise_variables()
deaths_patients %>%
  select(-patient_id, -follow_up_start, -death_date) %>%
  plot_variables(geom = "bar")
deaths_events %>%
    mutate(date = case_when(
        date == "Not applicable" ~ NA_character_, 
        TRUE ~ date
    )) %>%
    mutate(date = ymd(date)) %>%
    filter(!(cause_of_death != "Not applicable" & event == "death_last_measurement")) %>%
    ggplot() +
        geom_point(aes(x = date, y = patient_id, color = event)) +
        geom_line(aes(x = date, y = patient_id, group = patient_id)) +
        theme_minimal() +
        theme(legend.position = "top", axis.text.y = element_blank()) +
        scale_x_date(position = "top")
deaths_patients %>%
    select(patient_id, death_status_last, follow_up_days) %>%
    # Kaplan-Meier estimate of cumulative hazard
    mutate(death_status_last = case_when(
        death_status_last == "Alive" ~ FALSE,
        is.na(death_status_last) ~ NA,
        # otherwise dead at last measurement date
        TRUE ~ TRUE
    )) %$% # makes variables accessible with bare variable names anywhere, like df$var_name
    survival::survfit(
        survival::Surv(time = follow_up_days, event = death_status_last) ~ 1
    ) %>%
    broom::tidy() %>%
    mutate(mortality = 1 - estimate) %>%
    # ----------------------------------
    ggplot() +
        geom_point(aes(y = mortality, x = time)) +
        scale_y_continuous(limits = c(0, 1)) +
        labs(
            x = "Follow up time (days)",
            y = "Cumulative proportion of deaths",
            caption = "Kaplan-Meier for right censoring"
        ) +
        theme_minimal()

RNAs

mrna_artery_genes %>%
    select(sample(colnames(.), size = 1000)) %>%
    plot_missing()

mrna_blood_genes %>%
    select(sample(colnames(.), size = 1000)) %>%
    plot_missing()

# mrna_mono_genes %>%
#     select(sample(colnames(.), size = 1000)) %>%
#     plot_missing()
mrna_genes %>%
    select(sample(colnames(.), size = 5000)) %>%
    summarise_missingness() %>%
    ggplot() +
        geom_histogram(aes(x = missing_proportion), bins = 20) +
        scale_x_continuous(limits = c(0, 1)) +
        theme_minimal() +
        labs(caption = "The count of missing values in a sample of 5000 probe variables")

The distribution of all probe values has two peaks and a long right tail.

mrna_all_probes %>%
    magrittr::extract( , -(1:3)) %>%
    # ugly hack via class converters, for speed
    as.matrix() %>%
    as.numeric() %>%
    tibble(all_probe_values = .) %>%
    ggplot() +
        geom_histogram(aes(x = all_probe_values), bins = 50) +
        theme_minimal() +
        scale_y_continuous(labels = scales::number_format())
mrna_all_genes %>%
    select(sample(colnames(.), size = 1000)) %>%
    plot_missing()

Lipoproteins

plot_missing(lipopro_samples)
# TODO: unit is "Particle class", not "Particle"
plot_missing(lipopro_particles)
lipopro_samples %>%
    select(-contains("_id")) %>%
    summarise_variables()
lipopro_particles %>%
    select(-contains("_id")) %>%
    summarise_variables()
lipopro_samples %>%
    select(-(patient_id:sample_type)) %>%
    map_dfr(function(x) x + 1e-9) %>%
    plot_variables(geom = "density") +
        scale_x_log10(name = "value + 1e-9") +
        theme(axis.text.x = element_text(size = 8))
lipopro_particles %>%
    select(-patient_id, -sample_id) %>%
    group_by(unit) %>%
    nest() %$%
    map(data, plot_variables, geom = "bar")

LMWM

plot_missing(lmwm)
lmwm %>%
    select(-contains("_id")) %>%
    summarise_variables()
lmwm %>%
    select(-sample_type, -contains("_id")) %>%
    map_dfr(function(x) x + 1) %>%
    plot_variables(geom = "density") +
        scale_x_log10(
            name = "value + 1", 
            labels = function(x) format(x, scientific = FALSE)
        ) +
        theme(axis.text.x = element_text(size = 8))

Lipids

A small proportion of the whole study population is systematically (by identifier) missing lipids data. Describe why.

plot_missing(lipids) + 
  theme(axis.text.y = element_blank())
lipids %>%
    gather("lipid", "lipid_quantity", -(patient_id:sample_type)) %>%
    summarise_variables()

The distribution of all lipid values is approximately symmetric on a logaritmic scale. The values distribute identically in plasma and serum samples.

lipids %>%
    gather("lipid", "lipid_quantity", -(patient_id:sample_type)) %>%
    filter(!is.na(lipid_quantity)) %>%
    mutate(log2_lipid_quantity = log(lipid_quantity, base = 2)) %>%
    select(log2_lipid_quantity, sample_type) %>%
    plot_variables(geom = "bar")

We also explore sums of lipids in their structural classes (as defined in the LIPIDMAPS database).

plot_missing(lipids_classes)

Cholesterol esters are the most abundant class in the data, followed by glycerophospholipids and -cholines, and sphingolipids.

lipids_classes %>%
    select(-contains("_id")) %>%
    summarise_variables()
lipids_classes %>%
    select(-(patient_id:sample_type)) %>%
    mutate_all(~ . + 1) %>%
    plot_variables(geom = "density") +
        scale_x_log10(name = "value + 1", labels = scales::number_format())

The twenty most abundant individual lipids are consequently cholesterol esters and phosphocholines. The quantity of different lipids species spans multiple orders of magnitude. This is also the case within some structural classes.

lipids %>%
    select(-contains("_id")) %>%
    summarise_variables()
lipids %>%
    select(-(patient_id:sample_type)) %>%
    plot_variables(geom = "density") +
        scale_x_log10(
            position = "top",
            labels = scales::scientific_format(),
            breaks = c(1e-4, 1e-2, 1e0, 1e2, 1e4)
        )

We visualize a few random pairs of lipids to get some idea of the raw lipid data space.

lipids %>%
    select(sample(colnames(.), size = 7)) %>%
    GGally::ggpairs(
      lower = list(continuous = GGally::wrap("points", alpha = 0.25)), 
      progress = FALSE
    ) +
    theme_void()

Start by computing the singular value decomposition (PCA) defined above (learn_pca_svd).

lipids_pca_results <- learn_pca_svd(lipids, id_variable_names = c("patient_id", "sample_id", "sample_type"))

Then we proceed to postprocessing by visualizing samples in the output feature (principal component) space.

lipids_pca_results %>%
  pluck("observation_level") %$%
  qplot(pc1, pc2, geom = "point") +
    theme_minimal()

Then we visualize the lipids in the loading (coefficient, weight) space.

lipids_pca_results %>%
  pluck("model_level") %>%
  filter(str_detect(feature, "loading")) %>%
  mutate(value = as.numeric(value)) %>%
  spread(feature, value) %$%
  qplot(x = pc1_loading, y = pc2_loading, geom = "point") +
    theme_minimal()

Statistics of the PCA model fit can be seen below.

lipids_pca_results %>%
  pluck("model_level") %>%
  filter(unit %in% c("pc1", "pc2"))

We can conclude that there is no simple linear structure in the lipids data space and no clear clustering can be seen in this projection. A few outliers exist and may be worth a closer look.

lipids_umap_results <- learn_umap(
  lipids, 
  id_variable_names = c("patient_id", "sample_id", "sample_type"),
  # focus more on the local structures
  n_neighbors = c(2, 4, 8, 16, 32, 64)
)

Again for postprocessing, we visualize the lipid samples in the UMAP output feature spaces.

lipids_umap_results %>%
  pluck("observation_level") %>%
  ggplot() +
    geom_point(aes(x = umap1, y = umap2), alpha = 0.5) +
    facet_wrap(vars(n_neighbors), scales = "free", labeller = "label_both") +
    theme_minimal()

At the most local levels, we see a few interesting clusters but noise is a possible complicating factor. Otherwise UMAP learned a uniform structure.

Finally, we compute TOM lipid networks.

lipids_tom_results <- learn_tom(
  lipids, 
  id_variable_names = c("patient_id", "sample_id", "sample_type"), 
  power = seq(1, 8, by = 1)
)
lipids_tom_results %>%
  pluck("model_level") %>%
  ggplot() +
    geom_histogram(aes(x = tom_similarity), bins = 30) +
    scale_x_continuous(breaks = seq(0, 1, by = 0.2)) +
    facet_wrap(vars(power), labeller = "label_both", ncol = 2) +
    theme_minimal() +
    theme(panel.grid.minor.y = element_blank())

We visualize the networks by assigning the weakest of the edge weights (TOM similarities) to zero and mapping the rest to the transparency property. A network layout algorithm computes the node positions based on the edge weights (TOM similarities). Unfortunately, this visualization is computational expensive and can only be used to explore just a few of the strongest connections.

We see from the TOM similarity distributions above that only a tiny proportion of similarities survive the correlation power shrinkage so that the similarity stays larger than about 0.1.

lipids_tom_results %>%
    pluck("model_level") %>%
    as_tbl_graph(directed = FALSE) %>%
    activate(edges) %>%
    filter(tom_similarity > 0.10 & power == 8) %>%
    ggraph(layout = "with_fr") +
        geom_edge_link(aes(alpha = tom_similarity)) +
        geom_node_point(alpha = 0.3) +
        # scales
        scale_edge_alpha_continuous(range = c(0.10, 0.01), guide = FALSE) +
        # theme
        theme_graph() +
        theme(legend.position = "top") +
        labs(subtitle = "power = 8")

# ggsave(
#   filename = "lipids-tom-network-power-8.pdf",
#   plot = last_plot(),
#   path = here::here("analysis", "images"),
#   width = 4,
#   height = 4,
#   dpi = 300
# )
lipids_tom_results %>%
    pluck("model_level") %>%
    as_tbl_graph(directed = FALSE) %>%
    activate(edges) %>%
    filter(tom_similarity > 0.10 & power == 6) %>%
    ggraph(layout = "with_fr") +
        geom_edge_link(aes(alpha = tom_similarity)) +
        geom_node_point(alpha = 0.3) +
        # scales
        scale_edge_alpha_continuous(range = c(0.10, 0.01), guide = FALSE) +
        # theme
        theme_graph() +
        theme(legend.position = "top") +
        labs(subtitle = "power = 6")

We continue TOM network postprocessing by looking at the potential clusters from another perspective by clustering the lipids TOM matrix agglomeratively and visualizing the clusters as a dendrogram.

lipids_tom_clusters <- lipids_tom_results %>%
  pluck("model", 8) %>%
  learn_cluster_hierarchy()
lipids_tom_clusters %>%
  pluck("model") %>%
  plot(main = "Lipid TOM cluster hierarchy (power = 8)", which.plot = 2, labels = FALSE)

References



eteppo/tvs-project documentation built on Aug. 13, 2019, 8:53 a.m.