Prior to any analysis, clinical data and raw expression data has to be cleaned; i.e sample names have to be consistent, spaces removed, all samples need correct metadata, ... Below is a rough description of the different steps but a complete overview with code to go from the excel source to cleaned up data can be found via these links:
View the code for the preprocessing of the metadata of the plasma and the serum.
View the code for the preprocessing of the experimental data of the plasma and the serum.
The metadata for the separate subgroups/diseases was combined into a single metadata dataframe per tissue. Column names were cleaned and made consistent between the different subgroups to allow for easier handling in R. Further cleaning of the dates was performed in R using the lubridate package and annotations such as age, time in freezer, BVAS score and site of sample origin were added as well.
Samples without Olink expression data were removed from the metadata of plasma ("lun1065") and serum (("lin113", "sle_121/a" and "sle_156/a").
The resulting metadata for serum and plasma separately is saved in the folder Results/Plasma/plasma_metadata.csv and Results/Serum/serum_metadata.csv respectively.
The names of the samples were cleaned and modified to correspond to the metadata. Three samples that were present in the expression data but had no metadata were removed ("lun1059", "iga2156" and "umu78").
4 additional plasma proteins were measured on Luminex and their log2 normalized concentrations were added to the plasma Olink data. Merging different types of data can result in bias, but since the Luminex data is on a similar log2 scale as the Olink NPX data and the univariate analysis is performed per-protein, this should be okay. As a sanity check, a PCA visualization of the centered and scaled expression levels of all plasma proteins (see Figure \@ref(fig:sourcePca)) shows the proteins measured by Luminex group in general with the Olink data.
knitr::include_graphics(c("Results/Plasma/QC/plasma_olink_lum_pca.pdf"))
In addition, all data and metadata can also be accessed via R:
# install.packages("devtools") devtools::install_github("vincent-van-hoef/Analysis5204") library("Analysis5204") data(list=c("plasma_metadata", "plasma_npx", "serum_metadata", "serum_npx", "gsea_sets", "lum_dat"), package = "Analysis5204")
One additional sample ("vaska637") is filtered out beforehand due to technical concerns (failed sample according to Olink QC rapport).
Sample "umu63" is filtered out because of a GPA label of "2".
Three proteins (MCP-1, OPG and uPA) are measured in both the Cardiovascular and the Inflammation panel. The expression values correlate well between the panels so only the Inflammation data was kept for these proteins.
:::fyi Shown in this section are the results for the plasma samples. The QC of the serum samples is available in the folder /Serum/QC/ :::
A general overview of the data distribution per plasma sample per panel, and ordered by the median value on that panel is shown in Figure \@ref(fig:plDistrBox). They show one boxplot per sample and can be colored by any feature, but by default are colored by whether the sample passed technical QC (“QC_Warning” status as defined by Olink). The things to look for in this plot are odd distributions, deviant medians and patterns among the outlier assays. Overall, distributions are fairly similar and the medians are comparable. Two samples in the INFLAMMATION panel are deemed to have failed the technical QC, but a further investigation has to show whether to remove them from the dataset.
knitr::include_graphics(c("Results/Plasma/QC/plasma_CVIII_dist.pdf", "Results/Plasma/QC/plasma_INF_dist.pdf"))
The distribution plot can get difficult to read when there are a large number of samples or many panels. Plotting the interquartile range (IQR) vs the median shows an alternative view on the data distributions (Figure \@ref(fig:plIQR)). Most samples cluster together, some samples have a slightly more extreme IQR. Even though two samples are marked with QC warnings in the boxplots, their data distributions (and PCA plot - data not shown) look ok. According to the Olink data analysis tutorial, this means the samples can be kept for downstream analysis so they were retained for further analysis.
knitr::include_graphics(c("Results/Plasma/QC/plasma_qc.pdf"))
Using Principal Component Analysis (PCA), the structure in the data was visualized by projecting the samples on a two-dimensional graph using the two first principal components (PC). These components are linear combinations of the proteins that explain the most variation between those samples. These kind of plots can be used to examine the samples for outliers, sample swaps, batch effects and other relationships. When normalization successfully removed technical or batch artefacts, the relative distances between samples should be biologically interpretable. A plot of PC1 vs PC2 using the 100 most variable proteins and colored by their biological/experimental group is shown in Figure \@ref(fig:plPCA). Each point corresponds to a sample plotted according to PC1 and PC2. The different biological groups do seem to differ in this plot, especially along PC1. Also, the healthy control samples seem to cluster more together than the active or other control samples, suggesting they are more homogeneous. From this PCA plot one can infer differences exist between the groups and there do not appear to be major outliers among the samples.
knitr::include_graphics(c("Results/Plasma/QC/disease_pca_100.pdf"))
In the QC folder you will find additionally a PCA plot of PC1 vs PC3, a screeplot showing the explained variation per PC and a barplot showing the top and bottom genes contributing to the variation on PC1.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.