```{=latex} \definecolor{codegray}{HTML}{efefef} \let\textttOrig\texttt \renewcommand{\texttt}[1]{\textttOrig{\colorbox{codegray}{#1}}}
<p style="text-align:center;"><img src="https://user-images.githubusercontent.com/43075653/166063689-45b4e34e-7129-42c3-b109-3a34afdc24e2.png" class="center" width="433" height="178"/></p> # Introduction The aim of this `VisomX`'s proteomics section is to provide an integrated analysis workflow for robust and reproducible analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment. This section is build on the foundation of the R package `DEP` [@zhang2018]. `VisomX` was designed to be usable with minimal prior knowledge of the R programming language or programming in general. You will need to be familiar with the idea of running commands from a console or writing basic scripts. For R beginners, [this](https://moderndive.netlify.app/1-getting-started.html) is a great starting point, there are some good resources [here](https://education.rstudio.com/learn/beginner/) and we suggest using the [RStudio application](https://posit.co/products/open-source/rstudio/). It provides an environment for writing and running R code. With consideration for `R` novices, `VisomX` establishes a framework in which a complete, detailed proteomics data analysis can be performed in two simple steps: 1. *Read data* from a tabular input with protein abundances, as generated by quantitative analysis software of raw mass spectrometry data, and apply several filters. 2. *Run workflow*, including missing value imputation, normalization, log2 transformation, statistical analysis, differential expression analysis, pathway enrichment analysis, and rendering of a report that summarizes the results. All computational results of a workflow are stored in a data container (list) and can be visualized by passing them to a set of dedicated plotting functions. \pagebreak # Installation Install the most current version with package `devtools`: ``` r install.packages("BiocManager") BiocManager::install("NicWir/VisomX")
Load the package:
# Load the VisomX package library(VisomX)
# Load the VisomX package suppressPackageStartupMessages(library(VisomX))
The proteomics test dataset was generated by culturing Pseudomonas putida KT2440 in 500-mL shaken flasks filled with 50 mL media at 30°C and 250 rpm. The media were either synthetic de Bont medium supplemented with 20 mM glucose or 30 mM acetate, or LB medium. Samples were taken in the mid-exponential phase, and cells were harvested by centrifugation at 17,000 x g for 2 min at 4°C. After removal of the supernatant, cell pellets were frozen and kept at −80°C until proteomics analysis was performed as described previously [@rennig2019]. No normalization of protein abundances was performed prior to this analysis.
The path to the example dataset can be extracted with:
system.file("prot_KT2440_glc_ac_LB.txt", package = "VisomX")
VisomX
requires a tabular input, either in the form of an R dataframe,
or stored in an XLS, XLSX, CSV, TSV, or TXT file. Three types
of columns are required:
A column containing protein identifiers. These can be Uniprot identifiers, Ensembl Gene IDs, Entrez Gene IDs, etc.
A column containing names. Gene Symbols are commonly used for this purpose.
Columns containing protein abundances of the different samples in the experiment. These columns should have a common prefix (e.g., abundance.) used for their identification within the table. Furthermore, replicates are indicated by an underscore followed by a number. For example, abundance.KT2440_Glc_1, abundance.KT2440_Glc_2, abundance.KT2440_Glc_3, and abundance.KT2440_Glc_4 indicate four replicates for the condition KT2440_Glc.
Optional: Columns containing + and - indicating contaminant proteins and decoy database hits that should be filtered out.
The function prot.read_data()
accepts either an R dataframe or the
path to a table file with extension '.xlsx', '.xls', '.csv', '.tsv', or
'.txt'. It creates a SummarizedExperiment
object and performs feature
filtering based on indicated contaminants or unspecific identification,
a defined relative standard deviation threshold, or the presence of
missing values. For details about this function, run help("prot.read_data", "VisomX")
. For instructions on how to work with SummarizedExperiment
objects, see the Bioconductor manual
Here, we filter proteins that are not present in at least 6 out of 9
samples.
# Read data, incl. filtering data <- prot.read_data( data = system.file("prot_KT2440_glc_ac_LB.txt", package = "VisomX"), pfx = "abundance.", id = "Ensembl Gene ID", name = "Gene Symbol", filt_type = "fraction", filt_min = 0.66)
The wrapper function prot.workflow()
runs a complete analysis workflow
and exports the results as tabular TXT files as well as (optionally) a
report in PDF and HTML format. The following settings usually need
closer consideration:
We use the "SampMin" method for missing value imputation
(imp_fun
), which has been shown to be among the most reliable
methods for both missing-not-at-random (MNAR) and
missing-completely-at-random types of missing values [@liu2021].
For the types of comparisons to perform (type
), we choose "all"
to cover all possible contrasts. Alternatively, one can compare
every condition with a single control condition by choosing
type = 'control'
and defining the control condition with
control
. If only very specific contrasts are of interest,
type = 'manual'
can be chosen and the desired contrasts can be
defined as a vector of strings in the form
contrast = c('ConditionA_vs_ConditionB', ConditionC_vs_ConditionD')
.
The _vs_
substring is hereby essential!
We set the significance threshold for adjusted p values (alpha
) at
0.05. VisomX
tests for differential expression of proteins based
on protein-wise linear models and\
empirical Bayes statistics using limma
. False Discovery Rates are
estimated using fdrtool
.
We define 1 as the relevance threshold for log2 fold changes
(lfc
), corresponding to fold changes of 2 and 0.5, to be
considered significant.
We choose pathway_enrichment = TRUE
to perform pathway enrichment
analysis. We test for enriched KEGG pathways by setting
pathway_kegg = TRUE
and defining a kegg_organism
. We also test
for enriched pathways extracted from the BioCyc database as a
tab-delimited TXT file included in VisomX
. To test against a set
of pathways other than those listed in KEGG, we provide a dataframe
object with columns 'Pathway' and 'Accession'. Important:
The names or IDs used for proteins/genes listed under 'Accession'
need to correspond to the IDs initially used to create the
SummarizedExperiment with prot.read_data()
.
Lastly, we choose whether a report shall be generated and exported into a subfolder of the current working directory (omitted because of the long run time associated).
# Read TXT file with BioCyc pathways for P. putida KT2440 custom_df <- read.table( system.file("BioCyc_pathways_KT2440.txt", package = "VisomX"), sep = "\t", header = T, stringsAsFactors = F, fill = T, na.strings = "", quote = "", comment.char = "", check.names = F ) # Run workflow and export results results <- prot.workflow(se = data, imp_fun = "SampMin", type = "all", alpha = 0.05, lfc = 1, pathway_enrichment = TRUE, pathway_kegg = TRUE, kegg_organism = "ppu", custom_pathways = custom_df, volcano.add_names = FALSE, # show protein labels in volcano report = FALSE, out.dir = tempdir() )
If further customization of certain computational steps are required,
all steps included within prot.workflow()
can also be performed
manually. The following steps are required to end up with the same
results list object as obtained with the workflow function:
# 1. Perform data normalization: prot_norm <- prot.normalize_vsn(se) # 2. Missing value imputation # a. with imp_fun == "MinProb": prot_imp <- prot.impute(prot_norm, fun = imp_fun, q = q) # b. with imp_fun == "knn": prot_imp <- prot.impute(prot_norm, fun = imp_fun, rowmax = knn.rowmax) # c. for all other methods: prot_imp <- prot.impute(prot_norm, fun = imp_fun) # 3. Principal component analysis prot_pca <- prot.pca(SummarizedExperiment::assay(prot_imp)) # 4. Test for differential expression by empirical Bayes moderation of a linear model and defined contrasts: prot_diff <- prot.test_diff(prot_imp, type = type, control = control, test = contrast) # 5. Denote significantly differential proteins: prot_dep <- prot.add_rejections(prot_diff, alpha = alpha, lfc = lfc) # 6. Generate a results table: prot_res <- prot.get_results(prot_dep) # 7. Pathway enrichment analysis # a. Get vector of tested contrasts: contrasts <- SummarizedExperiment::rowData(prot_dep) %>% data.frame(check.names = FALSE) %>% select(ends_with("_diff")) %>% colnames() %>% str_replace_all("_diff", "") # b. Perform pathway enrichment analysis res.pathway <- enrich_pathways(prot_dep, contrasts, alpha_pathways = alpha_pathways, pathway_kegg = pathway_kegg, kegg_organism = kegg_organism, custom_pathways = custom_pathways) # 8. Assemble results into list results <- list(data = SummarizedExperiment::rowData(se), se = se, norm = prot_norm, imputed = prot_imp, pca = prot_pca, diff = prot_diff, dep = prot_dep, results = results, param = param)
Most plots available in VisomX
are generated automatically within the
report. Additionally, the figures are automatically exported as PNG and
PDF files if argument export = TRUE
when running prot.workflow()
.
To customize specific plot or generate them post-hoc, this section provides an overview of available plotting functions.
Protein identifications per sample:
pg_width = ncol(data) / 2 if (pg_width > 10) { pg_width = 10 } suppress_ggrepel <- function(w) { if (any(grepl("ggrepel", w))) invokeRestart("muffleWarning") }
prot.plot_numbers(data, plot = T, export = F)
Gene coverage in all samples:
plot_coverage(data)
The purpose of the following graphs is to determine if there is a systematic trend in the standard deviation of the data as a function of overall expression. These graphs are based on the assumption that most proteins are not differentially expressed, so the running median is a reasonable estimate of the standard deviation of the data at the feature level, as a function of the mean. After vsn normalization, the running median should be approximately a horizontal line. While a completely flat curve of the square root of variance over the mean may seem like the goal of such transformations, this may be unreasonable in the case of datasets with many true differences due to the experimental conditions. There may be random fluctuations, but there should not be a general trend. If this is not the case, it usually indicates a data quality problem or is the result of inadequate data preprocessing.
SD-Rank(mean) plot before normalization:
meanSdPlot(data, ranks = TRUE, plot = T, xlab = "Rank(mean)", ylab = "SD")
SD-Rank(mean) plot after normalization:
meanSdPlot(results$norm, plot = T, xlab = "Rank(mean)", ylab = "SD")
norm_height = ncol(data) / 2.2 if (norm_height > 9.5) {norm_height = 9.5}
Box plots of the distributions before and after normalization:
prot.boxplot_intensity(data, results$norm, plot = T, export = F)
Missing values in the data set must be imputed. The downstream analysis can be strongly influenced by the imputation. Therefore, the selection of an appropriate method is crucial. The data can be missing at random (MAR), for example if proteins are quantified in some replicates but not in others, without any bias towards low intensities. This class of missing values can arise when the peptide sequence is mapped incorrectly or software erroneously assigns shared peptides to precursors leading to misidentification in some samples and missing values in others.\ MNAR may result from experimental effects such as (1) enzyme miscleavages, (2) true presence/absence in the biological samples and (3) instrumentation effects (when peptide measurements are low in abundance compared to background noise or constitute low ionization efficiency). Because values are missing because of low abundant nature of the respective proteins, this category of missing values is considered left-censored, i.e., the distribution of values (if present in the data) would fall on the left tail of the total observations in the dataset. A detailed treatise about missing values in the context of proteomics data can be found here: Missing value handling
To asses the type of missing data (random or not), the following heat map of missing values and density distributions of proteins with/without missing values provide a visual aid.\ If data is randomly missing, use the k-nearest neighbor ("knn") or maximum likelihood ("MLE") options for imputation. If the missing data is biased to certain samples (e.g. controls) which are expected to be depleted of certain proteins and/or if there is a clear bias of missing values towards low abundances, use the "QRILC", "MinProb", "SampMin", or "man" options for imputation. For a more detailed description of different imputation methods, see the MSnbase vignette and in particular the impute function description. The "SampMin" method was proposed by Liu and Dongre (2020)) and shown to outperform other common methods for MNAR, left-censored datasets.
prot.plot_missval(results$norm, fontsize = 12, export = F)
To check whether missing values are biased toward low-intensity proteins (i.e., 'left-censored'), densities and cumulative proportions are plotted for proteins with and without missing values.
prot.plot_detect(results$norm, basesize = 10, plot = T, export = F)
The effects of data normalization and imputation on the distributions are visualized via density distributions:
prot.plot_imputation( data, results$norm, results$imp, basesize = 13, plot = T, export = F )
To get a high-level overview of the data, the unsupervised method principal component analysis (PCA) reduces the data dimensionality (i.e., number of proteins included in the analysis) while retaining most of the data information. PCA is an unsupervised method for reducing the dimensionality of high-dimensional data. The goal of PCA is to find a new set of variables, called "principal components," that capture most of the data information while reducing the number of dimensions. This is achieved by finding a new coordinate system that aligns with the directions of maximum variance in the data. The first principal component captures the most variation in the data, the second component captures the second-most variation, and so on. By retaining only the first few principal components, the data can be reduced to a lower-dimensional representation while retaining most of the important information. This can be useful for visualizing the data, reducing noise, or improving the performance of machine learning algorithms.
A scree plot in principal component analysis (PCA) is a graphical representation of the eigenvalues of the principal components. It helps to determine the number of significant components in a PCA analysis by plotting the eigenvalues against the component number. The "scree" refers to the downward-sloping "elbow" shape that often appears in these plots, which is used as a visual aid for determining the number of important components. The idea is to keep components with large eigenvalues and discard the others, as they are likely to be noise or have little contribution to the data variance.
prot.plot_screeplot( results$pca, axisLabSize = 16, titleLabSize = 17, plot = T, export = F, components = PCAtools::getComponents(pca)[1:ncomp] )
From the Scree Plot, we see that the first two PCs explain about two-thirds of the observed variation in the dataset. Further components provide little explanatory value.
A 2D PCA plot is a visualization of a dataset after undergoing principal component analysis (PCA) and reducing the data to two dimensions. In a 2D PCA plot, each data point is represented as a dot on a 2D graph, with the x-axis and y-axis representing the first and second principal components, respectively. These new coordinates, or principal components (PCs), are not real variables generated by the system and therefore, applying PCA to a dataset loses its interpretability.
However, despite the loss of interpretability, PCA can be very useful in identifying batch effects and assessing which original samples are similar and different from each other. By plotting the data in the 2D PCA space, it can be easier to visualize and identify clusters of similar samples, and distinguish samples that are different from each other. This information can then be used to better understand the underlying structure of the data, and improve the accuracy of downstream analyses and applications.
The function prot.plot_pca
internally performs PCA analysis and
requires a SummarizedExperiment object. Thus, the imp
rather than the
pca
object is used as input.
prot.plot_pca( results$imp, x = 1, y = 2, point_size = 4, basesize = 12, plot = T, export = F, title = "PC Scores - PC2 vs. PC1" )
prot.plot_pca( results$imp, x = 1, y = 3, point_size = 4, basesize = 12, plot = T, export = F, title = "PC Scores - PC3 vs. PC1" )
We can see that the three conditions cluster well in PC1, less in PC2, and not at all in PC3 which explains only 9.5% variation within the dataset.
PCA loadings are the coefficients of the linear combination of the original variables (proteins) from which the principal components (PCs) are generated. They describe how much each gene contributes to a particular principal component. Large loadings (positive or negative) indicate that a particular variable has a strong relationship to a particular principal component. The sign of a loading indicates whether a variable and a principal component are positively or negatively correlated.
The prot.plot_loadings()
function accept a pca
object generated with
prot.pca()
.
prot.plot_loadings(results$pca,labSize = 3, plot = T, export = F)
High loadings on PCs that explain the observed variations to a high degree are a direct indicator that the respective proteins play an important role in distinguishing the tested conditions. In PC1, the LB group clustered at a great distance to the Glc and Ac groups. The high loading of BkdA1 on PC1 suggests that differences in the abundance of this protein are a significant factor distinguishing the conditions. This makes sense, since it is a subunit of the branched-chain alpha-keto acid dehydrogenase complex, involved in the degradation of branched chain amino acid, an abundand carbon source in LB medium that is absent in the other conditions.
A correlation matrix is a type of data matrix that shows the pairwise relationships between variables in a dataset. In quality control, a correlation matrix can be used to assess the quality of the data by examining the relationships between samples. By plotting the samples in a correlation matrix, it is possible to identify groups of samples that are highly similar to each other, and to differentiate samples that are not similar. This can be useful for identifying and correcting batch effects, which can affect the accuracy of the data analysis. For example, if some samples are highly correlated within a batch, but not correlated with samples from another batch, this may indicate that the samples from different batches have been processed differently, and that further adjustments or normalization may be necessary to ensure accurate analysis. A correlation matrix is usually represented as a heatmap, where each cell in the matrix represents the pairwise correlation between two samples. Samples that are highly correlated will appear in similar colors in the heatmap, while samples that are not correlated will appear in different colors.
The Pearson coefficient, also known as Pearson's correlation coefficient, is a statistic that measures the strength and direction of the linear relationship between two continuous variables.
The Pearson coefficient ranges from -1 to 1, with a value of 1 indicating a perfect positive linear relationship, a value of -1 indicating a perfect negative linear relationship, and a value of 0 indicating no linear relationship. Positive values of the Pearson coefficient indicate that as one variable increases, the other variable also increases. Negative values of the Pearson coefficient indicate that as one variable increases, the other variable decreases.
As most proteins within a proteomics dataset usually have little variation, the pearson coefficients observed are usually high. Thus, we set a lower threshold for the applied color scale to 0.8.
prot.plot_corrheatmap( results$dep, lower = 0.8, upper = 1, font_size = 12, significant = FALSE )
The following plots are a visual representation of differential expression (DE) analysis. DE analysis is the process of comparing gene or protein expression levels between two or more groups of samples in order to identify genes or proteins that are differentially expressed between the groups. The aim of DE analysis is to identify genes or proteins that show statistically significant differences in expression levels between the different groups.
In the following correlation plot, we consider only significant proteins:
prot.plot_corrheatmap( results$dep, lower = 0.8, upper = 1, font_size = 12, significant = T )
Heat maps are either used to visualize the expression level of each protein in each sample or the log2 fold changes for each tested contrast. The heat maps give an overview of all proteins with significantly different abundances (rows) in all samples (bars). This allows general trends to be identified, for example when one sample or replicate differs from others. In addition, clustering of samples (columns) can point to more similar samples and clustering of proteins (rows) can indicate proteins that behave in a similar way. By considering only significant proteins we can get a better overview of the expression levels of proteins that are relevant to the comparison of the conditions.
width = ncol(data) / 2 if (width < 7) { width = 7 } if (width > 12) { width = 12 } len = nrow(results$dep[SummarizedExperiment::rowData(results$dep)$significant, ]) / 13 if (len < 5) { len = 5 }
prot.plot_heatmap( results$dep, type = "contrast", kmeans = TRUE, k = 4, show_row_names = T, row_font_size = 5, plot = T, export = F )
prot.plot_heatmap( results$dep, type = "centered", kmeans = TRUE, k = 4, show_all = T, show_row_names = T, row_font_size = 5, indicate = "condition", col_limit = 2, plot = T, export = F )
Volcano plots can be used to visualize a particular contrast (comparison between two samples). This allows to examine the gene enrichment between two samples (x-axis) and their corresponding (adjusted) p-values (y-axis).
prot.plot_volcano( results$dep, contrast = "Glc_vs_Ac", add_names = TRUE, label_size = 2.5, adjusted = FALSE, plot = TRUE, export = FALSE, lfc = 1, alpha = 0.05 )
Pathway overrepresentation analysis is a method used to identify pathways that are significantly enriched in a set of genes or proteins. It is a useful tool for interpreting the results of gene or protein expression studies, as it can identify pathways that are likely to be involved in the studied biological process.
We use lollipop charts to visualize significantly enriched pathways. In these plots, each enriched pathway is represented as a horizontal line, with the length indicating what fraction of the total genes constituting the pathway were found to be enriched (e.g, if a pathway consists of 10 genes of which 6 were found to be overrepresented, the Rich factor is 0.6). The circle size at the end of the 'lollipops' indicates the absolute number of enriched genes in the pathway, while its color indicates the adjusted p value associated with the finding.
Instead of Venn diagrams to illustrate the relationship between different gene sets and how they overlap, we use UpSet plots [@lex2014][@conway2017]. In a Venn diagram, the intersections between sets are represented as overlapping circles, with the area of the overlap representing the number of elements in the intersection. However, Venn diagrams become less effective as the number of sets increases, since the overlapping circles can become complex and difficult to interpret. In contrast, UpSet plots are more effective for larger numbers of sets, as they represent the intersection between sets as bars across the x-axis, with the height of the bars representing the number of elements in the intersection. The matrix below the bar chart provides details about the set intersections. This makes it easier to compare and visualize the relationships between sets, even when there are many sets. Further instructions on how to interpret UpSet plots can be found at https://upset.app/.
In the following plot, we have used the KEGG pathways to visualize the significant proteins that are differentially expressed within the tested contrast.
We can generate the plot showing the overrepresented KEGG pathways for a specific contrast:
w_pora <- 11 h_pora <- 7
# Lollipop chart prot.plot_enrichment( results$pora_kegg_up[["Glc_vs_Ac"]], title = "Upregulated pathways - KEGG", subtitle = "Glc_vs_Ac", plot = T, export = F ) # UpSet plot prot.plot_upset(results$pora_kegg_up[["Glc_vs_Ac"]], line.size = 0.9, mb.ratio = c(0.6, 0.4) )
Further details about the enriched pathways can be extracted via
converting the enrichResult
objects into a dataframe:
df <- as.data.frame(results$pora_kegg_up[["Glc_vs_Ac"]]) # Inspect the structure of the dataframe glimpse(df)
No KEGG pathways were found to be UNDERrepresented for the contrast "Glc_vs_Ac":
df <- as.data.frame(results$pora_kegg_dn[["Glc_vs_Ac"]]) # Inspect the structure of the dataframe glimpse(df)
In the following plot, we have used the BioCyc pathways to visualize the significant proteins that are differentially expressed within the tested contrast.
w_pora <- 11 h_pora <- 9
# Plot overrepresented KEGG pathways ## Lollipop chart prot.plot_enrichment( results$pora_custom_up[["Glc_vs_Ac"]], title = "Upregulated pathways - BioCyc", subtitle = "Glc_vs_Ac", plot = T, export = F ) ## UpSet plot prot.plot_upset(results$pora_custom_up[["Glc_vs_Ac"]], line.size = 0.9, mb.ratio = c(0.6, 0.4) )
w_pora <- 11 h_pora <- 6
# Plot underrepresented Custom pathways ## Lollipop chart prot.plot_enrichment( results$pora_custom_dn[["Glc_vs_Ac"]], title = "Upregulated pathways - BioCyc", subtitle = "Glc_vs_Ac", plot = T, export = F ) ## UpSet plot prot.plot_upset(results$pora_custom_dn[["Glc_vs_Ac"]], line.size = 0.9, mb.ratio = c(0.6, 0.4) )
Often, a specific set of proteins is of particular interest for an analysis. The prot.plot_bar()
allows visualization of protein abundances across different conditions or their log2-fold ratio for selected contrasts in a column chart.
For example, let's have a look at the proteins constituting the TCA cycle. For this, we can store all KEGG pathways in a list object using the prot.get_kegg_pathways
and then extract the respective genes to create the plot. Since prot.get_kegg_pathways
returns locus IDs and not gene names, we convert the IDs into names by accessing the primary information-containing dataframe within a SummarizedExperiment object with rowData()
from package SummarizedExperiment
:
# Load the SummarizedExperiment package library(SummarizedExperiment) # Get all KEGG pathways and their respective genes kegg_pathways <- prot.get_kegg_pathways("ppu") # Extract genes constituting the TCA cycle TCA_genes <- kegg_pathways[["Citrate cycle (TCA cycle) - Pseudomonas putida KT2440"]] # Get the names instead of IDs for the genes TCA_names <- rowData(results$dep)[rowData(results$dep)$ID %in% TCA_genes, "name"] # Plot TCA cycle protein abundances prot.plot_bar( dep = results$dep, type = "centered", combine = T, proteins = TCA_names, col.id = "name", export = F, plot = T )
Lastly, we plot the log2 fold change values of TCA cycle proteins for one of the tested contrasts:
# Plot TCA cycle protein abundances prot.plot_bar( dep = results$dep, type = "contrast", contrast = "Glc_vs_Ac", combine = T, convert_name = T, proteins = TCA_names, col.id = "name", export = F, plot = T )
Other type
option of this function include reference
, in which the log2(abundance) of all shown proteins are centered around a reference protein, and abundance
, which shows the absolute log(abundance) values.
::: {#refs} :::
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.