In YasinEl/mzRAPP: Benchmark Dataset Generation and Non-Targeted Data Pre-Processing Assessment

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The goal of mzRAPP is to allow reliability assessment of non-targeted data pre-processing (NPP) in the realm of liquid chromatography high-resolution mass spectrometry (LC-HRMS). mzRAPPs approach is based on the increasing popularity of merging non-targeted with targeted metabolomics, meaning that both types of data evaluation are often performed on the same data set. Following this assumption, mzRAPP can utilize user-provided information on a set of molecules (at best > 50) with known retention behavior. mzRAPP extracts and validates chromatographic peaks for which boundaries are provided for all (enviPat predicted) isotopologues of those target molecules directly from mzML files. The resulting benchmark data set is used to extract different performance metrics for NPP performed on the same mzML files.

Example

To run an example, you first need to download some exemplary files:

Download all 30 mzML files (at least five if you do not want to process that many) ending on "_POS.mzML" from the repository MTBLS267 on metabolights.
Download the csv files "SampleGroups_MTBLS267.csv" and "Target_File_MTBLS267.csv" from ucloud for benchmark generation.
(optional) Download the already prepared benchmark "Benchmark.csv" if you do not want to do the benchmark generation part of the example from ucloud
Download the XCMS- and MZmine 2-output files from ucloud. Alternatively, you can also process the mzML files downloaded before by yourself. If you do so, make sure to follow the instructions given in the readme section "Exporting NPP outputs from different tools".
D not forget to unzip/extract all downloaded csv files. Otherwise you wont be able to select them in file selection dialog windows!
If you have not installed mzRAPP yet, please follow the instructions in the mzRAPP Readme (e.g., on GitHub) to do so.

You can now start mzRAPP using:

library(mzRAPP)
callmzRAPP()

Benchmark generation

In the Generate Benchmark tab, you have to select all necessary files and set the instrument/resolution used.

Select mzML files: downloaded mzML files
Select sample-group file: SampleGrp_RBC_pos.csv
Select target file: TargetFile_RBC_POS_BordersPerFile.csv
Select instrument & resolution: OrbitrapXL,Velos,VelosPro_R60000\@400
Select the adducts M+NH4, M+Na, and M+K

Afterward, the number of parameters have to be set:

Lowest isotopologue to be considered: 0.05
Min. # of scans per peak: 6
mz precision [ppm]: 6
mz accuracy [ppm]: 5
Processing plan: according to your computational resources. (generally multiprocess will be faster)

After that, benchmark generation can be started by clicking the blue button at the bottom of the page. Please note that this can take up to an hour if you selected all files.

When mzRAPP is done, it will automatically export the finished Benchmark as csv file to your working directory and switch to the View Benchmark tab where you get key metrics as well as two interactive plots on the generated Benchmark:

If you processed all 30 mzML files, you should have generated a benchmark containing 47 different molecules with 157 different features (including all adducts and isotopologues), resulting in 2870 peaks in total.
In the provided histogram, you can choose to plot the height of detected benchmark peaks. This shows you that your benchmark peaks span several orders of magnitude. Sufficient quality of low peaks was ensured by removing isotopologues that do not satisfy criteria in peak shape (peak shape correlation with most abundant isotopologue) and abundance (Isotopologue ratio bias < 30\%)
(optional) You can inspect the Benchmark in greater detail by exporting it to Skyline, which you can download from here. To do this, click the button Export Skyline Transition list and peak boundaries. Afterward, follow the instructions in the exported txt file exported to your working directory. It is also possible to remove peaks before proceeding by using a filter on some of the provided variables or deleting selected rows from the csv file.

Non-targeted data pre-processing assement

Going forward, the generated Benchmark (BM) can be used to assess non-targeted data pre-processing (NPP) outputs. To do this go to the Setup NPP assessment tab. You can then assess the performance of NPP runs we have performed via XCMS (script for their generation is available on ucloud). Please note that the following percentage values reported by us could slightly vary since we estimate confidence intervals using bootstrapping, which implies an element of randomness:

Performance check (XCMS) 1

Non-targeted tool used: XCMS
Select unaligned file(s): XCMS_unaligned_run1.csv
Select aligned file: XCMS_aligned_run1.csv
Select benchmark file: the exported Benchmark as csv (if you did not process all 30 files or skipped the benchmark generation step, please use the Benchmark provided in ucloud)

Afterward, the assessment can be started by pressing the blue button on the bottom of the page.

When assessment is done mzRAPP will automatically switch to the View NPP assessment tab. Key performance metrics can be inspected in the three overview boxes on the top.

It always makes sense to first check the final output in the Post Alignment box (If the final output looks good, we generally do not need to waste time on understanding if the intermediate steps worked well). There we can see that about 83-94% of peaks have been detected, which is ok. However, the isotopologue ratio (IR)-metric reports 28-53% degenerated IR, which we would consider quite a lot. Hence we need to look at the previous steps.
The Peak Picking box shows that 2660/2870 peaks have been found, which should correspond to about 83-90% of all raw data peaks. This is about the same range as the final output (we would consider that to be not problematic). The IR-metric looks quite good, with only up to 3% of IR being degenerated. However, up to 12% of matches led to split peaks which is quite a lot. This is problematic since they tend to cause problems for downstream alignment processes. More specifically, they are often in competition with more "properly" picked peaks during the alignment process. Check Figure 1 in the readme to find out how mzRAPP decides which peak is 'best'.
In the Alignment box, 19-35% of peaks matched from peak picking are reported to not appear anymore in the aligned file. This can happen when the alignment algorithm only selects one peak per time unit and file but discards all other peaks within the same time unit. We consider 19-35% to be quite a lot. As mentioned above, this could be due to the high number of split peaks, when split peaks are retained for the aligned output instead of 'true' peaks.

We consider the high number of degenerated IR in the aligned output as not acceptable. As mentioned, the peak picking step and the high number of split peaks might have been problematic. We will therefore try to improve it.

Performance check (XCMS) 2

Non-targeted tool used: XCMS
Select unaligned file(s): XCMS_unaligned_run2.csv
Select aligned file: XCMS_aligned_run2.csv
Select benchmark file: the exported Benchmark as csv (if you did not process all 30 files or skipped the benchmark generation step, please use the Benchmark provided in ucloud)

Afterward, the assessment can be started by pressing the blue button on the bottom of the page.

Looking at the final output in Post Alignment, we see that the Found peaks-metric did not change a lot. The IR-metric improved a little bit but not much.
Also, in the Peak Picking box, both of those metrics did not change much. However, we have fewer split peaks now (1-3%) which is going in the right direction. The number of found peaks decreased by a little, which is not problematic since it seems like the fill gaps algorithm has picked them up as the number of found peaks in the post-alignment box is still high.
The Alignment Step improved slightly but also not too much.

Since the final output did not improve much, we will try to work on the alignment process and fill gaps now.

Performance check (XCMS) 3

Non-targeted tool used: XCMS
Select unaligned file(s): XCMS_unaligned_run3.csv
Select aligned file: XCMS_aligned_run3.csv
Select benchmark file: the exported Benchmark as csv (if you did not process all 30 files or skipped the benchmark generation step, please use the Benchmark provided in ucloud)

Afterward, the assessment can be started by pressing the blue button on the bottom of the page.

In the Post Alignment box, we see that now about 93-99% of peaks have been detected, which is quite some improvement. Also, the proportion of degenerated IR decreased to 3-20%. The Missing peaks classification shows that no classification was possible. This is most likely because full features have gone missing rather than random peaks. This can be confirmed by scrolling down to the Nature of missing values section, setting the slider to after Alignment and turning the plotting function on.
We did not change anything about the peak picking step, so everything stayed the same.
The alignment improved quite a bit to about 4-10% lost peaks.

Although this was much better than Example 1, there is still room for improvement, which might be achievable by optimizing the parameters even more. Below we provided one more example for MZmine2.

Performance check (MZmine 2) 1

Non-targeted tool used: MZmine
Select unaligned file(s): all 30 csv files in the folder MZmine_unaligned
Select aligned file: MZmine_aligned.csv
Select benchmark file: the exported Benchmark as csv (if you did not process all 30 files or skipped the benchmark generation step, please use the Benchmark provided in ucloud)

Afterward, the assessment can be started by pressing the blue button on the bottom of the page.

In the Post Alignment box, we see that now about 82-92% of peaks have been detected, Which is, in our opinion, not bad but improvable. The proportion of degenerated IR is 1-9\%, which we consider to be quite good. It is also worth noting that the line plot in the section Quality of reported NPP peak abundances shows that those IR which are degenerated are not too high/problematic. The Missing peaks classification shows that most of the undetected peaks are relatively low compared to the peaks detected in the same respective feature (please check the Readme for a more comprehensive explanation).
In the Peak Picking box, we can see that about the same number of peaks have been detected before and after the alignment. However, checking the sunburst plot in the section Distribution of found/not found peaks shows that it has not been the same peaks that have not been detected. This can be explained by the fact that there have been several alignment problems reported in the Alignment box.
The Alignment Step box shows that not all (but most) BM divergences can be identified as alignment errors (details on this difference is explained in the Readme, e.g., Figure 5). Under the assumption that the alignment in the Benchmark was done correctly the number of errors accounts for up to about 7\%. As can be seen in the sunburst plot in the section Distribution of found/not found peaks, this might be the reason some of the peaks found during peak detection were not found anymore after alignment, as they might have been put into a different aligned feature.

If you want, you can try to do better using MZmine, XCMS, or another tool (all tools listed in the Readme on GitHub are supported by mzRAPP).

Please note that although you can use mzRAPP for non-targeted parameter optimization, it is not mzRAPPs primary purpose. mzRAPP is meant to provide you with an idea of how well your non-targeted pre-processing went. It is on the user to decide whether that result is good enough or not.

YasinEl/mzRAPP documentation built on Feb. 18, 2024, 11:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

YasinEl/mzRAPP
Benchmark Dataset Generation and Non-Targeted Data Pre-Processing Assessment

In YasinEl/mzRAPP: Benchmark Dataset Generation and Non-Targeted Data Pre-Processing Assessment

Example

Benchmark generation

Non-targeted data pre-processing assement

Performance check (XCMS) 1

Performance check (XCMS) 2

Performance check (XCMS) 3

Performance check (MZmine 2) 1

R Package Documentation

Browse R Packages

We want your feedback!

YasinEl/mzRAPP Benchmark Dataset Generation and Non-Targeted Data Pre-Processing Assessment

In YasinEl/mzRAPP: Benchmark Dataset Generation and Non-Targeted Data Pre-Processing Assessment

Example

Benchmark generation

Non-targeted data pre-processing assement

Performance check (XCMS) 1

Performance check (XCMS) 2

Performance check (XCMS) 3

Performance check (MZmine 2) 1

R Package Documentation

Browse R Packages

We want your feedback!

YasinEl/mzRAPP
Benchmark Dataset Generation and Non-Targeted Data Pre-Processing Assessment