In lauzikaite/massFlowR: LC-MS Data Pre-Processing

Package: massFlowR
Authors: Elzbieta Lauzikaite
Modified: r file.info("massFloWR.Rmd")$mtime
Compiled: r date()

BiocStyle::markdown() 
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
Biocpkg <- function (pkg) {
    sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg)
}
## supressing progress bar
options(kpb.suppress_noninteractive = TRUE)

## load libraries quietly to avoid printing messages in the vignette
suppressWarnings(library(massFlowR))
suppressWarnings(library(xcms))
out_directory <- getwd()
url_m <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/doc/massFlowR.html"
url_a <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/doc/annotation.html"

Overview

This documents provides an overview of the LC-MS data pre-processing with massFlowR.

massFlowR aligns and annotates structurally-related spectral peaks across LC-MS experiment samples. The pipeline consists of five stages:

Processing of individual LC-MS samples
Peak alignment across samples
Alignment validation
Peak filling
Automatic peak annotation

Installation

Install the development version of the package directly from GitHub with:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("lauzikaite/massflowR")

Data import

LC-MS data in mzML/NetCDF format import is supported. Data import is implemented via r Biocpkg("mzR") package.

In this document, functionality of the package will be demonstrated using data from r Biocpkg("faahKO") package. Raw LC-MS data files (in NetCDF format) are provided for spinal cords samples taken from six knock-out (KO) and six wild-type (WT) mice. Each datafile contains centroided data acquired in positive ionization mode, with data recorded at 200-600 m/z and 2500-4500 seconds.

Load the package and locate the raw CDF files within the faahKO package:

library(massFlowR)
## Get the full path to the CDF files
# faahKO_files <- system.file(c('cdf/WT/wt15.CDF', 'cdf/WT/wt16.CDF'), package = "faahKO"))
faahKO_files <- dir(system.file("cdf/WT", package = "faahKO"), full.names = T)
faahKO_files

Individual samples processing

The first stage in the pipeline is chromatographic peak detection and peak grouping via function groupPEAKS.

Chromatographic peak detection

Peaks are detected using the centWave algorithm from r Biocpkg("xcms") package (see [@Tautenhahn2008]). Appropriate parameters for the LC-MS experiment must be selected. For advise on this, please see the official xcms manual. Selected parameters must be built into a CentWaveParam class object:

## xcms parameters for peak-picking
cwt_param <- xcms::CentWaveParam(ppm = 25,
                                 snthresh = 10,
                                 noise = 1000,
                                 prefilter =  c(3, 100),
                                 peakwidth = c(30, 80),
                                 mzdiff = 0)

Chromatographic peak grouping

Detected peaks are put into groups, which comprise peaks originating from the same chemical compound: adducts and isotopes. For each peak in a sample, function groupPEAKS:

Finds all co-eluting peaks
Performs extracted ion chromatogram (EIC) correlation between all co-eluting peaks
Builds a network of peaks with high EIC correlation
Detects communities of peak within the correlation network (implemented by igraph package algorithm, see [@Raghavan2007])

Peaks in the detected communities form a single peak-group. Only groups with more than one peak are retained in the final peak table.

Implementation

Function groupPEAKS processes every LC-MS datafile independently and thus can be implemented in parallel, or during sample acquisition on the machine linked to the LC-MS. A list of paths to LC-MS datafiles (in mzML/NetCDF format) is feeded to groupPEAKS, together with the CentWaveParam class object, path to output directory and parameters for parallelisation.

groupPEAKS writes a csv with detected and grouped peaks in the selected directory for each LC-MS sample separately. The filenames of the generated csv files will be needed for the next stage in the pipeline. The filename starts with the original raw LC-MS filename and ends with "peakgrs.csv".

massFlowR pipeline requires a metadata table with the following columns for each sample:

filename specifies the basename of the raw LC-MS file.
run_order specifies the acquisition order for the corresponding LC-MS sample.
raw_filepath specifies the absolute path to the raw LC-MS file (netCDF/mzML).

## define where processed datafiles should be written
# out_directory <- "absolute_path_to_output_directory"

## create metadata table with required columns 'filename', 'run_order' and 'raw_filepath'
metadata <-
  data.frame(
  filename = gsub(".CDF", "", basename(faahKO_files)),
  run_order = 1:length(faahKO_files),
  raw_filepath = faahKO_files
  )
write.csv(metadata, file = file.path(out_directory, "metadata.csv"), row.names = FALSE)

kableExtra::kable_styling(
  knitr::kable(x = metadata,
             format = "html",
             row.names = F),
  full_width = TRUE, bootstrap_options = "striped", position = "left")

## run peak detection and grouping for the listed faahKO files with two workers
groupPEAKS(file = file.path(out_directory, "metadata.csv"), out_dir = out_directory, cwt = cwt_param, ncores = 3)

Peak alignment

To align peaks across samples in LC-MS experiment, an alignment algorithm, which preserves the structural spectral information, is implemented.

Peaks are aligned by taking samples in the order of raw sample acquisition and matching them against a template. Template is list of all previously aligned peaks, which is updated with each sample by:

adding new peaks
averaging the m/z and rt values of matching peaks between the sample and the template.

Therefore, template stores the moving averages of m/z and rt values.

For each peak in a sample, alignment algorithm:

Finds all template peaks within a m/z and rt window.
Identifies the true match by comparing the spectral similarity between the peak-group of the peak-of-interest and all matching template's peak-groups.
Merges the selected template's peak-group with the peak-group of the peak-of-interest. It updates template's m/z and rt values for the matching peaks across the template and the sample.

Spectral similarity is measured by obtaining the cosine of the angle between two 2D vectors, representing each PCSs m/z and intensity values.

Implementation

To enable peak alignment, previous metadata table has to contain additional column:

proc_filepath specifies the absolute path to a csv file, generated by the groupPEAKS function.

##  update previous metadata table
processed_files <- list.files(out_directory, "peakgrs.csv", full.names = T)
metadata$proc_filepath <- processed_files
write.csv(metadata, file.path(out_directory, "metadata_grouped.csv"), row.names = F)

kableExtra::kable_styling(
  knitr::kable(x = metadata,
             format = "html",
             row.names = F),
  full_width = TRUE, bootstrap_options = "striped", position = "left")

To initiate peak alignment, use function buildTMP, which constructs a massFlowTemplateclass object. massFlowTemplate object stores sample alignment and annotation data and is updated with every sample. Define the desired error window for m/z and rt (seconds) values, which will be used for the whole experiment. mz_err = 0.01 and rt_err = 2 are recommended for high-resolution UPLC-MS data. mz_err = 0.01 and rt_err = 10 are suitable for the faahKOpackage data.

## initiate template
template <- buildTMP(file = file.path(out_directory, "metadata_grouped.csv"), out_dir = out_directory, mz_err = 0.01, rt_err = 10)

massFlowTemplate object stores the most up-to-date template within the @tmp slot. Function buildTMP creates a template using the first sample in the experiment. To review the samples that are in the experiment, use slot @samples.

## review gnerated template using slot "tmp"
head(template@tmp)

## review samples in the experiment using slot "samples"
template@samples

To align peaks across all samples in the study, apply method alignPEAKS. alignPEAKS updates the massFlowTemplate object:

Selects next sample to be aligned and checks whether it was already peak-picked and grouped (waits until the corresponding csv file is written).
Matches every peak in the sample against the template.
Selects best matches using spectral similarity comparison.
Updates template with sample's peaks: adds new and averages matching peaks.

Parameter ncores allows a quicker implementation using the parallel backend that is available on the user's machine (i.e. multicore on Unix/Mac and snow on Windows). Select the desired number of parallel workers.

## align peaks across all remaining samples
template <- alignPEAKS(template, out_dir = out_directory, ncores = 2)

Aligned sample's peak tables are stored within the @data slot, which lists tables for each sample separately.

## review alignment results for an individual sample, e.g. the second, using slot "data"
head(template@data[[2]])

## review alignment results for an individual sample, e.g. the second, using slot "data"
dt <- head(template@data[[2]])
kableExtra::kable_styling(
  knitr::kable(x = dt,
             format = "html",
             row.names = F),
  full_width = TRUE, bootstrap_options = c("striped"), position = "left")

Alignment validation

Once peaks are aligned across all samples, the obtained peak-groups are validated. Intensity values for each peak in a group are correlated across all samples. Correlation estimates are then used to build networks of peaks, that behave similarly across all samples. Peaks exhibiting a different pattern in their intensities are put into a new peak-group.

Each peak-group is a pseudo chemical spectra (PCS), which comprised peaks exhibiting consistent behaviour across samples.

To enable alignment validation, a metadata table and final template file that both were written by the alignPEAKS function in the selected directory, must be used. A massFlowTemplateclass object is first created.

## get the absolute paths to the updated metadata file and the final template
m_file <- file.path(out_directory, "aligned.csv")
tmp_file <- file.path(out_directory, "template.csv")

## initiate validation by first loading aligned samples into a massFlowTemplate object
template <- loadALIGNED(file = m_file, template = tmp_file, rt_err = 10)

Peak-group validation is enabled by applying the method validPEAKS on the massFlowTemplateclass object. validation can be implemented in parallel using ncores parameter.

validPEAKS will return a massFlowTemplateclass object with validated pseudo chemical spectra, as well as write peak tables for the obtained PCS:

intensity_data.csv (intensity values for every peak in PCS in every sample)
peaks_data.csv
sample_data.csv

## Start validation using a massFlowTemplate object
template <- validPEAKS(template, out_dir = out_directory, ncores = 2, cor_thr = 0.5)

Peak filling

Final step in the pipeline is to re-integrate intensity for peaks that were not detected by the centWave using raw LC-MS files. In contrast to xcms package, m/z and rt values for intensity integration are estimated for each sample separetely. m/z and rt values are intrapolated using a cubic smoothing spline.

## Fill peaks using validated massFlowTemplate object
template <- fillPEAKS(template, out_dir = out_directory, ncores = 2)

Peak annotation

If in-house chemical reference database is available, PCS are annotated. For more details how to build a database file, see annotation using database).

References

lauzikaite/massFlowR documentation built on April 29, 2020, 9:45 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

lauzikaite/massFlowR
LC-MS Data Pre-Processing

In lauzikaite/massFlowR: LC-MS Data Pre-Processing

Overview

Installation

Data import

Individual samples processing

Chromatographic peak detection

Chromatographic peak grouping

Implementation

Peak alignment

Implementation

Alignment validation

Peak filling

Peak annotation

See also

References

R Package Documentation

Browse R Packages

We want your feedback!

lauzikaite/massFlowR LC-MS Data Pre-Processing

In lauzikaite/massFlowR: LC-MS Data Pre-Processing

Overview

Installation

Data import

Individual samples processing

Chromatographic peak detection

Chromatographic peak grouping

Implementation

Peak alignment

Implementation

Alignment validation

Peak filling

Peak annotation

See also

References

R Package Documentation

Browse R Packages

We want your feedback!

lauzikaite/massFlowR
LC-MS Data Pre-Processing