Package: massFlowR
Authors: Elzbieta Lauzikaite
Date: r date()

BiocStyle::markdown() 
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
Biocpkg <- function (pkg) {
    sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg)
}
## supressing progress bar
options(kpb.suppress_noninteractive = TRUE) 
## load libraries quietly to avoid printing messages in the vignette
suppressWarnings(library(massFlowR))
out_directory <- getwd()
url_m <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/vignettes/massFlowR.html"
url_a <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/vignettes/annotation.html"

Introduction

This documents provides an overview of the LC-MS data pre-processing with massFlowR using dataset faahKO.

Data import

LC-MS data in mzML/NetCDF format import is supported. Data import is implemented via r Biocpkg("mzR") package.

In this document, functionality of the package will be demonstrated using data from r Biocpkg("faahKO") package. Raw LC-MS data files (in NetCDF format) are provided for spinal cords samples taken from six knock-out (KO) and six wild-type (WT) mice. Each datafile contains centroided data acquired in positive ionization mode, with data recorded at 200-600 m/z and 2500-4500 seconds.

Load the package and locate the raw CDF files within the faahKO package:

library(massFlowR)
## Get the full path to the CDF files
faahKO_files <- dir(system.file("cdf/WT", package = "faahKO"), full.names = TRUE)
kableExtra::kable_styling(
  knitr::kable(x = faahKO_files,
             format = "html",
             row.names = FALSE, col.names = ""),
  full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")

Individual samples processing

The first stage in the pipeline is chromatographic peak detection and grouping via function groupPEAKS.

Chromatographic peak detection

Peaks are detected using the centWave algorithm from r Biocpkg("xcms") package (see [@Tautenhahn2008]). Appropriate parameters for the LC-MS experiment must be selected. For advise on this, please see the official xcms manual. Selected parameters must be built into a CentWaveParam class object:

## xcms parameters for peak-picking
cwt_param <- xcms::CentWaveParam(ppm = 25,
                                 snthresh = 10,
                                 noise = 1000,
                                 prefilter =  c(3, 100),
                                 peakwidth = c(30, 80),
                                 mzdiff = 0)

Chromatographic peak grouping

Detected peaks are put into groups, which comprise peaks originating from the same chemical compound: adducts and isotopes. For each peak in a sample, function groupPEAKS:

Peaks that group into a community form a pseudo chemical spectra. Only communities with more than one peak are retained for further processing.

Implementation

Function groupPEAKS processes every LC-MS datafile independently and thus can be implemented in parallel, or during sample acquisition on the machine linked to the LC-MS. A list of paths to LC-MS datafiles (in mzML/NetCDF format) is feeded to groupPEAKS, together with the CentWaveParam class object, path to output directory and parameters for parallelisation.

groupPEAKS writes a csv with detected and grouped peaks in the selected directory for each LC-MS sample separately. The filenames of the generated csv files will be needed for the next stage in the pipeline. The filename starts with the original raw LC-MS filename and ends with "peakgrs.csv".

massFlowR pipeline requires a metadata table with the following columns for each sample:

## define where processed datafiles should be written
# out_directory <- "absolute_path_to_output_directory"

## create metadata table with required columns 'filename', 'run_order' and 'raw_filepath'
metadata <-
  data.frame(
  filename = gsub(".CDF", "", basename(faahKO_files)),
  run_order = seq(length(faahKO_files)),
  raw_filepath = faahKO_files,
  stringsAsFactors = FALSE
  )
write.csv(metadata, file = file.path(out_directory, "metadata.csv"), row.names = FALSE)
kableExtra::kable_styling(
  knitr::kable(x = metadata,
             format = "html",
             row.names = FALSE),
  full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")
## run peak detection and grouping for the listed faahKO files with two workers
groupPEAKS(file = file.path(out_directory, "metadata.csv"), out_dir = out_directory, cwt = cwt_param, ncores = 2)

Peak alignment

To align structurally-related peaks as a group across samples in LC-MS experiment, an alignment algorithm, which preserves the structural spectral information, is implemented.

Peaks are aligned by taking samples in the order of raw sample acquisition and matching them against a template. Template is list of all previously aligned peaks, which is updated with each sample by:

Therefore, template stores the moving averages of m/z and rt values.

For each peak in a sample, alignment algorithm:

  1. Finds all template peaks within a m/z and rt window.
  2. Identifies the true match by comparing the spectral similarity between the peak-group of the peak-of-interest and all matching template's peak-groups.
  3. Merges the selected template's peak-group with the peak-group of the peak-of-interest. It updates template's m/z and rt values for the matching peaks across the template and the sample.

Spectral similarity is measured by obtaining the cosine of the angle between two 2D vectors, representing each PCSs m/z and intensity values.

Implementation

To enable peak alignment, previous metadata table has to contain additional column:

##  update previous metadata table and add paths to generated csv files
processed_files <- list.files(out_directory, "peakgrs.csv", full.names = TRUE)
metadata$proc_filepath <- processed_files
write.csv(metadata, file.path(out_directory, "metadata_grouped.csv"), row.names = FALSE)
kableExtra::kable_styling(
  knitr::kable(x = metadata,
             format = "html",
             row.names = FALSE),
  full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")

To initiate peak alignment, use function buildTMP, which constructs a massFlowTemplateclass object. massFlowTemplate object stores sample alignment and annotation data and is updated with every sample. Define the desired error window for m/z and rt (seconds) values, which will be used for the whole experiment. mz_err = 0.01 and rt_err = 2 are recommended for high-resolution UPLC-MS data. mz_err = 0.01 and rt_err = 10 are suitable for the faahKOpackage data.

Desired spectral similarity threshold can be selected with cutoff parameter (0-1, with 1 being perfectly identical). cutoff = 0.5 is recommended for high-resolution UPLC-MS data

## initiate template
template <- buildTMP(file = file.path(out_directory, "metadata_grouped.csv"), out_dir = out_directory, mz_err = 0.01, rt_err = 10, cutoff = 0.5)

To review the samples that are in the experiment, use slot @samples.

## review samples in the experiment using slot "samples"
template@samples
## review alignment results for an individual sample, e.g. the second, using slot "data"
dt <- template@samples
kableExtra::kable_styling(
  knitr::kable(x = dt,
             format = "html",
             row.names = F),
  full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")

massFlowTemplate object stores the most up-to-date template in the @tmp slot. Function buildTMP creates a template using the first sample in the experiment.

## review gnerated template using slot "tmp"
head(template@tmp, 20)
## review alignment results for an individual sample, e.g. the second, using slot "data"
dt <- head(template@tmp, 20)
kableExtra::kable_styling(
  knitr::kable(x = dt,
             format = "html",
             row.names = F),
  full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")

To align peaks across all samples in the study, apply method alignPEAKS. alignPEAKS updates the massFlowTemplate object:

  1. Selects next sample to be aligned and checks whether it was already peak-picked and grouped (waits until the corresponding csv file is written, see).
  2. Matches every peak in the sample against the template.
  3. Selects best matches using spectral similarity comparison.
  4. Updates template with sample's peaks: adds new and averages matching peaks.

Parameter ncores allows a quicker implementation using the parallel backend that is available on the user's machine (i.e. multicore on Unix/Mac and snow on Windows). Select the desired number of parallel workers.

## align peaks across all remaining samples
template <- alignPEAKS(template, out_dir = out_directory, ncores = 2)

Aligned sample's peak tables are stored within the @data slot, which lists tables for each sample separately.

## review alignment results for an individual sample, e.g. the second, using slot "data"
head(template@data[[2]])
## review alignment results for an individual sample, e.g. the second, using slot "data"
dt <- head(template@data[[2]])
kableExtra::kable_styling(
  knitr::kable(x = dt,
             format = "html",
             row.names = F),
  full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")

Alignment re-initiation

If alignment was stopped before all samples in the experiment could be aligned (for example, because of a computer issue), alignmment can be re-initiated by loading the samples that had already been processed. Function loadALIGNED requires the absolute paths to the updated metadata file and the final template writen by alignPEAKS. Note that same parameters must be provided as for the buildTMP function.

## get the absolute paths to the updated metadata file and the final template writen by "alignPEAKS" 
m_file <- file.path(out_directory, "aligned.csv")
tmp_file <- file.path(out_directory, "template.csv")

## load already aligned samples into a massFlowTemplate object
template <- loadALIGNED(file = m_file, template = tmp_file, mz_err = 0.01, rt_err = 10, cutoff = 0.5)
## if some of the samples in the experiment have not been aligned yet, re-start alignment
template@samples$aligned
if (any(template@samples$aligned == FALSE)) {
  template <- alignPEAKS(template, out_dir = out_directory, ncores = 2)
}

Realtime implementation

Sample alignment can be performed in real-time during data acquisition by setting the buildTMP function parameter to realtime = TRUE. The function will wait for the intermediate peak-groups csv files to be written instead of stopping when a file that does not exist is encountered. Make sure, however, that correct filepaths are provided in the metadata csv file's column proc_filepath.

template <- buildTMP(file = file.path(out_directory, "metadata_grouped.csv"), out_dir = out_directory, mz_err = 0.01, rt_err = 10, cutoff = 0.5, realtime = TRUE)
template <- alignPEAKS(template, out_dir = out_directory, ncores = 2)

Alignment validation

Once peaks are aligned across all samples, the obtained peak-groups are validated. Intensity values for each peak in a group are correlated across all samples. Correlation estimates are then used to build networks of peaks, that behave similarly across all samples. Peaks exhibiting a different pattern in their intensities are put into a new peak-group.

Each peak-group is a pseudo chemical spectra (PCS), which comprised peaks exhibiting consistent behaviour across samples.

To enable alignment validation, a metadata table and final template file that both were written by the alignPEAKS function in the selected directory, must be used. A massFlowTemplateclass object, returned by the alignPEAKS function will be needed. If alignment validation will be performed in a new R session, re-initiate the massFlowTemplateclass objec using loadALIGNED function as before.

## get the absolute paths to the updated metadata file and the final template writen by "alignPEAKS" 
m_file <- file.path(out_directory, "aligned.csv")
tmp_file <- file.path(out_directory, "template.csv")

## initiate validation by first loading aligned samples into a massFlowTemplate object
template <- loadALIGNED(file = m_file, template = tmp_file)

Peak-group validation is enabled by applying the method validPEAKS on the massFlowTemplateclass object. Validation can be implemented in parallel using ncores parameter.

validPEAKS will return a massFlowTemplateclass object with validated pseudo chemical spectra, as well as write peak tables for the obtained PCS:

## Start validation using a massFlowTemplate object
template <- validPEAKS(template, out_dir = out_directory, ncores = 2, cor_thr = 0.5)

Peak filling

Final step in the pipeline is to re-integrate intensity values for peaks that were not detected by the centWave using raw LC-MS files. In contrast to xcms package, m/z and rt values for intensity integration are estimated for each sample separetely. m/z and rt values are modelled using a cubic smoothing spline.

## Fill peaks using validated massFlowTemplate object
template <- fillPEAKS(template, out_dir = out_directory, ncores = 2)

Final output

fillPEAKS writes file 'filled_intensity_data.csv', which contains features metadata (including pseudo chemical spectra number) and intensity values for each sample in the experiment.

final_dt <- read.csv(file.path(out_directory, "filled_intensity_data.csv"), header = TRUE, stringsAsFactors = FALSE)
ggplot2::ggplot(final_dt,
             ggplot2::aes(x = rt, y = mz, color = as.factor(pcs))) +
      ggplot2::geom_point(size = 3, alpha = 0.8) +
  ggplot2::xlab("Retention time") +
  ggplot2::ylab("m/z") +
  ggplot2::scale_color_viridis_d(name = "Pseudo chemical spectra") +
  ggplot2::theme_bw(base_size = 14) +
  ggplot2::theme(
    legend.position = "bottom",
    panel.grid.major = ggplot2::element_blank(),
    panel.grid.minor = ggplot2::element_blank(),
    axis.line = ggplot2::element_line(size = 0.1)
  )

Peak annotation

If in-house chemical reference database is available, PCS are annotated. For more details how to build a database file, see annotation using database).

See also

unlink(list.files(path = out_directory, pattern = ".csv", full.names = T))
unlink(list.files(path = out_directory, pattern = ".RDS", full.names = T))

References



lauzikaite/massFlowR documentation built on April 29, 2020, 9:45 a.m.