Package: massFlowR
Authors: Elzbieta Lauzikaite
Modified: r file.info("massFloWR.Rmd")$mtime
Compiled: r date()
BiocStyle::markdown() knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) Biocpkg <- function (pkg) { sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg) } ## supressing progress bar options(kpb.suppress_noninteractive = TRUE)
## load libraries quietly to avoid printing messages in the vignette suppressWarnings(library(massFlowR)) suppressWarnings(library(xcms)) out_directory <- getwd() url_m <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/doc/massFlowR.html" url_a <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/doc/annotation.html"
This documents provides an overview of the LC-MS data pre-processing with massFlowR
.
massFlowR
aligns and annotates structurally-related spectral peaks across LC-MS experiment samples. The pipeline consists of five stages:
Install the development version of the package directly from GitHub with:
if (!require("devtools")) install.packages("devtools") devtools::install_github("lauzikaite/massflowR")
LC-MS data in mzML/NetCDF format import is supported. Data import is implemented via r Biocpkg("mzR")
package.
In this document, functionality of the package will be demonstrated using data from r Biocpkg("faahKO")
package. Raw LC-MS data files (in NetCDF format) are provided for spinal cords samples taken from six knock-out (KO) and six wild-type (WT) mice. Each datafile contains centroided data acquired in positive ionization mode, with data recorded at 200-600 m/z and 2500-4500 seconds.
Load the package and locate the raw CDF files within the faahKO
package:
library(massFlowR) ## Get the full path to the CDF files # faahKO_files <- system.file(c('cdf/WT/wt15.CDF', 'cdf/WT/wt16.CDF'), package = "faahKO")) faahKO_files <- dir(system.file("cdf/WT", package = "faahKO"), full.names = T) faahKO_files
The first stage in the pipeline is chromatographic peak detection and peak grouping via function groupPEAKS
.
Peaks are detected using the centWave algorithm from r Biocpkg("xcms")
package (see [@Tautenhahn2008]). Appropriate parameters for the LC-MS experiment must be selected. For advise on this, please see the official xcms
manual. Selected parameters must be built into a CentWaveParam class object:
## xcms parameters for peak-picking cwt_param <- xcms::CentWaveParam(ppm = 25, snthresh = 10, noise = 1000, prefilter = c(3, 100), peakwidth = c(30, 80), mzdiff = 0)
Detected peaks are put into groups, which comprise peaks originating from the same chemical compound: adducts and isotopes. For each peak in a sample, function groupPEAKS
:
igraph
package algorithm, see [@Raghavan2007])Peaks in the detected communities form a single peak-group. Only groups with more than one peak are retained in the final peak table.
Function groupPEAKS
processes every LC-MS datafile independently and thus can be implemented in parallel, or during sample acquisition on the machine linked to the LC-MS. A list of paths to LC-MS datafiles (in mzML/NetCDF format) is feeded to groupPEAKS
, together with the CentWaveParam class object, path to output directory and parameters for parallelisation.
groupPEAKS
writes a csv with detected and grouped peaks in the selected directory for each LC-MS sample separately. The filenames of the generated csv files will be needed for the next stage in the pipeline. The filename starts with the original raw LC-MS filename and ends with "peakgrs.csv".
massFlowR
pipeline requires a metadata table with the following columns for each sample:
filename
specifies the basename of the raw LC-MS file.run_order
specifies the acquisition order for the corresponding LC-MS sample.raw_filepath
specifies the absolute path to the raw LC-MS file (netCDF/mzML).## define where processed datafiles should be written # out_directory <- "absolute_path_to_output_directory" ## create metadata table with required columns 'filename', 'run_order' and 'raw_filepath' metadata <- data.frame( filename = gsub(".CDF", "", basename(faahKO_files)), run_order = 1:length(faahKO_files), raw_filepath = faahKO_files ) write.csv(metadata, file = file.path(out_directory, "metadata.csv"), row.names = FALSE)
kableExtra::kable_styling( knitr::kable(x = metadata, format = "html", row.names = F), full_width = TRUE, bootstrap_options = "striped", position = "left")
## run peak detection and grouping for the listed faahKO files with two workers groupPEAKS(file = file.path(out_directory, "metadata.csv"), out_dir = out_directory, cwt = cwt_param, ncores = 3)
To align peaks across samples in LC-MS experiment, an alignment algorithm, which preserves the structural spectral information, is implemented.
Peaks are aligned by taking samples in the order of raw sample acquisition and matching them against a template. Template is list of all previously aligned peaks, which is updated with each sample by:
Therefore, template stores the moving averages of m/z and rt values.
For each peak in a sample, alignment algorithm:
Spectral similarity is measured by obtaining the cosine of the angle between two 2D vectors, representing each PCSs m/z and intensity values.
To enable peak alignment, previous metadata table has to contain additional column:
proc_filepath
specifies the absolute path to a csv file, generated by the groupPEAKS
function.## update previous metadata table processed_files <- list.files(out_directory, "peakgrs.csv", full.names = T) metadata$proc_filepath <- processed_files write.csv(metadata, file.path(out_directory, "metadata_grouped.csv"), row.names = F)
kableExtra::kable_styling( knitr::kable(x = metadata, format = "html", row.names = F), full_width = TRUE, bootstrap_options = "striped", position = "left")
To initiate peak alignment, use function buildTMP
, which constructs a massFlowTemplate
class object. massFlowTemplate
object stores sample alignment and annotation data and is updated with every sample. Define the desired error window for m/z and rt (seconds) values, which will be used for the whole experiment. mz_err = 0.01
and rt_err = 2
are recommended for high-resolution UPLC-MS data. mz_err = 0.01
and rt_err = 10
are suitable for the faahKO
package data.
## initiate template template <- buildTMP(file = file.path(out_directory, "metadata_grouped.csv"), out_dir = out_directory, mz_err = 0.01, rt_err = 10)
massFlowTemplate
object stores the most up-to-date template within the @tmp
slot. Function buildTMP
creates a template using the first sample in the experiment. To review the samples that are in the experiment, use slot @samples
.
## review gnerated template using slot "tmp" head(template@tmp) ## review samples in the experiment using slot "samples" template@samples
To align peaks across all samples in the study, apply method alignPEAKS
. alignPEAKS
updates the massFlowTemplate
object:
Parameter ncores
allows a quicker implementation using the parallel backend that is available on the user's machine (i.e. multicore on Unix/Mac and snow on Windows). Select the desired number of parallel workers.
## align peaks across all remaining samples template <- alignPEAKS(template, out_dir = out_directory, ncores = 2)
Aligned sample's peak tables are stored within the @data
slot, which lists tables for each sample separately.
## review alignment results for an individual sample, e.g. the second, using slot "data" head(template@data[[2]])
## review alignment results for an individual sample, e.g. the second, using slot "data" dt <- head(template@data[[2]]) kableExtra::kable_styling( knitr::kable(x = dt, format = "html", row.names = F), full_width = TRUE, bootstrap_options = c("striped"), position = "left")
Once peaks are aligned across all samples, the obtained peak-groups are validated. Intensity values for each peak in a group are correlated across all samples. Correlation estimates are then used to build networks of peaks, that behave similarly across all samples. Peaks exhibiting a different pattern in their intensities are put into a new peak-group.
Each peak-group is a pseudo chemical spectra (PCS), which comprised peaks exhibiting consistent behaviour across samples.
To enable alignment validation, a metadata table and final template file that both were written by the alignPEAKS
function in the selected directory, must be used. A massFlowTemplate
class object is first created.
## get the absolute paths to the updated metadata file and the final template m_file <- file.path(out_directory, "aligned.csv") tmp_file <- file.path(out_directory, "template.csv") ## initiate validation by first loading aligned samples into a massFlowTemplate object template <- loadALIGNED(file = m_file, template = tmp_file, rt_err = 10)
Peak-group validation is enabled by applying the method validPEAKS
on the massFlowTemplate
class object. validation can be implemented in parallel using ncores
parameter.
validPEAKS
will return a massFlowTemplate
class object with validated pseudo chemical spectra, as well as write peak tables for the obtained PCS:
## Start validation using a massFlowTemplate object template <- validPEAKS(template, out_dir = out_directory, ncores = 2, cor_thr = 0.5)
Final step in the pipeline is to re-integrate intensity for peaks that were not detected by the centWave using raw LC-MS files. In contrast to xcms
package, m/z and rt values for intensity integration are estimated for each sample separetely. m/z and rt values are intrapolated using a cubic smoothing spline.
## Fill peaks using validated massFlowTemplate object template <- fillPEAKS(template, out_dir = out_directory, ncores = 2)
If in-house chemical reference database is available, PCS are annotated. For more details how to build a database file, see annotation using database).
unlink(list.files(path = out_directory, pattern = ".csv", full.names = T)) unlink(list.files(path = out_directory, pattern = ".RDS", full.names = T))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.