Package: massFlowR
Authors: Elzbieta Lauzikaite
Date: r date()
BiocStyle::markdown() knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) Biocpkg <- function (pkg) { sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg) } ## supressing progress bar options(kpb.suppress_noninteractive = TRUE)
## load libraries quietly to avoid printing messages in the vignette suppressWarnings(library(massFlowR)) out_directory <- getwd() url_m <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/vignettes/massFlowR.html" url_a <- "https://htmlpreview.github.io/?https://github.com/lauzikaite/massFlowR/blob/master/vignettes/annotation.html"
This documents provides an overview of the LC-MS data pre-processing with massFlowR
using dataset faahKO.
LC-MS data in mzML/NetCDF format import is supported. Data import is implemented via r Biocpkg("mzR")
package.
In this document, functionality of the package will be demonstrated using data from r Biocpkg("faahKO")
package. Raw LC-MS data files (in NetCDF format) are provided for spinal cords samples taken from six knock-out (KO) and six wild-type (WT) mice. Each datafile contains centroided data acquired in positive ionization mode, with data recorded at 200-600 m/z and 2500-4500 seconds.
Load the package and locate the raw CDF files within the faahKO
package:
library(massFlowR) ## Get the full path to the CDF files faahKO_files <- dir(system.file("cdf/WT", package = "faahKO"), full.names = TRUE)
kableExtra::kable_styling( knitr::kable(x = faahKO_files, format = "html", row.names = FALSE, col.names = ""), full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")
The first stage in the pipeline is chromatographic peak detection and grouping via function groupPEAKS
.
Peaks are detected using the centWave algorithm from r Biocpkg("xcms")
package (see [@Tautenhahn2008]). Appropriate parameters for the LC-MS experiment must be selected. For advise on this, please see the official xcms
manual. Selected parameters must be built into a CentWaveParam class object:
## xcms parameters for peak-picking cwt_param <- xcms::CentWaveParam(ppm = 25, snthresh = 10, noise = 1000, prefilter = c(3, 100), peakwidth = c(30, 80), mzdiff = 0)
Detected peaks are put into groups, which comprise peaks originating from the same chemical compound: adducts and isotopes. For each peak in a sample, function groupPEAKS
:
igraph
package algorithm, see [@Raghavan2007]).Peaks that group into a community form a pseudo chemical spectra. Only communities with more than one peak are retained for further processing.
Function groupPEAKS
processes every LC-MS datafile independently and thus can be implemented in parallel, or during sample acquisition on the machine linked to the LC-MS. A list of paths to LC-MS datafiles (in mzML/NetCDF format) is feeded to groupPEAKS
, together with the CentWaveParam class object, path to output directory and parameters for parallelisation.
groupPEAKS
writes a csv with detected and grouped peaks in the selected directory for each LC-MS sample separately. The filenames of the generated csv files will be needed for the next stage in the pipeline. The filename starts with the original raw LC-MS filename and ends with "peakgrs.csv".
massFlowR
pipeline requires a metadata table with the following columns for each sample:
filename
specifies the basename of the raw LC-MS file.run_order
specifies the acquisition order for the corresponding LC-MS sample.raw_filepath
specifies the absolute path to the raw LC-MS file (netCDF/mzML).## define where processed datafiles should be written # out_directory <- "absolute_path_to_output_directory" ## create metadata table with required columns 'filename', 'run_order' and 'raw_filepath' metadata <- data.frame( filename = gsub(".CDF", "", basename(faahKO_files)), run_order = seq(length(faahKO_files)), raw_filepath = faahKO_files, stringsAsFactors = FALSE ) write.csv(metadata, file = file.path(out_directory, "metadata.csv"), row.names = FALSE)
kableExtra::kable_styling( knitr::kable(x = metadata, format = "html", row.names = FALSE), full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")
## run peak detection and grouping for the listed faahKO files with two workers groupPEAKS(file = file.path(out_directory, "metadata.csv"), out_dir = out_directory, cwt = cwt_param, ncores = 2)
To align structurally-related peaks as a group across samples in LC-MS experiment, an alignment algorithm, which preserves the structural spectral information, is implemented.
Peaks are aligned by taking samples in the order of raw sample acquisition and matching them against a template. Template is list of all previously aligned peaks, which is updated with each sample by:
Therefore, template stores the moving averages of m/z and rt values.
For each peak in a sample, alignment algorithm:
Spectral similarity is measured by obtaining the cosine of the angle between two 2D vectors, representing each PCSs m/z and intensity values.
To enable peak alignment, previous metadata table has to contain additional column:
proc_filepath
specifies the absolute path to the csv files generated by the groupPEAKS
function.## update previous metadata table and add paths to generated csv files processed_files <- list.files(out_directory, "peakgrs.csv", full.names = TRUE) metadata$proc_filepath <- processed_files write.csv(metadata, file.path(out_directory, "metadata_grouped.csv"), row.names = FALSE)
kableExtra::kable_styling( knitr::kable(x = metadata, format = "html", row.names = FALSE), full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")
To initiate peak alignment, use function buildTMP
, which constructs a massFlowTemplate
class object. massFlowTemplate
object stores sample alignment and annotation data and is updated with every sample. Define the desired error window for m/z and rt (seconds) values, which will be used for the whole experiment. mz_err = 0.01
and rt_err = 2
are recommended for high-resolution UPLC-MS data. mz_err = 0.01
and rt_err = 10
are suitable for the faahKO
package data.
Desired spectral similarity threshold can be selected with cutoff parameter (0-1, with 1 being perfectly identical). cutoff = 0.5
is recommended for high-resolution UPLC-MS data
## initiate template template <- buildTMP(file = file.path(out_directory, "metadata_grouped.csv"), out_dir = out_directory, mz_err = 0.01, rt_err = 10, cutoff = 0.5)
To review the samples that are in the experiment, use slot @samples
.
## review samples in the experiment using slot "samples" template@samples
## review alignment results for an individual sample, e.g. the second, using slot "data" dt <- template@samples kableExtra::kable_styling( knitr::kable(x = dt, format = "html", row.names = F), full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")
massFlowTemplate
object stores the most up-to-date template in the @tmp
slot. Function buildTMP
creates a template using the first sample in the experiment.
## review gnerated template using slot "tmp" head(template@tmp, 20)
## review alignment results for an individual sample, e.g. the second, using slot "data" dt <- head(template@tmp, 20) kableExtra::kable_styling( knitr::kable(x = dt, format = "html", row.names = F), full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")
To align peaks across all samples in the study, apply method alignPEAKS
. alignPEAKS
updates the massFlowTemplate
object:
Parameter ncores
allows a quicker implementation using the parallel backend that is available on the user's machine (i.e. multicore on Unix/Mac and snow on Windows). Select the desired number of parallel workers.
## align peaks across all remaining samples template <- alignPEAKS(template, out_dir = out_directory, ncores = 2)
Aligned sample's peak tables are stored within the @data
slot, which lists tables for each sample separately.
## review alignment results for an individual sample, e.g. the second, using slot "data" head(template@data[[2]])
## review alignment results for an individual sample, e.g. the second, using slot "data" dt <- head(template@data[[2]]) kableExtra::kable_styling( knitr::kable(x = dt, format = "html", row.names = F), full_width = TRUE, bootstrap_options = c("striped", "hover", "responsive"), position = "left")
If alignment was stopped before all samples in the experiment could be aligned (for example, because of a computer issue), alignmment can be re-initiated by loading the samples that had already been processed. Function loadALIGNED
requires the absolute paths to the updated metadata file and the final template writen by alignPEAKS
. Note that same parameters must be provided as for the buildTMP
function.
## get the absolute paths to the updated metadata file and the final template writen by "alignPEAKS" m_file <- file.path(out_directory, "aligned.csv") tmp_file <- file.path(out_directory, "template.csv") ## load already aligned samples into a massFlowTemplate object template <- loadALIGNED(file = m_file, template = tmp_file, mz_err = 0.01, rt_err = 10, cutoff = 0.5)
## if some of the samples in the experiment have not been aligned yet, re-start alignment template@samples$aligned
if (any(template@samples$aligned == FALSE)) { template <- alignPEAKS(template, out_dir = out_directory, ncores = 2) }
Sample alignment can be performed in real-time during data acquisition by setting the buildTMP
function parameter to realtime = TRUE
. The function will wait for the intermediate peak-groups csv files to be written instead of stopping when a file that does not exist is encountered. Make sure, however, that correct filepaths are provided in the metadata csv file's column proc_filepath
.
template <- buildTMP(file = file.path(out_directory, "metadata_grouped.csv"), out_dir = out_directory, mz_err = 0.01, rt_err = 10, cutoff = 0.5, realtime = TRUE) template <- alignPEAKS(template, out_dir = out_directory, ncores = 2)
Once peaks are aligned across all samples, the obtained peak-groups are validated. Intensity values for each peak in a group are correlated across all samples. Correlation estimates are then used to build networks of peaks, that behave similarly across all samples. Peaks exhibiting a different pattern in their intensities are put into a new peak-group.
Each peak-group is a pseudo chemical spectra (PCS), which comprised peaks exhibiting consistent behaviour across samples.
To enable alignment validation, a metadata table and final template file that both were written by the alignPEAKS
function in the selected directory, must be used. A massFlowTemplate
class object, returned by the alignPEAKS
function will be needed. If alignment validation will be performed in a new R session, re-initiate the massFlowTemplate
class objec using loadALIGNED
function as before.
## get the absolute paths to the updated metadata file and the final template writen by "alignPEAKS" m_file <- file.path(out_directory, "aligned.csv") tmp_file <- file.path(out_directory, "template.csv") ## initiate validation by first loading aligned samples into a massFlowTemplate object template <- loadALIGNED(file = m_file, template = tmp_file)
Peak-group validation is enabled by applying the method validPEAKS
on the massFlowTemplate
class object. Validation can be implemented in parallel using ncores
parameter.
validPEAKS
will return a massFlowTemplate
class object with validated pseudo chemical spectra, as well as write peak tables for the obtained PCS:
## Start validation using a massFlowTemplate object template <- validPEAKS(template, out_dir = out_directory, ncores = 2, cor_thr = 0.5)
Final step in the pipeline is to re-integrate intensity values for peaks that were not detected by the centWave using raw LC-MS files. In contrast to xcms
package, m/z and rt values for intensity integration are estimated for each sample separetely. m/z and rt values are modelled using a cubic smoothing spline.
## Fill peaks using validated massFlowTemplate object template <- fillPEAKS(template, out_dir = out_directory, ncores = 2)
fillPEAKS
writes file 'filled_intensity_data.csv', which contains features metadata (including pseudo chemical spectra number) and intensity values for each sample in the experiment.
final_dt <- read.csv(file.path(out_directory, "filled_intensity_data.csv"), header = TRUE, stringsAsFactors = FALSE)
ggplot2::ggplot(final_dt, ggplot2::aes(x = rt, y = mz, color = as.factor(pcs))) + ggplot2::geom_point(size = 3, alpha = 0.8) + ggplot2::xlab("Retention time") + ggplot2::ylab("m/z") + ggplot2::scale_color_viridis_d(name = "Pseudo chemical spectra") + ggplot2::theme_bw(base_size = 14) + ggplot2::theme( legend.position = "bottom", panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank(), axis.line = ggplot2::element_line(size = 0.1) )
If in-house chemical reference database is available, PCS are annotated. For more details how to build a database file, see annotation using database).
unlink(list.files(path = out_directory, pattern = ".csv", full.names = T)) unlink(list.files(path = out_directory, pattern = ".RDS", full.names = T))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.