Parallel Annotation

BiocStyle::markdown()
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)

Package: r Biocpkg("peakPantheR")
Authors: Arnaud Wolfer, Goncalo Correia

## Silently loading all packages
library(BiocStyle)
library(peakPantheR)
library(faahKO)
library(pander)
library(doParallel)
library(foreach)

Introduction

The peakPantheR package is designed for the detection, integration and reporting of pre-defined features in MS files (e.g. compounds, fragments, adducts, ...).

The Parallel Annotation is set to detect and integrate multiple compounds in multiple files in parallel and store results in a single object. It can be employed to integrate a large number of expected features across a dataset.

Using the r Biocpkg("faahKO") raw MS dataset as an example, this vignette will:

Abbreviations

Parallel Annotation Concept

Parallel compound integration is set to process multiple compounds in multiple files in parallel, and store results in a single object.

knitr::include_graphics("../man/figures/parallelAnnotation.png")

To achieve this, peakPantheR will:

  1. load a list of expected RT / m/z ROI and a list of files to process
  2. initialise an output object with expected ROI and file paths
  3. first pass (without peak filling) on a subset of representative samples (e.g QC samples):
    • for each file, detect features in each ROI and keep highest intensity
    • determine peak statistics for each feature
    • store results + EIC for each ROI
  4. visual inspection of first pass results, update ROI:
    • diagnostic plots: all EICs, peak apex RT / m/z & peak width evolution
    • correct ROI (remove interfering feature, correct RT shift)
    • define fallback integration regions (FIR) if no feature is detected (median RT / m/z start and end of found features)
  5. initialise a new output object, with updated regions of interest (uROI) and fallback integration regions (FIR), with all samples
  6. second pass (with peak filling) on all samples:
    • for each file, detect features in each uROI and keep highest intensity
    • determine peak statistics for each feature
    • integrate FIR when no peaks are found
    • store results + EIC for each uROI
  7. summary statistics:
    • plot EICs, apex and peakwidth evolution
    • compare first and second pass
  8. return the resulting object and/or table (row: file, col: compound)
knitr::include_graphics("../man/figures/parallelAnnotation_procedure.png")

Diagram of the workflow and functions used for parallel annotation.

Parallel Annotation Example

We can target 2 pre-defined features in 6 raw MS spectra file from the r Biocpkg("faahKO") package using peakPantheR_parallelAnnotation(). For more details on the installation and input data employed, please consult the Getting Started with peakPantheR vignette.

Input Data

First the paths to 3 MS file from the r Biocpkg("faahKO") are located and used as input spectras. In this example these 3 samples are considered as representative of the whole run (e.g. Quality Control samples):

library(faahKO)
## file paths
input_spectraPaths  <- c(system.file('cdf/KO/ko15.CDF', package = "faahKO"),
                        system.file('cdf/KO/ko16.CDF', package = "faahKO"),
                        system.file('cdf/KO/ko18.CDF', package = "faahKO"))
input_spectraPaths

Two targeted features (e.g. compounds, fragments, adducts, ...) are defined and stored in a table with as columns:

# targetFeatTable
input_targetFeatTable <- data.frame(matrix(vector(), 2, 8, dimnames=list(c(), 
                        c("cpdID", "cpdName", "rtMin", "rt", "rtMax", "mzMin", 
                            "mz", "mzMax"))), stringsAsFactors=FALSE)
input_targetFeatTable[1,] <- c("ID-1", "Cpd 1", 3310., 3344.888, 3390., 
                                522.194778, 522.2, 522.205222)
input_targetFeatTable[2,] <- c("ID-2", "Cpd 2", 3280., 3385.577, 3440., 
                                496.195038, 496.2, 496.204962)
input_targetFeatTable[,c(3:8)] <- sapply(input_targetFeatTable[,c(3:8)], 
                                        as.numeric)
# use pandoc for improved readability
input_targetFeatTable <- data.frame(matrix(vector(), 2, 8, dimnames=list(c(), 
                        c("cpdID", "cpdName", "rtMin", "rt", "rtMax", "mzMin", 
                        "mz", "mzMax"))), stringsAsFactors=FALSE)
input_targetFeatTable[1,] <- c("ID-1", "Cpd 1", 3310., 3344.888, 3390., 
                                522.194778, 522.2, 522.205222)
input_targetFeatTable[2,] <- c("ID-2", "Cpd 2", 3280., 3385.577, 3440., 
                                496.195038, 496.2, 496.204962)
input_targetFeatTable[,c(3:8)] <- sapply(input_targetFeatTable[,c(3:8)], 
                                        as.numeric)
rownames(input_targetFeatTable) <- NULL
pander::pandoc.table(input_targetFeatTable, digits = 9)

Additional compound and spectra metadata can be provided but isn't employed during the fitting procedure:

# spectra Metadata
input_spectraMetadata  <- data.frame(matrix(c("sample type 1", "sample type 2", 
                            "sample type 1"), 3, 1, 
                            dimnames=list(c(),c("sampleType"))),
                            stringsAsFactors=FALSE)
# use pandoc for improved readability
input_spectraMetadata  <- data.frame(matrix(c("sample type 1", "sample type 2", 
                                            "sample type 1"), 3, 1, 
                                            dimnames=list(c(),c("sampleType"))),
                                    stringsAsFactors=FALSE)
pander::pandoc.table(input_spectraMetadata)

Initialise and Run Parallel Annotation

A peakPantheRAnnotation object is first initialised with the path to the files to process (spectraPaths), features to integrate (targetFeatTable) and additional information and parameters such as spectraMetadata, uROI, FIR and if they should be used (useUROI=TRUE, useFIR=TRUE):

library(peakPantheR)
init_annotation <- peakPantheRAnnotation(spectraPaths = input_spectraPaths,
                        targetFeatTable = input_targetFeatTable,
                        spectraMetadata = input_spectraMetadata)

The resulting peakPantheRAnnotation object is not annotated, does not contain and use uROI and FIR

init_annotation

peakPantheR_parallelAnnotation() will run the annotation across files in parallel (if ncores >0) and return the successful annotations (result$annotation) and failures (result$failures):

# annotate files serially
annotation_result <- peakPantheR_parallelAnnotation(init_annotation, ncores=0,
                                                    curveModel='skewedGaussian',
                                                    verbose=TRUE)

# successful fit
nbSamples(annotation_result$annotation)
data_annotation   <- annotation_result$annotation
data_annotation

# list failed fit
annotation_result$failures

Process Parallel Annotation Results

Based on the fit results, updated ROI (uROI) and fallback integration region (FIR) can be automatically determined using annotationParamsDiagnostic():

updated_annotation  <- annotationParamsDiagnostic(data_annotation, verbose=TRUE)

# uROI now exist
updated_annotation

outputAnnotationDiagnostic() will save to disk annotationParameters_summary.csv containing the original ROI and newly determined uROI and FIR for manual validation. Additionnaly a diagnostic plot for each compound is saved for reference and can be generated in parallel with the argument ncores:

# create a colourScale based on the sampleType
uniq_sType <- sort(unique(spectraMetadata(updated_annotation)$sampleType),
                    na.last=TRUE)
col_sType  <- unname( setNames(c('blue', 'red'),
                c(uniq_sType))[spectraMetadata(updated_annotation)$sampleType] )

# create a temporary location to save the diagnotic (otherwise provide the path
# to the selected location)
output_folder <- tempdir()

# output fit diagnostic to disk
outputAnnotationDiagnostic(updated_annotation, saveFolder=output_folder, 
                            savePlots=TRUE, sampleColour=col_sType, 
                            verbose=TRUE, ncores=2)

The data saved in annotationParameters_summary.csv is as follow:

# use pandoc for improved readability, display the diagnostic results
tmp_csv <- data.frame(matrix(nrow=2,ncol=21,dimnames=list(c(), c('cpdID', 
        'cpdName', 'X', 'ROI_rt', 'ROI_mz','ROI_rtMin', 'ROI_rtMax', 
        'ROI_mzMin', 'ROI_mzMax', 'X', 'uROI_rtMin', 'uROI_rtMax', 'uROI_mzMin',
        'uROI_mzMax', 'uROI_rt', 'uROI_mz', 'X', 'FIR_rtMin', 'FIR_rtMax', 
        'FIR_mzMin', 'FIR_mzMax'))), stringsAsFactors=FALSE)
tmp_csv[1,] <- c('ID-1','Cpd 1', '|', 3344.888, 522.2, 3310., 3390., 522.194778,
                522.205222,'|', 3305.75893, 3411.436284, 522.194778, 522.205222,
                3344.888, 522.2, '|', 3326.10635, 3407.272648, 522.194778, 
                522.205222)
tmp_csv[2,] <- c('ID-2','Cpd 2', '|', 3385.577, 496.2, 3280., 3440., 496.195038,
                496.204962,'|',3337.376665, 3462.449033, 496.195038, 496.204962,
                3385.577, 496.2, '|', 3365.023857, 3453.404957, 496.195038, 
                496.204962)
tmp_csv[,-c(1,2,3,10,17)]  <- sapply(tmp_csv[,-c(1,2,3,10,17)], as.numeric)
colnames(tmp_csv) <- c('cpdID', 'cpdName', 'X', 'ROI_rt', 'ROI_mz','ROI_rtMin', 
                    'ROI_rtMax', 'ROI_mzMin', 'ROI_mzMax', 'X', 'uROI_rtMin', 
                    'uROI_rtMax', 'uROI_mzMin', 'uROI_mzMax', 'uROI_rt', 
                    'uROI_mz', 'X', 'FIR_rtMin', 'FIR_rtMax', 'FIR_mzMin', 
                    'FIR_mzMax')
pander::pandoc.table(tmp_csv, digits=9)
knitr::include_graphics(
    "../man/figures/parallel_annotation_diagnostic_cpd1.png")

Diagnostic plot for compound 1: The top panel is an overlay of the extracted EIC across all samples with the fitted curve as dotted line. The panel under the EIC represent each found peak RT peakwidth (rtMin, rtMax and apex marked as dot), ordered with the first sample at the top. The bottom 3 panels represent found RT (peakwidth), m/z (peakwidth) and peak area by run order, with the corresponding histograms to the right

ROI exported to .csv can be updated based on the diagnostic plots; uROI (updated ROI potentially used for all samples) and FIR (fallback integration regions for when no peak is found) can also be tweaked to better fit the peaks.

Retention time correction

The optional retentionTimeCorrection() method provides an interface to adjust the expected ROI rt values and account for chromatographic batch effects. By comparing expected and found rt values for a set of reference compounds, a model of the chromatographic shift for the present batch can be established. This model can be in turned used to correct the expected retention time of all targeted compounds. In order to apply this method, the peakPantheRAnnotation must be previously annotated (isAnnotated=TRUE). The retention time correction algorithm to use can be selected using the method argument (currently polynomial and constant methods are available). retentionTimeCorrection() fits a correction function to model the dependency of the mean rt_dev_sec per reference feature with the expected databased retention time. If useUROI=TRUE, the expected retention time value is taken from the UROI_rt field, otherwise ROI_rt is used. If robust=TRUE, the RANSAC algorithm is used to automatically detect outliers and exclude them from the fit (this should only be used with a large number of reference features). retentionTimeCorrection() returns a list with 2 elements: a modified peakPantheRAnnotation object a ggplot2 diagnostic plot (optional, depending on whether TRUE or FALSE is passed to the diagnostic argument). The returned peakPantheRAnnotation object contains the same uROI and FIR mz values as the original annotation, but the retention time related parameters (rt, rtMin and rtMax) are replaced by the adjusted values. The rtMax and rtMin are set as the corrected rt value plus or minus half the value passed to the rtWindowWidth argument, respectively. useUROI is also set to TRUE. To continue with the workflow, simply set a new annotation object with the fit parameters established by retentionTimeCorrection() and call peakPantheR_parallelAnnotation() for the final annotation.

# Example with constant correction.
rtCorrectionOutput <- retentionTimeCorrection(updated_annotation,
                            rtCorrectionReferences=c('ID-1'),
                            method='constant',
                            robust=FALSE,
                            rtWindowWidth=15,
                            diagnostic=TRUE)
updated_annotation <- rtCorrectionOutput$annotation
# The ggplot2 plot object
rtCorrectionOutput$plot

# Example with second degree polynomial, without using RANSAC
# # to obtain a robust fit
rtCorrectionOutput <- retentionTimeCorrection(updated_annotation,
                            rtCorrectionReferences=NULL,
                            method='polynomial',
                            params=list(polynomialOrder=2),
                            robust=FALSE, rtWindowWidth=15,
                            diagnostic=TRUE)

New Initialisation with Updated Parameters to be Applied to All Study Samples

Following this manual validation of the fit on reference samples, the modified parameters in the .csv file can be reloaded and applied to all study samples.

Load new fit parameters

peakPantheR_loadAnnotationParamsCSV() will load the new .csv parameters (as generated by outputAnnotationDiagnostic()) and initialise a peakPantheRAnnotation object without spectraPaths, spectraMetadata or cpdMetadata which will need to be added before annotation. useUROI and useFIR are set to FALSE by default and will need to be modified according to the analysis to run. uROIExist is established depending on the .csv uROI column, and will only be set to TRUE if no NA are present. It is possible to reset the FIR values with the uROI windows using resetFIR().

update_csv_path <- '/path_to_new_csv/'

# load csv
new_annotation <- peakPantheR_loadAnnotationParamsCSV(update_csv_path)
#> uROIExist set to TRUE
#> New peakPantheRAnnotation object initialised for 2 compounds

new_annotation
#> An object of class peakPantheRAnnotation
#>  2 compounds in 0 samples. 
#>   updated ROI exist (uROI)
#>   does not use updated ROI (uROI)
#>   does not use fallback integration regions (FIR)
#>   is not annotated

new_annotation <- resetFIR(new_annotation)
#> FIR will be reset with uROI values

Add new samples to process

Now that the fit parameters were set on 3 representative samples (e.g. QC), the same processing can be applied to all study samples. resetAnnotation() will reinitialise all the results and modify the samples or compounds targeted as required:

## new files
new_spectraPaths   <- c(system.file('cdf/KO/ko15.CDF', package = "faahKO"),
                        system.file('cdf/WT/wt15.CDF', package = "faahKO"),
                        system.file('cdf/KO/ko16.CDF', package = "faahKO"),
                        system.file('cdf/WT/wt16.CDF', package = "faahKO"),
                        system.file('cdf/KO/ko18.CDF', package = "faahKO"),
                        system.file('cdf/WT/wt18.CDF', package = "faahKO"))

new_spectraPaths

Below we define the metadata of these new samples:

## new spectra metadata
new_spectraMetadata  <- data.frame(matrix(c("KO", "WT", "KO", "WT", "KO", "WT"),
                                        6, 1, dimnames=list(c(), c("Group"))), 
                                    stringsAsFactors=FALSE)
# use pandoc for improved readability
new_spectraMetadata  <- data.frame(matrix(c("KO", "WT", "KO", "WT", "KO", "WT"),
                                        6, 1, dimnames=list(c(), c("Group"))), 
                                    stringsAsFactors=FALSE)
pander::pandoc.table(new_spectraMetadata)
new_annotation <- resetAnnotation(updated_annotation, 
                                spectraPaths=new_spectraPaths, 
                                spectraMetadata=new_spectraMetadata, 
                                useUROI=TRUE, useFIR=TRUE, verbose=FALSE) 
## add new samples to the annotation loaded from csv, useUROI, useFIR

new_annotation <- resetAnnotation(new_annotation, spectraPaths=new_spectraPaths,
                                spectraMetadata=new_spectraMetadata, 
                                useUROI=TRUE, useFIR=TRUE)
#> peakPantheRAnnotation object being reset:
#>   Previous "ROI", "cpdID" and "cpdName" value kept
#>   Previous "uROI" value kept
#>   Previous "FIR" value kept
#>   Previous "cpdMetadata" value kept
#>   New "spectraPaths" value set
#>   New "spectraMetadata" value set
#>   Previous "uROIExist" value kept
#>   New "useUROI" value set
#>   New "useFIR" value set
new_annotation

Run Final Parallel Annotation

We can now run the final annotation on all samples with the optimised targeted features:

# annotate files serially
new_annotation_result <- peakPantheR_parallelAnnotation(new_annotation, 
                                                        ncores=0, verbose=FALSE)

# successful fit
nbSamples(new_annotation_result$annotation)

final_annotation      <- new_annotation_result$annotation
final_annotation

# list failed fit
new_annotation_result$failures

Output final results

The final fits can be saved to disk with outputAnnotationDiagnostic():

# create a colourScale based on the sampleType
uniq_group <- sort(unique(spectraMetadata(final_annotation)$Group),na.last=TRUE)
col_group  <- unname( setNames(c('blue', 'red'),
                    c(uniq_sType))[spectraMetadata(final_annotation)$Group] )

# create a temporary location to save the diagnotic (otherwise provide the path
# to the selected location)
final_output_folder <- tempdir()

# output fit diagnostic to disk
outputAnnotationDiagnostic(final_annotation, saveFolder=final_output_folder,
                        savePlots=TRUE, sampleColour=col_group, verbose=TRUE)

For each processed sample, a peakTables contains all the fit information for all compounds targeted. annotationTable( , column) will group the values across all samples and compounds for any peakTables column:

# peakTables for the first sample
peakTables(final_annotation)[[1]]
# use pandoc for improved readability
pander::pandoc.table(peakTables(final_annotation)[[1]])
# Extract the found peak area for all compounds and all samples
annotationTable(final_annotation, column='peakArea')
# use pandoc for improved readability
pander::pandoc.table(annotationTable(final_annotation, column='peakArea'))

Finally all annotation results can be saved to disk as .csv with outputAnnotationResult(). These .csv will contain the compound metadata, spectra metadata and a file for each column of peakTables (with samples as rows and compounds as columns):

# create a temporary location to save the diagnotic (otherwise provide the path
# to the selected location)
final_output_folder <- tempdir()

# save
outputAnnotationResult(final_annotation, saveFolder=final_output_folder, 
                        annotationName='ProjectName', verbose=TRUE)
#> Compound metadata saved at /final_output_folder/ProjectName_cpdMetadata.csv
#> Spectra metadata saved at 
#>     /final_output_folder/ProjectName_spectraMetadata.csv
#> Peak measurement "found" saved at /final_output_folder/ProjectName_found.csv
#> Peak measurement "rtMin" saved at /final_output_folder/ProjectName_rtMin.csv
#> Peak measurement "rt" saved at /final_output_folder/ProjectName_rt.csv
#> Peak measurement "rtMax" saved at /final_output_folder/ProjectName_rtMax.csv
#> Peak measurement "mzMin" saved at /final_output_folder/ProjectName_mzMin.csv
#> Peak measurement "mz" saved at /final_output_folder/ProjectName_mz.csv
#> Peak measurement "mzMax" saved at /final_output_folder/ProjectName_mzMax.csv
#> Peak measurement "peakArea" saved at 
#>     /final_output_folder/ProjectName_peakArea.csv
#> Peak measurement "maxIntMeasured" saved at 
#>     /final_output_folder/ProjectName_maxIntMeasured.csv
#> Peak measurement "maxIntPredicted" saved at 
#>     /final_output_folder/ProjectName_maxIntPredicted.csv
#> Peak measurement "is_filled" saved at 
#>     /final_output_folder/ProjectName_is_filled.csv
#> Peak measurement "ppm_error" saved at 
#>     /final_output_folder/ProjectName_ppm_error.csv
#> Peak measurement "rt_dev_sec" saved at 
#>     /final_output_folder/ProjectName_rt_dev_sec.csv
#> Peak measurement "tailingFactor" saved at 
#>     /final_output_folder/ProjectName_tailingFactor.csv
#> Peak measurement "asymmetryFactor" saved at 
#>     /final_output_folder/ProjectName_asymmetryFactor.csv
#> Summary saved at /final_output_folder/ProjectName_summary.csv

See Also



Try the peakPantheR package in your browser

Any scripts or data that you put into this service are public.

peakPantheR documentation built on Nov. 8, 2020, 6:38 p.m.