BiocStyle::markdown()

Package: r Biocpkg("genedataRutils")
Authors: r packageDescription("genedataRutils")[["Author"]]
Last modified: r file.info("genedataRutils.Rmd")$mtime
Compiled: r date()

library(BiocStyle)

Introduction

Genedata Expressionist for MS allows to export results in different file formats. One of these formats is .gda. This format is read by the Analyst module in Genedata Expressionist for MS. Beside the actual peak data it contains metadata like annotations of samples and peaks. The package genedataRutils was developed to enable seamless reading of these files into R and allow further processing. This vignette describes the basic functions for reading and processing of MS1 data in form of .gda files and MS2 data in form of .mgf files.

Loading the library

The genedataRutils package is available from GitHub and can be directly installed from there using the devtools::install_github() function. Once installed the package can be loaded via library(). This vignette uses additionally functionalities from the RforMassSpectrometry packages for handling of MS2 data.

devtools::install_github("michaelwitting/genedataRutils")

After installation the library can be loaded.

# load required libraries
library(genedataRutils)
library(Spectra)

Reading, processing and writing MS1 data

.gda files are read via the readGda() function. A single file is read and three data frames are created, which are returned as list. The first one is called ms_data and contains the actual intensties or areas that were exported. The two other data frames are called col_anno and row_anno and contain the annotations of the rows and the columns. All data frames are using the IDs of the peaks, cluster etc as row names or sample names as column names or in the $File variable. The example below reads a short example distributed with the genedataRutils package.

# load example file and read MS1 data
ms1_cluster <- readGda(system.file("extdata/20191108_Pesticides_test_Cluster.gda",
                            package = "genedataRutils"))

ms1_cluster

The individual data portions can be accessed by the respective get-methods. This returns the respective part as data.frame.

# get data stored in ms1_cluster
getMsData(ms1_cluster)
getRowAnno(ms1_cluster)
getColAnno(ms1_cluster)

Similar to get get-methods, set-methods are available. These can be used if the content needs to be changed, e.g. normalizing data or so. The set-methods are checking of the supplied data.frame is checking if row and column names are fitting.

# get data, changed it and return original
ms_data <- getMsData(ms1_cluster)
ms_data <- ms_data / max(ms_data) * 100
setMsData(ms1_cluster, ms_data)
#setMsData(ms1_cluster, ms_data[-1,])
#setMsData(ms1_cluster, ms_data[,-1])

In the frame of processing it might be required to remove specific samples or peaks. This can be achieved by the functions removeSamples and removeCluster.

removeSamples(ms1_cluster, "MSMSpos_0402_RP_2-A,1_01_15100")
removeCluster(ms1_cluster, "Cluster_0311")

If changes made to the data in R shall be written, the writeGda function generates a new .gda file from the data.

writeGda(ms1_cluster, file = "export.gda")

Linking MS1 and MS2 data

In order to enable correct identification of metabolites MS2 data is required. Genedata Expressionist for MS can export MS2 data as .mgf files. These can be loaded e.g. using the MsBackendMgf. genedataRutils requires spectra as Spectra object.

library(Spectra)
library(MsBackendMgf)

# load MS2 data
mgf_files <- dir(system.file("extdata",
                               package = "genedataRutils"),
                   pattern = "mgf$",
                   full.names = TRUE)

ms2_spectra <- Spectra(mgf_files,
                   source = MsBackendMgf(),
                   backend = MsBackendDataFrame())

The function ms2AddId() adds the ID of the corresponding MS1 peak as new variable to the Spectra object. It uses the peak boundaries in the row annotations and checks if a precursorMz and rtime are within this boundaries. The new variable is called CLUSTER_ID.

# add MS1 id
spectra <- ms2AddId(ms1_cluster, ms2_spectra)
head(spectra$CLUSTER_ID)

Sometimes the peak boundaries are rather large and therefore include MS2 spectra which are actually not correct. ms2AddId2 offers an alternative by defining the maximum allowed distances in form of tolerance and rtimeTolerance from the reported m/z and retention time.

spectra_2 <- ms2AddId2(ms1_cluster, ms2_spectra, tolerance = 0.005, ppm = 0, rtimeTolerance = 0.1)
head(spectra_2$CLUSTER_ID)

This new variable can be used to subset the Spectra object.

spectra[which(spectra$CLUSTER_ID == "Cluster_0615")]

Reconstruction of isotope pattern

Genedata Expressionist for MS has no functionality to display the isotope pattern that have been grouped together. However, peaks and isotope clusters can be exported as separate .gda files. These files can be used to reconstruct the isotope pattern from the grouping. The prequisite is that the files contain the respective cluster and peak IDs. This vignette shows based on example files how the isotope pattern can be reconstructed.

# load example file and read MS1 data
ms1_peak <- readGda(system.file("extdata/20191108_Pesticides_test_Peaks.gda",
                            package = "genedataRutils"))

ms1_peak

The MS1 spectra are reconstructed for each file and cluster. This requires the row annotation data frame for the peaks and cluster as well as the intensities of the individual peaks. The function reconstructIsoPattern combines this information and returns a Spectra object which contains the MS1 spectrum for each cluster in each sample. The example used here contains four clusters and four samples. Therefore, 16 spectra are returned. The code can be parallelized by registering a parallel cluster. This is strongly advisable, since for a large number of cluster and samples long run times are expected.

# reconstruct MS1 spectra for Fluopicolide
isoPattern <- reconstructIsoPattern(ms1_peak, ms1_cluster)
head(isoPattern)

Isotope pattern can be plotted using the plotSpectra function from the Spectra package.

plotSpectra(isoPattern, main = isoPattern$CLUSTER_ID)

Since often only a single representative isotope pattern is required, the individual spectra can be combined using the combineSpectra function from the Spectra package.

isoPatternComb <- combineSpectra(isoPattern,
                                 f = isoPattern$CLUSTER_ID,
                                 intensityFun = base::sum,
                                 mzd = 0.005)

plotSpectra(isoPatternComb, main = isoPatternComb$CLUSTER_ID)

Finally, MS1 and MS2 spectra can be written .mgf files for future use. The export function from the MsBackendMgf package supports also the export of custom variables such as $CLUSTER_ID.

export(ms2_spectra, backend = MsBackendMgf(), file = "MS2_spectra.mgf")
export(isoPatternComb, backend = MsBackendMgf(), file = "MS1_spectra.mgf")

Creating of Sirius .ms files

The Sirius software from the Boecker group offers the possibility of formula calculation and MS2 data analysis. The software can be downloaded from here. To be used optimal the isotope pattern as well as fragmentation spectra need to be delivered to the software. This is done best in the .ms format defined by the Boecker group. For creation of a .ms file a MS1 spectrum with the isotope pattern and minimum one MS2 spectrum are required. Both have to be stored in a Spectra object and contain the metadata columns CLUSTER_ID and RAWFILE. Furthermore, the values in CLUSTER_ID in the MS1 and MS2 data have to match. Lastly, the row annotation of the clusters are required to isolate the averaged precursor m/z for writing of the file. The example code below fetches all the cluster ids from the MS1 spectra data and iterates over it. A .ms file is only written if minimum one MS2 spectrum is available. Data can be either written to one single file or an individual file per peak.

# write single file
writeSiriusFile(isoPatternComb, spectra, singleFile = TRUE)

# write multiple files
writeSiriusFile(isoPatternComb, spectra)

Metabolite identification with genedataRutils

genedataRutils enables the annotation on the MS1 and MS2 level. MS1 compound libraries have to be supplied as data.frame. The minimal information in this data.frame are the following colummns: name, adduct and mz.

ms1library_file <- system.file("extdata/pesticideLibMs1.tsv", package = "genedataRutils")

ms1library <- read.table(ms1library_file, sep = "\t", header = TRUE, comment.char = "")

adducts <- c("[M+H]+")

annotateMz(ms1_cluster,
           ms1library,
           tolerance = 0.005,
           adducts = adducts)

Beyond MS1 matching, library searching for MS2 is possible. MS2 libraries have to be present as Spectra. Comparison is based on compareSpectra and returns the forward and reverse dotproduct. The Spectra object needs to contain the following columns: accession, name, exactmass, adduct, precursorMz. A data.frame with all the results is returned. In the example below before matching peaks below 1% of the based peak removed to improve results.

library(MsBackendMassbank)

# load MS2 data
massbank_files <- dir(system.file("extdata",
                                  package = "genedataRutils"),
                      pattern = ".txt$",
                      full.names = TRUE)

ms2library <- Spectra(massbank_files,
                      source = MsBackendMassbank(),
                      backend = MsBackendDataFrame())

# clean up MS2 spectra a bit
my_filter <- function(x) {
    x > max(x, na.rm = TRUE) / 100
}

# filter peaks below 1% of base peak
spectra_filtered <- filterIntensity(spectra, intensity = my_filter)

lib_results <- compareSpectraLibrary(spectra_filtered,
                                     ms2library,
                                     tolerance = 0.005)

head(lib_results)

Adding Metabolite identifications

If the identification of metabolites was successful, the results can be mapped back to the row annotations. A data.frame with the cluster or peak ids as row names is required. Additionally, the columns \$DBID, \$Formula, \$SMILES, \$InChI, \$Name and \$MSI are required. If any information of them is not available they should be filled with NA. The column \$MSI refers to the level of identification defined by the Metabolomics Standard Initiative.

metids <- read.table(system.file("extdata/metIds.tsv",
                                 package = "genedataRutils"),
                     header = TRUE,
                     row.names = 1,
                     sep = "\t")

head(metids)

Using the function addMetid the metabolite identifications are added to the row annotations.

ms1_cluster_new <- addMetid(ms1_cluster, metids)
head(getRowAnno(ms1_cluster_new))

Writing of results to differnt formats

After processing the data can be written to different formats. One is the original .gda format for Genedata Expressionst. Furthermore, export into a prototype MAF file for the Metabolights data repository. However this is currently only a prototype and will be expanded further.

writeGda(ms1_cluster_new)

createMtblsMetaboliteFile(ms1_cluster, "ID0001")
createMtblsMetaboliteFile(ms1_cluster_new, "ID0002")

Additionally to the MAF file the sample can be created.

createMtblsSampleFile(ms1_cluster_new, "ID0002")

Session info

sessionInfo()


michaelwitting/genedataRutils documentation built on April 30, 2021, 5:11 p.m.