BiocStyle::markdown()
Package: r Biocpkg("genedataRutils")
Authors: r packageDescription("genedataRutils")[["Author"]]
Last modified: r file.info("genedataRutils.Rmd")$mtime
Compiled: r date()
library(BiocStyle)
Genedata Expressionist for MS allows to export results in different file formats. One of these formats is .gda
. This format is read by the Analyst module in Genedata Expressionist for MS. Beside the actual peak data it contains metadata like annotations of samples and peaks. The package genedataRutils
was developed to enable seamless reading of these files into R and allow further processing. This vignette describes the basic functions for reading and processing of MS1 data in form of .gda
files and MS2 data in form of .mgf
files.
The genedataRutils
package is available from GitHub and can be directly installed from there using the devtools::install_github()
function. Once installed the package can be loaded via library()
. This vignette uses additionally functionalities from the RforMassSpectrometry
packages for handling of MS2 data.
devtools::install_github("michaelwitting/genedataRutils")
After installation the library can be loaded.
# load required libraries library(genedataRutils) library(Spectra)
.gda
files are read via the readGda()
function. A single file is read and three data frames are created, which are returned as list. The first one is called ms_data
and contains the actual intensties or areas that were exported. The two other data frames are called col_anno
and row_anno
and contain the annotations of the rows and the columns. All data frames are using the IDs of the peaks, cluster etc as row names or sample names as column names or in the $File
variable.
The example below reads a short example distributed with the genedataRutils
package.
# load example file and read MS1 data ms1_cluster <- readGda(system.file("extdata/20191108_Pesticides_test_Cluster.gda", package = "genedataRutils")) ms1_cluster
The individual data portions can be accessed by the respective get
-methods. This returns the respective part as data.frame
.
# get data stored in ms1_cluster getMsData(ms1_cluster) getRowAnno(ms1_cluster) getColAnno(ms1_cluster)
Similar to get get
-methods, set
-methods are available. These can be used if the content needs to be changed, e.g. normalizing data or so. The set
-methods are checking of the supplied data.frame
is checking if row and column names are fitting.
# get data, changed it and return original ms_data <- getMsData(ms1_cluster) ms_data <- ms_data / max(ms_data) * 100 setMsData(ms1_cluster, ms_data)
#setMsData(ms1_cluster, ms_data[-1,]) #setMsData(ms1_cluster, ms_data[,-1])
In the frame of processing it might be required to remove specific samples or peaks. This can be achieved by the functions removeSamples
and removeCluster
.
removeSamples(ms1_cluster, "MSMSpos_0402_RP_2-A,1_01_15100") removeCluster(ms1_cluster, "Cluster_0311")
If changes made to the data in R shall be written, the writeGda
function generates a new .gda file from the data.
writeGda(ms1_cluster, file = "export.gda")
In order to enable correct identification of metabolites MS2 data is required. Genedata Expressionist for MS can export MS2 data as .mgf files. These can be loaded e.g. using the MsBackendMgf
. genedataRutils
requires spectra as Spectra
object.
library(Spectra) library(MsBackendMgf) # load MS2 data mgf_files <- dir(system.file("extdata", package = "genedataRutils"), pattern = "mgf$", full.names = TRUE) ms2_spectra <- Spectra(mgf_files, source = MsBackendMgf(), backend = MsBackendDataFrame())
The function ms2AddId()
adds the ID of the corresponding MS1 peak as new variable to the Spectra
object. It uses the peak boundaries in the row annotations and checks if a precursorMz
and rtime
are within this boundaries. The new variable is called CLUSTER_ID
.
# add MS1 id spectra <- ms2AddId(ms1_cluster, ms2_spectra) head(spectra$CLUSTER_ID)
Sometimes the peak boundaries are rather large and therefore include MS2 spectra which are actually not correct. ms2AddId2
offers an alternative by defining the maximum allowed distances in form of tolerance
and rtimeTolerance
from the reported m/z and retention time.
spectra_2 <- ms2AddId2(ms1_cluster, ms2_spectra, tolerance = 0.005, ppm = 0, rtimeTolerance = 0.1) head(spectra_2$CLUSTER_ID)
This new variable can be used to subset the Spectra
object.
spectra[which(spectra$CLUSTER_ID == "Cluster_0615")]
Genedata Expressionist for MS has no functionality to display the isotope pattern that have been grouped together. However, peaks and isotope clusters can be exported as separate .gda
files. These files can be used to reconstruct the isotope pattern from the grouping. The prequisite is that the files contain the respective cluster and peak IDs. This vignette shows based on example files how the isotope pattern can be reconstructed.
# load example file and read MS1 data ms1_peak <- readGda(system.file("extdata/20191108_Pesticides_test_Peaks.gda", package = "genedataRutils")) ms1_peak
The MS1 spectra are reconstructed for each file and cluster. This requires the row annotation data frame for the peaks and cluster as well as the intensities of the individual peaks. The function reconstructIsoPattern
combines this information and returns a Spectra
object which contains the MS1 spectrum for each cluster in each sample. The example used here contains four clusters and four samples. Therefore, 16 spectra are returned. The code can be parallelized by registering a parallel cluster. This is strongly advisable, since for a large number of cluster and samples long run times are expected.
# reconstruct MS1 spectra for Fluopicolide isoPattern <- reconstructIsoPattern(ms1_peak, ms1_cluster) head(isoPattern)
Isotope pattern can be plotted using the plotSpectra
function from the Spectra
package.
plotSpectra(isoPattern, main = isoPattern$CLUSTER_ID)
Since often only a single representative isotope pattern is required, the individual spectra can be combined using the combineSpectra
function from the Spectra
package.
isoPatternComb <- combineSpectra(isoPattern, f = isoPattern$CLUSTER_ID, intensityFun = base::sum, mzd = 0.005) plotSpectra(isoPatternComb, main = isoPatternComb$CLUSTER_ID)
Finally, MS1 and MS2 spectra can be written .mgf files for future use. The export
function from the MsBackendMgf
package supports also the export of custom variables such as $CLUSTER_ID
.
export(ms2_spectra, backend = MsBackendMgf(), file = "MS2_spectra.mgf") export(isoPatternComb, backend = MsBackendMgf(), file = "MS1_spectra.mgf")
The Sirius software from the Boecker group offers the possibility of formula calculation and MS2 data analysis. The software can be downloaded from here. To be used optimal the isotope pattern as well as fragmentation spectra need to be delivered to the software. This is done best in the .ms format defined by the Boecker group.
For creation of a .ms file a MS1 spectrum with the isotope pattern and minimum one MS2 spectrum are required. Both have to be stored in a Spectra
object and contain the metadata columns CLUSTER_ID
and RAWFILE
. Furthermore, the values in CLUSTER_ID
in the MS1 and MS2 data have to match. Lastly, the row annotation of the clusters are required to isolate the averaged precursor m/z for writing of the file. The example code below fetches all the cluster ids from the MS1 spectra data and iterates over it. A .ms
file is only written if minimum one MS2 spectrum is available. Data can be either written to one single file or an individual file per peak.
# write single file writeSiriusFile(isoPatternComb, spectra, singleFile = TRUE) # write multiple files writeSiriusFile(isoPatternComb, spectra)
genedataRutils
genedataRutils
enables the annotation on the MS1 and MS2 level. MS1 compound libraries have to be supplied as data.frame
. The minimal information in this data.frame
are the following colummns: name
, adduct
and mz
.
ms1library_file <- system.file("extdata/pesticideLibMs1.tsv", package = "genedataRutils") ms1library <- read.table(ms1library_file, sep = "\t", header = TRUE, comment.char = "") adducts <- c("[M+H]+") annotateMz(ms1_cluster, ms1library, tolerance = 0.005, adducts = adducts)
Beyond MS1 matching, library searching for MS2 is possible. MS2 libraries have to be present as Spectra
. Comparison is based on compareSpectra
and returns the forward and reverse dotproduct. The Spectra
object needs to contain the following columns: accession
, name
, exactmass
, adduct
, precursorMz
. A data.frame
with all the results is returned. In the example below before matching peaks below 1% of the based peak removed to improve results.
library(MsBackendMassbank) # load MS2 data massbank_files <- dir(system.file("extdata", package = "genedataRutils"), pattern = ".txt$", full.names = TRUE) ms2library <- Spectra(massbank_files, source = MsBackendMassbank(), backend = MsBackendDataFrame()) # clean up MS2 spectra a bit my_filter <- function(x) { x > max(x, na.rm = TRUE) / 100 } # filter peaks below 1% of base peak spectra_filtered <- filterIntensity(spectra, intensity = my_filter) lib_results <- compareSpectraLibrary(spectra_filtered, ms2library, tolerance = 0.005) head(lib_results)
If the identification of metabolites was successful, the results can be mapped back to the row annotations. A data.frame
with the cluster or peak ids as row names is required. Additionally, the columns \$DBID
, \$Formula
, \$SMILES
, \$InChI
, \$Name
and \$MSI
are required. If any information of them is not available they should be filled with NA
. The column \$MSI
refers to the level of identification defined by the Metabolomics Standard Initiative.
metids <- read.table(system.file("extdata/metIds.tsv", package = "genedataRutils"), header = TRUE, row.names = 1, sep = "\t") head(metids)
Using the function addMetid
the metabolite identifications are added to the row annotations.
ms1_cluster_new <- addMetid(ms1_cluster, metids) head(getRowAnno(ms1_cluster_new))
After processing the data can be written to different formats. One is the original .gda format for Genedata Expressionst. Furthermore, export into a prototype MAF file for the Metabolights data repository. However this is currently only a prototype and will be expanded further.
writeGda(ms1_cluster_new) createMtblsMetaboliteFile(ms1_cluster, "ID0001") createMtblsMetaboliteFile(ms1_cluster_new, "ID0002")
Additionally to the MAF file the sample can be created.
createMtblsSampleFile(ms1_cluster_new, "ID0002")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.