MS.DataCreation: an initial data from GC-MS analyses by collecting and...
In MSeasy: Preprocessing of Gas Chromatography-Mass Spectrometry (GC-MS) data

Description Usage Arguments Details Value Author(s) Examples

This function constructs a global matrix called initial_DATA.txt by collecting and assembling the information from chromatograms and mass spectra from several GC-MS analyses. It performs basic peak detection if the input file is in ASCII format. For other input files, peak retention times (or retention indices) are retrieved from the chromatograms (peaklist.txt or rteres.txt files) and associated to their respective mass spectrum (AIA/ANDI NetCDF, mzXML, mzData and mzML files). Each row of the output matrix represents one peak in one analysis and reports the sample name in first column, the peak retention time (or retention index) in second column and the mass spectrum of the peak in the following columns. If the input file is in Agilent format, two quantification measures of peak size can be extracted directly from rteres.txt: corrected area is then inserted in column 3 and percent of the total corrected area is placed in column 4 of initial_DATA.txt. If the input file is CDF, one or two quantification measures of peak size can be extracted from column 6 (quantification1) and 7 (quantification2) of peaklist.txt; values are then reported respectively in column 3 and 4 of initial_DATA.txt. Except for ASCII, xcms package is needed. Copy paste the following code to download xcms: source("http://bioconductor.org/biocLite.R");biocLite("xcms")

1 2	MS.DataCreation(DataType="CDF", path="", pathCDF="", mz, N_filt=3, apex= FALSE, quant = FALSE)

`DataType`	Indicate the type of input files: CDF (default) when each sample folder contains a mass spectrum in AIA/ANDI NetCDF, mzXML, mzData or mzML format, and a peak list stored in a file named peaklist.txt. Agilent when sample folders are obtained with Agilent Technologies machines (extension .D) and contained a peak list stored in rteres.txt file (all .D folders should be grouped in one folder); mass spectra in AIA/ANDI format are grouped in a separate folder. ASCII for sample folders as returned by trans.ASCII.
`path`	If `DataType="Agilent"`, name of the folder containing all the .D folders generated by Agilent Technologies. Each .D folder should contain a rteres.txt file (rteres.txt is the peak list generated by Agilent Technologies for each GC-MS analysis. Default parameters should be used in GC-MS Chemstation software. For each analysis, the name of the .D folder should be identical to the name of the AIA/ANDI file, which is usually the sample name. All .D folders should have different names). If `DataType="ASCII"`, name of the folder output_date_time returned by trans.ASCII and containing converted files for each GC-MS analysis initially in ASCII format.
`pathCDF`	If `DataType="Agilent"`, name of the folder containing the mass spectra of all the GC-MS analyses in AIA/ANDI NetCDF format. If `DataType="CDF"`, name of the folder grouping all the GC-MS analysis folders. For each GC-MS analysis, the folder contains the mass spectrum in AIA/ANDI NetCDF, mzXML, mzData or mzML format, and a peak list stored in a file named peaklist.txt (see details below for the structure of the peak list file. All AIA/ANDI files should have different names).
`mz`	Range of mass fragments delimiting the mass spectrum, e.g. 30:250. If `mz="all"` or empty, the range is automatically detected and used to delimit the mass spectrum.
`N_filt`	Only if `DataType="ASCII"`, N_filt must be informed for chromatogram smoothing before peak detection. For more details about smoothing, please refer to the documentation of the function filter with method=convolution. If N_filt is lower than 3, there will be no smoothing of the profile. A high N_filt will lower the noise in the chromatogram but can result in the loss of low concentrated peaks.
`apex`	`TRUE` indicates that the mass spectrum is considered at the apex of the peak and `FALSE` (default) indicates that a mean mass spectrum is obtained by averaging 5 percent of the mass spectra surrounding the apex (apex included) for AIA/ANDI NetCDF files, and by averaging the mass spectrum before, the mass spectrum after and the mass spectrum in the apex for ASCII files
`quant`	If `DataType="Agilent"` or `DataType="CDF"`, the option quant indicates if quantification measures of peak size should be extracted from the peak list files and added to the initial_DATA matrix. `TRUE`, if `DataType="Agilent"`, indicates that the two quantification columns CorrArea (corrected peak area) and PercTot (percent of the total corrected area) are extracted from rteres.txt and added in columns 3 and 4 of the output matrix. Corrected area is used for absolute quantification when associated with the use of external and/or internal standards. Percent of the total corrected area is used for relative quantification (no external or internal standard needed). If `DataType="CDF"`, indicates that one or two columns with quantification measures of peak size (height, width or area) are in columns 6 and 7 of peaklist.txt. The information is extracted and added in column 3 and 4 of the output matrix. This option will allow to generate one or two profiling matrices with quantification for each putative molecule after MS.clust. `FALSE` indicates that quantitative measures are absent or should not be added to the output matrix. Then, a fingerprinting matrix (absence or presence of each putative molecule) will be obtained after MS.clust.

After a GC-MS analysis, different types of files are produced from the chromatograph and the mass spectrometer . Each instrument vendor provide specific proprietary data formats that should be converted to common raw data format such as ANDI NetCDF or mzXML. Most commonly used file formats for mass spectral data, i.e. NetCDF, mzXML and ASCII, are acceptable in MS.DataCreation. Specific proprietary format from Agilent Technologies can also be used directly. Below the detailed structure of the three types of input formats:

(i) DataType=CDF. Each GC-MS analysis has its own folder, which contains a mass spectrum in AIA/ANDI NetCDF, mzXML, mzData or mzML format, and a peak list stored in a file named peaklist.txt. Peaklist.txt should have column headings similar to

peak/RT/firstscan/maxscan/lastscan/quantification1/quantification2. The first column contain the peak number, the retention time in minute or second is in the second column, the first scan of the peak is in the third column, the scan at the apex (maxscan) is in column 4, the last scan of the peak is in column 5, and optionally a quantitative measure of peak size (quantifaction1) is in column 6, and another quantitative measures of peak size (quantification2) is in column 7 (only maxscan used if apex=TRUE in MS.clust). The sample name reported in the output matrix is extracted from the name of the AIA/ANDI files. Thus, all AIA/ANDI files should have different names. All analysis folders should be grouped in one folder. The function first checks for the presence of AIA/ANDI and peaklist.txt files, controls if the range of mz is consistent and checks the structure of the peaklist.txt files. In a second time, the function collects the peak's retention time in peaklist.txt and looks for corresponding mass spectra in CDF files. Depending on the Apex option, the mean mass spectrum per each peak is calculated or the mass spectrum at the apex is extracted. The intensity, in counts, of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum. If quant = TRUE, one or two quantification columns, quantification1 and quantification2, are extracted for each peak from peaklist.txt and placed respectively in columns 3 and 4 of the output initial_DATA matrix.

(ii) DataType=Agilent. For Agilent Technologies providers (using the default parameters): each GC-MS analysis returns a folder .D that contains a file rteres.txt with summary information of the chromatogram (analogous to a peak list). All the analysis folders should have different names and should be grouped in one folder. The mass spectra should be exported in ANDI NetCDF format. These files are automatically generated at once for several selected GC-MS analyses with the Chemstation data analysis software (Menu/File/Export to AIA/ANDI). By default, all CDF files are exported in one folder that may correspond to pathCDF. The sample name reported in the output matrix is extracted from the name of the .D folder. Thus, all .D folders should have different names. AIA/ANDI files should have identical name with the corresponding .D folder. The function first checks if all sample folders (.D) within the folder path have a file rteres.txt and if in pathCDF there are all the CDF files needed. If one file is missing, the analysis stops and indicates the name of the problematic sample. The analysis should be restarted after correction or removal. In a second time, the function collects the peak's retention time in rteres.txt and looks for corresponding mass spectra in CDF files. Depending on the Apex option, the mean mass spectrum per each peak is calculated or the mass spectrum at the apex is extracted. The intensity, in counts, of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum. If quant = TRUE, the two quantification columns CorrArea (corrected peak area) and PercTot (percent of the total corrected area) are extracted for each peak from rteres.txt and placed respectively in columns 3 and 4 of the output initial_DATA matrix.

(iii) DataType=ASCII.If your GC-MS raw data have been converted into the international ASCII format, all files (one per GC-MS analysis) should be grouped in one folder and first pass through the trans.ASCII function. The trans.ASCII function generates a folder output_date_time with translated files compatible with MS.DataCreation. This output_date_time file may correspond to path. First, a smoothing of chromatogram depending on the option N_filt is performed (see the documentation of the function filter, method=convolution). Afterwards, peak are detected by the succession of 3 points with increasing intensity directly followed by three points of decreasing intensity (all points should have an intensity higher than 10 kilocounts). The first and last peaks of the chromatogram are removed if incomplete. In a third time, depending on the Apex option, the function calculates the mean mass spectrum per each peak or extracts the mass spectrum at the apex and the intensity (in counts) of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum.

The output file called initial_DATA.txt is saved in a folder called

Output_MSDataCreation_resultdate_time. It contains the relative mass spectrum of each peak of all samples. The first column contains sample name (the name of the folder containing the GC-MS analysis), the second column is the peak retention time (or retention index) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum). If quant = TRUE, the first column contains sample name (the name of the folder containing the GC-MS analysis), the second column is the peak retention time (or retention index), the third column contains quantification 1 (corrected area for Agilent), the fourth column contains quantification 2 (percent of the total corrected area for Agilent) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum).

MS.DataCreation returns a data matrix called initial_DATA.txt, saved in a folder called

Output_MSDataCreation_resultdate_time. It contains one row per peak and per individual with sample name, retention time (or retention index) and relative mass spectrum. If quant =TRUE, two supplementary columns quantification1 and quantification2 are added after the column retention time. During the analysis, a temporary file called save_list_temp.rda is automatically generated in folder Output_MSDataCreation_resultdate_time. It allows recovering temporary informations if the function stopped before ending.

Elodie Courtois, Yann Guitton, Florence Nicole

##not run 
## DataType="Agilent"
## require xcms package
## For Agilent Technologies GC-MS files 
## two folders are required:one folder with all .D analysis folders,
## each containing a rteres.txt file
## the second folder contains all CDF or mzXML files.
## CDF files have to be downloaded from MSeasy web site 
##  http://sites.google.com/site/rpackagemseasy/downloads/Agilent_example.zip
## Not run:  
url1<-"http://sites.google.com/site/rpackagemseasy/downloads/Agilent_example.zip"
download.file(url=url1, destfile="AgilentCDF.zip")
unzip(zipfile="AgilentCDF.zip", exdir=".") 
unlink("AgilentCDF.zip")  ##delete the zip files
## Two folders are created in your current working directory : Agilent_CDF and Agilent_rteres

#with pathCDF
library(xcms)
MS.DataCreation(path=file.path(getwd(),"Agilent_rteres"), pathCDF=file.path(getwd(),
"Agilent_CDF"), DataType="Agilent", mz=30:250,apex=FALSE, quant=FALSE) 

# without pathCDF
library(xcms)
MS.DataCreation(path=file.path(getwd(),"Agilent_rteres"), DataType="Agilent", 
mz=30:250,apex=FALSE, quant=FALSE) 

## Browse for the path to the Agilent_CDF folder
## downloaded and unzipped from MSeasy website
unlink(c("Agilent_rteres", "Agilent_CDF"), recursive=TRUE)   #remove 

##DataType="CDF"
##require xcms package
## Each GC-MS files has one folder containing
## one CDF files and one peak list file named peaklist.txt
## All analysis folders are grouped in one folder
## CDF files and peaklist.txt have to be downloaded from MSeasy web site 
##  http://sites.google.com/site/rpackagemseasy/downloads/CDF_peaklist_example.zip


url1<-"http://sites.google.com/site/rpackagemseasy/downloads/CDF_peaklist_example.zip"
download.file(url=url1, destfile="ExampleCDF.zip")
unzip(zipfile="ExampleCDF.zip", exdir=".") 
##One folder is created in your current working directory CDF_peaklist
unlink("ExampleCDF.zip")  ##delete the zip files

#with pathCDF
library(xcms)
MS.DataCreation(pathCDF=file.path(getwd(),"CDF_peaklist"), 
DataType="CDF", mz="all",apex=FALSE, quant=FALSE) 

# without pathCDF
library(xcms)
MS.DataCreation(DataType="CDF", mz="all",apex=FALSE, quant=FALSE) 


## Ask for the CDF_peaklist folder
## downloaded and unzipped from MSeasy website
unlink("CDF_peaklist", recursive=TRUE)  

## End(Not run)

##For ASCII GC-MS files  
pathASCII<-system.file("doc/ASCII_MSDataCreation",
package="MSeasy")
MS.DataCreation(path=pathASCII,mz=30:250,DataType="ASCII",apex=TRUE, N_filt=3)