Introduction

This package provides some functions and utilities extending the functionality of the xcms Bioconductor package. These functions aim to provide a better integration into the Bioconductor framework (using as much as possible methods defined in ProtGenerics and methods and classes from MSnbase; and also following Bioconductor's coding- and function-naming conventions). Some of the concepts and definitions used in this package are from [@Smith:2014di].

New objects introduced by xcmsExtensions

To enable a more streamlined and user-friendly access to MS data, the xcmsExtensions package defines the following objects. Some of these are redundant in functionality and data content with objects in xcms that do however, in most cases, not provide simple access to all of the data or have other limitations. The xcms package defines for example the xcmsEIC class and the getEIC method to extract ion chromatograms from an xcmsSet object. The EIC, by default, is however extracted from the profile matrix that represents intensity values aggregated in equal-spaced bins in m/z space. In addition there exists the rawEIC method that extracts m/z, scan index and corresponding intensity values from the raw data, but does not return these as an xcmsEIC object (and does also not report the actual retention time). The xcmsExtensions package tries to solve some of these possible confusions by the introduction of the below listed classes and their methods.

Extracting MS data

In this section we show some examples how (raw or binned) data of one or several MS runs can be extracted from xcmsRaw, xcmsSet, MSsliceList, MSslice or MSdata objects.

First we load the example data set from the xcms package.

library(xcmsExtensions)
library(faahKO)
xset <- faahko

As described in the previous section, an MSdata object simply stores intensities for rt-m/z tuples and corresponds thus to some degree to a xcmsRaw object from the xcms package. Below we are extracting all of the data (retention time, m/z and intensity) values from one of the raw files of the faahko data set.

## Extract the full data from an xcmsRaw object.
xraw <- getXcmsRaw(xset, 1)
msFull <- msData(xraw)
msFull

Alternatively, we could extract only a portion of the data by specifying an mz value range and an retention time range with the mzrange and rtrange arguments.

## Extract only a subset of the data (from a slice in M/Z-rt space)
msSub <- msData(xraw, mzrange=c(300, 350), rtrange=c(2700, 2900))
msSub

We can extract the retention time, mz values and intensity values using the rtime, mz and intensity methods.

## Get the retention time values
head(rtime(msSub))
## Get the mz values
head(mz(msSub))
## Get the intensity values
head(intensity(msSub))

We could also convert the MSdata into a numeric matrix.

head(as.matrix(msSub))

A chromatogram (i.e. intensities along retention time) can be extracted from the MSdata object using the getChromatogram method, or plotted using the plotChromatogram method. In its default setting, the getChromatogram method extracts all intensity and retention time pairs in the MSdata, aggregating intensities with the exact same retention time. The argument FUN of the method allows to specify the function to aggregate the intensities. In the default setting (FUN being max) only the maximal intensity for a retention time is selected. Below we extract the chromatogram for the data sub-set.

## Extract the (full) chromatogram; the maximal intensity will be selected
## if two or more values have the same retention time. Otherwise, the data
## will be returnes /as.is/
chr <- getChromatogram(msSub)
head(chr)
nrow(chr)

Sometimes it might also be useful to bin the data in retention time dimension and aggregate the values within these bins as well. The arguments nbin or binSize allow to specify either the number of bins along the retention time axis, or the size of these bins (with the bins argument it would be even possible to provide the actual bins, but in most instances the nbin or binSize arguments should suffice). This binning allows to reduce the amount of data and to group intensities in discrete, equal sized, bins. All intensities (along the full M/Z range present in the MSdata object) with a retention times that fall within a bin are aggregated with the FUN function. Below we bin the data of the sub-set into 30 bins along the retention time range.

## Bin the values in 30 bins along the retention time range of the MSdata
chrB <- getChromatogram(msSub, nbin=30)
head(chrB)
nrow(chrB)

In the example above we the matrix representing the extracted chromatogram has 30 rows, each row representing the aggregated intensities for each bin. Note that the matrix could also have less than 30 rows, e.g. if no values would be available for a bin.

We next plot the chromatogram, first without binning and add the binned chromatogram to that plot. Especially for larger data sets (i.e. with more data points) binning provides a considerable speed-up to the plotting process.

## Plot the chromatogram.
plotChromatogram(msSub, type="l")
## Add the binned chromatogram to the plot.
plotChromatogram(msSub, type="l", col="blue", add=TRUE, nbin=30)

With this methods we can also extract the base peak chromatogram [@Smith:2014di] (BPC, the maximal intensity along the full M/Z range for the same retention time) or the total ion chromatogram (TIC, the sum of all intensities along the M/Z dimension for the same retention time). We plot these two for the MSdata representing the full data of the first MS run in the experiment. For the former we can just use the default setting for the FUN method, for the latter we change FUN to sum.

par(mfrow=c(1, 2))
## Plot the BPC.
plotChromatogram(msFull, type="l", main="BPC")
## And the TIC.
plotChromatogram(msFull, type="l", FUN=sum, main="TIC")

Analogous to the getChromatogram, the getSpectrum method allows to extract a spectrum (i.e. intensities, eventually aggregated for identical retention times, along the M/Z dimension) from an MSdata object, enabling also an optional binning of values in M/Z dimension. In the example below we refine the MS data slice for the peak from the example above and plot the chromatogram as well as the spectrum for it. The MS data could alternatively also extracted as a two-dimensional matrix with the mapMatrix method (see further below in Data manipulations section for an example).

rtr <- c(2550, 2700)
mzr <- c(300, 330)
## Extrac the MS data slice
msd <- msData(xraw, mzrange=mzr, rtrange=rtr)
par(mfrow=c(1, 2))
## Plot the chromatogram
plotChromatogram(msd, type="l")
## And the spectrum
plotSpectrum(msd, type="l")

The MSdata is thus the basic object to store MS data of a single MS run/sample. Data from different MS runs/samples is represented by, and can be stored into, a MSslice object. This object is supposed to store the data from the same slice from the 2-dimensional (m/z over rt) MS data across several samples. In the example below we define a slice representing one of the peaks in the faahko data set shown in Figure 4 in the xcms vignette.

Alternatively we could extract an MSslice object representing the full MS data of all samples of an experiment by simply not providing an rtrange or mzrange.

## Extract the chromatogram of a m/z-rt slice across all samples.
## Defining the mz and rt ranges for the peak in the xcms
## vignette (Figure 4).
mzr <- c(300, 330)
rtr <- c(2550, 2700)
mss <- msSlice(xset, rtrange=rtr, mzrange=mzr)
mss

We can further subset data in an MSslice object using the subset method providing mzrange and/or rtrange, or we can subset the MSslice object to specific samples using the [ method. Finally we can also extract single MSdata objects using the [[ method.

## Further subsetting the MSslice object
mssSub <- subset(mss, rtrange=c(2570, 2590))

## Or we can use the [ and [[ methods to subset to samples.
## Extract samples 2, 8, 9
mss[c(2, 8, 9)]

## Extract the MSdata representing sample 8
mss[[8]]

Most of the methods defined for MSdata objects are also available for MSslice objects. In the example below we thus extract the chromatogram and the spectrum from the MSdata object and get for each of the two as a result a matrix, rows being either unique retention times or M/Z values and columns samples.

## Extracting the chromatogram from the MSslice object. To reduce the data points
## and enable a better matching of values across samples we use binning on the
## retention time and bin the data in 2 second intervals.
chrM <- getChromatogram(mss, binSize=2)
head(chrM)

## The same with the spectra. Binning into 0.5 M/Z
spcM <- getSpectrum(mss, binSize=0.5)
head(spcM)

We can also plot the chromatogram and the spectrum with the plotChromatogram and plotSpectrum methods. As above, these methods would also allow to plot the BPC and the TIC as well as the TIS for all samples within an experiment into the same plot.

par(mfrow=c(1, 2))
plotChromatogram(mss, binSize=2, type="l", col="#00000050")
plotSpectrum(mss, binSize=0.5, type="h", col="#00000050")

In addition, we could use the image or the levelplot method to plot the chromatogram or spectrum matrix directly.

levelplot(chrM, las=2, xlab="Retention time", ylab="Sample")
levelplot(spcM, las=2, xlab="M/Z", ylab="Sample")

If multiple different ranges (in M/Z and/or retention time space) are defined, the msSlice method would extract MSslice object for each of these and return the results as a MSsliceList.

In the example below we extract a slice of MS data for each of the four peaks from the xcms vignette, Figure 3 from all samples.

## Defining the mz and rt ranges.
mzr <- rbind(c(300.0, 300.3),
         c(301.0, 301.3),
         c(298.0, 298.3),
         c(491.0, 491.4))
rtr <- rbind(c(3300, 3450),
         c(3300, 3450),
         c(3100, 3250),
         c(3300, 3500))
## Extracting the MS data slices; we are extracting the raw,
## i.e. unaligned retention times.
msl <- msSlice(xset, rtrange=rtr, mzrange=mzr, rt="raw")
msl

Plotting the (base peak) chromatogram for each of the slices.

par(mfrow=c(2, 2))
lapply(slices(msl), plotChromatogram, type="l", col="#00000080")

Data manipulations

One of the simplest data manipulations is to aggregate intensities in bins along the M/Z or the retention time axis. This might be done to reduce the amount of data points e.g. to plot the data. Binning in retention time dimension is also especially useful to match individual intensities across samples, if the retention times slightly differ. In the example below we aggregate intensities in bins of size 1 along the M/Z and along the retention time axis (especially in M/Z dimension this binning might be too coarse in a real world situation). By default, the maximal signal is selected for each bin, but this can be changed with the argument FUN.

## Using the the same of the MS data from the previous section.
rtr <- c(2550, 2700)
mzr <- c(300, 330)
msd <- msData(xraw, mzrange=mzr, rtrange=rtr)

## Bin along the M/Z range
mzBinned <- binMz(msd, binSize=1)
## And along the retention time
rtBinned <- binRtime(msd, binSize=5)

## Plotting
par(mfrow=c(2, 2))
plotChromatogram(msd, type="l", col="grey")
plotChromatogram(mzBinned, type="l", add=TRUE, col="blue", lty=2)
plotSpectrum(msd, type="l", col="grey")
plotSpectrum(mzBinned, type="l", add=TRUE, col="blue", lty=2)
##
plotChromatogram(msd, type="l", col="grey")
plotChromatogram(rtBinned, type="l", add=TRUE, col="blue", lty=2)
plotSpectrum(msd, type="l", col="grey")
plotSpectrum(rtBinned, type="l", add=TRUE, col="blue", lty=2)

To bin the data matrix in both dimensions, the binMzRtime method can be used.

The MS data can also be extracted as a two-dimensional matrix, with rows being the M/Z and columns the retention time values for each measurement. This might e.g. be useful to plot the data using the image or levelplot functions. Below we first bin the MSdata in both, M/Z and retention time, extract the resulting MS data matrix and plot it. Below we plot the matrix using the image function, to use the levelplot we first have to cast the dgCMatrix into a normal matrix using the as.matrix function.

## We bin the data along the retention time in 5 second intervals and into 20
## bins along the M/Z dimension.
M <- mapMatrix(binMzRtime(msd, rtBinSize=5, mzNbin=20))
image(M, xlab="Retention time", ylab="M/Z")

The same binning can also be applied to MSslice and MSsliceList objects. For both methods, the mzrange or rtrange across all samples is first cut into intervals and the binning is then performed, for each sample separately, but using the same bins.

Simple compound database

The xcmsExtensions package provides also a very simple metabolic compounds database (a SimpleCompoundDb) that contains presently some of the data from the Human metabolome database (HMDB [@Wishart:2013is]). This database, which is by default bound to the variable scDb, can be easily queried and also used for a simple compound identification (see next section). Records can be retrieved from the database using the compounds method, which allows also filter the results in a similar fashion than e.g. EnsDb databases from Bioconductor's ensembldb package. The individual filter objects are listed below:

Below we first inspect some database properties and subsequently fetch all the compounds from the database.

## First list some information from the database
scDb

## List all columns that are available in the database.
columns(scDb)

## Fetch all compounds from the database.
cmps <- compounds(scDb)
nrow(cmps)

Next we specify a MassrangeFilter and fetch all compounds with a mass larger than 300 and smaller than or equal to 310 from the database.

## Define the filter for masses > 300 and <= 310
mrf <- MassrangeFilter(value=c(300, 310), condition="(]")

## Fetch the ID, the monoisotopic molecular weight and the chemical formula
## for these compounds
cmps <- compounds(scDb, filter=mrf, columns=c("monoisotopic_molecular_weight",
                          "accession", "chem_formula"))
nrow(cmps)
head(cmps)

Alternatively (or in addition) we can also specify a CompoundidFilter to fetch specific compounds from the database.

## Create a filter for some IDs
idf <- CompoundidFilter(value=c("HMDB00523", "HMDB00010"))

## Filters can be compbined, in which case retrieved values have to match
## all conditions.
cmps <- compounds(scDb, filter=list(mrf, idf), columns=c("monoisotopic_molecular_weight",
                             "chem_formula", "accession"))
cmps

Note that entries are not in the same order than e.g. the IDs in the provided filter. The ordering of the result can be specified with the order.by parameter. By default, results are ordered by ID.

## Fetch the result ordered by monoisotopic_molecular_weight
compounds(scDb, filter=idf, columns=c("accession", "chem_formula",
                      "monoisotopic_molecular_weight"),
      order.by="monoisotopic_molecular_weight")

ESI MS adducts mass conversion

Conversion between M/Z values from an MS measurement and the mass of the compound can be performed with the adductmz2mass and mass2adductmz methods. These base on the data from the Fiehn lab for the most commonly observed ion adducts in ESI experiments [@Huang:1999gb] to perform the calculation, i.e. assuming the M/Z value corresponds derives from a certain ion adduct of the real compound, it calculates its mass. By default, both methods assume that the measured compound is an ion and thus do just return the value as is. An ion adduct name, or several ion adduct names, can be provided to the method with the ionAdduct argument. All supported ion adduct names can be listed with the supportedIonAdducts() function.

Below we first list all supported ion adducts and convert then an (artificial M/Z) to the mass of the compound, assuming that the M+H ion adduct was measured.

## List all supported ion adducts.
supportedIonAdducts()

## Convert the given M/Z to the mass of the compound, assuming that the
## M+H was measured.
adductmz2mass(300.1, ionAdduct="M+H")

The method returns a list, with the list name specifying the ion adduct. Below we convert the M/Z to the mass of all supported negative ion adducts.

## Convert the M/Z to the mass of all possible negative ion adducts.
adductmz2mass(300.1, ionAdduct=supportedIonAdducts(charge="neg"))

We can also calculate the M/Z values of all possible ion adducts of a compound that we fetch from the database.

## Fetch the mass from a compound from the database.
cmp <- compounds(scDb, filter=CompoundidFilter("HMDB00010"),
         columns=c("accession", "name", "monoisotopic_molecular_weight"))

## Calculate the M/Z for the ion adduct M+H for this compound.
mass2adductmz(cmp$monoisotopic_molecular_weight, ionAdduct="M+H")

## And finally of all possible (supported) adducts
mass2adductmz(cmp$monoisotopic_molecular_weight, ionAdduct=supportedIonAdducts())

Simple peak/compound identification

One of the major problems in the metabolomics data analysis workflow is the identification of the compounds. To facilitate this step, the xcmsExtensions package provides an internal, simple, database that enables a fast and easy identification based on the compound mass.

In the code chunk below we show some basic information on this database, that is bound to the variable name scDb (for simple compound database).

## Show some information on the database.
scDb

Some of the methods defined by the AnnotationDbi Bioconductor package are implemented, thus we can ask for the available columns in the database with the columns method or get the SQLite database connection with the dbconn method.

## List the available columns.
columns(scDb)

To retrieve compounds in this database we can use the compounds method, that by default will return all of the available compounds. The xcmsExtensions package implements however also the filtering framework introduced by Bioconductor's ensembldb package. Thus, we can use an CompoundidFilter to select only specific compounds from the database.

## Get all of the compounds from the database.
allCmps <- compounds(scDb)
nrow(allCmps)

The compounds method returns all database columns from the respective database table from the database. The method's columns argument allows to select only specific columns (selected from the available ones returned by the columns method shown above) that should be returned.

As stated above, we can use this database also for a simple, mass based, peak/compound identification. For the simplest case, assuming that the measured peaks is an ion, the mzmatch method compares the provided M/Z values with the masses of all of the compounds in the database and returns for each M/Z all matching compounds given a user specified difference threshold (defined with the mzdev and ppm arguments). By default, the method returns all matches allowing a difference of 10 ppm.

## Defining compound masses.
comps <- c(300.1898, 298.1508, 491.2000, 169.13481, 169.1348)

## Searching for matches in the database.
Res <- mzmatch(comps, scDb)

Res

The results are returned as a list, elements being a matrix with the match(es) for each specified mass. The column idx of the matrix contains the compound IDs and the column deltaMz the difference between the specified mass (or calculated mass, assuming the measured M/Z corresponds to that of an ion adduct; see further below for more details) and the compound's mass. The column adduct indicates the ion adduct that matches the metabolite in the database (in the example above we assumed that the measured compound is already and ion, thus an M is listed).

In a real world scenario, the measured M/Z value will correspond to that of an ion adduct of the real metabolite. The xcmsExtensions package integrates the data from the ESI MS adducts calculator from the Fiehn lab [@Huang:1999gb] and uses this to convert between M/Z values and masses for all or some of the most commonly found ion adducts. The methods adductmz2mass and mass2adductmz allow to convert between M/Z and mass values of supported ion adducts (which can be listed using the supportedIonAdducts() method).

To calculate the mass of the measured compounds, assuming that the M+H ion of the metabolite was measured we, can use the adductmz2mass method as shown below.

## Calculate the mass assuming the M+H ion was measured.
adductmz2mass(comps, ionAdduct="M+H")

## Calculate the mass assuming that a negative ion adduct was measured.
adductmz2mass(comps, ionAdduct=supportedIonAdducts(charge="neg"))

If ion adduct name(s) are provided to the mzmatch method with the ionAdduct argument, the method internally compares all masses for all ion adducts with the database and returns matching results.

## Match the M/Z, assuming it to correspond to a positive ion adduct, with all
## masses in the database
Res <- mzmatch(comps, scDb, ionAdduct=supportedIonAdducts(charge="pos"))
## Remove all non-matching results ensuring that at leas one result is returned
## for an input M/Z
Res <- lapply(Res, function(z){
    if(all(is.na(z$idx)))
    return(z[1, ])
    return(z[!is.na(z$idx), ])
})

Res[1:2]

In addition (as shown in the next section) it is possible to perform the peak identification using the measured peak range in M/Z direction (i.e. the mzmin and mzmax value of the peak). Also, it would be to some extend possible to perform the compound identification using a MassrangeFilter and the compounds method, but in that case ppm and the mass of the assumed ion adduct would have to be calculated beforehand.

In the code chunk below we're defining an CompoundidFilter for the identified compounds of the 4th mass and return its name and inchi key from the database.

## Define CompoundidFilter for the matches of the selected mass
cf <- CompoundidFilter(Res[[1]][1, "idx"])

## Getting the compounds' names and inchi keys from the database
cmps <- compounds(scDb, filter=cf, columns=c("name", "inchikey"))
cmps

Using xcmsExtensions for an updated xcms data analysis workflow

In this section we perform the analysis of the faahKO data set described in the xcms package vignette LC/MS Preprocessing and Analysis with xcms.

## load the libraries and data
library(xcms)
library(xcmsExtensions)
library(faahKO)
## list the files provided by the package
cdfpath <- system.file("cdf", package="faahKO")
dir(cdfpath, recursive=TRUE)

In contrast to the default workflow, we define a data.frame specifying the input files, associated samples and phenotypes (or genotypes as in the present setup). Usually, such a table will be saved as a tabulator delimited text file which is then read with the read.table or read.AnnotatedDataFrame method.

## Define a phenotype data.frame
cdffiles <- dir(cdfpath, recursive=TRUE, full.names=TRUE)
## Extract the genotype from the file name
genot <- rep("KO", length(cdffiles))
genot[grep(cdffiles, pattern="WT")] <- "WT"
## And the sample name.
tmp <- strsplit(cdffiles, split=.Platform$file.sep)
sampn <- unlist(lapply(tmp, function(z){
    return(gsub(z[length(z)], pattern=".CDF", replacement="", fixed=TRUE))
}))

## And the phenodata table
pheno <- data.frame(file=cdffiles, genotype=genot, name=sampn)

head(pheno)

Next we create a new xcmsSet object with the xcmsSet function. This call will identify peaks in each of the samples. In most instances we might however change the default peak identification approach with the centwave method [@Tautenhahn:2008fx]. In any case, we have to set the sample class assignment according to the experimental setup so that the peak grouping across samples is performed as expected. Similar to other data objects defined in Bioconductor (e.g. the ExpressionSet), we can access data in the provided phenodata table using the $ operator.

## The default call from the vignette.
xset <- xcmsSet(files=cdffiles, phenoData=pheno)

## Setting the sample class; that's important for the peak grouping
## algorithm
sampclass(xset) <- xset$genotype

## At last we define also a color for each of the two genotypes.
genoColor <- c("#E41A1C80", "#377EB880")
names(genoColor) <- c("KO", "WT")

Next we plot the base peak chromatogram (BPC) and the total ion chromatogram (TIC) for all of the files to get a global overview of the data. To this end we're extracting an MSslice object without specifying an M/Z and retention time range, thus returning the full MS data for each sample.

## Get the full MS data from each sample.
fullData <- msSlice(xset)

## Eventually setting the names of the data
names(fullData) <- sampnames(xset)

The BPC and the BPS (base peak spectrum, i.e. the maximum signal across the retention time range for one M/Z) are shown below.

par(mfrow=c(1, 2))
plotChromatogram(fullData, binSize=1, col=genoColor[xset$genotype], main="BPC", type="l")
plotSpectrum(fullData, binSize=0.1, col=genoColor[xset$genotype], main="BPS", type="l")

According to the spectrum, there seems to be some difference between the sample group around an M/Z of about 320. We will extract a slice from the whole MS data from an M/Z value of 320 to 330, taking the full retention time range.

## Extract the slice
mzs <- msSlice(xset, mzrange=c(320, 330))

par(mfrow=c(1, 2))
plotChromatogram(mzs, binSize=1, col=genoColor[xset$genotype], main="BPC", type="l")
plotSpectrum(mzs, binSize=0.1, col=genoColor[xset$genotype], main="BPS", type="l")

Indeed, there is a clear difference between the groups in that slice.

Next we plot the total ion chromatogram (TIC) and total ion spectrum (TIS).

par(mfrow=c(1, 2))
plotChromatogram(fullData, binSize=1, FUN=sum, col=genoColor[xset$genotype], main="TIC", type="l")
plotSpectrum(fullData, binSize=0.1, FUN=sum, col=genoColor[xset$genotype], main="TIS", type="l")

Plotting also boxplots of the signal distribution of all peaks per sample; width of the boxes relative to the number of peaks.

boxplot(log2(into)~sample, xlab="", data=data.frame(peaks(xset)),
    main="Peak intensity distribution", col=genoColor[xset$genotype],
    varwidth=TRUE, xaxt="n", ylab=expression(log[2]~intensity))
axis(side=1, at=1:length(xset$name), as.character(xset$name), las=2)

Some of the samples show higher overall peak intensities but over and above, their distribution across samples is comparable. The number of identified peaks per sample (represented by the box width) varies also between samples.

Next we perform the grouping of the peaks across sample and the retention time correction. As stated above, setting the class assignment of the samples (using the sampclass method) is crucial, since the default peak grouping algorithm (group.density) requires by default that at a peak has to be present in at least 30% of the samples in a class in order to be considered a peak (argument minfrac). After retention time correction the peaks are grouped again and finally missing peak data is filled in.

## Perform the peak grouping across samples.
xset <- group(xset)

## Perform the retention time correction.
xset <- retcor(xset, family="symmetric")

## Group the peaks after retention time correction.
xset <- group(xset)

## Filling missing peak data.
xset <- fillPeaks(xset)

The groupval method can be used to access the actual peak intensity values of the grouped peaks. By default that method returns the index of the corresponding peak in the peaks slot, to return the integrated original (raw) signal of the peak we can set the value argument of the method to into. That returns the peak intensity matrix, each column representing a sample, each row a peak.

## Accessing the intensity value of the grouped peaks.
head(groupval(xset, value="into"))

On that matrix we can perform the remaining analyses. We're first evaluating the signal distribution of the peaks per sample. Note that we add a +1 to each of the signal intensities to avoid taking the logarithm of values between 0 and 1, or even of 0. Also, we replace peak intensities of 0.

peakMat <- groupval(xset, value="into")
colnames(peakMat) <- as.character(xset$name)

## Eventually replacing 0 values with NA.
peakMat[peakMat == 0] <- NA
peakMat <- log2(peakMat + 1)

par(mfrow=c(1, 2))
## Density plot
denses <- apply(peakMat, MARGIN=2, density, na.rm=TRUE)
Xses <- do.call(cbind, lapply(denses, function(z) z$x))
Yses <- do.call(cbind, lapply(denses, function(z) z$y))
matplot(Xses, Yses, ylab="Density", xlab=expression(log[2]~peak~intensity),
    col=genoColor[xset$genotype], type="l", lty=1)
## And the boxplot
boxplot(peakMat, col=genoColor[xset$genotype],
    xaxt="n", ylab=expression(log[2]~peak~intensity))
axis(side=1, at=1:length(xset$name), xset$name, las=2)

The distribution of peak intensities (integrated intensities of the peak area) between the samples are comparable, but there might be some need for normalization.

Next we could use more sophisticated tools like limma to identify peaks with significantly different signal (i.e. differently abundant compounds) between the genotypes.

Below we perform a differential abundance analysis using the methodology provided by Bioconductor's limma package [@Smyth:2004to]. The methods in this package were developed for microarray data analysis, but, it might be as well applicable to metabolomics data, given that the peak intensity distribution is similar to the signal distribution usually seen in microarray experiments.

## Loading the limma package
library(limma)
## Limma setup: define the design matrix KO vs WT
genot <- xset$genotype
dsgn <- model.matrix(~ 0 + genot)

dsgn

The very simple model above could be further extended to include and account for e.g. batches in the data etc. Below we fit now the model to the data (the linear model is fit to each row/peak in the peak matrix). Subsequently we define the contrasts of interest (i.e. the comparison of the sample groups) and estimate the p-values for significance of the difference. Finally, we adjust the p-values for multiple hypothesis testing using the method from Benjamini and Hochberg.

## Fitting the linear model to the data
fit <- lmFit(peakMat, design=dsgn)

## Defining the contrasts of interest, i.e. which sample group we want
## to compare against which.
contrMat <- makeContrasts(KOvsWT=genotKO - genotWT, levels=dsgn)
fit <- contrasts.fit(fit, contrMat)
fit <- eBayes(fit)

## Adjusting the p-values
padj <- p.adjust(fit$p.value[, "KOvsWT"], method="BH")

We next plot the distribution of raw and adjusted p-values.

par(mfrow=c(1, 2))
hist(fit$p.value[, "KOvsWT"], breaks=128, main="Raw p-values", col="grey", xlab="p-value")
hist(padj, breaks=128, xlim=c(0, 1), main="Adjusted p-values", col="grey", xlab="p-value")

An enrichment of small p-values is present in the data, indicating peaks with significant difference in abundances. Except from these, there is an about uniform distribution of raw p-values, with a slight bias towards bigger p-values (suggesting eventual problems in the data such as batch effects not accounted for etc).

Next we create a volcano plot indicating the difference in abundances (in analogy to microarray analysis called M, i.e. a log2 fold change) on the x-axis against its significance on the y-axis.

X <- fit$coefficients[, "KOvsWT"]
Y <- -log10(padj)
plot(3, 3, pch=NA, xlab="M", ylab=expression(-log[10](p[BH])),
     xlim=range(X), ylim=range(Y))
## Plotting the area we would consider to contain significantly different peaks
rect(xleft=min(X)-1, xright=-1, ybottom=-log10(0.05), ytop=max(Y)+1, col="#377EB810", border=NA)
rect(xleft=1, xright=max(X)+1, ybottom=-log10(0.05), ytop=max(Y)+1, col="#377EB810", border=NA)
abline(h=-log10(0.05), col="#377EB880", lty=2)
abline(v=c(-1, 1), col="#377EB880", lty=2)
## Plotting the points.
points(X, Y, col="#00000030", pch=16)
## Plotting the FDR axis
Yticks <- pretty(Y)
Yticks <- Yticks[-length(Yticks)]
Ylabs <- 100*(10^-Yticks)
axis(side=4, at=Yticks, label=format(Ylabs, digits=2, scientific=TRUE), cex=par("cex.axis"), las=2)
mtext(side=4, line=4.2, text="% FDR", cex=par("cex.lab"))

We next select the peaks with a significant difference in abundance (more than two-fold different at a 5 percent FDR) and perform a simple compound identification for these, based on their M/Z value. In the default setting the mzmatch method searches for masses being equal or highly similar to the actual M/Z value, thus assuming the compound being an ion. Alternatively, we can also specify a specific ion adduct, if we assume that the signal was generated by a certain ion adduct from the metabolite (e.g. M+H for the metabolite + an H).

## Select significant peaks
idx <- which(padj < 0.05 & abs(fit$coefficients[, "KOvsWT"]) > 1)
## Compiling the result table
resTable <- cbind(idx=idx, groups(xset)[idx, ],
          rawp=fit$p.value[idx, "KOvsWT"],
          padj=padj[idx],
          M=fit$coefficients[idx, "KOvsWT"])

## Performing the compound identification using the mzmatch method and the
## internal SimpleCompoundDb based on the median M/Z value and accepting a
## ppm of 20
comps <- mzmatch(resTable[, "mzmed"], mz=scDb, ppm=20, mzdev=0.01)
## Getting the names for the identified compounds.
compNames <- lapply(comps, function(z){
    if(any(is.na(z[, 1])))
    return(NA)
    tmp <- compounds(scDb, columns="name", filter=CompoundidFilter(z[, 1]))
    return(tmp[, "name"])
})
## Combine multiple compounds into a single one.
compNames <- lapply(compNames, function(z){
    return(paste(z, collapse="; "))
})
comps <- lapply(comps, function(z){
    return(c(id=paste(z[, 1], collapse="; "),
         deltaMz=paste(z[, 2], collapse="; ")))
})
## Combine to the final result table
resTable <- data.frame(do.call(rbind, comps),
               name=unlist(compNames, use.names=FALSE),
               resTable, stringsAsFactors=FALSE)
## Order the table by significance
resTable <- resTable[order(resTable$rawp), ]

In the next code block we perform the peak identification assuming that the signal was generated by M+H ion adducts by providing the ionAdduct parameter. For a list of supported (most frequent) ion adducts use the supportedIonAdducts method. Note also that more than one ion adduct can be specified. In addition it is possible to search for compounds based on the range of the peak in M/Z direction, i.e. the database is searched for masses within the mass range corresponding to the M/Z range/peak width (+- the specified ppm). In the example below we submit the mzmin and mzmax of the peaks, which are then converted to a min and max mass, assuming the signal derives from an M+H ion adduct. Compounds with a (monoisotopic mass) within the range spanned by min mass (-ppm of that mass) and max mass (+ppm of that mass) are then returned.

## Performing the search assuming the signal derives from M+H ion adducts.
compsH <- mzmatch(as.matrix(resTable[, c("mzmin", "mzmax")]), mz=scDb,
          ppm=10, ionAdduct="M+H")

This resulted in far more matches then the search based on the median mz of the peak.

The top 6 most significant peaks are shown below; none of them was identified by their M/Z value.

head(resTable[, c("id", "mzmed", "rtmed", "padj", "M")])

At last we plot the chromatogram and the spectrum for the first peak. To this end we first extract the MS data from the xcmsSet object for the slice defined by the M/Z and retention time of the peak, extending these ranges a little such that we can see also the neighborhood of the peak.

## Extracting an MSslice from the xcmsSet object
rtr <- as.numeric(resTable[1, c("rtmin", "rtmax")])
mzr <- as.numeric(resTable[1, c("mzmin", "mzmax")])

## Extract the slice extending the ranges a little.
mss <- msSlice(xset, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2))
par(mfrow=c(1, 2))
## chromatogram
plotChromatogram(mss, type="l", col=genoColor[as.character(xset$genotype)])
## highlighting the peak area.
abline(v=rtr, col="#00000020", lty=2)
## spectrum
plotSpectrum(mss, type="h", col=genoColor[as.character(xset$genotype)])
abline(v=mzr, col="#00000020", lty=2)

And also for the second peak.

rtr <- as.numeric(resTable[2, c("rtmin", "rtmax")])
mzr <- as.numeric(resTable[2, c("mzmin", "mzmax")])

## Extract the slice extending the ranges a little.
mss <- msSlice(xset, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2))

par(mfrow=c(1, 2))
plotChromatogram(mss, type="l", col=genoColor[as.character(xset$genotype)])
## highlighting the peak area.
abline(v=rtr, col="#00000020", lty=2)
## spectrum
plotSpectrum(mss, type="h", col=genoColor[as.character(xset$genotype)])
abline(v=mzr, col="#00000020", lty=2)

Surprisingly, the peaks don't look very much aligned (i.e. retention time corrected).

Plotting the MS data for the most significant peak before and after retention time correction.

rtr <- as.numeric(resTable[1, c("rtmin", "rtmax")])
mzr <- as.numeric(resTable[1, c("mzmin", "mzmax")])

## Extract the slice from the data before retention time correction
mssBefore <- subset(fullData, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2))

mssAfter <- msSlice(xset, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2))

par(mfrow=c(1, 2))
plotChromatogram(mssBefore, type="l", col=genoColor[as.character(xset$genotype)])
abline(v=rtr, col="#00000020", lty=2)
## After retention time correction.
plotChromatogram(mssAfter, type="l", col=genoColor[as.character(xset$genotype)])
abline(v=rtr, col="#00000020", lty=2)

Retention time correction was not able to align that peak properly. Checking also for peaks listed in the original vignette from the xcms package.

Alternative way to access data in xcmsRaw objects

The getData method is an alternative method that can be used to extract paired retention time, m/z and intensity values from an xcmsRaw object. The advantage over the msData method described in the previous method is, that it allows also a sub-setting by intensities.

Loading the libraries and the xcmsRaw object.

library(xcmsExtensions)
library(faahKO)
xset <- faahko
## Getting the raw data for the first data file.
xraw <- getXcmsRaw(xset)

The raw data of an LC/GC-MS run is stored in an xcmsRaw object, more specifically, in the slots @scantime, @env$mz, @env$intensity. Extracting data from such a xcmsRaw object can however be somewhat cumbersome, especially when we want to extract only (eventually multiple) sub-sets of data. Also, for memory reasons, the length of the @scantime slot does not match the length of the @env$mz slot as it stores only the distinct scan/measurement time points. We can however use the scantimes method to extract a numeric vector that matches the length of the @env$mz and @env$intensity slots, thus specifying the scan time for each of these data points.

## What's the length of data points we have?
length(xraw@env$mz)
## And the length of scan times?
length(xraw@scantime)

## Extract the scan times matching the individual data points.
head(scantimes(xraw))
length(scantimes(xraw))

The data in an R-object should however not be accessed directly (i.e. accessing the slots), but ideally through an getter methods. Thus, to extract the raw data from an xcmsRaw object we can use the getData method. We could also use the rawMat method defined in the xcms package, but the getData should be preferably used, as it is also faster and extracts always correct sub-sets if sub-setting is done on retention time ranges.

## Get the full data from the object
dim(getData(xraw))
head(getData(xraw))

## Extract only a subset of data specified by an retention time range.
datmat <- getData(xraw, rtrange=c(2509, 2530))
dim(datmat)

## We can also specify both, a retention time and a mz range.
datmat <- getData(xraw, rtrange=c(2509, 2530), mzrange=c(301, 302.003))

Besides sub-setting by retention time and m/z ranges it is also possible to extract data sub-sets defined by an intensity range.

## Use and intensity range: all with a signal higher than 300

And finally, we can also specify multiple retention time and or m/z (or intensity) ranges to extract multiple sub-sets.


Note that, if we load an xcmsRaw object for a xcmsSet object after having applied retention time correction, the retention times (scan time) we extract from that object will correspond to the corrected ones.

## Grouping (alignment) of peaks/features
xset <- group(xset)
## Retention time correction
xset <- retcor(xset)
## Extract "raw" data; corrected retention times will be applied to the raw data.
xraw2 <- getXcmsRaw(xset)

Extracting data by directly accessing the slots of an R-object is however no

References



jotsetung/xcmsExtensions documentation built on May 19, 2019, 9:42 p.m.