This package provides some functions and utilities extending the functionality
of the xcms
Bioconductor package. These functions aim to provide a better
integration into the Bioconductor framework (using as much as possible methods
defined in ProtGenerics
and methods and classes from MSnbase
; and also
following Bioconductor's coding- and function-naming conventions). Some of the
concepts and definitions used in this package are from [@Smith:2014di].
xcmsExtensions
To enable a more streamlined and user-friendly access to MS data, the
xcmsExtensions
package defines the following objects. Some of these are
redundant in functionality and data content with objects in xcms
that do
however, in most cases, not provide simple access to all of the data or have
other limitations. The xcms
package defines for example the xcmsEIC
class
and the getEIC
method to extract ion chromatograms from an xcmsSet
object. The EIC, by default, is however extracted from the profile matrix that
represents intensity values aggregated in equal-spaced bins in m/z space. In
addition there exists the rawEIC
method that extracts m/z, scan index and
corresponding intensity values from the raw data, but does not return these as
an xcmsEIC
object (and does also not report the actual retention time). The
xcmsExtensions
package tries to solve some of these possible confusions by the
introduction of the below listed classes and their methods.
MSdata
: this is the basic object holding retention time (rt), m/z and
measured intensities for the rt-mz tuples. To reduce the memory footprint, the
retention time is internally stored as an Rle
object, the mz values as
numeric
and the intensities as integer
. By design this object is thought
to hold only MS data for a subset of a whole MS run, but in principle, it
could also store all the detected intensities for all m/z and retention/scan
times. In this latter case the object is somewhat redundant with the xcmsRaw
object, but provides an easier extraction of paired and matched rt, mz and
intensity values (using the msData
method).
MSslice
: represents a wrapper object for multiple MSdata
objects, all
defining the same area (slice) in the m/z-rt space, but across several
samples. By design, a MSslice
object is thought to hold data for the same
slice in each of the samples of an xcmsSet
object. Note also that, depending
on the data, each MSdata
object might have a slightly different range of m/z
and rt values.
MSsliceList
: a list of several MSslice
object. If e.g. different slices
defined by different mz/rt ranges from a single sample (e.g. from a single
xcmsRaw
object), these will be stored and returned in a MSsliceList
. Such
an MSsliceList
object could for example be used to store the MS data of all
peaks identified across all samples.
In this section we show some examples how (raw or binned) data of one or several
MS runs can be extracted from xcmsRaw
, xcmsSet
, MSsliceList
, MSslice
or
MSdata
objects.
First we load the example data set from the xcms
package.
library(xcmsExtensions) library(faahKO) xset <- faahko
As described in the previous section, an MSdata
object simply stores
intensities for rt-m/z tuples and corresponds thus to some degree to a xcmsRaw
object from the xcms
package. Below we are extracting all of the data
(retention time, m/z and intensity) values from one of the raw files of the
faahko
data set.
## Extract the full data from an xcmsRaw object. xraw <- getXcmsRaw(xset, 1) msFull <- msData(xraw) msFull
Alternatively, we could extract only a portion of the data by specifying an mz
value range and an retention time range with the mzrange
and rtrange
arguments.
## Extract only a subset of the data (from a slice in M/Z-rt space) msSub <- msData(xraw, mzrange=c(300, 350), rtrange=c(2700, 2900)) msSub
We can extract the retention time, mz values and intensity values using the
rtime
, mz
and intensity
methods.
## Get the retention time values head(rtime(msSub)) ## Get the mz values head(mz(msSub)) ## Get the intensity values head(intensity(msSub))
We could also convert the MSdata
into a numeric matrix.
head(as.matrix(msSub))
A chromatogram (i.e. intensities along retention time) can be extracted from the
MSdata
object using the getChromatogram
method, or plotted using the
plotChromatogram
method. In its default setting, the getChromatogram
method
extracts all intensity and retention time pairs in the MSdata
, aggregating
intensities with the exact same retention time. The argument FUN
of the method
allows to specify the function to aggregate the intensities. In the default
setting (FUN
being max
) only the maximal intensity for a retention time is
selected. Below we extract the chromatogram for the data sub-set.
## Extract the (full) chromatogram; the maximal intensity will be selected ## if two or more values have the same retention time. Otherwise, the data ## will be returnes /as.is/ chr <- getChromatogram(msSub) head(chr) nrow(chr)
Sometimes it might also be useful to bin the data in retention time dimension
and aggregate the values within these bins as well. The arguments nbin
or
binSize
allow to specify either the number of bins along the retention time
axis, or the size of these bins (with the bins
argument it would be even
possible to provide the actual bins, but in most instances the nbin
or
binSize
arguments should suffice). This binning allows to reduce the amount of
data and to group intensities in discrete, equal sized, bins. All intensities
(along the full M/Z range present in the MSdata
object) with a retention times
that fall within a bin are aggregated with the FUN
function. Below we bin the
data of the sub-set into 30 bins along the retention time range.
## Bin the values in 30 bins along the retention time range of the MSdata chrB <- getChromatogram(msSub, nbin=30) head(chrB) nrow(chrB)
In the example above we the matrix representing the extracted chromatogram has 30 rows, each row representing the aggregated intensities for each bin. Note that the matrix could also have less than 30 rows, e.g. if no values would be available for a bin.
We next plot the chromatogram, first without binning and add the binned chromatogram to that plot. Especially for larger data sets (i.e. with more data points) binning provides a considerable speed-up to the plotting process.
## Plot the chromatogram. plotChromatogram(msSub, type="l") ## Add the binned chromatogram to the plot. plotChromatogram(msSub, type="l", col="blue", add=TRUE, nbin=30)
With this methods we can also extract the base peak chromatogram
[@Smith:2014di] (BPC, the maximal intensity along the full M/Z range for the
same retention time) or the total ion chromatogram (TIC, the sum of all
intensities along the M/Z dimension for the same retention time). We plot these
two for the MSdata
representing the full data of the first MS run in the
experiment. For the former we can just use the default setting for the FUN
method, for the latter we change FUN
to sum
.
par(mfrow=c(1, 2)) ## Plot the BPC. plotChromatogram(msFull, type="l", main="BPC") ## And the TIC. plotChromatogram(msFull, type="l", FUN=sum, main="TIC")
Analogous to the getChromatogram
, the getSpectrum
method allows to extract a
spectrum (i.e. intensities, eventually aggregated for identical retention times,
along the M/Z dimension) from an MSdata
object, enabling also an optional
binning of values in M/Z dimension. In the example below we refine the MS data
slice for the peak from the example above and plot the chromatogram as well as
the spectrum for it. The MS data could alternatively also extracted as a
two-dimensional matrix with the mapMatrix
method (see further below in Data
manipulations section for an example).
rtr <- c(2550, 2700) mzr <- c(300, 330) ## Extrac the MS data slice msd <- msData(xraw, mzrange=mzr, rtrange=rtr) par(mfrow=c(1, 2)) ## Plot the chromatogram plotChromatogram(msd, type="l") ## And the spectrum plotSpectrum(msd, type="l")
The MSdata
is thus the basic object to store MS data of a single MS
run/sample. Data from different MS runs/samples is represented by, and can be
stored into, a MSslice
object. This object is supposed to store the data from
the same slice from the 2-dimensional (m/z over rt) MS data across several
samples. In the example below we define a slice representing one of the peaks in
the faahko
data set shown in Figure 4 in the xcms
vignette.
Alternatively we could extract an MSslice
object representing the full MS data
of all samples of an experiment by simply not providing an rtrange
or
mzrange
.
## Extract the chromatogram of a m/z-rt slice across all samples. ## Defining the mz and rt ranges for the peak in the xcms ## vignette (Figure 4). mzr <- c(300, 330) rtr <- c(2550, 2700) mss <- msSlice(xset, rtrange=rtr, mzrange=mzr) mss
We can further subset data in an MSslice
object using the subset
method
providing mzrange
and/or rtrange
, or we can subset the MSslice
object to
specific samples using the [
method. Finally we can also extract single
MSdata
objects using the [[
method.
## Further subsetting the MSslice object mssSub <- subset(mss, rtrange=c(2570, 2590)) ## Or we can use the [ and [[ methods to subset to samples. ## Extract samples 2, 8, 9 mss[c(2, 8, 9)] ## Extract the MSdata representing sample 8 mss[[8]]
Most of the methods defined for MSdata
objects are also available for
MSslice
objects. In the example below we thus extract the chromatogram and the
spectrum from the MSdata
object and get for each of the two as a result a
matrix, rows being either unique retention times or M/Z values and columns
samples.
## Extracting the chromatogram from the MSslice object. To reduce the data points ## and enable a better matching of values across samples we use binning on the ## retention time and bin the data in 2 second intervals. chrM <- getChromatogram(mss, binSize=2) head(chrM) ## The same with the spectra. Binning into 0.5 M/Z spcM <- getSpectrum(mss, binSize=0.5) head(spcM)
We can also plot the chromatogram and the spectrum with the plotChromatogram
and plotSpectrum
methods. As above, these methods would also allow to plot the
BPC and the TIC as well as the TIS for all samples within an experiment into the
same plot.
par(mfrow=c(1, 2)) plotChromatogram(mss, binSize=2, type="l", col="#00000050") plotSpectrum(mss, binSize=0.5, type="h", col="#00000050")
In addition, we could use the image
or the levelplot
method to plot the
chromatogram or spectrum matrix directly.
levelplot(chrM, las=2, xlab="Retention time", ylab="Sample")
levelplot(spcM, las=2, xlab="M/Z", ylab="Sample")
If multiple different ranges (in M/Z and/or retention time space) are defined,
the msSlice
method would extract MSslice
object for each of these and return
the results as a MSsliceList
.
In the example below we extract a slice of MS data for each of the four peaks
from the xcms
vignette, Figure 3 from all samples.
## Defining the mz and rt ranges. mzr <- rbind(c(300.0, 300.3), c(301.0, 301.3), c(298.0, 298.3), c(491.0, 491.4)) rtr <- rbind(c(3300, 3450), c(3300, 3450), c(3100, 3250), c(3300, 3500)) ## Extracting the MS data slices; we are extracting the raw, ## i.e. unaligned retention times. msl <- msSlice(xset, rtrange=rtr, mzrange=mzr, rt="raw") msl
Plotting the (base peak) chromatogram for each of the slices.
par(mfrow=c(2, 2)) lapply(slices(msl), plotChromatogram, type="l", col="#00000080")
One of the simplest data manipulations is to aggregate intensities in bins along
the M/Z or the retention time axis. This might be done to reduce the amount of
data points e.g. to plot the data. Binning in retention time dimension is also
especially useful to match individual intensities across samples, if the
retention times slightly differ. In the example below we aggregate intensities
in bins of size 1 along the M/Z and along the retention time axis (especially in
M/Z dimension this binning might be too coarse in a real world situation). By
default, the maximal signal is selected for each bin, but this can be changed
with the argument FUN
.
## Using the the same of the MS data from the previous section. rtr <- c(2550, 2700) mzr <- c(300, 330) msd <- msData(xraw, mzrange=mzr, rtrange=rtr) ## Bin along the M/Z range mzBinned <- binMz(msd, binSize=1) ## And along the retention time rtBinned <- binRtime(msd, binSize=5) ## Plotting par(mfrow=c(2, 2)) plotChromatogram(msd, type="l", col="grey") plotChromatogram(mzBinned, type="l", add=TRUE, col="blue", lty=2) plotSpectrum(msd, type="l", col="grey") plotSpectrum(mzBinned, type="l", add=TRUE, col="blue", lty=2) ## plotChromatogram(msd, type="l", col="grey") plotChromatogram(rtBinned, type="l", add=TRUE, col="blue", lty=2) plotSpectrum(msd, type="l", col="grey") plotSpectrum(rtBinned, type="l", add=TRUE, col="blue", lty=2)
To bin the data matrix in both dimensions, the binMzRtime
method can be used.
The MS data can also be extracted as a two-dimensional matrix, with rows being
the M/Z and columns the retention time values for each measurement. This might
e.g. be useful to plot the data using the image
or levelplot
functions. Below we first bin the MSdata
in both, M/Z and retention time,
extract the resulting MS data matrix and plot it. Below we plot the matrix using
the image
function, to use the levelplot
we first have to cast the
dgCMatrix
into a normal matrix using the as.matrix
function.
## We bin the data along the retention time in 5 second intervals and into 20 ## bins along the M/Z dimension. M <- mapMatrix(binMzRtime(msd, rtBinSize=5, mzNbin=20)) image(M, xlab="Retention time", ylab="M/Z")
The same binning can also be applied to MSslice
and MSsliceList
objects. For both methods, the mzrange
or rtrange
across all samples is
first cut into intervals and the binning is then performed, for each sample
separately, but using the same bins.
The xcmsExtensions
package provides also a very simple metabolic compounds
database (a SimpleCompoundDb
) that contains presently some of the data from
the Human metabolome database (HMDB [@Wishart:2013is]). This database, which
is by default bound to the variable scDb
, can be easily queried and also used
for a simple compound identification (see next section). Records can be
retrieved from the database using the compounds
method, which allows also
filter the results in a similar fashion than e.g. EnsDb
databases from
Bioconductor's ensembldb
package. The individual filter objects are listed
below:
CompoundidFilter
: allows to fetch specific compounds from the database
providing their ID. This filter allows single or multiple IDs and queries the
database column accession
.
MassrangeFilter
: allows to fetch specific compounds from the database with a
mass (either in column monoisotopic_molecular_weight
or
avg_molecular_weight
) within the mass range specified in the filter. The
object requires the mass range to be specified as a numeric vector of
length 2.
Below we first inspect some database properties and subsequently fetch all the compounds from the database.
## First list some information from the database scDb ## List all columns that are available in the database. columns(scDb) ## Fetch all compounds from the database. cmps <- compounds(scDb) nrow(cmps)
Next we specify a MassrangeFilter
and fetch all compounds with a mass larger
than 300 and smaller than or equal to 310 from the database.
## Define the filter for masses > 300 and <= 310 mrf <- MassrangeFilter(value=c(300, 310), condition="(]") ## Fetch the ID, the monoisotopic molecular weight and the chemical formula ## for these compounds cmps <- compounds(scDb, filter=mrf, columns=c("monoisotopic_molecular_weight", "accession", "chem_formula")) nrow(cmps) head(cmps)
Alternatively (or in addition) we can also specify a CompoundidFilter
to fetch
specific compounds from the database.
## Create a filter for some IDs idf <- CompoundidFilter(value=c("HMDB00523", "HMDB00010")) ## Filters can be compbined, in which case retrieved values have to match ## all conditions. cmps <- compounds(scDb, filter=list(mrf, idf), columns=c("monoisotopic_molecular_weight", "chem_formula", "accession")) cmps
Note that entries are not in the same order than e.g. the IDs in the provided
filter. The ordering of the result can be specified with the order.by
parameter. By default, results are ordered by ID.
## Fetch the result ordered by monoisotopic_molecular_weight compounds(scDb, filter=idf, columns=c("accession", "chem_formula", "monoisotopic_molecular_weight"), order.by="monoisotopic_molecular_weight")
Conversion between M/Z values from an MS measurement and the mass of the compound can
be performed with the adductmz2mass
and mass2adductmz
methods. These base on the
data from the Fiehn lab for the most commonly observed ion adducts in ESI
experiments [@Huang:1999gb] to perform the calculation, i.e. assuming the
M/Z value corresponds derives from a certain ion adduct of the real compound, it
calculates its mass. By default, both methods assume that the measured compound
is an ion and thus do just return the value as is. An ion adduct name, or several
ion adduct names, can be provided to the method with the ionAdduct
argument. All supported ion adduct names can be listed with the
supportedIonAdducts()
function.
Below we first list all supported ion adducts and convert then an (artificial
M/Z) to the mass of the compound, assuming that the M+H
ion adduct was
measured.
## List all supported ion adducts. supportedIonAdducts() ## Convert the given M/Z to the mass of the compound, assuming that the ## M+H was measured. adductmz2mass(300.1, ionAdduct="M+H")
The method returns a list, with the list name specifying the ion adduct. Below we convert the M/Z to the mass of all supported negative ion adducts.
## Convert the M/Z to the mass of all possible negative ion adducts. adductmz2mass(300.1, ionAdduct=supportedIonAdducts(charge="neg"))
We can also calculate the M/Z values of all possible ion adducts of a compound that we fetch from the database.
## Fetch the mass from a compound from the database. cmp <- compounds(scDb, filter=CompoundidFilter("HMDB00010"), columns=c("accession", "name", "monoisotopic_molecular_weight")) ## Calculate the M/Z for the ion adduct M+H for this compound. mass2adductmz(cmp$monoisotopic_molecular_weight, ionAdduct="M+H") ## And finally of all possible (supported) adducts mass2adductmz(cmp$monoisotopic_molecular_weight, ionAdduct=supportedIonAdducts())
One of the major problems in the metabolomics data analysis workflow is the
identification of the compounds. To facilitate this step, the xcmsExtensions
package provides an internal, simple, database that enables a fast and easy
identification based on the compound mass.
In the code chunk below we show some basic information on this database, that is
bound to the variable name scDb
(for simple compound database).
## Show some information on the database.
scDb
Some of the methods defined by the AnnotationDbi
Bioconductor package are
implemented, thus we can ask for the available columns in the database with the
columns
method or get the SQLite database connection with the dbconn
method.
## List the available columns. columns(scDb)
To retrieve compounds in this database we can use the compounds
method, that
by default will return all of the available compounds. The xcmsExtensions
package implements however also the filtering framework introduced by
Bioconductor's ensembldb
package. Thus, we can use an CompoundidFilter
to
select only specific compounds from the database.
## Get all of the compounds from the database. allCmps <- compounds(scDb) nrow(allCmps)
The compounds
method returns all database columns from the respective database
table from the database. The method's columns
argument allows to select only
specific columns (selected from the available ones returned by the columns
method shown above) that should be returned.
As stated above, we can use this database also for a simple, mass based,
peak/compound identification. For the simplest case, assuming that the measured
peaks is an ion, the mzmatch
method compares the provided M/Z values
with the masses of all of the compounds in the database and returns for each
M/Z all matching compounds given a user specified
difference threshold (defined with the mzdev
and ppm
arguments).
By default, the method returns all matches allowing a difference of 10 ppm.
## Defining compound masses. comps <- c(300.1898, 298.1508, 491.2000, 169.13481, 169.1348) ## Searching for matches in the database. Res <- mzmatch(comps, scDb) Res
The results are returned as a list
, elements being a matrix with the match(es)
for each specified mass. The column idx
of the matrix contains the compound
IDs and the column deltaMz
the difference between the specified mass (or
calculated mass, assuming the measured M/Z corresponds to that of an ion adduct;
see further below for more details) and the compound's mass. The column adduct
indicates the ion adduct that matches the metabolite in the database (in the
example above we assumed that the measured compound is already and ion, thus an
M
is listed).
In a real world scenario, the measured M/Z value will correspond to that of an
ion adduct of the real metabolite. The xcmsExtensions
package integrates the data
from the ESI MS adducts calculator from the Fiehn lab [@Huang:1999gb]
and uses this to convert between M/Z values and masses for all
or some of the most commonly found ion adducts. The methods adductmz2mass
and
mass2adductmz
allow to convert between M/Z and mass values of supported ion
adducts (which can be listed using the supportedIonAdducts()
method).
To calculate the mass of the measured compounds, assuming that the M+H
ion
of the metabolite was measured we, can use the adductmz2mass
method as shown
below.
## Calculate the mass assuming the M+H ion was measured. adductmz2mass(comps, ionAdduct="M+H") ## Calculate the mass assuming that a negative ion adduct was measured. adductmz2mass(comps, ionAdduct=supportedIonAdducts(charge="neg"))
If ion adduct name(s) are provided to the mzmatch
method with the ionAdduct
argument, the method internally compares all masses for all ion adducts with the
database and returns matching results.
## Match the M/Z, assuming it to correspond to a positive ion adduct, with all ## masses in the database Res <- mzmatch(comps, scDb, ionAdduct=supportedIonAdducts(charge="pos")) ## Remove all non-matching results ensuring that at leas one result is returned ## for an input M/Z Res <- lapply(Res, function(z){ if(all(is.na(z$idx))) return(z[1, ]) return(z[!is.na(z$idx), ]) }) Res[1:2]
In addition (as shown in the next section) it is possible to perform the peak
identification using the measured peak range in M/Z direction (i.e. the mzmin
and mzmax
value of the peak). Also, it would be to some extend possible to
perform the compound identification using a MassrangeFilter
and the
compounds
method, but in that case ppm and the mass of the assumed ion adduct
would have to be calculated beforehand.
In the code chunk below we're defining an CompoundidFilter
for the identified
compounds of the 4th mass and return its name and inchi key from the database.
## Define CompoundidFilter for the matches of the selected mass cf <- CompoundidFilter(Res[[1]][1, "idx"]) ## Getting the compounds' names and inchi keys from the database cmps <- compounds(scDb, filter=cf, columns=c("name", "inchikey")) cmps
xcmsExtensions
for an updated xcms
data analysis workflowIn this section we perform the analysis of the faahKO
data set described in
the xcms
package vignette LC/MS Preprocessing and Analysis with xcms.
## load the libraries and data library(xcms) library(xcmsExtensions) library(faahKO) ## list the files provided by the package cdfpath <- system.file("cdf", package="faahKO") dir(cdfpath, recursive=TRUE)
In contrast to the default workflow, we define a data.frame
specifying the
input files, associated samples and phenotypes (or genotypes as in the present
setup). Usually, such a table will be saved as a tabulator delimited text file
which is then read with the read.table
or read.AnnotatedDataFrame
method.
## Define a phenotype data.frame cdffiles <- dir(cdfpath, recursive=TRUE, full.names=TRUE) ## Extract the genotype from the file name genot <- rep("KO", length(cdffiles)) genot[grep(cdffiles, pattern="WT")] <- "WT" ## And the sample name. tmp <- strsplit(cdffiles, split=.Platform$file.sep) sampn <- unlist(lapply(tmp, function(z){ return(gsub(z[length(z)], pattern=".CDF", replacement="", fixed=TRUE)) })) ## And the phenodata table pheno <- data.frame(file=cdffiles, genotype=genot, name=sampn) head(pheno)
Next we create a new xcmsSet
object with the xcmsSet
function. This call
will identify peaks in each of the samples. In most instances we might however
change the default peak identification approach with the centwave
method
[@Tautenhahn:2008fx]. In any case, we have to set the sample class
assignment according to the experimental setup so that the peak grouping across
samples is performed as expected. Similar to other data objects defined in
Bioconductor (e.g. the ExpressionSet
), we can access data in the provided
phenodata table using the $
operator.
## The default call from the vignette. xset <- xcmsSet(files=cdffiles, phenoData=pheno) ## Setting the sample class; that's important for the peak grouping ## algorithm sampclass(xset) <- xset$genotype ## At last we define also a color for each of the two genotypes. genoColor <- c("#E41A1C80", "#377EB880") names(genoColor) <- c("KO", "WT")
Next we plot the base peak chromatogram (BPC) and the total ion chromatogram
(TIC) for all of the files to get a global overview of the data. To this end
we're extracting an MSslice
object without specifying an M/Z and retention
time range, thus returning the full MS data for each sample.
## Get the full MS data from each sample. fullData <- msSlice(xset) ## Eventually setting the names of the data names(fullData) <- sampnames(xset)
The BPC and the BPS (base peak spectrum, i.e. the maximum signal across the retention time range for one M/Z) are shown below.
par(mfrow=c(1, 2)) plotChromatogram(fullData, binSize=1, col=genoColor[xset$genotype], main="BPC", type="l") plotSpectrum(fullData, binSize=0.1, col=genoColor[xset$genotype], main="BPS", type="l")
According to the spectrum, there seems to be some difference between the sample group around an M/Z of about 320. We will extract a slice from the whole MS data from an M/Z value of 320 to 330, taking the full retention time range.
## Extract the slice mzs <- msSlice(xset, mzrange=c(320, 330)) par(mfrow=c(1, 2)) plotChromatogram(mzs, binSize=1, col=genoColor[xset$genotype], main="BPC", type="l") plotSpectrum(mzs, binSize=0.1, col=genoColor[xset$genotype], main="BPS", type="l")
Indeed, there is a clear difference between the groups in that slice.
Next we plot the total ion chromatogram (TIC) and total ion spectrum (TIS).
par(mfrow=c(1, 2)) plotChromatogram(fullData, binSize=1, FUN=sum, col=genoColor[xset$genotype], main="TIC", type="l") plotSpectrum(fullData, binSize=0.1, FUN=sum, col=genoColor[xset$genotype], main="TIS", type="l")
Plotting also boxplots of the signal distribution of all peaks per sample; width of the boxes relative to the number of peaks.
boxplot(log2(into)~sample, xlab="", data=data.frame(peaks(xset)), main="Peak intensity distribution", col=genoColor[xset$genotype], varwidth=TRUE, xaxt="n", ylab=expression(log[2]~intensity)) axis(side=1, at=1:length(xset$name), as.character(xset$name), las=2)
Some of the samples show higher overall peak intensities but over and above, their distribution across samples is comparable. The number of identified peaks per sample (represented by the box width) varies also between samples.
Next we perform the grouping of the peaks across sample and the retention time
correction. As stated above, setting the class assignment of the samples (using
the sampclass
method) is crucial, since the default peak grouping algorithm
(group.density
) requires by default that at a peak has to be present in at
least 30% of the samples in a class in order to be considered a peak (argument
minfrac
). After retention time correction the peaks are grouped again and
finally missing peak data is filled in.
## Perform the peak grouping across samples. xset <- group(xset) ## Perform the retention time correction. xset <- retcor(xset, family="symmetric") ## Group the peaks after retention time correction. xset <- group(xset) ## Filling missing peak data. xset <- fillPeaks(xset)
The groupval
method can be used to access the actual peak intensity values of
the grouped peaks. By default that method returns the index of the corresponding
peak in the peaks
slot, to return the integrated original (raw) signal of the
peak we can set the value
argument of the method to into
. That returns the
peak intensity matrix, each column representing a sample, each row a peak.
## Accessing the intensity value of the grouped peaks. head(groupval(xset, value="into"))
On that matrix we can perform the remaining analyses. We're first evaluating the
signal distribution of the peaks per sample. Note that we add a +1
to each of
the signal intensities to avoid taking the logarithm of values between 0 and 1,
or even of 0. Also, we replace peak intensities of 0
.
peakMat <- groupval(xset, value="into") colnames(peakMat) <- as.character(xset$name) ## Eventually replacing 0 values with NA. peakMat[peakMat == 0] <- NA peakMat <- log2(peakMat + 1) par(mfrow=c(1, 2)) ## Density plot denses <- apply(peakMat, MARGIN=2, density, na.rm=TRUE) Xses <- do.call(cbind, lapply(denses, function(z) z$x)) Yses <- do.call(cbind, lapply(denses, function(z) z$y)) matplot(Xses, Yses, ylab="Density", xlab=expression(log[2]~peak~intensity), col=genoColor[xset$genotype], type="l", lty=1) ## And the boxplot boxplot(peakMat, col=genoColor[xset$genotype], xaxt="n", ylab=expression(log[2]~peak~intensity)) axis(side=1, at=1:length(xset$name), xset$name, las=2)
The distribution of peak intensities (integrated intensities of the peak area) between the samples are comparable, but there might be some need for normalization.
Next we could use more sophisticated tools like limma
to identify peaks with
significantly different signal (i.e. differently abundant compounds) between the
genotypes.
Below we perform a differential abundance analysis using the methodology
provided by Bioconductor's limma
package [@Smyth:2004to]. The methods in
this package were developed for microarray data analysis, but, it might be as
well applicable to metabolomics data, given that the peak intensity distribution
is similar to the signal distribution usually seen in microarray experiments.
## Loading the limma package library(limma) ## Limma setup: define the design matrix KO vs WT genot <- xset$genotype dsgn <- model.matrix(~ 0 + genot) dsgn
The very simple model above could be further extended to include and account for e.g. batches in the data etc. Below we fit now the model to the data (the linear model is fit to each row/peak in the peak matrix). Subsequently we define the contrasts of interest (i.e. the comparison of the sample groups) and estimate the p-values for significance of the difference. Finally, we adjust the p-values for multiple hypothesis testing using the method from Benjamini and Hochberg.
## Fitting the linear model to the data fit <- lmFit(peakMat, design=dsgn) ## Defining the contrasts of interest, i.e. which sample group we want ## to compare against which. contrMat <- makeContrasts(KOvsWT=genotKO - genotWT, levels=dsgn) fit <- contrasts.fit(fit, contrMat) fit <- eBayes(fit) ## Adjusting the p-values padj <- p.adjust(fit$p.value[, "KOvsWT"], method="BH")
We next plot the distribution of raw and adjusted p-values.
par(mfrow=c(1, 2)) hist(fit$p.value[, "KOvsWT"], breaks=128, main="Raw p-values", col="grey", xlab="p-value") hist(padj, breaks=128, xlim=c(0, 1), main="Adjusted p-values", col="grey", xlab="p-value")
An enrichment of small p-values is present in the data, indicating peaks with significant difference in abundances. Except from these, there is an about uniform distribution of raw p-values, with a slight bias towards bigger p-values (suggesting eventual problems in the data such as batch effects not accounted for etc).
Next we create a volcano plot indicating the difference in abundances (in analogy to microarray analysis called M, i.e. a log2 fold change) on the x-axis against its significance on the y-axis.
X <- fit$coefficients[, "KOvsWT"] Y <- -log10(padj) plot(3, 3, pch=NA, xlab="M", ylab=expression(-log[10](p[BH])), xlim=range(X), ylim=range(Y)) ## Plotting the area we would consider to contain significantly different peaks rect(xleft=min(X)-1, xright=-1, ybottom=-log10(0.05), ytop=max(Y)+1, col="#377EB810", border=NA) rect(xleft=1, xright=max(X)+1, ybottom=-log10(0.05), ytop=max(Y)+1, col="#377EB810", border=NA) abline(h=-log10(0.05), col="#377EB880", lty=2) abline(v=c(-1, 1), col="#377EB880", lty=2) ## Plotting the points. points(X, Y, col="#00000030", pch=16) ## Plotting the FDR axis Yticks <- pretty(Y) Yticks <- Yticks[-length(Yticks)] Ylabs <- 100*(10^-Yticks) axis(side=4, at=Yticks, label=format(Ylabs, digits=2, scientific=TRUE), cex=par("cex.axis"), las=2) mtext(side=4, line=4.2, text="% FDR", cex=par("cex.lab"))
We next select the peaks with a significant difference in abundance (more than
two-fold different at a 5 percent FDR) and perform a simple compound
identification for these, based on their M/Z value. In the default setting the
mzmatch
method searches for masses being equal or highly similar to the actual
M/Z value, thus assuming the compound being an ion. Alternatively, we can also
specify a specific ion adduct, if we assume that the signal was generated by a
certain ion adduct from the metabolite (e.g. M+H for the metabolite + an H).
## Select significant peaks idx <- which(padj < 0.05 & abs(fit$coefficients[, "KOvsWT"]) > 1) ## Compiling the result table resTable <- cbind(idx=idx, groups(xset)[idx, ], rawp=fit$p.value[idx, "KOvsWT"], padj=padj[idx], M=fit$coefficients[idx, "KOvsWT"]) ## Performing the compound identification using the mzmatch method and the ## internal SimpleCompoundDb based on the median M/Z value and accepting a ## ppm of 20 comps <- mzmatch(resTable[, "mzmed"], mz=scDb, ppm=20, mzdev=0.01) ## Getting the names for the identified compounds. compNames <- lapply(comps, function(z){ if(any(is.na(z[, 1]))) return(NA) tmp <- compounds(scDb, columns="name", filter=CompoundidFilter(z[, 1])) return(tmp[, "name"]) }) ## Combine multiple compounds into a single one. compNames <- lapply(compNames, function(z){ return(paste(z, collapse="; ")) }) comps <- lapply(comps, function(z){ return(c(id=paste(z[, 1], collapse="; "), deltaMz=paste(z[, 2], collapse="; "))) }) ## Combine to the final result table resTable <- data.frame(do.call(rbind, comps), name=unlist(compNames, use.names=FALSE), resTable, stringsAsFactors=FALSE) ## Order the table by significance resTable <- resTable[order(resTable$rawp), ]
In the next code block we perform the peak identification assuming that the
signal was generated by M+H ion adducts by providing the ionAdduct
parameter. For a list of supported (most frequent) ion adducts use the
supportedIonAdducts
method. Note also that more than one ion adduct can be
specified. In addition it is possible to search for compounds based on the range
of the peak in M/Z direction, i.e. the database is searched for masses
within the mass range corresponding to the M/Z range/peak width (+- the
specified ppm). In the example below we submit the mzmin
and mzmax
of the
peaks, which are then converted to a min and max mass, assuming the signal
derives from an M+H ion adduct. Compounds with a (monoisotopic mass) within the
range spanned by min mass (-ppm of that mass) and max mass (+ppm of that mass)
are then returned.
## Performing the search assuming the signal derives from M+H ion adducts. compsH <- mzmatch(as.matrix(resTable[, c("mzmin", "mzmax")]), mz=scDb, ppm=10, ionAdduct="M+H")
This resulted in far more matches then the search based on the median mz of the peak.
The top 6 most significant peaks are shown below; none of them was identified by their M/Z value.
head(resTable[, c("id", "mzmed", "rtmed", "padj", "M")])
At last we plot the chromatogram and the spectrum for the first peak. To this
end we first extract the MS data from the xcmsSet
object for the slice defined
by the M/Z and retention time of the peak, extending these ranges a little such
that we can see also the neighborhood of the peak.
## Extracting an MSslice from the xcmsSet object rtr <- as.numeric(resTable[1, c("rtmin", "rtmax")]) mzr <- as.numeric(resTable[1, c("mzmin", "mzmax")]) ## Extract the slice extending the ranges a little. mss <- msSlice(xset, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2))
par(mfrow=c(1, 2)) ## chromatogram plotChromatogram(mss, type="l", col=genoColor[as.character(xset$genotype)]) ## highlighting the peak area. abline(v=rtr, col="#00000020", lty=2) ## spectrum plotSpectrum(mss, type="h", col=genoColor[as.character(xset$genotype)]) abline(v=mzr, col="#00000020", lty=2)
And also for the second peak.
rtr <- as.numeric(resTable[2, c("rtmin", "rtmax")]) mzr <- as.numeric(resTable[2, c("mzmin", "mzmax")]) ## Extract the slice extending the ranges a little. mss <- msSlice(xset, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2)) par(mfrow=c(1, 2)) plotChromatogram(mss, type="l", col=genoColor[as.character(xset$genotype)]) ## highlighting the peak area. abline(v=rtr, col="#00000020", lty=2) ## spectrum plotSpectrum(mss, type="h", col=genoColor[as.character(xset$genotype)]) abline(v=mzr, col="#00000020", lty=2)
Surprisingly, the peaks don't look very much aligned (i.e. retention time corrected).
Plotting the MS data for the most significant peak before and after retention time correction.
rtr <- as.numeric(resTable[1, c("rtmin", "rtmax")]) mzr <- as.numeric(resTable[1, c("mzmin", "mzmax")]) ## Extract the slice from the data before retention time correction mssBefore <- subset(fullData, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2)) mssAfter <- msSlice(xset, rtrange=grow(rtr, 10), mzrange=grow(mzr, 2)) par(mfrow=c(1, 2)) plotChromatogram(mssBefore, type="l", col=genoColor[as.character(xset$genotype)]) abline(v=rtr, col="#00000020", lty=2) ## After retention time correction. plotChromatogram(mssAfter, type="l", col=genoColor[as.character(xset$genotype)]) abline(v=rtr, col="#00000020", lty=2)
Retention time correction was not able to align that peak properly. Checking
also for peaks listed in the original vignette from the xcms
package.
xcmsRaw
objectsThe getData
method is an alternative method that can be used to extract paired
retention time, m/z and intensity values from an xcmsRaw
object. The advantage
over the msData
method described in the previous method is, that it allows
also a sub-setting by intensities.
Loading the libraries and the xcmsRaw
object.
library(xcmsExtensions) library(faahKO) xset <- faahko ## Getting the raw data for the first data file. xraw <- getXcmsRaw(xset)
The raw data of an LC/GC-MS run is stored in an xcmsRaw
object, more
specifically, in the slots @scantime
, @env$mz
, @env$intensity
. Extracting
data from such a xcmsRaw
object can however be somewhat cumbersome, especially
when we want to extract only (eventually multiple) sub-sets of data. Also, for
memory reasons, the length of the @scantime
slot does not match the length of
the @env$mz
slot as it stores only the distinct scan/measurement time points.
We can however use the scantimes
method to extract a numeric vector that
matches the length of the @env$mz
and @env$intensity
slots, thus specifying
the scan time for each of these data points.
## What's the length of data points we have? length(xraw@env$mz) ## And the length of scan times? length(xraw@scantime) ## Extract the scan times matching the individual data points. head(scantimes(xraw)) length(scantimes(xraw))
The data in an R-object should however not be accessed directly (i.e. accessing
the slots), but ideally through an getter methods. Thus, to extract the raw
data from an xcmsRaw
object we can use the getData
method. We could also use
the rawMat
method defined in the xcms
package, but the getData
should be
preferably used, as it is also faster and extracts always correct sub-sets if
sub-setting is done on retention time ranges.
## Get the full data from the object dim(getData(xraw)) head(getData(xraw)) ## Extract only a subset of data specified by an retention time range. datmat <- getData(xraw, rtrange=c(2509, 2530)) dim(datmat) ## We can also specify both, a retention time and a mz range. datmat <- getData(xraw, rtrange=c(2509, 2530), mzrange=c(301, 302.003))
Besides sub-setting by retention time and m/z ranges it is also possible to extract data sub-sets defined by an intensity range.
## Use and intensity range: all with a signal higher than 300
And finally, we can also specify multiple retention time and or m/z (or intensity) ranges to extract multiple sub-sets.
Note that, if we load an xcmsRaw
object for a xcmsSet
object after having
applied retention time correction, the retention times (scan time) we extract
from that object will correspond to the corrected ones.
## Grouping (alignment) of peaks/features xset <- group(xset) ## Retention time correction xset <- retcor(xset) ## Extract "raw" data; corrected retention times will be applied to the raw data. xraw2 <- getXcmsRaw(xset)
Extracting data by directly accessing the slots of an R-object is however no
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.