In yclement/xcms: LC-MS and GC-MS Data Analysis

BiocStyle::markdown()

Package: r Biocpkg("xcms")
Authors: Johannes Rainer
Modified: r file.info("xcms.Rmd")$mtime
Compiled: r date()

## Silently loading all packages
library(BiocStyle)
library(xcms)
library(faahKO)
library(pander)
## Use socket based parallel processing on Windows systems
if (.Platform$OS.type == "unix") {
    register(bpstart(MulticoreParam(3)))
} else {
    register(bpstart(SnowParam(3)))
}

Introduction

This documents describes data import, exploration, preprocessing and analysis of LCMS experiments with xcms version >= 3. The examples and basic workflow was adapted from the original LC/MS Preprocessing and Analysis with xcms vignette from Colin A. Smith.

The new user interface and methods use the XCMSnExp object (instead of the old xcmsSet object) as a container for the pre-processing results. To support packages and pipelines relying on the xcmsSet object, it is however possible to convert an XCMSnExp into a xcmsSet object using the as method (i.e.xset <- as(x, "xcmsSet"), with x being an XCMSnExp object.

Data import

xcms supports analysis of LC/MS data from files in (AIA/ANDI) NetCDF, mzML/mzXML and mzData format. For the actual data import Bioconductor's r Biocpkg("mzR") is used. For demonstration purpose we will analyze a subset of the data from [@Saghatelian04] in which the metabolic consequences of knocking out the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided with the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds.

Below we load all required packages, locate the raw CDF files within the faahKO package and build a phenodata data frame describing the experimental setup.

library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(magrittr)
library(pheatmap)
library(SummarizedExperiment)

## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
            recursive = TRUE)
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
                                   replacement = "", fixed = TRUE),
                 sample_group = c(rep("KO", 6), rep("WT", 6)),
                 stringsAsFactors = FALSE)

Subsequently we load the raw data as an OnDiskMSnExp object using the readMSData method from the r Biocpkg("MSnbase") package. The MSnbase provides based structures and infrastructure for the processing of mass spectrometry data. Also, MSnbase can be used to centroid profile-mode MS data (see the corresponding vignette in the MSnbase package).

raw_data <- readMSData(files = cdfs, pdata = new("NAnnotatedDataFrame", pd),
                       mode = "onDisk")

raw_data <- filterRt(raw_data, c(2500, 4000))

The resulting OnDiskMSnExp object contains general information about the number of spectra, retention times, the measured total ion current etc, but does not contain the full raw data (i.e. the m/z and intensity values from each measured spectrum). Its memory footprint is thus rather small making it an ideal object to represent large metabolomics experiments while allowing to perform simple quality controls, data inspection and exploration as well as data sub-setting operations. The m/z and intensity values are imported from the raw data files on demand, hence the location of the raw data files should not be changed after initial data import.

Initial data inspection

The OnDiskMSnExp organizes the MS data by spectrum and provides the methods intensity, mz and rtime to access the raw data from the files (the measured intensity values, the corresponding m/z and retention time values). In addition, the spectra method could be used to return all data encapsulated in Spectrum objects. Below we extract the retention time values from the object.

head(rtime(raw_data))

All data is returned as one-dimensional vectors (a numeric vector for rtime and a list of numeric vectors for mz and intensity, each containing the values from one spectrum), even if the experiment consists of multiple files/samples. The fromFile function returns an integer vector providing the mapping of the values to the originating file. Below we use the fromFile indices to organize the mz values by file.

mzs <- mz(raw_data)

## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))

length(mzs_by_file)

As a first evaluation of the data we plot below the base peak chromatogram (BPC) for each file in our experiment. We use the chromatogram method and set the aggregationFun to "max" to return for each spectrum the maximal intensity and hence create the BPC from the raw data. To create a total ion chromatogram we could set aggregationFun to sum.

## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(raw_data, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")

## Plot all chromatograms.
plot(bpis, col = group_colors[raw_data$sample_group])

The chromatogram method returned a Chromatograms object that organizes individual Chromatogram objects (which in fact contain the chromatographic data) in a two-dimensional array: columns represent samples and rows (optionally) m/z and/or retention time ranges. Below we extract the chromatogram of the first sample and access its retention time and intensity values.

bpi_1 <- bpis[1, 1]
head(rtime(bpi_1))
head(intensity(bpi_1))

The chromatogram method supports also extraction of chromatographic data from a m/z-rt slice of the MS data. In the next section we will use this method to create an extracted ion chromatogram (EIC) for a selected peak.

Note that chromatogram reads the raw data from each file to calculate the chromatogram. The bpi and tic methods on the other hand do not read any data from the raw files but use the respective information that was provided in the header definition of the input files (which might be different from the actual data).

Below we create boxplots representing the distribution of total ion currents per file. Such plots can be very useful to spot problematic or failing MS runs.

## Get the total ion current by file
tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, col = group_colors[raw_data$sample_group],
        ylab = "intensity", main = "Total ion current")

Also, we can cluster the samples based on similarity of their base peak chromatogram. This can also be helpful to spot potentially problematic samples in an experiment or generally get an initial overview of the sample grouping in the experiment. Since the retention times between samples are not exactly identical, we use the bin function to group intensities in fixed time ranges (bins) along the retention time axis. In the present example we use a bin size of 1 second, the default is 0.5 seconds. The clustering is performed using complete linkage hierarchical clustering on the pairwise correlations of the binned base peak chromatograms.

## Bin the BPC
bpis_bin <- bin(bpis, binSize = 2)

## Calculate correlation on the log2 transformed base peak intensities
cormat <- cor(log2(do.call(cbind, lapply(bpis_bin, intensity))))
colnames(cormat) <- rownames(cormat) <- raw_data$sample_name

## Define which phenodata columns should be highlighted in the plot
ann <- data.frame(group = raw_data$sample_group)
rownames(ann) <- raw_data$sample_name

## Perform the cluster analysis
pheatmap(cormat, annotation = ann,
         annotation_color = list(group = group_colors))

The samples cluster in a pairwise manner, the KO and WT samples for the sample index having the most similar BPC. In addition, 3 larger clusters are present grouping KO and WT samples 21 and 22 together as well as 18 and 19. KO and WT samples 15 and 16 separate from these two clusters.

Chromatographic peak detection

Next we perform the chromatographic peak detection using the centWave algorithm [@Tautenhahn:2008fx]. Before running the peak detection it is however strongly suggested to visually inspect e.g. the extracted ion chromatogram of internal standards or known compounds to evaluate and adapt the peak detection settings since the default settings will not be appropriate for most LCMS experiments. The two most critical parameters for centWave are the peakwidth (expected range of chromatographic peak widths) and ppm (maximum expected deviation of m/z values of centroids corresponding to one chromatographic peak; this is usually much larger than the ppm specified by the manufacturer) parameters. To evaluate the typical chromatographic peak width we plot the EIC for one peak.

## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(raw_data, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$sample_group])

Note that Chromatogram objects extracted by the chromatogram method contain an NA value if in a certain scan (i.e. for a specific retention time) no signal was measured in the respective mz range. This is reflected by the lines not being drawn as continuous lines in the plot above.

The peak above has a width of about 50 seconds. The peakwidth parameter should be set to accommodate the expected widths of peak in the data set. We set it to 20,80 for the present example data set.

For the ppm parameter we extract the full MS data (intensity, retention time and m/z values) corresponding to the above peak. To this end we first filter the raw object by retention time, then by m/z and finally plot the object with type = "XIC" to produce the plot below. We use the pipe (%>%) command better illustrate the corresponding workflow. Note also that in this type of plot identified chromatographic peaks would be indicated by default if present.

raw_data %>%
    filterRt(rt = rtr) %>%
    filterMz(mz = mzr) %>%
    plot(type = "XIC")

In the present data there is actually no variation in the m/z values. Usually one would see the m/z values (lower panel) scatter around the real m/z value of the compound. The first step of the centWave algorithm defines so called regions of interest (ROI) based on the difference of m/z values from consecutive scans. In detail, m/z values from consecutive scans are included into a ROI if the difference between the m/z and the mean m/z of the ROI is smaller than the user defined ppm parameter. A reasonable choice for the ppm could thus be the maximal m/z difference of data points from neighboring scans/spectra that are part of the chromatographic peak. It is suggested to inspect the ranges of m/z values for many compounds (either internal standards or compounds known to be present in the sample) and define the ppm parameter for centWave according to these.

Note that we can also perform the peak detection on the extracted ion chromatogram. This can help to evaluate different peak detection settings. Only be aware that peak detection on an extracted ion chromatogram will not consider the ppm parameter and that the estimation of the background signal is different to the peak detection on the full data set; values for the snthresh will hence have different consequences. Below we perform the peak detection with the findChromPeaks function on the extracted ion chromatogram. The submitted parameter object defines which algorithm will be used and allows to define the settings for this algorithm. We use the centWave algorithm with default settings, except for snthresh.

xchr <- findChromPeaks(chr_raw, param = CentWaveParam(snthresh = 2))

We can access the identified chromatographic peaks with the chromPeaks function.

head(chromPeaks(xchr))

Parallel to the chromPeaks matrix there is also a data frame chromPeakData that allows to add arbitrary annotations to each chromatographic peak. Below we extract this data frame that by default contains only the MS level in which the peak was identified.

chromPeakData(xchr)

Next we plot the identified chromatographic peaks in the extracted ion chromatogram. We use the col parameter to color the individual chromatogram lines. Colors can also be specified for the identified peaks, peakCol for the foreground/border color, peakBg for the background/fill color. One color has to be provided for each chromatographic peak listed by chromPeaks. Below we define a color to indicate the sample group from which the sample is and use the sample information in the peaks' "sample" column to assign the correct color to each chromatographic peak. More peak highlighting options are described further below.

sample_colors <- group_colors[xchr$sample_group]
plot(xchr, col = sample_colors,
     peakBg = sample_colors[chromPeaks(xchr)[, "column"]])

Finally we perform the chromatographic peak detection on the full data set. Note that we set the argument prefilter to c(3, 5000) and noise to 5000 to reduce the run time of this vignette. With this setting we consider only signals with a value larger than 5000 in the peak detection step.

cwp <- CentWaveParam(peakwidth = c(20, 80), noise = 5000,
                     prefilter = c(3, 5000))
xdata <- findChromPeaks(raw_data, param = cwp)

The results are returned as an XCMSnExp object which extends the OnDiskMSnExp object by storing also LC/GC-MS preprocessing results. This means also that all methods to sub-set and filter the data or to access the (raw) data are inherited from the OnDiskMSnExp object and can thus be re-used. Note also that it is possible to perform additional rounds of peak detection (e.g. on MS level > 1 data) on the xdata object by calling findChromPeaks with the parameter add = TRUE.

The results from the chromatographic peak detection can be accessed with the chromPeaks method.

head(chromPeaks(xdata))

The returned matrix provides the m/z and retention time range for each identified chromatographic peak as well as the integrated signal intensity ("into") and the maximal peak intensitity ("maxo"). Columns "sample" contains the index of the sample in the object/experiment in which the peak was identified.

Annotations for each individual peak can be extracted with the chromPeakData function. This data frame could also be used to add/store arbitrary annotations for each detected peak.

chromPeakData(xdata)

Peak detection will not always work perfectly leading to peak detection artifacts, such as overlapping peaks or artificially split peaks. The refineChromPeaks function allows to refine peak detection results by either or removing identified peaks exceeding a maximal allower rt width or by merging artificially split chromatographic peaks. For the former a CleanPeaksParam parameter object can be submitted to the function while the latter can be configured with the MergeNeighboringPeaksParam object. Below we post-process the peak detection results merging peaks overlapping in a 4 second window per file if the signal between in between them is lower than 75% of the smaller peak's maximal intensity. See the MergeNeighboringPeaksParam help page for a detailed description of the settings and the approach.

mpp <- MergeNeighboringPeaksParam(expandRt = 4)
xdata_pp <- refineChromPeaks(xdata, mpp)

An example for a merged peak is given below.

mzr_1 <- 305.1 + c(-0.01, 0.01)
chr_1 <- chromatogram(filterFile(xdata, 1), mz = mzr_1)
chr_2 <- chromatogram(filterFile(xdata_pp, 1), mz = mzr_1)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)

For the first trace in the chromatogram above centWave detected 3 peaks (1 for the full area and two smaller ones, see left panel in the plot above). The peak refinement with MergeNeighboringPeaksParam reduced them to a single peak (right panel in the figure above). Note that this refinement does not merge neighboring peaks for which the signal in between them is lower than a certain proportion (see figure below).

mzr_1 <- 496.2 + c(-0.01, 0.01)
chr_1 <- chromatogram(filterFile(xdata, 1), mz = mzr_1)
chr_2 <- chromatogram(filterFile(xdata_pp, 1), mz = mzr_1)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)

Note also that it is possible to perform the peak refinement on extracted ion chromatograms. This could e.g. be used to fine-tune the settings for the parameter. To illustrate this we perform below a peak refinement on the extracted ion chromatogram chr_1 reducing the minProp parameter to force joining the two peaks.

res <- refineChromPeaks(chr_1, MergeNeighboringPeaksParam(minProp = 0.05))
chromPeaks(res)
plot(res)

Before proceeding we replace the xdata object with the results from the peak refinement.

xdata <- xdata_pp

Below we use the data from the chromPeaks matrix to calculate some per-file summaries.

summary_fun <- function(z)
    c(peak_count = nrow(z), rt = quantile(z[, "rtmax"] - z[, "rtmin"]))

T <- lapply(split.data.frame(
    chromPeaks(xdata), f = chromPeaks(xdata)[, "sample"]),
    FUN = summary_fun)
T <- do.call(rbind, T)
rownames(T) <- basename(fileNames(xdata))
pandoc.table(
    T,
    caption = paste0("Summary statistics on identified chromatographic",
                     " peaks. Shown are number of identified peaks per",
                     " sample and widths/duration of chromatographic ",
                     "peaks."))

We can also plot the location of the identified chromatographic peaks in the m/z - retention time space for one file using the plotChromPeaks function. Below we plot the chromatographic peaks for the 3rd sample.

plotChromPeaks(xdata, file = 3)

To get a global overview of the peak detection we can plot the frequency of identified peaks per file along the retention time axis. This allows to identify time periods along the MS run in which a higher number of peaks was identified and evaluate whether this is consistent across files.

plotChromPeakImage(xdata)

Next we highlight the identified chromatographic peaks for the example peak from before. Evaluating such plots on a list of peaks corresponding to known peaks or internal standards helps to ensure that peak detection settings were appropriate and correctly identified the expected peaks. We extract the ion chromatogram from the peak detection result object, which contains then also the identified chromatographic peaks for that ion that we can extract with the chromPeaks function.

chr_ex <- chromatogram(xdata, mz = mzr, rt = rtr)
chromPeaks(chr_ex)

We can also plot the extracted ion chromatogram. Identified chromatographic peaks will be automatically highlighted in the plot. Below we highlight chromatographic peaks with a rectangle from the peak's minimal to maximal rt and from an intensity of 0 to the maximal signal of the peak.

sample_colors <- group_colors[chr_ex$sample_group]
plot(chr_ex, col = sample_colors, peakType = "rectangle",
     peakCol = sample_colors[chromPeaks(chr_ex)[, "sample"]],
     peakBg = NA)

Alternatively to the rectangle visualization above, it is possible to represent the apex position of each peak with a single point (passing argument type = "point" to the function), or draw the actually identified peak by specifying type = "polygon". To completely omit highlighting the identified peaks (e.g. to plot base peak chromatograms or similar) type = "none" can be used. Below we use type = "polygon" to fill the peak area for each identified chromatographic peak in each sample. Whether individual peaks can be still identified in such a plot depends however on the number of samples from which peaks are drawn.

plot(chr_ex, col = group_colors[chr_raw$sample_group], lwd = 2,
     peakBg = sample_colors[chromPeaks(chr_ex)[, "sample"]])

Note that we can also specifically extract identified chromatographic peaks for a selected region by providing the respective m/z and retention time ranges with the mz and rt arguments in the chromPeaks method.

pander(chromPeaks(xdata, mz = mzr, rt = rtr),
       caption = paste("Identified chromatographic peaks in a selected ",
                       "m/z and retention time range."))

Finally we plot also the distribution of peak intensity per sample. This allows to investigate whether systematic differences in peak signals between samples are present.

## Extract a list of per-sample peak intensities (in log2 scale)
ints <- split(log2(chromPeaks(xdata)[, "into"]),
              f = chromPeaks(xdata)[, "sample"])
boxplot(ints, varwidth = TRUE, col = group_colors[xdata$sample_group],
        ylab = expression(log[2]~intensity), main = "Peak intensities")
grid(nx = NA, ny = NULL)

Alignment

The time at which analytes elute in the chromatography can vary between samples (and even compounds). Such a difference was already observable in the extracted ion chromatogram plot shown as an example in the previous section. The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.

A plethora of alignment algorithms exist (see [@Smith:2013gr]), with some of them being implemented also in xcms. The method to perform the alignment/retention time correction in xcms is adjustRtime which uses different alignment algorithms depending on the provided parameter class.

In the example below we use the obiwarp method [@Prince:2006jj] to align the samples. We use a binSize = 0.6 which creates warping functions in mz bins of 0.6. Also here it is advisable to modify the settings for each experiment and evaluate if retention time correction did align internal controls or known compounds properly.

xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.6))

adjustRtime, besides calculating adjusted retention times for each spectrum, does also adjust the reported retention times of the identified chromatographic peaks.

To extract the adjusted retention times we can use the adjustedRtime method, or simply the rtime method that, if present, returns by default adjusted retention times from an XCMSnExp object.

## Extract adjusted retention times
head(adjustedRtime(xdata))

## Or simply use the rtime method
head(rtime(xdata))

Raw retention times can be extracted from an XCMSnExp containing aligned data with rtime(xdata, adjusted = FALSE).

To evaluate the impact of the alignment we plot the BPC on the adjusted data. In addition we plot the differences of the adjusted- to the raw retention times per sample using the plotAdjustedRtime function. Note that we don't want to highlight any chromatographic peaks (because there would simply be too many) and set thus peakType = "none" in the plot call.

## Get the base peak chromatograms.
bpis_adj <- chromatogram(xdata, aggregationFun = "max")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis_adj, col = group_colors[bpis_adj$sample_group], peakType = "none")
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])

Too large differences between adjusted and raw retention times could indicate poorly performing samples or alignment.

Alternatively to obiwarp we could use the peak groups alignment method that adjusts the retention time by aligning previously identified hook peaks (chromatographic peaks present in most/all samples). Ideally, these hook peaks should span most part of the retention time range. Below we first restore the raw retention times (also of the identified peaks) using the dropAdjustedRtime methods. Note that a drop* method is available for each preprocessing step allowing to remove the respective results from the XCMSnExp object.

## Does the object have adjusted retention times?
hasAdjustedRtime(xdata)

## Drop the alignment results.
xdata <- dropAdjustedRtime(xdata)

## Does the object have adjusted retention times?
hasAdjustedRtime(xdata)

Note: XCMSnExp objects hold the raw along with the adjusted retention times and subsetting will in most cases drop the adjusted retention times. Sometimes it might thus be useful to replace the raw retention times with the adjusted retention times. This can be done with the applyAdjustedRtime.

As noted above the peak groups method requires peak groups (features) present in most samples to perform the alignment. We thus have to perform a first correspondence run to identify such peaks (details about the algorithm used are presented in the next section). We use here again default settings, but it is strongly advised to adapt the parameters for each data set. The definition of the sample groups (i.e. assignment of individual samples to the sample groups in the experiment) is mandatory for the PeakDensityParam. If there are no sample groups in the experiment sampleGroups should be set to a single value for each file (e.g. rep(1, length(fileNames(xdata))).

## Correspondence: group peaks across samples.
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                        minFraction = 0.8)
xdata <- groupChromPeaks(xdata, param = pdp)

## Now the retention time correction.
pgp <- PeakGroupsParam(minFraction = 0.85)

## Get the peak groups that would be used for alignment.
xdata <- adjustRtime(xdata, param = pgp)

Note also that we could use the adjustRtimePeakGroups method on the object before alignment to evaluate on which features (peak groups) the alignment would be performed. This can be useful to test different settings for the peak groups algorithm. Also, it is possible to manually select or define certain peak groups (i.e. their retention times per sample) and provide this matrix to the PeakGroupsParam class with the peakGroupsMatrix argument.

Below plot the difference between raw and adjusted retention times using the plotAdjustedRtime function, which, if the peak groups method is used for alignment, also highlights the peak groups used in the adjustment.

## Plot the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group],
                  peakGroupsCol = "grey", peakGroupsPch = 1)

At last we evaluate the impact of the alignment on the test peak.

par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw, col = group_colors[chr_raw$sample_group])

## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(xdata, rt = rtr, mz = mzr)
plot(chr_adj, col = group_colors[chr_raw$sample_group], peakType = "none")

Subset-based alignment

In some experiments it might be helpful to perform the alignment based on only a subset of the samples, e.g. if QC samples were injected at regular intervals or if the experiment contains blanks. Alignment method in xcms allow to estimate retention time drifts on a subset of samples (either all samples excluding blanks or QC samples injected at regular intervals during a measurement run) and use these to adjust the full data set.

Parameters subset (of the PeakGroupsParam or ObiwarpParam object) can be used to define the subset of samples on which the alignment of the full data set will be based (e.g. subset being the index of QC samples), and parameter subsetAdjust allows to specify the method by which the left-out samples will be adjusted. There are currently two options available:

subsetAdjust = "previous": adjust the retention times of a non-subset sample based on the alignment results of the previous subset sample (e.g. a QC sample). If samples are e.g. in the order A1, B1, B2, A2, B3, B4 with A representing QC samples and B study samples, using subset = c(1, 4) and subsetAdjust = "previous" would result in all A samples to be aligned with each other and non-subset samples B1 and B2 being adjusted based on the alignment result of subset samples A1 and B3 and B4 on those of A2.
subsetAdjust = "average": adjust retention times of non-subset samples based on an interpolation of the alignment results of the previous and subsequent subset sample. In the example above, B1 would be adjusted based on the average of adjusted retention times between subset (QC) samples A1 and A2. Since there is no subset sample after non-subset samples B3 and B4 these will be adjusted based on the alignment results of A2 alone. Note that a weighted average is used to calculate the adjusted retention time averages, which uses the inverse of the difference of the index of the non-subset sample to the subset samples as weights. Thus, if we have a setup like A1, B1, B2, A2 the adjusted retention times of A1 would get a larger weight than those of A2 in the adjustment of non-subset sample B1 causing it's adjusted retention times to be closer to those of A1 than to A2. See below for examples.

Both cases require a meaningful/correct ordering of the samples within the object (e.g. ordering by injection index).

The examples below aim to illustrate the effect of these alignment options. First we assume samples 1, 6 and 11 in the faahKO data set to be QC samples (sample pools). We thus want to perform the alignment based on these samples and subsequently adjust the retention times of the left-out samples (2, 3, 4, 5, 7, 8, 9, 10 and 12) based on interpolation of the results from the neighboring subset samples. After initial peak grouping we perform below the alignment with the peak groups method passing the indices of the samples on which we want the alignment to be based with the subset argument and specify subsetAdjust = "average" to adjust the study samples based on interpolation of the alignment results from neighboring subset/QC samples.

Note that for any subset-alignment all parameters such as minFraction are relative to the subset, not the full experiment!

xdata_subs <- dropAdjustedRtime(xdata)
## Define the experimental layout
xdata_subs$sample_type <- "study"
xdata_subs$sample_type[c(1, 6, 11)] <- "QC"

## Initial peak grouping. Use sample_type as grouping variable
pdp_subs <- PeakDensityParam(sampleGroups = xdata_subs$sample_type,
                             minFraction = 0.9)
xdata_subs <- groupChromPeaks(xdata_subs, param = pdp_subs)

## Define subset-alignment options and perform the alignment
pgp_subs <- PeakGroupsParam(minFraction = 0.85,
                            subset = which(xdata_subs$sample_type == "QC"),
                            subsetAdjust = "average", span = 0.4)
xdata_subs <- adjustRtime(xdata_subs, param = pgp_subs)

To illustrate the alignment results we plot the results only for samples 1 to 6 (1 and 6 being QC samples and 2 to 5 study samples that were adjusted based on the QC samples). This nicely shows how the interpolation of the subsetAdjust = "average" works: retention times of sample 2 are adjusted based on those from subset sample 1 and 6, giving however more weight to the closer subset sample 1 which results in the adjusted retention times of 2 being more similar to those of sample 1. Sample 5 on the other hand gets adjusted giving more weight to the second subset sample (6).

clrs <- rep(NA, 12)
clrs[1:6] <- paste0(brewer.pal(6, "Dark2"), 80)

plotAdjustedRtime(xdata_subs, col = clrs, lwd = 2, peakGroupsCol = NA)
legend("topleft", legend = 1:6, col = brewer.pal(6, "Dark2"), lwd = 2)

We also plot the full alignment results, drawing the subset samples as green and study samples as grey lines.

clrs <- rep("#00000040", 12)
clrs[xdata_subs$sample_type == "QC"] <- c("#00ce0080")
par(mfrow = c(2, 1), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(xdata_subs, aggregationFun = "sum"),
     col = clrs, peakType = "none")
plotAdjustedRtime(xdata_subs, col = clrs, peakGroupsPch = 1,
                  peakGroupsCol = "#00ce0040")

Option subsetAdjust = "previous" adjusts the retention times of a non-subset sample based on a single subset sample (the previous), which results in most cases in the adjusted retention times of the non-subset sample being highly similar to those of the subset sample which was used for adjustment.

Also obiwarp alignment supports the same types of subset-alignment as shown below.

xdata_subs <- dropAdjustedRtime(xdata_subs)
prm <- ObiwarpParam(subset = which(xdata_subs$sample_type == "QC"),
                    subsetAdjust = "average")
xdata_subs <- adjustRtime(xdata_subs, param = prm)

We again plot the results.

clrs <- rep("#00000040", 12)
clrs[which(xdata_subs$sample_type == "QC")] <- "#00ce0080"
par(mfrow = c(2, 1), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(xdata_subs, aggregationFun = "sum"),
     col = clrs, peakType = "none")
plotAdjustedRtime(xdata_subs, col = clrs, peakGroupsPch = 1)

Correspondence

The final step in the metabolomics preprocessing is the correspondence that matches detected chromatographic peaks between samples (and depending on the settings, also within samples if they are adjacent). The method to perform the correspondence in xcms is groupChromPeaks. We will use the peak density method [@Smith:2006ic] to group chromatographic peaks. The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis within small slices along the mz dimension. To illustrate this we plot below the chromatogram for an mz slice with multiple chromatographic peaks within each sample. We use below a value of 0.4 for the minFraction parameter hence only chromatographic peaks present in at least 40% of the samples per sample group are grouped into a feature. The sample group assignment is specified with the sampleGroups argument.

## Define the mz slice.
mzr <- c(305.05, 305.15)

## Extract and plot the chromatograms
chr_mzr <- chromatogram(xdata, mz = mzr, rt = c(2500, 4000))
par(mfrow = c(3, 1), mar = c(1, 4, 1, 0.5))
cols <- group_colors[chr_mzr$sample_group]
plot(chr_mzr, col = cols, xaxt = "n", xlab = "",
     peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]])
## Define the parameters for the peak density method
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                        minFraction = 0.4, bw = 30)
par(mar = c(4, 4, 1, 0.5))
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
                     pch = 16, xlim = c(2500, 4000))
## Use a different bw
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                        minFraction = 0.4, bw = 20)
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
                     pch = 16, xlim = c(2500, 4000))

The upper panel in the plot above shows the extracted ion chromatogram for each sample with the detected peaks highlighted. The middle and lower plot shows the retention time for each detected peak within the different samples. The black solid line represents the density distribution of detected peaks along the retention times. Peaks combined into features (peak groups) are indicated with grey rectangles. Different values for the bw parameter of the PeakDensityParam were used:bw = 30 in the middle and bw = 20 in the lower panel. With the default value for the parameter bw the two neighboring chromatographic peaks would be grouped into the same feature, while with a bw of 20 they would be grouped into separate features. This grouping depends on the parameters for the density function and other parameters passed to the algorithm with the PeakDensityParam.

## Perform the correspondence
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                        minFraction = 0.4, bw = 20)
xdata <- groupChromPeaks(xdata, param = pdp)

Results from the xcms-based preprocessing can be summarized into a SummarizedExperiment object from the r Biocpkg("SummarizedExperiment") package with the quantify method. This object will contain the feature abundances as the assay matrix, the feature definition (their m/z, retention time and other metadata) as rowData (i.e. row annotations) and the sample/phenotype information as colData (i.e. column annotations). All the processing history will be put into the object's metadata. This object can then be used for any further (xcms-independent) processing and analysis.

Below we use quantify to generate the result object for the present analysis. The parameters value and any other additional parameters are passed along to the featureValues method that is used internally to create the feature abundance matrix.

res <- quantify(xdata, value = "into")

Sample annotations can be accessed with the colData method.

colData(res)

Feature annotations with rowData:

rowData(res)

The feature abundances can be accessed with the assay method. Note also that a SummarizedExperiment supports multiple such assay matrices.

head(assay(res))

In addition it is possible to extract the results from the correspondence analysis individually using the featureDefinitions and featureValues methods, that former returning a DataFrame with the definition of the features (i.e. the mz and rt ranges and, in column peakidx, the index of the chromatographic peaks in the chromPeaks matrix for each feature), the latter the feature abundances.

## Extract the feature definitions
featureDefinitions(xdata)

The featureValues method returns a matrix with rows being features and columns samples. The content of this matrix can be defined using the value argument. The default isvalue = "index" which simply returns the index of the peak (in the chromPeaks matrix) assigned to a feature in a sample. Setting value = "into" returns a matrix with the integrated signal of the peaks corresponding to a feature in a sample. Any column name of the chromPeaks matrix can be passed to the argument value. Below we extract the integrated peak intensity per feature/sample.

## Extract the into column for each feature.
head(featureValues(xdata, value = "into"))

This feature matrix contains NA for samples in which no chromatographic peak was detected in the feature's m/z-rt region. While in many cases there might indeed be no peak signal in the respective region, it might also be that there is signal, but the peak detection algorithm failed to detect a chromatographic peak. xcms provides the fillChromPeaks method to fill in intensity data for such missing values from the original files. The filled in peaks are added to the chromPeaks matrix and get a TRUE value in the "is_filled" column of the chromPeakData data frame. Below we perform such a filling-in of missing peaks.

## Filling missing peaks using default settings. Alternatively we could
## pass a FillChromPeaksParam object to the method.
xdata <- fillChromPeaks(xdata)

head(featureValues(xdata))

For features without detected peaks in a sample, the method extracts all intensities in the mz-rt region of the feature, integrates the signal and adds a filled-in peak to the chromPeaks matrix. No peak is added if no signal is measured/available for the mz-rt region of the feature. For these, even after filling in missing peak data, a NA is reported in the featureValues matrix.

It should be mentioned that fillChromPeaks uses columns "rtmin", "rtmax", "mzmin" and "mzmax" from the featureDefinitions table to define the region from which the signal should be integrated to generate the filled-in peak signal. These values correspond however to the positions of the peak apex not the peak boundaries of all chromatographic peaks assigned to a feature. It might be advisable to increase this area in retention time dimension by a constant value appropriate to the average peak width in the experiment. Such a value can be specified with fixedRt of the FillChromPeaksParam. If the average peak width in the experiment is 20 seconds, specifyingfixedRt = 10 ensures that the area from which peaks are integrated is at least 20 seconds wide.

Below we compare the number of missing values before and after filling in missing values. We can use the parameter filled of the featureValues method to define whether or not filled-in peak values should be returned too.

## Missing values before filling in peaks
apply(featureValues(xdata, filled = FALSE), MARGIN = 2,
      FUN = function(z) sum(is.na(z)))

## Missing values after filling in peaks
apply(featureValues(xdata), MARGIN = 2,
      FUN = function(z) sum(is.na(z)))

Next we use the featureSummary function to get a general per-feature summary that includes the number of samples in which a peak was found or the number of samples in which more than one peak was assigned to the feature. Specifying also sample groups breaks down these summary statistics for each individual sample group.

head(featureSummary(xdata, group = xdata$sample_group))

The performance of peak detection, alignment and correspondence should always be evaluated by inspecting extracted ion chromatograms e.g. of known compounds, internal standards or identified features in general. The featureChromatograms function allows to extract chromatograms for each feature present in featureDefinitions. The returned Chromatograms object contains an ion chromatogram for each feature (each row containing the data for one feature) and sample (each column representing containing data for one sample). Below we extract the chromatograms for the first 4 features.

feature_chroms <- featureChromatograms(xdata, features = 1:4)

feature_chroms

And plot the extracted ion chromatograms. We again use the group color for each identified peak to fill the area.

plot(feature_chroms, col = sample_colors,
     peakBg = sample_colors[chromPeaks(feature_chroms)[, "sample"]])

third <- feature_chroms[3, ]
plot(third)
chromPeaks(third)


plot(feature_chroms[3, ], col = group_colors[feature_chroms$sample_group])
plot(feature_chroms[4, ], col = group_colors[feature_chroms$sample_group])

At last we perform a principal component analysis to evaluate the grouping of the samples in this experiment. Note that we did not perform any data normalization hence the grouping might (and will) also be influenced by technical biases.

## Extract the features and log2 transform them
ft_ints <- log2(featureValues(xdata, value = "into"))

## Perform the PCA omitting all features with an NA in any of the
## samples. Also, the intensities are mean centered.
pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)

## Plot the PCA
cols <- group_colors[xdata$sample_group]
pcSummary <- summary(pc)
plot(pc$x[, 1], pc$x[,2], pch = 21, main = "",
     xlab = paste0("PC1: ", format(pcSummary$importance[2, 1] * 100,
                                   digits = 3), " % variance"),
     ylab = paste0("PC2: ", format(pcSummary$importance[2, 2] * 100,
                                   digits = 3), " % variance"),
     col = "darkgrey", bg = cols, cex = 2)
grid()
text(pc$x[, 1], pc$x[,2], labels = xdata$sample_name, col = "darkgrey",
     pos = 3, cex = 2)

We can see the expected separation between the KO and WT samples on PC2. On PC1 samples separate based on their ID, samples with an ID <= 18 from samples with an ID > 18. This separation might be caused by a technical bias (e.g. measurements performed on different days/weeks) or due to biological properties of the mice analyzed (sex, age, litter mates etc).

Further data processing and analysis

Normalizing features' signal intensities is required, but at present not (yet) supported in xcms (some methods might be added in near future). It is advised to use the SummarizedExperiment returned by the quantify method for any further data processing, as this type of object stores feature definitions, sample annotations as well as feature abundances in the same object. For the identification of e.g. features with significant different intensities/abundances it is suggested to use functionality provided in other R packages, such as Bioconductor's excellent limma package. To enable support also for other packages that rely on the old xcmsSet result object, it is possible to coerce the new XCMSnExp object to an xcmsSet object using xset <- as(x, "xcmsSet"), with x being an XCMSnExp object.

Additional details and notes

For a detailed description of the new data objects and changes/improvements compared to the original user interface see the new_functionality vignette.

Evaluating the process history

XCMSnExp objects allow to capture all performed pre-processing steps along with the used parameter class within the @processHistory slot. Storing also the parameter class ensures the highest possible degree of analysis documentation and in future might enable to replay analyses or parts of it. The list of all performed preprocessings can be extracted using the processHistory method.

processHistory(xdata)

It is also possible to extract specific processing steps by specifying its type. Available types can be listed with the processHistoryTypes function. Below we extract the parameter class for the alignment/retention time adjustment step.

ph <- processHistory(xdata, type = "Retention time correction")

ph

And we can also extract the parameter class used in this preprocessing step.

## Access the parameter
processParam(ph[[1]])

Subsetting and filtering

XCMSnEx objects can be subsetted/filtered using the [ method, or one of the many filter* methods. All these methods aim to ensure that the data in the returned object is consistent. This means for example that if the object is subsetted by selecting specific spectra (by using the [ method) all identified chromatographic peaks are removed. Correspondence results (i.e. identified features) are removed if the object is subsetted to contain only data from selected files (using the filterFile method). This is because the correspondence results depend on the files on which the analysis was performed - running a correspondence on a subset of the files would lead to different results.

As an exception, it is possible to force keeping adjusted retention times in the subsetted object setting the keepAdjustedRtime argument to TRUE in any of the subsetting methods.

Below we subset our results object the data for the files 2 and 4.

subs <- filterFile(xdata, file = c(2, 4))

## Do we have identified chromatographic peaks?
hasChromPeaks(subs)

Peak detection is performed separately on each file, thus the subsetted object contains all identified chromatographic peaks from the two files. However, we used a retention time adjustment (alignment) that was based on available features. All features have however been removed and also the adjusted retention times (since the alignment based on features that were identified on chromatographic peaks on all files).

## Do we still have features?
hasFeatures(subs)

## Do we still have adjusted retention times?
hasAdjustedRtime(subs)

We can however use the keepAdjustedRtime argument to force keeping the adjusted retention times.

subs <- filterFile(xdata, keepAdjustedRtime = TRUE)

hasAdjustedRtime(subs)

The filterRt method can be used to subset the object to spectra within a certain retention time range.

subs <- filterRt(xdata, rt = c(3000, 3500))

range(rtime(subs))

Filtering by retention time does not change/affect adjusted retention times (also, if adjusted retention times are present, the filtering is performed on the adjusted retention times).

hasAdjustedRtime(subs)

Also, we have all identified chromatographic peaks within the specified retention time range:

hasChromPeaks(subs)

range(chromPeaks(subs)[, "rt"])

The most natural way to subset any object in R is with [. Using [ on an XCMSnExp object subsets it keeping only the selected spectra. The index i used in [ has thus to be an integer between 1 and the total number of spectra (across all files). Below we subset xdata using both [ and filterFile to keep all spectra from one file.

## Extract all data from the 3rd file.
one_file <- filterFile(xdata, file = 3)

one_file_2 <- xdata[fromFile(xdata) == 3]

## Is the content the same?
all.equal(spectra(one_file), spectra(one_file_2))

While the spectra-content is the same in both objects, one_file contains also the identified chromatographic peaks while one_file_2 does not. Thus, in most situations subsetting using one of the filter functions is preferred over the use of [.

## Subsetting with filterFile preserves chromatographic peaks
head(chromPeaks(one_file))

## Subsetting with [ not
head(chromPeaks(one_file_2))

Note however that also [ does support the keepAdjustedRtime argument. Below we subset the object to spectra 20:30.

subs <- xdata[20:30, keepAdjustedRtime = TRUE]

hasAdjustedRtime(subs)

## Access adjusted retention times:
rtime(subs)

## Access raw retention times:
rtime(subs, adjusted = FALSE)

As with MSnExp and OnDiskMSnExp objects, [[ can be used to extract a single spectrum object from an XCMSnExp object. The retention time of the spectrum corresponds to the adjusted retention time if present.

## Extract a single spectrum
xdata[[14]]

At last we can also use the split method that allows to split an XCMSnExp based on a provided factor f. Below we split xdata per file. Using keepAdjustedRtime = TRUE ensures that adjusted retention times are not removed.

x_list <- split(xdata, f = fromFile(xdata), keepAdjustedRtime = TRUE)

lengths(x_list)

lapply(x_list, hasAdjustedRtime)

Note however that there is also a dedicated splitByFile method instead for that operation, that internally uses filterFile and hence does e.g. not remove identified chromatographic peaks. The method does not yet support the keepAdjustedRtime parameter and thus removes by default adjusted retention times.

xdata_by_file <- splitByFile(xdata, f = factor(1:length(fileNames(xdata))))

lapply(xdata_by_file, hasChromPeaks)

Parallel processing

Most methods in xcms support parallel processing. Parallel processing is handled and configured by the BiocParallel Bioconductor package and can be globally defined for an R session.

Unix-based systems (Linux, macOS) support multicore-based parallel processing. To configure it globally we register the parameter class. Note also that bpstart is used below to initialize the parallel processes.

register(bpstart(MulticoreParam(2)))

Windows supports only socket-based parallel processing:

register(bpstart(SnowParam(2)))

Note that multicore-based parallel processing might be buggy or failing on macOS. If so, the DoparParam could be used instead (requiring the foreach package).

For other options and details see the vignettes from the BiocParallel package.

References

yclement/xcms documentation built on April 10, 2020, 12:08 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

yclement/xcms
LC-MS and GC-MS Data Analysis

In yclement/xcms: LC-MS and GC-MS Data Analysis

Introduction

Data import

Initial data inspection

Chromatographic peak detection

Alignment

Subset-based alignment

Correspondence

Further data processing and analysis

Additional details and notes

Evaluating the process history

Subsetting and filtering

Parallel processing

References

R Package Documentation

Browse R Packages

We want your feedback!

yclement/xcms LC-MS and GC-MS Data Analysis

In yclement/xcms: LC-MS and GC-MS Data Analysis

Introduction

Data import

Initial data inspection

Chromatographic peak detection

Alignment

Subset-based alignment

Correspondence

Further data processing and analysis

Additional details and notes

Evaluating the process history

Subsetting and filtering

Parallel processing

References

R Package Documentation

Browse R Packages

We want your feedback!

yclement/xcms
LC-MS and GC-MS Data Analysis