This vignette demonstrates the use of the functions included in the dipps package for reading in peaklists as generated by Bruker software for MSI data, and performing simple analyses on peakpicked MSI data such as peak-grouping, investigating the quality of peak-grouping using heuristics with the help of the plyr package, calculations of DIPPS, and making relevant plots using the ggplot2 package.

library(dipps)
library(plyr)
library(ggplot2)

It seemed appropriate to use a full dataset as a worked example in order to write a vignette, but including a full MSI dataset in the dipps package seems inappropriate, as it would be large and would cause installing the package to be slow. As such this vignette is also available with a full dataset as the worked example from my github. Here I use a small subset of that dataset that is included with the dipps package for testing purposes. Hence several parts of this vignette will be slightly different, such as the code for reading in the data, for selecting a subset for DIPPS, and for selecting tuning parameters for peak-grouping.

This vignette is seperated into four main sections:

Reading Peaklists (from the output of Bruker software)

combine_peaklists is the relevent function here, reading in all the peaklists in <folname>/peaklists/ and writing two csv files in the current working directory, a Peak-list and a Spectra-list.

i.path = system.file("extdata", "test1", package = "dipps")
n.empt = combine_peaklists(i.path)
print(paste("Number of empty spectra:", n.empt))

i.path is specified here using system.file to acquire the absolute path to these test data included with package dipps, but in general this would usually be specified using file.path. Additional arguments can be passed to combine_peaklists to modify locations of inputs and outputs, see the documentation, ?combine_peaklists. The outputs can be read in using the helper functions load_*:

o.name = basename(i.path)
pl = load_peaklist(o.name)
sl = load_speclist(o.name)

o.name can be supplied to combine_peaklists in which case it would be used to name the output files thereof, and should be supplied to the load_* functions here. When o.name is not supplied to combine_peaklists, it defaults to basename(i.path), which is why that is how I specify o.name here.

Peakgrouping

Usually one of the first things you will want to do with such data is to produce peak-groups of some kind -- grouping peaks with similar mass to identify them with each other. This package supplies two functions for this purpose:

Extract Masses

Most simply, extract_masses can be used to simply extract peaks near known masses of interest, in this dataset for example, m/z = 1570.677 is a mass of interest, and so we could extract it as:

df.cal = extract_masses(pl, 1570.677, margin = 0.1)

This returns df.cal as a simple subset of pl, with an additional column called group filled with just the value 1570.677. So the above call to extract_masses is equivalent to df.cal = transform(subset(pl, abs(m.z - 1570.677) < 0.1), group = 1570.677). extract masses is provided as a convenience function as the second arguemnt provided, masses, can be a vector of masses, and the margin specified can also be provided in ppm -- see the documentation, ?extract_masses.

DBSCAN

As mentioned in the documentation, dbscan provides a univariate optimisation of the DBSCAN* algorithm of section 3 of Campello et al. (2013). This is a density based clustering algorithm that, breifly, can be thought of as two steps performed in sequence, first points with insufficient density are removed, and second the remaining points are grouped as per the equivelance classes induced by an equivalence relation. In this context density is measured as the number of points within a certain eps, and the equivalence relation used is "are within eps of each other". This second step, without the first, is often called tolerance clustering and is often sufficient -- this can be acheived by specifing a minimum density of 1:

pl$tol_clus = dbscan(pl$m.z, eps = 0.1, mnpts = 1)

But more generally the minimum density, mnpts, will need to be adjusted based on the dataset, usually proportionally to the size of the dataset but also depending on other factors. Similarly eps will depend on the dataset, and is directly related to mnpts -- the same value of mnpts is a more stringent density cutoff for a smaller value of eps. This test dataset is quite small, so a small value of mnpts is appropriate, such as for example:

pl$dbs_clus = dbscan(pl$m.z, eps = 0.1, mnpts = 10)

Heuristics for diagnosis of peak-grouping.

Here I'll show how to construct a plot I often use for heuristically diagnosing the quality of the performed peakpicking, and to adjust the tuning parameter selection to get a peak-grouping i am happy with. First, a couple of helper functions for calculating peak-group summaries, and generating the plot to which I alluded respectively:

summarise_peakgroups = function(pl, var) {
  return(ddply(pl, var, summarise,
               AWM = weighted.mean(m.z, log1p(intensity)),
               Centre = (max(m.z) + min(m.z)) / 2,
               Range = max(m.z) - min(m.z),
               nPeaks = length(m.z),
               nDupPeaks = (length(m.z) - length(unique(Acq)))))
}
diag_plot = function(pgs, t) {
  return(ggplot(pgs, aes(x=Range, y=nDupPeaks))
         + ggtitle(t)
         + geom_jitter(width = 0, height = 0.4) 
         + ylab("# Duplicate Peaks"))
}

Given these functions, a diagnostic plot for the two clusterings performed in the above section can then be generated like so:

pgs_tol <- summarise_peakgroups(pl, "tol_clus")
pgs_dbs <- summarise_peakgroups(pl, "dbs_clus")

print(diag_plot(subset(pgs_tol, tol_clus != 0), "# peaks unassigned: 0"))
print(diag_plot(subset(pgs_dbs, dbs_clus != 0), 
                paste("# peaks unassigned:", 
                      subset(pgs_dbs, dbs_clus == 0)$nPeaks)))

Duplicate peaks here refers to multiple peaks from the same spectrum occuring in the same peak-group. Usually, we would expect this not to happen if the peak-picking algorithm is working well (although sometimes the peak-picking can produce weird artefacts). Either way, number of duplicate peaks should be zero or if that is not possible, then it should be as small as possible (and noted, as these duplicates will likely have to be dealt with somehow later in the processing pipeline, either by averaging their intensities or by some other method, depending on the outcome being pursued).

I jitter the points in these plots vertically by +/- 0.2 in order to allow a visual que for the density of these otherwise overlapping points.

Note that these examples look pretty good -- with zero duplicate peaks and relatively low range, with the maximum less than 0.3 in both cases and the mode density of range aroudn 0.05 or so, which for a MALDI TOF experiement on peptides (which these data are) is entirely acceptable. This is mostly just because a small subset of the dataset is being used, and for realistic examples acheiving such good peakgrouping can be more difficult. I repeat this vignette but for a complete version of these data, and that is available from my github.

DIPPS

DIPPS can be calculated for a subset of spectra using the dipps function. In this case we select the subset of spectra by setting an arbitrary Y-coordinate cutoff, and we will use the equivalence class clustering produced above to identify the variables:

y_cutoff = 170
sl$subset = sl$Y >= y_cutoff
pl = merge(pl, sl[, c("Acq", "subset")])

dipps.df = dipps(pl$Acq, pl$tol_clus, pl$subset)
print(head(dipps.df), digits = 2, row.names = FALSE)

From the output of dipps we can see that cluster (or in this case equivalence class) number r dipps.df[1, "var"] occurs in r round(dipps.df[1, "p.d"], 2) of spectra with Y-coordinate less than r y_cutoff, and in r round(dipps.df[1, "p.u"], 2) of remaining spectra, giving it the highest DIPPS of r round(dipps.df[1, "d"], 2). The peak-group summary we made in the previous section is a conviniant way to look at the details of this peakgroup:

subset(pgs_tol, tol_clus == dipps.df[1, "var"])

The main observation being that the m/z window for this cluster (or in this case, equivalence class) is r round(subset(pgs_tol, tol_clus == dipps.df[1, "var"])$Centre, 3) +/- r round(subset(pgs_tol, tol_clus == dipps.df[1, "var"])$Range / 2, 3) with most of the intense peaks grouping nearer to the AWM of r round(subset(pgs_tol, tol_clus == dipps.df[1, "var"])$AWM, 3). If we wanted a better idea of the m/z distribution within this cluster, we could make a histogram:

hist(subset(pl, tol_clus == dipps.df[1, "var"])$m.z, xlab = "m/z", 
     main = paste("Cluster", dipps.df[1, "var"]))

Spatial Plotting

We could plot these occurrences, or their intensities, spatially using the spatial_plot function:

pl.sub = subset(pl, tol_clus == dipps.df[1, "var"])
pl.sub$occurrence = 1
pl.sub$log.intensity = log1p(pl.sub$intensity)
print(spatial_plot(pl.sub, sl, plot.var = "occurrence", 
                   return.plot = TRUE, print.plot = FALSE))
print(spatial_plot(pl.sub, sl, plot.var = "log.intensity", 
                   return.plot = TRUE, print.plot = FALSE))
unlink("*_peaklist.txt")
unlink("*_speclist.txt")


Armadilloa16/dipps documentation built on May 5, 2019, 7:06 a.m.