This vignette demonstrates the use of the functions included in the dipps
package for reading in peaklists as generated by Bruker software for MSI data,
and performing simple analyses on peakpicked MSI data such as peak-grouping,
investigating the quality of peak-grouping using heuristics with the help of the
plyr
package, calculations of DIPPS, and making relevant plots using the
ggplot2
package.
library(dipps) library(plyr) library(ggplot2)
It seemed appropriate to use a full dataset as a worked example in order to
write a vignette, but including a full MSI dataset in the dipps
package seems
inappropriate, as it would be large and would cause installing the package to be
slow. As such this vignette is also available with a full dataset as the worked
example from my github. Here I
use a small subset of that dataset that is included with the dipps
package for
testing purposes. Hence several parts of this vignette will be slightly
different, such as the code for reading in the data, for selecting a subset for
DIPPS, and for selecting tuning parameters for peak-grouping.
This vignette is seperated into four main sections:
combine_peaklists
function to read in
peaklists from Bruker peaklist format. If your peaks are in a different format
they could be read in in another way, massaged into the appropriate format,
and everything beyond this section would still apply.dipps
function, and how
to interpret the output thereof.combine_peaklists
is the relevent function here, reading in all the peaklists
in <folname>/peaklists/
and writing two csv files in the current working
directory, a Peak-list and a Spectra-list.
i.path = system.file("extdata", "test1", package = "dipps") n.empt = combine_peaklists(i.path) print(paste("Number of empty spectra:", n.empt))
i.path
is specified here using system.file
to acquire the absolute path to
these test data included with package dipps
, but in general this would usually
be specified using file.path
. Additional arguments can be passed to
combine_peaklists
to modify locations of inputs and outputs, see the
documentation, ?combine_peaklists
. The outputs can be read in using the helper
functions load_*
:
o.name = basename(i.path) pl = load_peaklist(o.name) sl = load_speclist(o.name)
o.name
can be supplied to combine_peaklists
in which case it would be used
to name the output files thereof, and should be supplied to the load_*
functions here. When o.name
is not supplied to combine_peaklists
, it
defaults to basename(i.path)
, which is why that is how I specify o.name
here.
Usually one of the first things you will want to do with such data is to produce peak-groups of some kind -- grouping peaks with similar mass to identify them with each other. This package supplies two functions for this purpose:
extract_masses
, anddbscan
.Most simply, extract_masses
can be used to simply extract peaks near known
masses of interest, in this dataset for example, m/z = 1570.677 is a mass of
interest, and so we could extract it as:
df.cal = extract_masses(pl, 1570.677, margin = 0.1)
This returns df.cal
as a simple subset of pl
, with an additional column
called group
filled with just the value 1570.677
. So the above call to
extract_masses
is equivalent to
df.cal = transform(subset(pl, abs(m.z - 1570.677) < 0.1), group = 1570.677)
.
extract masses
is provided as a convenience function as the second arguemnt
provided, masses
, can be a vector of masses, and the margin
specified can
also be provided in ppm -- see the documentation, ?extract_masses
.
As mentioned in the documentation, dbscan
provides a univariate optimisation
of the DBSCAN* algorithm of section 3 of Campello et al. (2013). This is a
density based clustering algorithm that, breifly, can be thought of as two
steps performed in sequence, first points with insufficient density are removed,
and second the remaining points are grouped as per the equivelance classes
induced by an equivalence relation. In this context density is measured as the
number of points within a certain eps
, and the equivalence relation used is
"are within eps
of each other". This second step, without the first, is often
called tolerance clustering and is often sufficient -- this can be acheived by
specifing a minimum density of 1:
pl$tol_clus = dbscan(pl$m.z, eps = 0.1, mnpts = 1)
But more generally the minimum density, mnpts
, will need to be adjusted based
on the dataset, usually proportionally to the size of the dataset but also
depending on other factors. Similarly eps
will depend on the dataset, and is
directly related to mnpts
-- the same value of mnpts
is a more stringent
density cutoff for a smaller value of eps
. This test dataset is quite small,
so a small value of mnpts
is appropriate, such as for example:
pl$dbs_clus = dbscan(pl$m.z, eps = 0.1, mnpts = 10)
Here I'll show how to construct a plot I often use for heuristically diagnosing the quality of the performed peakpicking, and to adjust the tuning parameter selection to get a peak-grouping i am happy with. First, a couple of helper functions for calculating peak-group summaries, and generating the plot to which I alluded respectively:
summarise_peakgroups = function(pl, var) { return(ddply(pl, var, summarise, AWM = weighted.mean(m.z, log1p(intensity)), Centre = (max(m.z) + min(m.z)) / 2, Range = max(m.z) - min(m.z), nPeaks = length(m.z), nDupPeaks = (length(m.z) - length(unique(Acq))))) } diag_plot = function(pgs, t) { return(ggplot(pgs, aes(x=Range, y=nDupPeaks)) + ggtitle(t) + geom_jitter(width = 0, height = 0.4) + ylab("# Duplicate Peaks")) }
Given these functions, a diagnostic plot for the two clusterings performed in the above section can then be generated like so:
pgs_tol <- summarise_peakgroups(pl, "tol_clus") pgs_dbs <- summarise_peakgroups(pl, "dbs_clus") print(diag_plot(subset(pgs_tol, tol_clus != 0), "# peaks unassigned: 0")) print(diag_plot(subset(pgs_dbs, dbs_clus != 0), paste("# peaks unassigned:", subset(pgs_dbs, dbs_clus == 0)$nPeaks)))
Duplicate peaks here refers to multiple peaks from the same spectrum occuring in the same peak-group. Usually, we would expect this not to happen if the peak-picking algorithm is working well (although sometimes the peak-picking can produce weird artefacts). Either way, number of duplicate peaks should be zero or if that is not possible, then it should be as small as possible (and noted, as these duplicates will likely have to be dealt with somehow later in the processing pipeline, either by averaging their intensities or by some other method, depending on the outcome being pursued).
I jitter the points in these plots vertically by +/- 0.2 in order to allow a visual que for the density of these otherwise overlapping points.
Note that these examples look pretty good -- with zero duplicate peaks and relatively low range, with the maximum less than 0.3 in both cases and the mode density of range aroudn 0.05 or so, which for a MALDI TOF experiement on peptides (which these data are) is entirely acceptable. This is mostly just because a small subset of the dataset is being used, and for realistic examples acheiving such good peakgrouping can be more difficult. I repeat this vignette but for a complete version of these data, and that is available from my github.
DIPPS can be calculated for a subset of spectra using the dipps
function. In
this case we select the subset of spectra by setting an arbitrary Y-coordinate
cutoff, and we will use the equivalence class clustering produced above to
identify the variables:
y_cutoff = 170 sl$subset = sl$Y >= y_cutoff pl = merge(pl, sl[, c("Acq", "subset")]) dipps.df = dipps(pl$Acq, pl$tol_clus, pl$subset) print(head(dipps.df), digits = 2, row.names = FALSE)
From the output of dipps
we can see that cluster (or in this case equivalence
class) number r dipps.df[1, "var"]
occurs in r round(dipps.df[1, "p.d"], 2)
of spectra with Y-coordinate less than r y_cutoff
, and in
r round(dipps.df[1, "p.u"], 2)
of remaining spectra, giving it the highest
DIPPS of r round(dipps.df[1, "d"], 2)
. The peak-group summary we made in the
previous section is a conviniant way to look at the details of this peakgroup:
subset(pgs_tol, tol_clus == dipps.df[1, "var"])
The main observation being that the m/z window for this cluster (or in this
case, equivalence class) is
r round(subset(pgs_tol, tol_clus == dipps.df[1, "var"])$Centre, 3)
+/-
r round(subset(pgs_tol, tol_clus == dipps.df[1, "var"])$Range / 2, 3)
with
most of the intense peaks grouping nearer to the AWM of
r round(subset(pgs_tol, tol_clus == dipps.df[1, "var"])$AWM, 3)
.
If we wanted a better idea of the m/z distribution within this cluster, we
could make a histogram:
hist(subset(pl, tol_clus == dipps.df[1, "var"])$m.z, xlab = "m/z", main = paste("Cluster", dipps.df[1, "var"]))
We could plot these occurrences, or their intensities, spatially using the
spatial_plot
function:
pl.sub = subset(pl, tol_clus == dipps.df[1, "var"]) pl.sub$occurrence = 1 pl.sub$log.intensity = log1p(pl.sub$intensity) print(spatial_plot(pl.sub, sl, plot.var = "occurrence", return.plot = TRUE, print.plot = FALSE)) print(spatial_plot(pl.sub, sl, plot.var = "log.intensity", return.plot = TRUE, print.plot = FALSE))
unlink("*_peaklist.txt") unlink("*_speclist.txt")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.