# CluMSID --- Clustering of MS^2^ Spectra for Metabolite Identification In CluMSID: Clustering of MS2 Spectra for Metabolite Identification

## Create a dendrogram

With the dendrogram, too, it is advisable to export is to pdf in a large format, e.g. as follows:

pdf(file = "CluMSID_dendro.pdf", width = 20, height = 20)
HCplot(distmat)
dev.off()


The plot from our example data looks like this:

knitr::include_graphics(system.file("extdata",
"CluMSID_dendro2.png",
package = "CluMSIDdata"))


The clusters are colour-coded and if exported to pdf, the tip labels containing feature ID and annotation are searchable.The height of the dendrogram's branching points serves as another piece of information when interpreting the clustered data as it signifies similarity of features.

For a detailed example of how to interpret, please refer to Depke et al. 2017, where CluMSID helped to identify new members of several classes of secondary metabolites in Pseudomonas aeruginosa.

Like with density-based clustering, it is also possible to generate a list of features with respective cluster assignments using HCtbl. As mentioned above for CluMSID_OPTISplot and OPTICStbl, it is crucial to run HCplot and HCtbl using the same parameters.

HCtbl <- HCtbl(distmat)

head(HCtbl)


# Generate a correlation network

As a new functionality, CluMSID offers the possibility to analyse the similarity data using weighted correlation networks. These networks offer some advantages with respect to standard clustering methods, most notably that they do not strictly assign every feature to a distinct cluster but also represent similarities between features that would fall into different clusters in hierarchical or density-based clustering. Thus, correlation networks potentially contain more useful information for data interpretation. On the downside, the interpretation is also complicated by this lack of concrete cluster assignments. E.g., we cannot simply look up which features belong to the same cluster in order to examine their spectra closely but we have to go back to the correlation network visualisation and search for connected features manually.

networkplot requires some arguments:

• distmat: matrix; a distance matrix like for all other functions described above
• interactive: logical; Similar to MDSplotplot, correlation network can be generate as interactive plots that are zoomable and display feature IDs on mouse-over. If that is desired, set interactive to TRUE (default is FALSE).
• show_labels: logical; whether to display feature IDs in the (non-interactive) plot (default is FALSE, ignored if interacive = TRUE)
• label_size: numeric; font size of feature ID labels (default is 1.5, which is way smaller than the default in GGally::ggnet2, 4.5)
• highlight_annotated: logical; whether to plot dots for features with annotation in a different colour (same as in MDSplotplot, default is FALSE)
• min_similarity: numeric; the minimum similarity (1 -- distance) threshold (similarities below this threshold will be ignored, default is 0.1)
• exclude_singletons: logical; whether to exclude features from the plot that do not have connections to other features, particularly useful with data sets containing very dissimilar spectra, e.g. neutral loss patterns or MS^1^ pseudospectra (default is FALSE)

A standard non-interactive correlation network for the MS^2^ example data can be plotted like this:

networkplot(distmat, highlight_annotated = TRUE,
show_labels = TRUE, interactive = FALSE)


As you can guess from this plot, it makes sense to use the interactive visualisation. Just like with MDSplotplot, you can view the interactive plot within RStudio or save it as html and view it in web browser.

my_net <- networkplot(distmat, interactive = TRUE,
highlight_annotated = TRUE)

htmlwidgets::saveWidget(my_net, "net.html")


This is how it looks like if you open the html file in Firefox, zoom in on a cluster and mouse over a feature:

knitr::include_graphics(system.file("extdata",
"interactive_net.png",
package = "CluMSIDdata"))


Please be aware that the spatial arrangement of the data points in the plot has a random component, i.e. while the relative position of the points (the distance to each other) is always the same, the absolute position varies and will not be the same even if the same command is executed twice.

The pairwise similarity of spectra or neutral loss patterns of features expressed by the cosine score is signified by the width of the line connecting the two features. All pairwise similarities greater than min_similarity result in a connecting line in the plot. The spatial proximity in which the features are mapped onto the plot is determined by the multivariate method underlying the network generation.

As we have already noticed after inspection of the heatmaps on p.13--14, the neutral loss patterns show much less similarity to each other than the MS^2^ spectra data. Thus, we expect quite a few neutral loss patterns that do not show any similarity to another neutral loss pattern. This expectation justifies the exclusion of these 'singletons' from the correlation network analysis. To do so, just set exclude_singletons to TRUE:

networkplot(nlmat, highlight_annotated = TRUE,
show_labels = TRUE, exclude_singletons = TRUE)


# Additional functionalities

Multidimensional scaling, density-based clustering, hierarchical clustering and correlation network analysis are the main CluMSID tools to analyse MS^2^ spectra or neutral loss pattern similarity data, however, the package contains some additional functionalities that may facilitate data analysis in some cases and can also be used in other contexts with or without the above-mentioned unsupervised methods.

## Access individual spectra from a list of spectra by various slot entries

Accessing S4 objects within lists is not trivial. Therefore, CluMSID offers a function to access individual or several MS2spectrum objects by their slot entries. getSpectrum() requires the following arguments:

• featlist: a list that contains only objects of class MS2spectrum
• slot: the slot to be searched (invalid slot arguments will produce errors):
• id
• annotation
• precursor (m/z of precursor ion)
• rt (retention time of precursor)
• what: the search term or number, must be character for id and annotation and numeric for precursor and rt
• mz.tol: the tolerance used for precursor ion m/z searches, defaults to 1E-05 (10ppm)
• rt.tol: the tolerance used for precursor ion retention time searches, defaults to 30s; high values can be used to specify retention time ranges (see example)

Some examples will demonstrate the use of getSpectrum():

1. Accessing a spectrum by its ID. For this, the exact feature ID must be known:

getSpectrum(annotatedSpeclist, "id", "M244.17T796.4")


2. Accessing a spectrum by its annotation. For this, the exact annotation has to be known as well, other annotations will produce a message:

getSpectrum(annotatedSpeclist, "annotation", "HHQ")

getSpectrum(annotatedSpeclist, "annotation", "C7-HQ")


3. Accessing spectra by their precursor ion m/z. If the list contains more than one spectrum with a precursor ion m/z within the tolerance, the output is again a list of MS2spectrum objects that meet the specified criterion:

getSpectrum(annotatedSpeclist, "precursor", 286.18, mz.tol = 1E-03)


4. Accessing spectra by their precursor retention time. Here, too, we can extract several MS2spectrum objects by setting a larger retention time tolerance. If we want to extract the spectra of all compounds that elute from 6min (360s) to 8min (480s), we proceed as follows:

six_eight <- getSpectrum(annotatedSpeclist, "rt", 420, rt.tol = 60)
length(six_eight)


## Find spectra that contain a specific fragment or neutral loss

Another pair of accessory functions is findFragment() and findNL() which are used to find spectra that contain a specific fragment ion or neutral loss. Analogous to getSpectrum(), they need as arguments a list of MS2spectrum objects, the m/z of the fragment or neutral loss of interest and the respective m/z tolerance in ppm (default is 10ppm).

The two functions can be useful in many situation, e.g. when working with lipid data where head groups and fatty acids often give characteristic fragments or neutral losses. In the world of P. aeruginosa secondary metabolites, alkylquinolones (AQs) play an important role and most of the AQ MS^2^ spectra contain a signature fragment with an m/z of 159.068. Based on this fragment m/z, we can create a list of putative AQs:

putativeAQs <- findFragment(annotatedSpeclist, 159.068)


An example for common neutral losses are nucleoside monophospates that all loose ribose-5'-monophosphate, resulting in a neutral loss of 212.009 in ESI-(+). Using findNL() we find CMP, UMP, AMP and GMP.

findNL(annotatedSpeclist, 212.009)


## Match one spectrum against a set of spectra

If you are mainly interested in one or a few number of spectra or neutral loss patterns, it may be sufficient to match one feature at a time against a larger set of spectra. This set of spectra can be all spectra contained in one mzXML file like in all the examples in this tutorial or they could be a spectral library, as long as its format in R is a list of MS2spectrum objects.

The getSimilarities() function requires several arguments:

• spec: The spectrum to be compared to other spectra. Can be either an object of class MS2spectrum or a two-column numerical matrix that contains fragment mass-to-charge ratios in the first and intensities in the second column.
• speclist: The set of spectra to which spec is to be compared. Must be a list where every entry is an object of class MS2spectrum. Can be generated from an mzXML file as shown above or constructed using new("MS2spectrum", ...) for every list entry (see example).
• type: Specifies whether MS^2^ spectra or neutral loss patterns are to be compared. Must be either 'spectrum' (default) or 'neutral_losses'.
• hits_only: Logical that indicates whether the result should contain only similarities greater than zero (see example).

In the first example, we want to find all MS^2^ spectra in our example data set that are similar to the spectrum of pyocyanin, an important secondary metabolite from Pseudomonas aeruginosa and therefore match the pyocyanin spectrum against our annotatedSpeclist. Because we have already identified pyocyanin in the data set, we can use getSpectrum to extract the MS2spectrum object from annotatedSpeclist. We do not want to search all r length(annotatedSpeclist) elements of the result vector, so we set hits_only to TRUE to exclude spectra that have 0 similarity to the pyocyanin spectrum.

pyo <- getSpectrum(annotatedSpeclist, "annotation", "pyocyanin")

sim_pyo <- getSimilarities(pyo, annotatedSpeclist, hits_only = TRUE)
sim_pyo


We get r length(getSimilarities(pyo, annotatedSpeclist, hits_only = TRUE)) spectra that have a non-zero similarity to the pyocyanin spectrum, including pyocyanin itself with a similarity of 1. Of course, we can further filter the data by subsetting the result vector in order to exclude spectra that have only minimal similarity, e.g. M679.43T1051.39 with a cosine similarity of only 0.0008 (the last element in the vector).

In the second example, we generate a new speclist, e.g. from a spectral library. We look at the unknown feature that has most similarity to pyocyanin. As pyocyanin is contained in annotatedSpeclist itself, we have to look at the second highest similarity. Again, we use getSpectrum() to extract the object from annotatedSpeclist:

highest_sim <- sort(sim_pyo, decreasing = TRUE)[2]

sim_spec <- getSpectrum(annotatedSpeclist, "id", names(highest_sim))
sim_spec


We see that the feature is not annotated. We are interested whether this feature also shows similarity to other members of the phenazine family of P. aeruginosa secondary metabolites. Some phenazines are contained in annotatedSpeclist but some are not, so we make a new speclist called phenazines and add the missing spectra manually from an in-house library:

phenazines <- list()
phenazines[[1]] <- getSpectrum(annotatedSpeclist,
"annotation", "pyocyanin")
phenazines[[2]] <- getSpectrum(annotatedSpeclist,
"annotation", "phenazine-1-carboxamide")
phenazines[[3]] <- getSpectrum(annotatedSpeclist,
"annotation", "phenazine-1-carboxylic acid")
phenazines[[4]] <- getSpectrum(annotatedSpeclist,
"annotation", "phenazine-1,6-dicarboxylic acid")
phenazines[[5]] <- new("MS2spectrum", id = "lib_entry_1",
annotation = "1-hydroxyphenazine",
spectrum = matrix(c(168.0632, 14,
169.0711, 288,
170.0743, 33,
179.0551, 62,
197.0653, 999),
byrow = TRUE,
ncol = 2))
phenazines[[6]] <- new("MS2spectrum", id = "lib_entry_2",
annotation = "2-hydroxy-phenazine-1-carboxylic acid",
spectrum = matrix(c(167.0621, 43,
179.0619, 93,
180.0650, 12,
195.0564, 40,
223.0509, 999,
224.0541, 142,
241.0611, 60),
byrow = TRUE,
ncol = 2))
phenazines[[7]] <- new("MS2spectrum", id = "lib_entry_3",
annotation = "pyocyanin (library spectrum)",
spectrum = matrix(c(168.0690, 58,
183.0927, 152,
184.0958, 19,
196.0640, 118,
197.0674, 15,
211.0873, 999,
212.0905, 145),
byrow = TRUE,
ncol = 2))

getSimilarities(sim_spec, phenazines, hits_only = FALSE)


As a result, we get the interesting information that the MS^2^ spectra similarity of our unknown feature seems to be specific to pyocyanin (both the experimental and the library spectrum).

## Convert MSnbase objects to class MS2spectrum

The MSnbase package---which is commonly used for proteomics applications and is also associated with XCMS3---has two classes for (MS^2^) spectra, Spectrum and Spectrum2 which contain spectra along with metainformation. These metainformation differ from those contained in MS2spectrum objects and are not very well suited for metabolomics applications. Still, it is possible to use CluMSID functions with objects of those two classes by converting them to MS2spectrum objects using as.MS2spectrum():

CluMSID_object <- as.MS2spectrum(MSnbase_object)
# or alternatively
CluMSID_object <- as(MSnbase_object, "MS2spectrum")


## Split polarities from polarity-switching runs

As polarity-switching and similar methords are gaining importance in LC-MS/MS metabolomics, CluMSID offers the possibility to process LC-MS/MS data containing spectra of different polarities. As spectra from positive and negative ionisation show different fragmentation mechanisms and patterns, it does not appear to be useful to compare spectra of different polarity to each other. Therefore, CluMSID provides a function to separate positive and negative spectra from each other. This has to be done in the very beginning of the analysis to not interfere with spectral merging. Positive and negative spectra can than be processed independently from each other as shown above.

A schematic workflow would like like this:

raw_list_mixedpolarities <- extractMS2spectra("raw_file_mixedpolarities.mzXML")

raw_list_positive <- splitPolarities(raw_list_mixedpolarities, "positive")
raw_list_negative <- splitPolarities(raw_list_mixedpolarities, "negative")

speclist_positive <- mergeMS2spectra(raw_list_positive)
speclist_negative <- mergeMS2spectra(raw_list_negative)


... and so on as described in this tutorial.

# Use MS^1^ pseudospectra instead of or in addition to MS^2^ data

MS^1^ pseudospectra are groups of peaks/ions that derive or are assumed to derive from the same compound. They consist of peaks for in-source fragment, adducts etc. Pseudospectra can contain structural information about analytes, e.g. about moieties that easily fragment even in MS^1^ mode without CID. Thus, it might sometimes be useful to study similarities between pseudospectra analogously to those between MS^2^ spectra. CluMSID makes use of the CAMERA package to assign peaks to pseudospectra. A custom S4 class named pseudospectrum is used which is very similar to the MS2spectrum class. For obvious reasons, it does not contain a precursor ion m/z slot and thus no neutral loss pattern, either. The pcgroup defined by CAMERA is used as ID, an annotation can be added if desired.

## Extract pseudospectra

To extract pseudospectra, you first have to process your data using the CAMERA package, either in R or via XCMSonline, where this is done automatically. There are two possibilities to use the extractPseudospectra() function in CluMSID: either with an xsAnnotate object which you generate with CAMERA in R or with a data.frame that contains data on m/z, retention time, intensity and pcgroup, e.g. the results table from XCMSonline.

The latter is demonstrated with the XCMSonline results table already used to generate a peak table. If the column names are not changed, the data.frame can be supplied as-is and intensity_columns does not have to be specified. We want to exclude pseudospectra that have only one peak, so we set min_peaks = 2.

pstable <-
readr::read_delim(file = system.file("extdata",
"TD035_XCMS.annotated.diffreport.tsv",
package = "CluMSIDdata"),
delim = "\t")

pseudospeclist <- extractPseudospectra(pstable, min_peaks = 2)


As a result, we get a list with r length(pseudospeclist) pseudospectra that we can now process further.

## Create distance matrix for pseudospectra

The creation of a distance matrix is analogous to the procedure for MS^2^ spectra:

pseudodistmat <- distanceMatrix(pseudospeclist)

load(file = system.file("extdata",
"pseudodistmat.RData",
package = "CluMSIDdata"))


## Generate a correlation network for pseudospectra

The distance matrix can now be used for MDS, clustering and correlation networks just like described above. For demonstration, we generate a correlation network:

networkplot(pseudodistmat, show_labels = TRUE, exclude_singletons = TRUE)


With the exclusion of singletons, we get a much less busy plot than for MS^2^ data but we still find quite a few connections that may prove informative.

# Session Info

sessionInfo()


## Try the CluMSID package in your browser

Any scripts or data that you put into this service are public.

CluMSID documentation built on Nov. 8, 2020, 7:46 p.m.