In paoloinglese/SPUTNIK: Spatially Automatic Denoising for Imaging Mass Spectrometry Toolkit

knitr::opts_chunk$set(echo = TRUE)

NOTE: this tutorial is designed for version 1.4. Please update SPUTNIK before from https://github.com/paoloinglese/SPUTNIK.

Introduction

Welcome to SPUTNIK Github Page!
SPUTNIK [@10.1093/bioinformatics/bty622] is a filter tool for Mass Spectrometry Imaging (MSI) data. It is designed to identify and remove the mass spectral peaks that are spatially unrelated with the sample of interest. Often MSI data contain signal generated by the analytical technique used for the ion desorption. For instance, DESI-MSI data can contain solvent-related signals, or, in the case of MALDI-MSI, matrix-related signal can be part of the data. These signals are non-informative about the molecular content of the analyzed sample, and can interfere with the statistical analysis of the data. Furthermore, the presence of these signals makes the dimensionality of the data uselessly big. SPUTNIK uses three classes of filters to identify and remove these types of signals.

Filters

Global reference filter

The first family of filters, called general similarity filters, uses the expected location of the sample (ROI) to determine if a signal is informative or not. Simple similarity measures between the ROI, or a continuous reference signal representing the area occupied by the sample, and the images generated by the peak signals are used to determine whether a peak is informative. The continuous reference and ROI images can be calculated from the MSI data or imported. There are four methods to calculate the continuous reference:

"sum": sum of all the peak intensities
"mean": non-zero mean of all the peak intensities
"median": non-zero median of all the peak intensities
"pca": first principal component scores are assigned to the pixels intensities

There are also four methods to calculate the ROI:

"otsu": the continuous reference (calculated using one of the previous methods) is binarized using Otsu's thresholding
"kmeans": k-means with 2 clusters is applied to the entire MSI dataset to generate a binary ROI
"kmeans2": k-means with N (N > 2) clusters is applied to the entire MSI dataset to generate a segmentation of the image. Then, the clusters are merged into sample-related and sample-unrelated.
"supervised": the user annotates two areas of the image as sample-unrelated and sample-related. A linear SVM, trained on those spectra, is used to assign the labels to all the remaining pixels.

For instance, continuous reference and ROI from Rompp et al. MALDI-MSI dataset are calculated as follows.

library(SPUTNIK)

SPUTNIK requires as input:

Intensity matrix (rows = pixels, columns = peaks)
Image size (number of rows, number of columns)
Vector of peaks m/z values.

These can be generated by several tools, such as MALDIquant (see tutorial).

NOTE The intensity matrix must contain the raw intensities (non-normalized or transformed) and must contain all finite values (NA needs to be converted to 0).

In this case, we will use an example MALDI-MSI dataset. Let's load the data

bladderMALDI <- bladderMALDIRompp2010()

The loaded dataset consists of 34,840 pixels and 1,175 peaks (features):

dim(bladderMALDI)

The spatial and mass information is stored as attributes of the matrix:

names(attributes(bladderMALDI))

Then, we create the dataset object using the msiDataset command:

msiX <- msiDataset(values = bladderMALDI,
                   mz = attr(bladderMALDI, 'mass'),
                   rsize = attr(bladderMALDI, 'size')[1],
                   csize = attr(bladderMALDI, 'size')[2])

The first thing to do is to plot the image of the number of detected peaks and total ion count (TIC). These image are useful to capture the spatial location of the sample, as we expect its spectra to contain more ions or more intense signals:

plot(msiX@totalioncount)

plot(msiX@numdetected)

Unfortunately, in this case, it seems that more ions with higher TIC are detected outside of the tissue, we will need to remember this in the following steps.

Now we can normalize and apply the variable stabilizing transformation:

msiX <- normIntensity(msiX, 'PQN')
msiX <- varTransform(msiX, 'log2')

Generating reference image and sample region-of-interest (ROI)

We can now generate the continuous and binary reference images.

NOTE: we are aligning the PCA image with the number of detected peaks, but , since the number of detected ions is greater outside of the tissue, we need to set the argument invertAlign to TRUE:

ref.cont <- refImageContinuous(msiData = msiX, method = "pca",
                               alignTo = 'detected',
                               invertAlign = TRUE)
plot(ref.cont)

The first principal component show clearly the area occupied by the sample.

Now we can generate the binary reference using k-means

NOTE: again we invert the binary ROI after aligning it to the number of detected peaks.

ref.bin <- refImageBinaryKmeans(dataset = msiX,
                                alignTo = "detected",
                                invertAligned = TRUE)
plot(ref.bin)

Often the binary reference contains several scattered pixels, as result of the presence of noise in the original data. The isolated pixels can be removed through the removeSmallObjects command. Now we remove all the connected objects smaller than 50 pixels.

ref.bin <- removeSmallObjects(ref.bin, threshold = 50)
plot(ref.bin)

The global reference filter uses the similarity between the peak intensity image and the reference image In this case, we calculate the normalized mutual information (NMI) with the binary reference.

The argument cores > 1 allows multicore processing.

gpf <- globalPeaksFilter(msiData = msiX, referenceImage = ref.bin,
                         method = 'nmi', threshold = 0, cores = 1)

Then, we can apply the filter to remove the uninformative peaks.

msiX <- applyPeaksFilter(msiX, gpf)
print(dim(getIntensityMat(msiX)))

We can check the effect of filtering by plotting the RGB image corresponding to the scores of the first 3 principal components. This gives a qualitative idea of the spatial structures present in the dataset

pca.img <- PCAImage(msiX)
plot(pca.img)

tic.img <- totalIonCountMSI(msiX)
plot(tic.img)

Pixel count filter

Pixel count filter overcomes the problems related with the standard approach of removing peaks that are found in less than a certain number of pixels (thresholding). This methodology does not take into account of the spatial structure of the peak signals and therefore can remove important, although localized in small regions, signals. Pixel count filter allows the user to define the minimum number of connected pixels for a peak signal to consider it informative. Also, through three levels of aggressiveness, it compares the size of the connected signal pixels inside and outside the sample ROI. Now we define a new msiDataset object using the original MALDI-MSI data

msiX <- msiDataset(values = bladderMALDI,
                   mz = attr(bladderMALDI, 'mass'),
                   rsize = attr(bladderMALDI, 'size')[1],
                   csize = attr(bladderMALDI, 'size')[2])

We normalize and transform as before:

msiX <- normIntensity(msiX, 'PQN')
msiX <- varTransform(msiX, 'log2')

We use the binary reference generated in the previous section.

Now we define the filter, requesting the smallest number of connected signal pixels to be larger than 9. Also, we set the aggressiveness to 1, through the parameter agressive = 1. In this way, we request that the largest connected object outside the ROI is smaller than the largest one within the ROI

cpf <- countPixelsFilter(msiData = msiX, roiImage = ref.bin, minNumPixels = 9,
                         aggressive = 1)

After, we apply the filter and check the RGB image of the first three principal components scores

msiX <- applyPeaksFilter(msiX, cpf)
print(dim(getIntensityMat(msiX)))

We can then plot the RGB image representing the first 3 principal components. The contrast between the sample and off-sample regions is now much stronger:

pca.img <- PCAImage(msiX)
plot(pca.img)

Complete spatial randomness

The third class of filters use the complete spatial randomness tests to determine whether the binarized peak signal images represent a random point pattern. There are two available tests: Clark-Evans and Kolmogorov-Smirnov. We show here how to apply the Clark-Evans test on the MALDI-MSI data.

Again,

msiX <- msiDataset(values = bladderMALDI,
                   mz = attr(bladderMALDI, 'mass'),
                   rsize = attr(bladderMALDI, 'size')[1],
                   csize = attr(bladderMALDI, 'size')[2])

Let's normalize and transform the intensities:

msiX <- normIntensity(msiX, 'PQN')
msiX <- varTransform(msiX, 'log2')

Now we run the filter

csr <- CSRPeaksFilter(msiData = msiX, method = 'ClarkEvans', verbose = TRUE,
                      cores = 1)  # running in serial mode (parallel if cores > 1)

By default, CSRPeaksFilter calculates the p-values for the selected test, and applies the Bonferroni multiple testing correction. The adjusted p-values are saved in the element q.value. We can use the 0.05 threshold to define the peak filter.

NOTE: The Clark-Evans test does not distinguish between signals localized outside or within the sample, so it is better to run it after the global reference filter.

We can check the distributions of p-values and q-values:

hist(csr$p.value)
hist(csr$q.value)

We define the new filter setting a threshold of 0.05 for the q.value:

csrFilter <- createPeaksFilter(which(csr$q.value < 0.05))

and we can check the number of kept peaks:

# Number of kept peaks (non-random)
length(csrFilter$sel.peaks)

We are ready to apply the filter on the dataset:

msiX <- applyPeaksFilter(msiX, csrFilter)
print(dim(getIntensityMat(msiX)))

Let's have a look at the RGB image of the first 3 principal components:

pca.img <- PCAImage(msiX)
plot(pca.img)

As you can see, the CSR filter does not remove off-sample signals because they are not recognized as spatially random.

Conclusion

In this short tutorial, we have seen how to apply spatial filters using SPUTNIK. In this case, the filters were applied separately, but they can also be combined to get improved results.
We suggest to start with the global filter, which removes most of the signal spatially detected outside of the sample region, followed by one or both the other filters.