In rformassspectrometry/SpectriPy: Enhancing Cross-Language Mass Spectrometry Data Analysis with R and Python

BiocStyle::markdown()

Compiled: r date()

Introduction

Powerful software libraries for mass spectrometry (MS) data are available in both Python and R, covering specific and sometimes distinct aspects in the analysis of proteomics and metabolomics data. R's reticulate package converts basic data types between Python and R and enables a seamless interoperability of both programming languages. The SpectriPy package extends reticulate allowing to translate MS data data structures between Python and R. In addition, functionality from Python's matchms library is directly wrapped into R functions allowing a seamless integration into R based workflows. SpectriPy thus enables powerful proteomics or metabolomics analysis workflows combining the strengths of Python and R MS libraries.

Installation

If the BiocManager package is not already available, please install it with install.packages("BiocManager"). To install the development version of SpectriPy from GitHub, please install in addition the remotes package with `install.packages("remotes"). As a system dependency, the package requires a Python environment to be available.

LLLL Update installation instructions. and can be installed with the BiocManager R package using the command BiocManager::install("RforMassSpectrometry/SpectriPy"). This will install the latest version of the package from GitHub. All required python libraries are installed automatically on demand. Note that first installation or first invocation of specific functions might take long because of this installation process.

Translating data structures between Python and R

The reticulate package enables a seamless integration of Python with R by translating the core data structures between the two programming languages and sharing a Python (respectively R) runtime environment that can be accessed from the other programming languages including shared variables. The SpectriPy package builds on that providing functionality to translate data structures for mass spectrometry (MS) data between the two languages. Specifically, the package translates between R's Spectra objects (from the r BiocStyle::Biocpkg("Spectra") package) and matchms.Spectrum objects from the matchms Python library.

R to Python

In the first examples we translated MS data from R to Python. We first load data from a mzML data file provided through the r BiocStyle::Biocpkg("msdata") package.

#' Loading the data from a mzML file as a `Spectra` object.
library(Spectra)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
s_r <- Spectra(fl)

We next restrict the data to MS2 spectra and remove spectra with less than 3 peaks (fragments).

#' Restrict to MS level 2 spectra
s_r <- filterMsLevel(s_r, 2)
s_r <- s_r[lengths(s_r) < 3]
s_r

library(AnnotationHub)
ah <- AnnotationHub(localHub = TRUE)
query(ah, "MassBank")
mb <- ah[["AH107049"]]
mb <- Spectra(mb)

register(SerialParam())
library(MetaboAnnotation)
m <- matchSpectra(s_r, mb, CompareSpectraParam(ppm = 40))
set.seed(123)
idx <- c(whichTarget(m), 5970, 5971, 5972, 5973, 10279, 15505, 15506,
         sample(seq_along(mb), 51))

mb <- mb[sort(idx)]
tmp <- selectSpectraVariables(
    mb, c("msLevel", "rtime", "acquisitionNum", "precursorMz",
          "precursorIntensity", "precursorCharge", "inchi", "smiles",
          "name"))

library(MsBackendMgf)
sm <- c(spectraVariableMapping(MsBackendMgf()), inchi = "INCHI",
        smiles = "SMILES", name = "NAME")
export(tmp, MsBackendMgf(), file = "/home/jo/tmp/test.mgf", mapping = sm)


library(AnnotationHub)
tmp <- system.file()

Python to R

We next import MS data from a file in MGF format using the matchms Python library.

Spectra similarity calculations using `matchms`

The SpectriPy package provides the compareSpectriPy() function that allows to perform spectra similarity calculations using the scoring functions from the matchms Python package. Below all currently supported scoring functions are listed along with the parameter class that allows selecting and configuring the algorithm in the compareSpectriPy() function. Additional functions will be added in future.

CosineGreedy: CosineGreedyParam.
CosineHungarian: CosineHungarianParam.
ModifiedCosineParam: ModifiedCosineParam.

We next create some simple example spectra and subsequently use the compareSpectriPy() function to calculate pairwise similarities between these.

library(Spectra)
library(SpectriPy)

## Create a Spectra object with two MS2 spectra for Caffeine.
caf <- DataFrame(
    msLevel = c(2L, 2L),
    name = "Caffeine",
    precursorMz = c(195.0877, 195.0877)
)
caf$intensity <- list(
    c(340.0, 416, 2580, 412),
    c(388.0, 3270, 85, 54, 10111))
caf$mz <- list(
    c(135.0432, 138.0632, 163.0375, 195.0880),
    c(110.0710, 138.0655, 138.1057, 138.1742, 195.0864))
caf <- Spectra(caf)

## Create a Spectra object with two MS2 spectra for 1-Methylhistidine
mhd <- DataFrame(
    msLevel = c(2L, 2L),
    precursorMz = c(170.0924, 170.0924),
    id = c("HMDB0000001", "HMDB0000001"),
    name = c("1-Methylhistidine", "1-Methylhistidine"))
mhd$mz <- list(
    c(109.2, 124.2, 124.5, 170.16, 170.52),
    c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16))
mhd$intensity <- list(
    c(3.407, 47.494, 3.094, 100.0, 13.240),
    c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643))
mhd <- Spectra(mhd)

We first calculate pairwise similarities between all spectra defined above and those of caffeine using Spectra's built-in compareSpectra() function.

all <- c(caf, mhd)
res_r <- compareSpectra(all, caf)
res_r

Thus, compareSpectra() returned the pairwise similarity scores (by default calculated using the normalized dot-product function) between all spectra in all (rows) and all spectra in caf (columns). compareSpectriPy() works similar, with the difference that we need to specify and configure the similarity function (from matchms) using a dedicated parameter object. Below we calculate the similarity using the CosineGreedy function changing the tolerance to a value of 0.05 (instead of the default 0.1).

Executing the following step will automatically download and install conda mini forge to manage Python package dependencies.

res <- compareSpectriPy(all, caf, param = CosineGreedyParam(tolerance = 0.05))
res

As a result compareSpectriPy() returns also a numeric matrix of similarities. Note also that the first compareSpectriPy() call takes usually a little longer because the Python setup has to be initialized.

Next we use the ModifiedCosine algorithm that considers also differences between the spectra's precursor m/z in the calculation.

res <- compareSpectriPy(all, caf, param = ModifiedCosineParam())
res

A third example is the CosineHungarian algorithm, also a similarity function from matchms. This algorithm calculates the cosine similarity score as with CosineGreedyParam, but using the Hungarian algorithm to find the best matching peaks between the compared spectra. The algorithm can be configured with parameters tolerance, mzPower and intensityPower (see parameter description for more details).

res <- compareSpectriPy(all, caf, param = CosineHungarianParam())
res

A fourth example is the NeutralLossesCosine algorithm, also a similarity function from matchms. The neutral losses cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance once a mass-shift is applied. The mass shift is the difference in precursor-m/z between the two spectra.

res <- compareSpectriPy(all, caf, param = NeutralLossesCosineParam())
res

Note that for the abovementioned similarity calculations all spectra precursor m/z values need to be available, otherwise an error will be thrown. Thus, we should always ensure to remove spectra without precursor m/z values prior to similarity scoring with this similarity method. Below we remove the precursor m/z from one of our input spectra and then show how the Spectra object could be subsetted to valid spectra for this method.

## Remove precursor m/z from the 3rd spectrum
all$precursorMz[3] <- NA

## Filter the input spectra removing those with missing precursor.
all <- all[!is.na(precursorMz(all))]

compareSpectriPy(all, caf, param = ModifiedCosineParam())

Advanced use and internals

For advanced users or developers, the rspec_to_pyspec() and pyspec_to_rspec() functions are available that enable conversion between MS data representations in R and Python (i.e. between the Spectra::Spectra object and the Python matchms.Spectrum object). To use these functions the reticulate R package needs to be installed along with a Python environment and the matchms Python package.

To illustrate their use we initialize below the Python environment that is bundled using the r BiocStyle::Biocpkg("basilisk") package within SpectriPy.

library(SpectriPy)
library(basilisk)
cl <- basiliskStart(SpectriPy:::matchms_env)

We next create a simple Spectra object representing fragment spectra for some small compounds.

library(Spectra)

# create example spectra
spd <- DataFrame(
  msLevel = c(2L, 2L, 2L),
  id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"),
  name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine"))

## Assign m/z and intensity values.
spd$mz <- list(
  c(109.2, 124.2, 124.5, 170.16, 170.52),
  c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16),
  c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712,
    111.0551, 123.0429, 138.0662, 195.0876))
spd$intensity <- list(
  c(3.407, 47.494, 3.094, 100.0, 13.240),
  c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643),
  c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994))

sps <- Spectra(spd)

We next convert the Spectra to a matchms.Spectrum object.

pysps <- rspec_to_pyspec(sps)
pysps

As a result we got now a Python list with 3 Spectrum objects containing the peak data of sps as well as a reduced set of the available spectra variables in sps. Which spectra variables get copied to Python can be defined with the mapping parameter of rspec_to_pyspec. The default is to convert all variables defined by spectraVariableMapping(), but additional variables along with their respective names in the Python Spectrum object can be defined too. Below we list the pre-selected spectra variables that are converted by default.

spectraVariableMapping()

With our Spectra data converted to Python, we could directly call all routines from the matchms package using the reticulate R package. Below we normalize the intensities of the 3 spectra using the normalize_intensities() function from the matchms.filtering library. We thus need to first import the functionality from that package and can then call the function directly on the (Python) objects.

library(reticulate)
filters <- import("matchms.filtering")

res <- vector("list", length(pysps))
for (i in (seq_along(pysps) - 1))
    res[[i + 1]] <- filters$normalize_intensities(pysps[i])
res <- r_to_py(res)
res

We can now convert the list of Python matchms.Spectrum objects back to R with pyspec_to_rspec():

sps_r <- pyspec_to_rspec(res)
#' The normalized intensities
intensity(sps_r)
#' The original intensities
intensity(sps)

Intensity values have now been normalized to values between 0 and 1.

Note however that, it would be much better (and likely more efficient) to directly use a Python function on the list of Spectrum objects instead of the mixed R and Python code used in the example above. Below we define a simple python script that iterates over the spectra in python and performs the normalization.

py_script <- paste0("from matchms.filtering import normalize_intensities\n",
                    "for i in range(len(pysps)):\n",
                    "    pysps[i] = normalize_intensities(pysps[i])\n")
py$pysps <- pysps
py_run_string(py_script)

tmp <- pyspec_to_rspec(pysps)
intensity(tmp)