# distance: Spectra Distance/Similarity Measurements In MsCoreUtils: Core Utils for Mass Spectrometry Data

## Description

These functions provide different normalized similariy/distance measurements.

## Usage

 1 2 3 4 5 6 7 8 9 ndotproduct(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...) dotproduct(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...) neuclidean(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...) navdist(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...) nspectraangle(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...) 

## Arguments

 x matrix, two-columns e.g. m/z, intensity y matrix, two-columns e.g. m/z, intensity m numeric, weighting for the first column of x and y (e.g. "mz"), default: 0 means don't weight by the first column. For more details see the ndotproduct details section. n numeric, weighting for the second column of x and y (e.g. "intensity"), default: 0.5 means effectly using sqrt(x[,2]) and sqrt(y[,2]). For more details see the ndotproduct details section. na.rm logical(1), should NA be removed prior to calculation (default TRUE). ... ignored.

## Details

All functions that calculate normalized similarity/distance measurements are prefixed with a n.

ndotproduct: the normalized dot product is described in Stein and Scott 1994 as: NDP = \frac{∑(W_1 W_2)^2}{∑(W_1)^2 ∑(W_2)^2}; where W_i = x^m * y^n, where x and y are the m/z and intensity values, respectively. Stein and Scott 1994 empirically determined the optimal exponents as m = 3 and n = 0.6 by analyzing ca. 12000 EI-MS data of 8000 organic compounds in the NIST Mass Spectral Library. MassBank (Horai et al. 2010) uses m = 2 and n = 0.5 for small compounds. In general with increasing values for m, high m/z values will be taken more into account for similarity calculation. Especially when working with small molecules, a value n > 0 can be set to give a weight on the m/z values to accommodate that shared fragments with higher m/z are less likely and will mean that molecules might be more similar. Increasing n will result in a higher importance of the intensity values. Most commonly m = 0 and n = 0.5 are used.

neuclidean: the normalized euclidean distance is described in Stein and Scott 1994 as: NED = (1 + \frac{∑((W_1 - W_2)^2)}{sum((W_2)^2)})^{-1}; where W_i = x^m * y^n, where x and y are the m/z and intensity values, respectively. See the details section about ndotproduct for an explanation how to set m and n.

navdist: the normalized absolute values distance is described in Stein and Scott 1994 as: NED = (1 + \frac{∑(|W_1 - W_2|)}{sum((W_2))})^{-1}; where W_i = x^m * y^n, where x and y are the m/z and intensity values, respectively. See the details section about ndotproduct for an explanation how to set m and n.

nspectraangle: the normalized spectra angle is described in Toprak et al 2014 as: NSA = 1 - \frac{2*\cos^{-1}(W_1 \cdot W_2)}{π}; where W_i = x^m * y^n, where x and y are the m/z and intensity values, respectively. The weighting was not originally proposed by Toprak et al. 2014. See the details section about ndotproduct for an explanation how to set m and n.

## Value

double(1) value between 0:1, where 0 is completely different and 1 identically.

## Note

These methods are implemented as described in Stein and Scott 1994 (navdist, ndotproduct, neuclidean) and Toprak et al. 2014 (nspectraangle) but because there is no reference implementation available we are unable to guarantee that the results are identical. Please see also the corresponding discussion at the github pull request linked below. If you find any problems or reference implementation please open an issue at https://github.com/rformassspectrometry/MsCoreUtils/issues.

## Author(s)

navdist, neuclidean, nspectraangle: Sebastian Gibb

ndotproduct: Sebastian Gibb and Thomas Naake, thomasnaake@googlemail.com

## References

Stein, S. E., and Scott, D. R. (1994). Optimization and testing of mass spectral library search algorithms for compound identification. Journal of the American Society for Mass Spectrometry, 5(9), 859–866. doi: 10.1016/1044-0305(94)87009-8.

Horai et al. (2010). MassBank: a public repository for sharing mass spectral data for life sciences. Journal of mass spectrometry, 45(7), 703–714. doi: 10.1002/jms.1777.

Toprak et al. (2014). Conserved peptide fragmentation as a benchmarking tool for mass spectrometers and a discriminating feature for targeted proteomics. Molecular & Cellular Proteomics : MCP, 13(8), 2056–2071. doi: 10.1074/mcp.O113.036475.

Pull Request for these distance/similarity measurements: https://github.com/rformassspectrometry/MsCoreUtils/pull/33

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 x <- matrix(c(1:5, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity"))) y <- matrix(c(1:5, 5:1), ncol = 2, dimnames = list(c(), c("mz", "intensity"))) ndotproduct(x, y) ndotproduct(x, y, m = 2, n = 0.5) ndotproduct(x, y, m = 3, n = 0.6) neuclidean(x, y) navdist(x, y) nspectraangle(x, y) 

MsCoreUtils documentation built on Nov. 8, 2020, 10:59 p.m.