dipps: Difference in ProPortions Statistics (DIPPS)
In Armadilloa16/dipps: Tools for Bruker Peaklist Format MSI Data

Description Usage Arguments Details Value See Also Examples

View source: R/dipps.R

Calculates the DIPPS for the given subset. The argument descriptions are generic as DIPPS can be applied to any binary (“occurrence”) data in which each variable has two values (“occurrence” and “absence”). In the MSI context, an occurrence is generally taken to be a peak, an observation is generally taken to be a spectrum and a variable is generally taken to be a mass range or peakgroup, possibly grouped via some clustering method such as that offered by dbscan.

1	dipps(obs, var, subset)

`obs`	A vector identifying the observation from which an occurrence originated.
`var`	A vector identifying the variable of which an occurrence is a realisation.
`subset`	A vector identifying occurrences belonging to the subset of observations of interest.

obs, var, and subset must be equal length, and can be taken from the output of combine_peaklists with relative ease – see example below. It is also assumed that equal entries in obs should have equal entries in subset as well. TODO: I should add a check for that.

Note that from the perspective of treating occurrence in each variable (seperately) being used as a binary classifier for membership in the subset, the DIPPS can be thought of as the Informedness of these classifiers, i.e. the DIPPS = sensitivity + specificity - 1.

Successful completion will return a data.frame in which rows represent variables (as identified by var), ordered in decreasing order of DIPPS, and with seven columns:

var.
p.u: proportions of occurrence in the subset == TRUE subset of observations.
p.d: proportions of occurrence in the subset == FALSE subset of observations.
d: p.u - p.d (DIPPS).
c.u: the cosine distance centroid of the subset == TRUE subset of observations.
cos: the cosine distance between c.u and the ‘template’ vector t which contains ones in each peakgroup with a DIPPS equal to or greater than the DIPPS of the peakgroup the corresponding row represents.
t: the ‘template’ vector for the heuristically chosen ‘optimal’ DIPPS cutoff – i.e. selecting a number of the highest DIPPS variables such that the cosine distance as described above is minimised, under the contraint that the dipps cutoff should be positive.

combine_peaklists, dbscan,

Winderbaum, L. J. et al. Feature extraction for proteomics imaging mass spectrometry data. The Annals of Applied Statistics. 2015;9(4):1973-1996. doi: 10.1214/15-AOAS870.

i.path = system.file("extdata", "test1", package = "dipps")
n.empty = combine_peaklists(i.path)
o.name = basename(i.path)
df.spec = load_speclist(o.name)
df.peak = load_peaklist(o.name)

# Construct peakgroups
df.peak$group = dbscan(df.peak$m.z, eps = 0.1, mnpts = 1)

# Select a subset of spectra expected to be overexpressed. In this case
# spectra with Y-coordinate greater than or equal to 170.
df.spec$subset = df.spec$Y >= 170
df.peak = merge(df.peak, df.spec[, c("Acq", "subset")])

# Calculate DIPPS
df.dipps = dipps(df.peak$Acq, df.peak$group, df.peak$subset)