ntp: nearest template prediction

View source: R/ntp.R

ntpR Documentation

nearest template prediction

Description

Nearest Template Prediction (NTP) based on predefined class templates.

Usage

ntp(
  emat,
  templates,
  nPerm = 1000,
  distance = "cosine",
  nCores = 1,
  seed = NULL,
  verbose = getOption("verbose"),
  doPlot = FALSE
)

Arguments

emat

a numeric matrix with row features and sample columns. rownames(emat) are matched against templates$probe.

templates

a data frame with two columns; class (coerced to factor) and probe (coerced to character).

nPerm

an integer, number of permutations for p-value estimation.

distance

a character, either c("cosine", "pearson", "spearman" or "kendall").

nCores

an integer specifying number of threads for parallelization.

seed

an integer, for p-value reproducibility. Setting seed enforces serial processing.

verbose

logical, whether console messages are to be displayed.

doPlot

logical, whether to produce prediction subHeatmap.

Details

ntp implements the Nearest Template Prediction (NTP) algorithm, largely as proposed by Yujin Hoshida (2010) (see below). For each sample, distances to templates are calculated and class assigned based on smallest distance. Distances are transformed from the sample-templates correlations as follows:

d.class = \sqrt(1/2 * (1-(cor(sample,templates))

Template values are 1 for class features and 0 for non-class features (-1 if there are only two classes). Prediction confidence is estimated based on the distance of the null-distribution, estimated from permutation tests. Thus the lowest possible estimate of the p-value is 1/nPerm.

  • emat should be a row-wise centered and scaled matrix. For large, balanced datasets, this may be achieved by applying ematAdjust function.

  • templates is a data.frame defining class templates. A class template is a set of marker genes with higher expected expression in samples belonging to class compared to non-class samples. templates must contain at least two columns named probe and class.

  • compared to Hoshida (2010), resulting p-value estimates are more conservative (by a factor equaling the number of classes) and the distances are a monotonic transformation of 1-cor (see Details section above).

  • Hoshida (2010) does not explicitly state whether input should be log2-transformed or not and examples includes both. Based on experience this choice affects results only at the margins, but for high-quality datasets, normalized, untransformed inputs may yield a small increase in accuracy.

For further details on the NTP algorithm, please refer to package vignette and Hoshida (2010).

Parallel processing is implemented through parallel mclapply or snow parLapply for nix and Windows systems, respectively.

Value

a data frame with class predictions, template distances, p-values and false discovery rate adjusted p-values (p.adjust(method = "fdr")). Rownames equal emat colnames.

Note

  • features with missing values are discarded.

  • setting seed disables parallel processing to ensure p-value reproducibility.

  • for two random uncorrelated vectors x,y N\sim(0,1) E[d.xy]\approx0.71 when distance is cosine.

  • internally, correlations instead of distances are calculated.

  • accepts reuse of features (marker not specific for one class only)

References

Hoshida, Y. (2010). Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment. PLoS ONE 5, e15543.

Eide PW, Bruun J, Lothe RA, Sveen A. (2017). CMScaller: an R package for consensus molecular subtyping of colorectal cancer pre-clinical models. doi: 10.1038/s41598-017-16747-x.

See Also

corCosine, cor


MolecularPathologyLab/MmCMS documentation built on Oct. 18, 2023, 10:42 p.m.