LPS: Linear Predictor Score fitting
In LPS: Linear Predictor Score, for Binary Inference from Multiple Continuous Variables

Description Usage Arguments Value Normalization Time efficiency Author(s) References See Also Examples

View source: R/LPS.r

This function trains a Linear Predictor Score model, given pre-computed coefficients. It uses data with known classes to fit the model.

It has numerous way to be called, and all the arguments are not mandatory. See the 'Examples' section.

1	LPS(data, coeff, response, k, threshold, formula, method = "fdr", ...)

`data`	Continuous data used to retrieve classes, as a `data.frame` or `matrix`, with samples in rows and features (genes) in columns. Rows and columns should be named. Some precautions must be taken concerning data normalization, see the corresponding section below.
`coeff`	Pre-computed coefficients for the model, as returned by `LPS.coeff` (see there for format details).
`response`	Already known classes for the samples provided in `data`, preferably as a two-level `factor`. Can be missing if a `formula` with a response element is provided, but this argument precedes.
`k`	Single `integer` value, amount of features to include in the model, in decreasing order of coefficient. Can be missing if `threshold` or `formula` are provided, but this argument precedes other both of them.
`threshold`	Single `numeric` value, p-value threshold to apply for feature selection. Can be missing if `k` or `formula` are provided, but `k` precedes on it and it precedes on `formula`.
`formula`	A `formula` object, describing the model to fit (several templates are handled, see 'Examples'). The formula response element (before the "~" sign) can replace the `response` argument if it is not provided. The variables (after the "~" sign) can be a single integer (standing for the `k` argument), a single numeric (standing for the `threshold` argument) or a sum of feature names to use directly. "." is also handled in the usual way (all `data` columns), and "1" is a more efficient way to refer to all numeric columns of `data`.
`method`	Single character value, to be passed to `p.adjust` when `threshold` is provided.
`...`	Further arguments are passed to `model.frame` if `response` is missing (thus defined via `formula`). `subset` and `na.action` may be particularly useful for cross-validation schemes, see `model.frame.default` for details. `subset` is always handled but masked in "..." for compatibility reasons.

An object of (S3) class "LPS" :

`coeff`	Named numeric vector, the coefficients used in the model.
`classes`	Character vector, the labels of the two groups to be predicted.
`scores`	List of two numeric vectors, training dataset scores sorted by group.
`means`	Numeric vector, score means of each group in the training dataset.
`sds`	Numeric vector, score `sd` of each group in the training dataset.
`ovl`	Numeric value, overlapping coefficient as returned by `OVL`.
`k`	Integer value, amount of features selected in the model (if relevant).
`p.threshold`	Numeric value, threshold used for feature selection (if relevant).
`p.method`	Character value, p-value correction used for feature selection (if relevant).

As expression values are directly used in the score, gene centering and scaling are strongly recommended. For Affymetrix raw expression values (strictly positive, linear and absolute), Wright et al. suggests a multiplicative centering on a median of 1000 followed by a log2 transformation. For log-ratio, gene centering and scaling should not be necessary, as they are naturally 0-centered.

Using a numeric matrix as data and a factor as response is the fastest way to compute coefficients, if time consumption matters (as in cross-validation schemes). formula is there only for consistency with R modeling functions, and to provide response, k or threshold in a single way.

Sylvain Mareschal

Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol. 2002;9(3):505-11.

Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6.

Bohers E, Mareschal S, Bouzelfen A, Marchand V, Ruminy P, Maingonnat C, Menard AL, Etancelin P, Bertrand P, Dubois S, Alcantara M, Bastard C, Tilly H, Jardin F. Targetable activating mutations are very frequent in GCB and ABC diffuse large B-cell lymphoma. Genes Chromosomes Cancer. 2014 Feb;53(2):144-53.

LPS.coeff

  # Data with features in columns
  data(rosenwald)
  group <- rosenwald.cli$group
  expr <- t(rosenwald.expr)
  
  # NA imputation (feature's mean to minimize impact)
  f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x }
  expr <- apply(expr, 2, f)
  
  # Coefficients
  coeff <- LPS.coeff(data=expr, response=group)
  
  
  # 10 best features (straightforward)
  m <- LPS(data=expr, coeff=coeff, response=group, k=10)
  
  # 10 best features (formula)
  ### 'k' MUST be an integer, or will be understood as a 'threshold'
  ### Numbers are "numeric", enforce integer with "L" or "as.integer"
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~10L)
  k <- as.integer(10)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~k)
  
  # FDR threshold
  thr <- 0.01
  m <- LPS(data=expr, coeff=coeff, response=group, threshold=thr)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~0.01)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~thr)
  
  # Custom model
  m <- LPS(data=expr, coeff=coeff[ c("27481","17013") ,], response=group, k=2)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~`27481`+`17013`)
  ### Notice backticks in formula for syntactically invalid names
  
  # Complete model
  m <- LPS(data=expr, coeff=coeff, response=group, k=ncol(expr))
  m <- LPS(data=expr, coeff=coeff, response=group, threshold=1)
  ### m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~.)
  ### The last is correct but (really) slow on large datasets