scores: Feature Selection in NMF Models

Description Usage Arguments Details Value Feature scores Feature selection Methods (by generic) References Examples

Description

The function featureScore implements different methods to computes basis-specificity scores for each feature in the data.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
featureScore(object, ...)

## S4 method for signature 'matrix'
featureScore(object, method = c("kim", "max"))

## S4 method for signature 'NMF'
featureScore(object, ...)

extractFeatures(object, ...)

## S4 method for signature 'matrix'
extractFeatures(
  object,
  method = c("kim", "max"),
  format = c("list", "combine", "subset"),
  nodups = TRUE
)

## S4 method for signature 'NMF'
extractFeatures(object, ...)

Arguments

object

an object from which scores/features are computed/extracted

...

extra arguments to allow extension

method

scoring or selection method. It specifies the name of one of the method described in sections Feature scores and Feature selection.

Additionally for extractFeatures, it may be an integer vector that indicates the number of top most contributing features to extract from each column of object, when ordered in decreasing order, or a numeric value between 0 and 1 that indicates the minimum relative basis contribution above which a feature is selected (i.e. basis contribution threshold). In the case of a single numeric value (integer or percentage), it is used for all columns.

Note that extractFeatures(x, 1) means relative contribution threshold of 100\ an integer value as in extractFeatures(x, 1L). However, if all elements in methods are > 1, they are automatically treated as if they were integers: extractFeatures(x, 2) means the top-2 most contributing features in each component.

format

output format. The following values are accepted:

‘list’

(default) returns a list with one element per column in object, each containing the indexes of the selected features, as an integer vector. If object has row names, these are used to name each index vector. Components for which no feature were selected are assigned a NA value.

‘combine’

returns all indexes in a single vector. Duplicated indexes are made unique if nodups=TRUE (default).

‘subset’

returns an object of the same class as object, but subset with the selected indexes, so that it contains data only from basis-specific features.

nodups

logical that indicates if duplicated indexes, i.e. features selected on multiple basis components (which should in theory not happen), should be only appear once in the result. Only used when format='combine'.

Details

One of the properties of Nonnegative Matrix Factorization is that is tend to produce sparse representation of the observed data, leading to a natural application to bi-clustering, that characterises groups of samples by a small number of features.

In NMF models, samples are grouped according to the basis components that contributes the most to each sample, i.e. the basis components that have the greatest coefficient in each column of the coefficient matrix (see predict,NMF-method). Each group of samples is then characterised by a set of features selected based on basis-specifity scores that are computed on the basis matrix.

Value

featureScore returns a numeric vector of the length the number of rows in object (i.e. one score per feature).

extractFeatures returns the selected features as a list of indexes, a single integer vector or an object of the same class as object that only contains the selected features.

Feature scores

The function featureScore can compute basis-specificity scores using the following methods:

‘kim’

Method defined by Kim and Park (2007).

The score for feature i is defined as:

S_i = 1 + 1/log2(k) sum_q [ p(i,q) log2( p(i,q) ) ]

,

where p(i,q) is the probability that the i-th feature contributes to basis q:

p(i,q) = W(i,q) / (sum_r W(i,r))

The feature scores are real values within the range [0,1]. The higher the feature score the more basis-specific the corresponding feature.

‘max’

Method defined by Carmona-Saez et al. (2006).

The feature scores are defined as the row maximums.

Feature selection

The function extractFeatures can select features using the following methods:

‘kim’

uses Kim and Park (2007) scoring schema and feature selection method.

The features are first scored using the function featureScore with method ‘kim’. Then only the features that fulfil both following criteria are retained:

  • score greater than \hat{μ} + 3 \hat{σ}, where \hat{μ} and \hat{σ} are the median and the median absolute deviation (MAD) of the scores respectively;

  • the maximum contribution to a basis component is greater than the median of all contributions (i.e. of all elements of W).

‘max’

uses the selection method used in the bioNMF software package and described in Carmona-Saez et al. (2006).

For each basis component, the features are first sorted by decreasing contribution. Then, one selects only the first consecutive features whose highest contribution in the basis matrix is effectively on the considered basis.

Methods (by generic)

extractFeatures:

featureScore:

References

Kim H, Park H (2007). “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis.” _Bioinformatics (Oxford, England)_, *23*(12), 1495-502. ISSN 1460-2059, doi: 10.1093/bioinformatics/btm134 (URL: https://doi.org/10.1093/bioinformatics/btm134).

Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A (2006). “Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization.” _BMC bioinformatics_, *7*, 78. ISSN 1471-2105, doi: 10.1186/1471-2105-7-78 (URL: https://doi.org/10.1186/1471-2105-7-78).

Kim H, Park H (2007). “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis.” _Bioinformatics (Oxford, England)_, *23*(12), 1495-502. ISSN 1460-2059, doi: 10.1093/bioinformatics/btm134 (URL: https://doi.org/10.1093/bioinformatics/btm134).

Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A (2006). “Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization.” _BMC bioinformatics_, *7*, 78. ISSN 1471-2105, doi: 10.1186/1471-2105-7-78 (URL: https://doi.org/10.1186/1471-2105-7-78).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# random NMF model
x <- rnmf(3, 50,20)

# probably no feature is selected
extractFeatures(x)
# extract top 5 for each basis
extractFeatures(x, 5L)
# extract features that have a relative basis contribution above a threshold
extractFeatures(x, 0.5)
# ambiguity?
extractFeatures(x, 1) # means relative contribution above 100%
extractFeatures(x, 1L) # means top contributing feature in each component

renozao/NMF documentation built on June 14, 2020, 9:35 p.m.