DaMiR.FSelect: Feature selection for classification
In DaMiRseq: Data Mining for RNA-seq data: normalization, feature selection and classification

Description Usage Arguments Details Value Author(s) References See Also Examples

This function identifies the class-correlated principal components (PCs) which are then used to implement a backward variable elimination procedure for the removal of non informative features.

1 2	DaMiR.FSelect(data, df, th.corr = 0.6, type = c("spearman", "pearson"), th.VIP = 3, nPlsIter = 1)

`data`	A transposed data frame or a matrix of normalized expression data. Rows and Cols should be, respectively, observations and features
`df`	A data frame with known variables; at least one column with 'class' label must be included
`th.corr`	Minimum threshold of correlation between class and PCs; default is 0.6. Note. If df$class has more than two levels, this option is disable and the number of PCs is set to 3.
`type`	Type of correlation metric; default is "spearman"
`th.VIP`	Threshold for `bve_pls` function, to remove non-important variables; default is 3
`nPlsIter`	Number of times that bve_pls has to run. Each iteration produces a set of selected features, usually similar to each other but not exacly the same! When nPlsIter is > 1, the intersection between each set of selected features is performed; so that, only the most robust features are selected. Default is 1

The function aims to reduce the number of features to obtain the most informative variables for classification purpose. First, PCs obtained by principal component analysis (PCA) are correlated with "class". The correlation threshold is defined by the user in th.corr argument. The higher is the correlation, the lower is the number of PCs returned. Importantly, if df$class has more than two levels, the number of PCs is automatically set to 3. In a binary experimental setting, users should pay attention to appropriately set the th.corr argument because it will also affect the total number of selected features that ultimately depend on the number of PCs. The bve_pls function of plsVarSel package is, then, applied. This function exploits a backward variable elimination procedure coupled to a partial least squares approach to remove those variable which are less informative with respect to class. The returned vector of variables is further reduced by the following DaMiR.FReduct function in order to obtain a subset of non correlated putative predictors.

A list containing:

An expression matrix with only informative features.
A data frame with class and optional variables information.

Mattia Chiesa, Luca Piacentini

Tahir Mehmood, Kristian Hovde Liland, Lars Snipen and Solve Saebo (2011). A review of variable selection methods in Partial Least Squares Regression. Chemometrics and Intelligent Laboratory Systems 118, pp. 62-69.

bve_pls
DaMiR.FReduct

# use example data:
data(data_norm)
data(df)
# extract expression data from SummarizedExperiment object
# and transpose the matrix:
t_data<-t(assay(data_norm))
t_data <- t_data[,seq_len(100)]
# select class-related features
data_reduced <- DaMiR.FSelect(t_data, df,
th.corr = 0.7, type = "spearman", th.VIP = 1)