hybrid: Feature selection with the hybrid algorithm
In silkeszy/Pomona: Identification of relevant variables in omics data sets using Random Forests

hybrid

R Documentation

Feature selection with the hybrid algorithm

Description

hybrid is an all relevant random forest feature selection wrapper algorithm that uses the corrected impurity importance (Nembrini et al. 2019) as variable importance measure (VIM); Analogously to to Janitza et al. 2018, variables negative impurity importance are interpreted unimportant and used to generate the null distribution and to calculate the p-values. That is, this implementation is fully based on the original implementation of the Vita variable selection algorithm available in ranger and that of Boruta. The method performs a top-down search for relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilise that test.

Usage

hybrid(x, ...)

## Default S3 method:
hybrid(
  x,
  y,
  pValue = 0.01,
  mcAdj = TRUE,
  maxRuns = 100,
  doTrace = 0,
  holdHistory = TRUE,
  getImp,
  alpha = 0.05,
  seed,
  ...
)

## S3 method for class 'formula'
hybrid(formula, data = .GlobalEnv, ...)

Arguments

`x`	data frame of predictors.
`...`	additional parameters passed to `getImp`.
`y`	response vector; factor for classification, numeric vector for regression, `Surv` object for survival (supports depends on importance adapter capabilities).
`pValue`	confidence level. Default value should be used.
`mcAdj`	if set to `TRUE`, a multiple comparisons adjustment using the Bonferroni method will be applied. Default value should be used; older (1.x and 2.x) versions of hybrid were effectively using `FALSE`.
`maxRuns`	maximal number of importance source runs. You may increase it to resolve attributes left Tentative.
`doTrace`	verbosity level. 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means the same as 1, plus reporting each importance source run, 3 means the same as 2, plus reporting of hits assigned to yet undecided attributes.
`holdHistory`	if set to `TRUE`, the full history of importance is stored and returned as the `ImpHistory` element of the result. Can be used to decrease a memory footprint of hybrid in case this side data is not used, especially when the number of attributes is huge; yet it disables plotting of such made `hybrid` objects and the use of the `TentativeRoughFix` function.
`getImp`	function used to obtain attribute importance. The default is get_imp_ranger, which runs random forest from the `ranger` package and gathers Z-scores of corrected impurities. It should return a numeric vector of a size identical to the number of columns of its first argument, containing importance measure of respective attributes. Any order-preserving transformation of this measure will yield the same result. It is assumed that more important attributes get higher importance. +-Inf are accepted, NaNs and NAs are treated as 0s, with a warning.
`alpha`	significance threshold used by Vita
`seed`	Seed
`formula`	alternatively, formula describing model to be analysed.
`data`	in which to interpret formula.

Details

hybrid iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones. Attributes that have significantly worst importance than shadow ones are being consecutively dropped. On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed. Shadows are re-created in each iteration. Algorithm stops when only Confirmed attributes are left, or when it reaches maxRuns importance source runs. If the second scenario occurs, some attributes may be left without a decision. They are claimed Tentative. You may try to extend maxRuns or lower pValue to clarify them, but in some cases their importances do fluctuate too much for hybrid to converge. Instead, you can use TentativeRoughFix function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis.

Value

An object of class hybrid, which is a list with the following components:

`finalDecision`	a factor of three value: `Confirmed`, `Rejected` or `Tentative`, containing final result of feature selection.
`ImpHistory`	a data frame of importances of attributes gathered in each importance source run. Beside predictors' importances, it contains maximal, mean and minimal importance of shadow attributes in each run. Rejected attributes get `-Inf` importance. Set to `NULL` if `holdHistory` was given `FALSE`.
`timeTaken`	time taken by the computation.
`impSource`	string describing the source of importance, equal to a comment attribute of the `getImp` argument.
`call`	the original call of the `hybrid` function.

References

Nembrini, S., Koenig, I. R. & Wright, M. N. (2018). The revival of the Gini Importance? Bioinformatics. https://doi.org/10.1093/bioinformatics/bty373. Janitza, S, Celik, E, Boulesteix, AL. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Adv Data Anal Classif.; doi.org: 10.1007/s11634-016-0276-4 Kursa, M. B. and Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software. Journal of Statistical Software, 36(11), p. 1-13. URL: http://www.jstatsoft.org/v36/i11/.

Examples

set.seed(777)

## Not run: 
#hybrid on the "small redundant XOR" problem; read ?srx for details
data(srx)
hybrid(Y~.,data=srx)->hybrid.srx

#Results summary
print(hybrid.srx)

#Result plot
plot(hybrid.srx)

#Attribute statistics
attStats(hybrid.srx)

#Using alternative importance source, rFerns
hybrid(Y~.,data=srx,getImp=getImpFerns)->hybrid.srx.ferns
print(hybrid.srx.ferns)

#Verbose
hybrid(Y~.,data=srx,doTrace=2)->hybrid.srx

## End(Not run)
## Not run: 
#hybrid on the iris problem extended with artificial irrelevant features
#Generate said features
iris.extended<-data.frame(iris,apply(iris[,-5],2,sample))
names(iris.extended)[6:9]<-paste("Nonsense",1:4,sep="")
#Run hybrid on this data
hybrid(Species~.,data=iris.extended,doTrace=2)->hybrid.iris.extended
#Nonsense attributes should be rejected
print(hybrid.iris.extended)

## End(Not run)

## Not run: 
#hybrid on the HouseVotes84 data from mlbench
library(mlbench); data(HouseVotes84)
na.omit(HouseVotes84)->hvo
#Takes some time, so be patient
hybrid(Class~.,data=hvo,doTrace=2)->Bor.hvo
print(Bor.hvo)
plot(Bor.hvo)
plotImpHistory(Bor.hvo)

## End(Not run)
## Not run: 
#hybrid on the Ozone data from mlbench
library(mlbench); data(Ozone)
library(randomForest)
na.omit(Ozone)->ozo
hybrid(V4~.,data=ozo,doTrace=2)->Bor.ozo
cat('Random forest run on all attributes:\n')
print(randomForest(V4~.,data=ozo))
cat('Random forest run only on confirmed attributes:\n')
print(randomForest(ozo[,getSelectedAttributes(Bor.ozo)],ozo$V4))

## End(Not run)
## Not run: 
#hybrid on the Sonar data from mlbench
library(mlbench); data(Sonar)
#Takes some time, so be patient
hybrid(Class~.,data=Sonar,doTrace=2)->Bor.son
print(Bor.son)
#Shows important bands
plot(Bor.son,sort=FALSE)

## End(Not run)

silkeszy/Pomona documentation built on March 31, 2022, 11:13 p.m.

silkeszy/Pomona index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

silkeszy/Pomona
Identification of relevant variables in omics data sets using Random Forests

hybrid: Feature selection with the hybrid algorithm
In silkeszy/Pomona: Identification of relevant variables in omics data sets using Random Forests

Feature selection with the hybrid algorithm

Description

Usage

Arguments

Details

Value

References

Examples

Related to hybrid in silkeszy/Pomona...

R Package Documentation

Browse R Packages

We want your feedback!

silkeszy/Pomona Identification of relevant variables in omics data sets using Random Forests

hybrid: Feature selection with the hybrid algorithm In silkeszy/Pomona: Identification of relevant variables in omics data sets using Random Forests

Feature selection with the hybrid algorithm

Description

Usage

Arguments

Details

Value

References

Examples

Related to hybrid in silkeszy/Pomona...

R Package Documentation

Browse R Packages

We want your feedback!

silkeszy/Pomona
Identification of relevant variables in omics data sets using Random Forests

hybrid: Feature selection with the hybrid algorithm
In silkeszy/Pomona: Identification of relevant variables in omics data sets using Random Forests