pRF

Description

The workhorse function - estimates statistical significance of feature importance by permuting the response variable

Usage

1
2
3
pRF(response, predictors, n.perms, alpha = 0.05, 
mtry = NULL, type = c("classification", "regression"),
ntree = 500,seed=12345, ...)

Arguments

response

a character vector or a factor for classification containing the group memberships for classification, a numeric vector for regression

predictors

A matrix consisting of features (measurements) corresponding to samples. The orientation per se does not matter - the function orients them correctly for Random Forest learning.

n.perms

Number of permutations to estimate significance. If the number of all possible permutations is less than this the latter will be used for estimation.

alpha

The significance level threshold of p.values for estimating false discovery rate using the two-step BH method for correlated test statistics, as implemented in the multtest package's mt.rawp2adjp function.

mtry

see ?randomForest for details - defines how many features are randomly sampled for building trees

type

string, set to "classification" or "regression"

ntree

number of trees in the random forest, see documentation from the randomForest package for details.

seed

set seed to ensure reproducibility from run to run and to standardise runs on actual and permuted data

...

Arguments to pass on to the randomForest function

Value

A standardised list containing

Res.table

A data.frame containing significance, FDR, and the feature name. b= number of permutations yielding a higher importance than observed + 1, m= number of permutations + 1

obs

named numeric vector, contains observed importances

perms

data.frame, contains importance values from permutations

Model

the randomForest model that was fit to the original data

Author(s)

Ankur Chakravarthy

References

The main function is based on the idea presented in

Altmann A, Tolosi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010 May 15;26(10):1340-7. doi: 10.1093/bioinformatics/btq134. Epub 2010 Apr 12. PubMed PMID: 20385727.

The permutation p.values in the package are exact, calculated according to

Phipson B, Smyth GK. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat Appl Genet Mol Biol. 2010;9:Article39. doi: 10.2202/1544-6115.1585. Epub 2010 Oct 31. PubMed PMID: 21044043.

False discovery rates account for correlations using the Two-Step BH procedure, initially reported in

Yoav Benjamini, Abba M. Krieger, and Daniel Yekutieli, 'Adaptive Linear Step-up Procedures That Control the False Discovery Rate', Biometrika, 93 (2006), 491-507.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#Load the iris dataset
data(iris)

#Set up the predictors object

predictors=iris[,c(1:4)]
colnames(predictors)<-colnames(iris[1:4])

#Execute the main pRF function
p.test<-pRF(response=factor(iris$Species),
predictors=iris[,c(1:4)],n.perms=20,mtry=3,
type="classification",alpha=0.05)

#Put together a dataframe that consists of the 
#significance stats and observed importance metrics

df<-cbind(p.test$Res.table,p.test$obs)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.