NTA: Novel testing approach

Description Usage Arguments Details Value References See Also Examples

View source: R/NTA.R

Description

Calculates the p-values for each permutation variable importance measure, based on the empirical null distribution from non-positive importance values as described in Janitza et al. (2015).

Usage

1
2
3
4
## Default S3 method:
NTA(PerVarImp)
## S3 method for class 'NTA'
print(x, ...)

Arguments

PerVarImp

permutation variable importance measures in a vector.

x

for the print method, an NTA object

...

optional parameters for print

Details

The observed non-positive permutation variable importance values are used to approximate the distribution of variable importance for non-relevant variables. The null distribution Fn0 is computed by mirroring the non-positive variable importance values on the y-axis. Given the approximated null importance distribution, the p-value is the probability of observing the original PerVarImp or a larger value. This testing approach is suitable for data with large number of variables without any effect.

PerVarImp should be computed based on the hold-out permutation variable importance measures. If using standard variable importance measures the results may be biased.

This function has not been tested for regression tasks so far, so this routine is meant for the expert user only and its current state is rather experimental.

Value

PerVarImp

the orginal permutation variable importance measures.

M

The non-positive variable importance values with the mirrored values on the y-axis.

pvalue

the p-value is the probability of observing the orginal PerVarImp or a larger value, given the approximated null importance distribution.

References

Janitza S, Celik E, Boulesteix A-L, (2015), A computationally fast variable importance test for random forest for high dimensional data,Technical Report 185, University of Munich, <http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-25587-4>

See Also

CVPVI,importance, randomForest

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
##############################
#      Classification        #
##############################
## Simulating data
X = replicate(100,rnorm(200))
X= data.frame( X) #"X" can also be a matrix
z  = with(X,2*X1 + 3*X2 + 2*X3 + 1*X4 -
            2*X5 - 2*X6 - 2*X7 + 1*X8 )
pr = 1/(1+exp(-z))         # pass through an inv-logit function
y = as.factor(rbinom(200,1,pr))
##################################################################
# cross-validated permutation variable importance

cv_vi = CVPVI(X,y,k = 2,mtry = 3,ntree = 500,ncores = 2)
##################################################################
#compare them with the original permutation variable importance
library("randomForest")
cl.rf = randomForest(X,y,mtry = 3,ntree = 500, importance = TRUE)
##################################################################
# Novel Test approach
cv_p = NTA(cv_vi$cv_varim)
summary(cv_p,pless = 0.1)
pvi_p = NTA(importance(cl.rf, type=1, scale=FALSE))
summary(pvi_p)


###############################
#      Regression             #
###############################
##################################################################
## Simulating data:
X = replicate(100,rnorm(200))
X = data.frame( X) #"X" can also be a matrix
y = with(X,2*X1 + 2*X2 + 2*X3 + 1*X4 - 2*X5 - 2*X6 - 1*X7 + 2*X8 )

##################################################################
# cross-validated permutation variable importance
cv_vi = CVPVI(X,y,k = 2,mtry = 3,ntree = 500,ncores = 2)
##################################################################
#compare them with the original permutation variable importance
reg.rf = randomForest(X,y,mtry = 3,ntree = 500, importance = TRUE)
##################################################################
# Novel Test approach (not tested for regression so far!)
cv_p = NTA(cv_vi$cv_varim)
summary(cv_p,pless = 0.1)
pvi_p = NTA(importance(reg.rf, type=1, scale=FALSE))
summary(pvi_p)

Example output

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Call:
NTA.default(PerVarImp = cv_vi$cv_varim)


  p-values less than  0.1 :
 ---------------------------
    CV-PerVarImp p-value    
X1        0.0018 < 2e-16 ***
X2        0.0099 < 2e-16 ***
X3        0.0035 < 2e-16 ***
X4        0.0016 < 2e-16 ***
X5        0.0015 < 2e-16 ***
X6        0.0037 < 2e-16 ***
X8        0.0006 0.06522 .  
X25       0.0006 0.06522 .  
X26       0.0007 0.06522 .  
X29       0.0007 0.06522 .  
X30       0.0008 0.04348 *  
X41       0.0007 0.06522 .  
X61       0.0005 0.08696 .  
X64       0.0005 0.08696 .  
X66       0.0007 0.06522 .  
X71       0.0006 0.06522 .  
X75       0.0005 0.07609 .  
X80       0.0008 0.03261 *  
X88       0.0005 0.08696 .  
X97       0.0005 0.08696 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
NTA.default(PerVarImp = importance(cl.rf, type = 1, scale = FALSE))


  p-values less than  0.05 :
 ---------------------------
     CV-PerVarImp   p-value    
X1         0.0032 < 2.2e-16 ***
X2         0.0145 < 2.2e-16 ***
X3         0.0061 < 2.2e-16 ***
X4         0.0029 < 2.2e-16 ***
X5         0.0043 < 2.2e-16 ***
X6         0.0039 < 2.2e-16 ***
X12        0.0013  0.007246 ** 
X24        0.0010  0.028986 *  
X30        0.0009  0.028986 *  
X57        0.0012  0.014493 *  
X61        0.0011  0.021739 *  
X64        0.0010  0.028986 *  
X66        0.0017 < 2.2e-16 ***
X100       0.0012  0.014493 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
NTA.default(PerVarImp = cv_vi$cv_varim)


  p-values less than  0.1 :
 ---------------------------
    CV-PerVarImp p-value    
X1        0.5003 < 2e-16 ***
X2        1.1493 < 2e-16 ***
X3        0.8199 < 2e-16 ***
X4        0.0961 0.03571 *  
X5        0.4449 < 2e-16 ***
X6        0.8207 < 2e-16 ***
X7        0.3009 < 2e-16 ***
X8        0.7954 < 2e-16 ***
X9        0.0868 0.04762 *  
X11       0.0765 0.05952 .  
X18       0.0888 0.04762 *  
X28       0.1226 < 2e-16 ***
X34       0.0780 0.05952 .  
X36       0.0558 0.09524 .  
X43       0.0978 0.03571 *  
X52       0.0867 0.04762 *  
X53       0.0801 0.05952 .  
X71       0.0686 0.05952 .  
X80       0.0845 0.04762 *  
X83       0.1024 0.03571 *  
X89       0.0842 0.04762 *  
X91       0.0534 0.09524 .  
X94       0.0592 0.07143 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
NTA.default(PerVarImp = importance(reg.rf, type = 1, scale = FALSE))


  p-values less than  0.05 :
 ---------------------------
    CV-PerVarImp p-value    
X1        0.5806 < 2e-16 ***
X2        1.6275 < 2e-16 ***
X3        0.8396 < 2e-16 ***
X5        0.3304 < 2e-16 ***
X6        1.1062 < 2e-16 ***
X7        0.4124 < 2e-16 ***
X8        1.2230 < 2e-16 ***
X14       0.1115 0.03947 *  
X17       0.1629 < 2e-16 ***
X42       0.1715 < 2e-16 ***
X65       0.1633 < 2e-16 ***
X67       0.1474 0.01316 *  
X92       0.1292 0.02632 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

vita documentation built on May 2, 2019, 9:12 a.m.