featscores: Univariate feature scores and significance tests

Description Usage Arguments Value Details References Examples

Description

featscore computes the univariate scores for each feature, and featscore.test can be used to compute the corresponding p-values to measure statistical significance of these values.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
featscore(x, y, type = "pearson", exclude = NULL, ...)

featscore.test(
  x,
  y,
  type = "pearson",
  exclude = NULL,
  test.max = FALSE,
  perms = 1000
)

Arguments

x

The original feature matrix, columns denoting the features and rows the instances.

y

A vector with the observed target values we try to predict using x. Can be factor for classification problems.

type

Score type. One of 'pearson', 'kendall', 'spearman', or 'runs'. The first three denote the type of correlation computed (will be passed to cor), whereas runs denote the runs test, that can potentially detect any nonlinear relationship.

exclude

Columns (variables) in x to ignore. The score will be zero for these.

...

Currently ignored.

test.max

If TRUE, compute the p-value for the maximum of the univariate scores. If FALSE (default), compute p-values separately for each feature.

perms

Number of random permutations to estimate the p-values for univariate scores.

Value

A vector giving the univariate scores (featscore) or p-values (featscore.test) for each feature.

Details

Univariate scores are a useful technique to assess variable relevances, and can be used for screening. The paper below has nice discussion and practical tips for how to use univariate scores and when they are appropriate.

References

Neal, R. and Zhang, J. (2006). High dimensional classification with Bayesian neural networks and Dirichlet diffusion trees. In Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. A., editors, Feature Extraction, Foundations and Applications, pages 265-296. Springer.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
###

# load the features x and target values y for the prostate cancer data
data("prostate", package = "dimreduce")
x <- prostate$x
y <- prostate$y

# absolute correlation between the target and each of the features
r <- featscore(x, y)
plot(r)

# compute the p-values for the univariate relevances of each original feature
pval <- featscore.test(x, y)
hist(pval, 30) # should have uniform distribution if no relevant variables
sum(pval < 0.001) # number of variables with p-value below some threshold
0.001 * ncol(x) # how many significant p-values one would expect only due to chance


# create some synthetic data
set.seed(213039)
func <- function(x) {
  # linear in x1, nonlinear in x2 and x3 (other inputs are irrelevant)
  x[, 1] + x[, 2]^2 + 3 * cos(pi * x[, 3])
}
sigma <- 0.5
n <- 200
p <- 10 # total number of features
x <- matrix(rnorm(n * p), n, p)
y <- func(x) + sigma * rnorm(n) # y = f(x) + e, e ~ N(0,sigma^2)

# significance test for marginal rank correlations;
# this is unlikely to detect any non-monotonic effects
pval <- featscore.test(x, y, type = "spearman")
which(pval < 0.05)
plot(pval)

# runs test; this is a weaker test than correlation tests, but it
# can potentially detect non-monotonic effects
pval <- featscore.test(x, y, type = "runs")
which(pval < 0.05)
plot(pval)

jpiironen/dimreduce documentation built on March 18, 2021, 11:52 p.m.