sisal | R Documentation |
Identifies relevant inputs using a backward selection type algorithm with optional branching. Choices are made by assessing linear models estimated with ordinary least squares or ridge regression in a cross-validation setting.
sisal(X, y, Mtimes = 100, kfold = 10, hbranches = 1,
max.width = hbranches^2, q = 0.165, standardize = TRUE,
pruning.criterion = c("round robin", "random nodes",
"random edges", "greedy"),
pruning.keep.best = TRUE, pruning.reverse = FALSE,
verbose = 1, use.ridge = FALSE,
max.warn = getOption("nwarnings"), sp = -1, ...)
X |
a |
y |
a |
Mtimes |
the number of times the cross-validation is repeated,
i.e. the number of predictions made for each data point. An
integral value ( |
kfold |
the number of approximately equally sized parts used for
partitioning the data on each cross-validation round. An integral
value ( |
hbranches |
the number of branches to take when removing a
variable from the model. In Tikka and Hollmén
(2008), the algorithm always removes the “weakest” variable
( |
max.width |
the maximum number of nodes with a given number of
variables allowed in the search graph. The same limit is used for
all search levels. An integral value ( |
q |
a |
standardize |
a |
pruning.criterion |
a If If If If |
pruning.keep.best |
a |
pruning.reverse |
a |
verbose |
a |
use.ridge |
a |
max.warn |
a |
sp |
a |
... |
additional arguments passed to |
When choosing which variable to drop from the model, the importance of a variable is measured by looking at two variables derived from the sampling distribution of its coefficient in the linear models of the repeated cross-validation runs:
absolute value of the median and
width of the distribution (see q
).
The importance of an input variable is the ratio of the median to
the width: hbranches
variables with the smallest ratios
are dropped, one variable in each branch. See max.width
and pruning.criterion
.
The main results of the function are described here. More details are available in ‘Value’.
The function returns two sets of inputs variables:
set corresponding to the smallest validation error.
smallest set where validation error is close to the smallest error. The margin is the standard deviation of the training error measured in the node of the smallest validation error.
The mean of mean squared errors in the training and
validation sets are also returned (E.tr
,
E.v
). For the training set, the standard deviation of
MSEs (s.tr
) is also returned. The length of
these vectors is the number of variables in X
. The
i:th element in each of the vectors corresponds to the best
model with i input variables, where goodness is measured by the
mean MSE in the validation set.
Linear models fitted to the whole data set are also returned. Both
ordinary least square regression (lm.L.f
,
lm.L.v
, lm.full
) and ridge regression models
(magic.L.f
, magic.L.v
,
magic.full
) are computed, irrespective of the
use.ridge
setting. Both fitting methods are used for the
L.f
set of variables, the L.v
set and the
full set (all variables).
A list
with class
"sisal"
. The items are:
L.f |
a |
L.v |
a |
E.tr |
a |
s.tr |
a |
E.v |
a |
L.f.nobranch |
a |
L.v.nobranch |
like |
E.tr.nobranch |
a |
s.tr.nobranch |
like |
E.v.nobranch |
like |
n.evaluated |
a |
edges |
a |
vertices |
a |
vertices.logical |
a |
vertex.data |
A
|
var.names |
names of the variables (column names of
|
n |
number of observations in the ( |
d |
number of variables (columns) in |
n.missing |
number of samples where either |
n.clean |
number of complete samples in the data set
|
lm.L.f |
|
lm.L.v |
|
lm.full |
|
magic.L.f |
|
magic.L.v |
|
magic.full |
|
mean.y |
mean of |
sd.y |
standard deviation (denominator |
zeroRange.y |
a |
mean.X |
column means of |
sd.X |
standard deviation (denominator |
zeroRange.X |
a |
constant.X |
a |
params |
a named |
pairwise.points |
a |
pairwise.wins |
a |
pairwise.preferences |
a |
pairwise.rank |
an |
path.length |
a |
nested.path |
a |
nested.rank |
an |
branching.useful |
If branching is enabled
( |
warnings |
warnings stored. A |
n.warn |
number of warnings produced. May be higher than number of warnings stored. |
Mikko Korpela
Tikka, J. and Hollmén, J. (2008) Sequential input selection algorithm for long-term prediction of time series. Neurocomputing, 71(13–15):2604–2615.
See magic
for information about the algorithm used for
estimating the regularization parameter and the corresponding linear
model when use.magic
is TRUE
.
See summary.sisal
for how to extract information from
the returned object.
library(stats)
set.seed(123)
X <- cbind(sine=sin((1:100)/5),
linear=seq(from=-1, to=1, length.out=100),
matrix(rnorm(800), 100, 8,
dimnames=list(NULL, paste("random", 1:8, sep="."))))
y <- drop(X %*% c(3, 10, 1, rep(0, 7)) + rnorm(100))
foo <- sisal(X, y, Mtimes=10, kfold=5)
print(foo) # selected inputs "L.v" are same as
summary(foo$lm.full) # significant coefficients of full model
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.