# sisal: Sequential Input Selection Algorithm (SISAL) In sisal: Sequential Input Selection Algorithm

## Description

Identifies relevant inputs using a backward selection type algorithm with optional branching. Choices are made by assessing linear models estimated with ordinary least squares or ridge regression in a cross-validation setting.

## Usage

 ```1 2 3 4 5 6 7``` ```sisal(X, y, Mtimes = 100, kfold = 10, hbranches = 1, max.width = hbranches^2, q = 0.165, standardize = TRUE, pruning.criterion = c("round robin", "random nodes", "random edges", "greedy"), pruning.keep.best = TRUE, pruning.reverse = FALSE, verbose = 1, use.ridge = FALSE, max.warn = getOption("nwarnings"), sp = -1, ...) ```

## Arguments

 `X` a `numeric` `matrix` where each column is a predictor (independent variable) and each row is an observation (data point) `y` a `numeric` vector containing a sample of the response (dependent) variable, in the same order as the rows of `X` `Mtimes` the number of times the cross-validation is repeated, i.e. the number of predictions made for each data point. An integral value (`numeric` or `integer`). `kfold` the number of approximately equally sized parts used for partitioning the data on each cross-validation round. An integral value (`numeric` or `integer`). `hbranches` the number of branches to take when removing a variable from the model. In Tikka and Hollmén (2008), the algorithm always removes the “weakest” variable (`hbranches` equals `1`, also the default here). By using a value larger than `1`, the algorithm creates branches in the search graph by removing each of the `hbranches` “weakest” variables, one at a time. The number of branches created is naturally limited by the number of variables remaining in the model at that point. See also `max.width`. `max.width` the maximum number of nodes with a given number of variables allowed in the search graph. The same limit is used for all search levels. An integral value (`numeric` or `integer`). See `pruning.criterion` and `pruning.keep.best`. `q` a `numeric` value between `0` and `0.5` (endpoints excluded) defining the quantiles `1-q` and `q`. The difference of these sample quantiles is used as the width of the sampling distribution (a measure of uncertainty) of each coefficient in a linear model. The default value `0.165` is the same as used by Tikka and Hollmén (2008). In case of a normally distributed parameter, the width is approximately twice the standard deviation (one standard deviation on both sides of the mean). `standardize` a `logical` flag. If `TRUE`, standardizes the data to zero mean and unit variance. If `FALSE`, uses original data. This affects the scale of the results. If `use.ridge` is `TRUE`, this should be set to `TRUE` or the search graph and the sets of selected variables could be affected. `pruning.criterion` a `character` string. Options are `"round robin"`, `"random nodes"`, `"random edges"` and `"greedy"`. Abbreviations are allowed. This affects how the search tree is pruned if the number of nodes to explore is about to exceed `max.width`. One of the following methods is used to select `max.width` nodes for the next level of search. If `"round robin"`, the nodes of the current level (`i` variables) take turns selecting nodes for the next level (`i-1` variables). The turns are taken in order of increasing validation error. Each parent node chooses children according to the order described in ‘Details’. If a duplicate choice would be made, the turn is skipped. If `"random nodes"`, random nodes are selected with uniform probability. If `"random edges"`, random nodes are selected, with the probability of a node directly proportional to the number of edges leading to it. If `"greedy"`, a method similar to `"round robin"` is used, but with the (virtual) looping order of parents and children swapped. Whereas the outer loop in `"round robin"` operates over children and the inner loop over parents, the outer loop in `"greedy"` operates over parents and the inner loop over children. That is, a `"greedy"` parent node selects all its children before passing on the turn to the next parent. `pruning.keep.best` a `logical` flag. If `TRUE`, the nodes that would also be present in the `hbranches = 1` case are immune to pruning. If `FALSE`, the result may underperform the original Tikka and Hollmén (2008) solution in terms of (the lowest) validation error as function of the number of inputs. `pruning.reverse` a `logical` flag. If `TRUE`, all the methods described in `pruning.criterion` except `"random nodes"` use reverse orders or inverse probabilities. The default is `FALSE`. `verbose` a `numeric` or `integer` verbosity level from `0` (no output) to `5` (all possible diagnostics). `use.ridge` a `logical` flag. If `TRUE`, the function uses ridge regression with automatic selection of the regularization (smoothing) parameter. `max.warn` a `numeric` value giving the maximum number of warnings to store in the returned object. If more warnings are given, their total number is still recorded in the object. `sp` a `numeric` value passed to `magic` if `use.ridge` is `TRUE`. Initial value of the regularization parameter. If negative (the default), initialization is automatic. `...` additional arguments passed to `magic` if `use.ridge` is `TRUE`. It is an error to supply arguments named `"S"` or `"off"`.

## Details

When choosing which variable to drop from the model, the importance of a variable is measured by looking at two variables derived from the sampling distribution of its coefficient in the linear models of the repeated cross-validation runs:

1. absolute value of the median and

2. width of the distribution (see `q`).

The importance of an input variable is the ratio of the median to the width: `hbranches` variables with the smallest ratios are dropped, one variable in each branch. See `max.width` and `pruning.criterion`.

The main results of the function are described here. More details are available in ‘Value’.

The function returns two sets of inputs variables:

L.v

set corresponding to the smallest validation error.

L.f

smallest set where validation error is close to the smallest error. The margin is the standard deviation of the training error measured in the node of the smallest validation error.

The mean of mean squared errors in the training and validation sets are also returned (`E.tr`, `E.v`). For the training set, the standard deviation of MSEs (`s.tr`) is also returned. The length of these vectors is the number of variables in `X`. The i:th element in each of the vectors corresponds to the best model with i input variables, where goodness is measured by the mean MSE in the validation set.

Linear models fitted to the whole data set are also returned. Both ordinary least square regression (`lm.L.f`, `lm.L.v`, `lm.full`) and ridge regression models (`magic.L.f`, `magic.L.v`, `magic.full`) are computed, irrespective of the `use.ridge` setting. Both fitting methods are used for the `L.f` set of variables, the `L.v` set and the full set (all variables).

## Value

A `list` with `class` `"sisal"`. The items are:

 `L.f` a `numeric` vector containing indices to columns of `X`. See ‘Details’. `L.v` a `numeric` index vector like `L.f`. See ‘Details’. `E.tr` a `numeric` vector of length `d + 1`. See ‘Details’. `s.tr` a `numeric` vector of length `d + 1`. See ‘Details’. `E.v` a `numeric` vector of length `d + 1`. See ‘Details’. `L.f.nobranch` a `numeric` vector or `NULL`. Like `L.f` but for the “no branching” solution. `NULL` if branching is not used or if some elements of `branching.useful` are missing. `L.v.nobranch` like `L.f.nobranch` but related to `L.v`. `E.tr.nobranch` a `numeric` vector or `NULL`. Like `E.tr` but for the “no branching” solution. `NULL` when `branching.useful` is `NULL`. An element is missing when the corresponding element of `branching.useful` is missing. `s.tr.nobranch` like `E.tr.nobranch` but related to `s.tr`. `E.v.nobranch` like `E.tr.nobranch` but related to `E.v`. `n.evaluated` a `numeric` vector of length ```d + 1```. The number of nodes evaluated for each model size, indexed by the number of variables used plus one. `edges` a `list` of directed edges between nodes in the search graph. There is an edge from node A to node B if and only if B was a candidate for a new node to be evaluated, resulting from removing one variable in A. The i:th element of the list contains edges directed away from the node represented by the i:th element of `vertices`. Each element is a list with one element, `"edges"`, which is a `numeric` vector of indices to `vertices`, pointing to the nodes towards which the edges are directed. There are no edges directed away from pruned nodes or nodes representing a single variable. `vertices` a `character` vector the same size as `edges`. Contains the names of the nodes in the search graph. Each name contains the indices of the variables included in the set in question, separated by dots. `vertices.logical` a `logical` `matrix` containing an alternative representation of `vertices`. Number of rows is the length of `vertices` and number of columns is `d`. The i:th column indicates whether the i:th input variable is present in a given node. The row index and the index to `vertices` are equivalent. `vertex.data` A `data.frame` with information about each node in the search graph (missing information means pruned node). The rows correspond to items in `vertices`. The columns are: E.tr mean of MSEs, training. s.tr standard deviation (`n-1`) of MSEs, training. E.v mean of MSEs, validation. E.v.level.rank rank of the node among all the evaluated (non-pruned) nodes with the same number of variables, in terms of validation error. Smallest error is rank 1. n.rank.deficient number of rank deficient linear models. This problem arises when the number of input variables is large compared to the number of observations and `use.ridge` is `FALSE`. n.NA.models number of models that could not be estimated due to lack of any samples n.inputs number of input variables used in the model represented by the node. min.branches the smallest branching factor large enough for producing the node. This is a number `k` between `1` and `hbranches`. The value for the root node (all input variables) is `1`. The value for other nodes is the minimum of the set of values suggested by its parents. The value suggested by an individual parent is the `min.branches` value of the parent itself or the ranking of the child in terms of increasing importance of the removed variable (see ‘Details’), whichever is larger. For example, when `pruning.keep.best` is `TRUE`, the `hbranches = 1` search path can be followed by looking for nodes where `min.branches` is `1`. `var.names` names of the variables (column names of `X`). `n` number of observations in the (`X`, `y`) data. `d` number of variables (columns) in `X`. `n.missing` number of samples where either `y` or all variables of `X` are missing. `n.clean` number of complete samples in the data set `X`, `y`. `lm.L.f` `lm` model fitted to `L.f` variables. `lm.L.v` `lm` model fitted to `L.v` variables. `lm.full` `lm` model fitted to all variables. `magic.L.f` `magic` model fitted to `L.f` variables. `magic.L.v` `magic` model fitted to `L.v` variables. `magic.full` `magic` model fitted to all variables. `mean.y` mean of `y`. `sd.y` standard deviation (denominator `n - 1`) of `y`. `zeroRange.y` a `logical` value indicating whether all non-missing elements of `y` are equal, with some numeric tolerance. `mean.X` column means of `X`. `sd.X` standard deviation (denominator `n - 1`) of each column in `X`. `zeroRange.X` a `logical` vector. Like `zeroRange.y` but for each column of `X`. `constant.X` a `logical` vector where the i:th value indicates whether the i:th column of `X` has a (nearly) constant, non-zero value (`NA` values allowed). `params` a named `list` containing the values used for most of the parameter-like formal arguments of the function, and also anything in `...`. The names are the names of the parameters. `pairwise.points` a `numeric` square `matrix` with `d` rows and columns. The count in row i, column j indicates the number of times that variable i was better than variable j. See ‘Details’ in `plotSelected.sisal`. `pairwise.wins` a `logical` square `matrix` with `d` rows and columns. A `TRUE` value in row i, column j indicates that i is more important than variable j. Derived from `pairwise.points`. `pairwise.preferences` a `numeric` vector with `d` elements. Number of wins minus number of losses (when another variable wins) per variable. Derived from `pairwise.wins`. `pairwise.rank` an `integer` vector of ranks according to Copeland's pairwise aggregation method. Element number i is the rank of variable (column) number i in `X`. Derived from `pairwise.preferences`. See ‘Details’ in `plotSelected.sisal`. `path.length` a `numeric` vector of path lengths. Consider a path starting from the full model and continuing through incrementally smaller models, each with the smallest validation error among the nodes with that number of variables. However, the path is broken at each point where the model with one less variable cannot be constructed by removing one variable from the bigger model (is not nested). The vector contains the lengths of the pieces. Its length is the number of breaks plus one. `nested.path` a `numeric` vector containing the indices (column numbers) of the input variables in their removal order on the “nested path”. The first element is the index of the variable that was removed first. The remaining variable is the last element. If the path does not exist, this is `NULL`. See ‘Details’ in `plotSelected.sisal`. `nested.rank` an `integer` vector of ranks determined by `nested.path`. Element number i is the rank of variable (column) number i in `X`. `NULL` if `nested.path` is `NULL`. See ‘Details’ in `plotSelected.sisal`. `branching.useful` If branching is enabled (`hbranches > 1`), this is a `logical` vector of length `d`. If the i:th element is `TRUE`, branching improved the best model with i variables in terms of validation error. The result is `NA` if a comparison is not possible (may happen if `pruning.keep.best` is `FALSE`). If branching is not used, this is `NULL`. `warnings` warnings stored. A `list` of objects that evaluate to a `character` string. `n.warn` number of warnings produced. May be higher than number of warnings stored.

Mikko Korpela

## References

Tikka, J. and Hollmén, J. (2008) Sequential input selection algorithm for long-term prediction of time series. Neurocomputing, 71(13–15):2604–2615.

See `magic` for information about the algorithm used for estimating the regularization parameter and the corresponding linear model when `use.magic` is `TRUE`.
See `summary.sisal` for how to extract information from the returned object.
 ``` 1 2 3 4 5 6 7 8 9 10``` ```library(stats) set.seed(123) X <- cbind(sine=sin((1:100)/5), linear=seq(from=-1, to=1, length.out=100), matrix(rnorm(800), 100, 8, dimnames=list(NULL, paste("random", 1:8, sep=".")))) y <- drop(X %*% c(3, 10, 1, rep(0, 7)) + rnorm(100)) foo <- sisal(X, y, Mtimes=10, kfold=5) print(foo) # selected inputs "L.v" are same as summary(foo\$lm.full) # significant coefficients of full model ```