# rlars: Robust least angle regression In robustHD: Robust Methods for High-Dimensional Data

## Description

Robustly sequence candidate predictors according to their predictive content and find the optimal model along the sequence.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ```rlars(x, ...) ## S3 method for class 'formula' rlars(formula, data, ...) ## Default S3 method: rlars(x, y, sMax = NA, centerFun = median, scaleFun = mad, winsorize = FALSE, const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), regFun = lmrob, regArgs = list(), crit = c("BIC", "PE"), splits = foldControl(), cost = rtmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, tol = .Machine\$double.eps^0.5, ...) ```

## Arguments

 `x` a matrix or data frame containing the candidate predictors. `formula` a formula describing the full model. `data` an optional data frame, list or environment (or object coercible to a data frame by `as.data.frame`) containing the variables in the model. If not found in data, the variables are taken from `environment(formula)`, typically the environment from which `rlars` is called. `y` a numeric vector containing the response. `sMax` an integer giving the number of predictors to be sequenced. If it is `NA` (the default), predictors are sequenced as long as there are twice as many observations as predictors. `centerFun` a function to compute a robust estimate for the center (defaults to `median`). `scaleFun` a function to compute a robust estimate for the scale (defaults to `mad`). `winsorize` a logical indicating whether to clean the full data set by multivariate winsorization, i.e., to perform data cleaning RLARS instead of plug-in RLARS (defaults to `FALSE`). `const` numeric; tuning constant to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2). `prob` numeric; probability for the quantile of the chi-squared distribution to be used in bivariate or multivariate winsorization (defaults to 0.95). `fit` a logical indicating whether to fit submodels along the sequence (`TRUE`, the default) or to simply return the sequence (`FALSE`). `s` an integer vector of length two giving the first and last step along the sequence for which to compute submodels. The default is to start with a model containing only an intercept (step 0) and iteratively add all variables along the sequence (step `sMax`). If the second element is `NA`, predictors are added to the model as long as there are twice as many observations as predictors. If only one value is supplied, it is recycled. `regFun` a function to compute robust linear regressions along the sequence (defaults to `lmrob`). `regArgs` a list of arguments to be passed to `regFun`. `crit` a character string specifying the optimality criterion to be used for selecting the final model. Possible values are `"BIC"` for the Bayes information criterion and `"PE"` for resampling-based prediction error estimation. `splits` an object giving data splits to be used for prediction error estimation (see `perry`). `cost` a cost function measuring prediction loss (see `perry` for some requirements). The default is to use the root trimmed mean squared prediction error (see `cost`). `costArgs` a list of additional arguments to be passed to the prediction loss function `cost`. `selectBest,seFactor` arguments specifying a criterion for selecting the best model (see `perrySelect`). The default is to use a one-standard-error rule. `ncores` a positive integer giving the number of processor cores to be used for parallel computing (the default is 1 for no parallelization). If this is set to `NA`, all available processor cores are used. For fitting models along the sequence and for prediction error estimation, parallel computing is implemented on the R level using package parallel. Otherwise parallel computing for some of of the more computer-intensive computations in the sequencing step is implemented on the C++ level via OpenMP (http://openmp.org/). `cl` a parallel cluster for parallel computing as generated by `makeCluster`. This is preferred over `ncores` for tasks that are parallelized on the R level, in which case `ncores` is only used for tasks that are parallelized on the C++ level. `seed` optional initial seed for the random number generator (see `.Random.seed`). This is useful because many robust regression functions (including `lmrob`) involve randomness, or for prediction error estimation. On parallel R worker processes, random number streams are used and the seed is set via `clusterSetRNGStream`. `model` a logical indicating whether the model data should be included in the returned object. `tol` a small positive numeric value. This is used in bivariate winsorization to determine whether the initial estimate from adjusted univariate winsorization is close to 1 in absolute value. In this case, bivariate winsorization would fail since the points form almost a straight line, and the initial estimate is returned. `...` additional arguments to be passed down. For the default method, additional arguments to be passed down to `robStandardize`.

## Value

If `fit` is `FALSE`, an integer vector containing the indices of the sequenced predictors.

Else if `crit` is `"PE"`, an object of class `"perrySeqModel"` (inheriting from class `"perrySelect"`, see `perrySelect`). It contains information on the prediction error criterion, and includes the final model as component `finalModel`.

Otherwise an object of class `"rlars"` (inheriting from class `"seqModel"`) with the following components:

 `active` an integer vector containing the indices of the sequenced predictors. `s` an integer vector containing the steps for which submodels along the sequence have been computed. `coefficients` a numeric matrix in which each column contains the regression coefficients of the corresponding submodel along the sequence. `fitted.values` a numeric matrix in which each column contains the fitted values of the corresponding submodel along the sequence. `residuals` a numeric matrix in which each column contains the residuals of the corresponding submodel along the sequence. `df` an integer vector containing the degrees of freedom of the submodels along the sequence (i.e., the number of estimated coefficients). `robust` a logical indicating whether a robust fit was computed (`TRUE` for `"rlars"` models). `scale` a numeric vector giving the robust residual scale estimates for the submodels along the sequence. `crit` an object of class `"bicSelect"` containing the BIC values and indicating the final model (only returned if argument `crit` is `"BIC"` and argument `s` indicates more than one step along the sequence). `muX` a numeric vector containing the center estimates of the predictors. `sigmaX` a numeric vector containing the scale estimates of the predictors. `muY` numeric; the center estimate of the response. `sigmaY` numeric; the scale estimate of the response. `x` the matrix of candidate predictors (if `model` is `TRUE`). `y` the response (if `model` is `TRUE`). `w` a numeric vector giving the data cleaning weights (if `winsorize` is `TRUE`). `call` the matched function call.

## Author(s)

Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar

## References

Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299.

`coef`, `fitted`, `plot`, `predict`, `residuals`, `lmrob`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## fit robust LARS model rlars(x, y, sMax = 10) ```