# redun: Redundancy Analysis In Hmisc: Harrell Miscellaneous

## Description

Uses flexible parametric additive models (see `areg` and its use of regression splines) to determine how well each variable can be predicted from the remaining variables. Variables are dropped in a stepwise fashion, removing the most predictable variable at each step. The remaining variables are used to predict. The process continues until no variable still in the list of predictors can be predicted with an R^2 or adjusted R^2 of at least `r2` or until dropping the variable with the highest R^2 (adjusted or ordinary) would cause a variable that was dropped earlier to no longer be predicted at least at the `r2` level from the now smaller list of predictors.

## Usage

 ```1 2 3 4 5``` ```redun(formula, data=NULL, subset=NULL, r2 = 0.9, type = c("ordinary", "adjusted"), nk = 3, tlinear = TRUE, allcat=FALSE, minfreq=0, iterms=FALSE, pc=FALSE, pr = FALSE, ...) ## S3 method for class 'redun' print(x, digits=3, long=TRUE, ...) ```

## Arguments

 `formula` a formula. Enclose a variable in `I()` to force linearity. `data` a data frame `subset` usual subsetting expression `r2` ordinary or adjusted R^2 cutoff for redundancy `type` specify `"adjusted"` to use adjusted R^2 `nk` number of knots to use for continuous variables. Use `nk=0` to force linearity for all variables. `tlinear` set to `FALSE` to allow a variable to be automatically nonlinearly transformed (see `areg`) while being predicted. By default, only continuous variables on the right hand side (i.e., while they are being predictors) are automatically transformed, using regression splines. Estimating transformations for target (dependent) variables causes more overfitting than doing so for predictors. `allcat` set to `TRUE` to ensure that all categories of categorical variables having more than two categories are redundant (see details below) `minfreq` For a binary or categorical variable, there must be at least two categories with at least `minfreq` observations or the variable will be dropped and not checked for redundancy against other variables. `minfreq` also specifies the minimum frequency of a category or its complement before that category is considered when `allcat=TRUE`. `iterms` set to `TRUE` to consider derived terms (dummy variables and nonlinear spline components) as separate variables. This will perform a redundancy analysis on pieces of the variables. `pc` if `iterms=TRUE` you can set `pc` to `TRUE` to replace the submatrix of terms corresponding to each variable with the orthogonal principal components before doing the redundancy analysis. The components are based on the correlation matrix. `pr` set to `TRUE` to monitor progress of the stepwise algorithm `...` arguments to pass to `dataframeReduce` to remove "difficult" variables from `data` if `formula` is `~.` to use all variables in `data` (`data` must be specified when these arguments are used). Ignored for `print`. `x` an object created by `redun` `digits` number of digits to which to round R^2 values when printing `long` set to `FALSE` to prevent the `print` method from printing the R^2 history and the original R^2 with which each variable can be predicted from ALL other variables.

## Details

A categorical variable is deemed redundant if a linear combination of dummy variables representing it can be predicted from a linear combination of other variables. For example, if there were 4 cities in the data and each city's rainfall was also present as a variable, with virtually the same rainfall reported for all observations for a city, city would be redundant given rainfall (or vice-versa; the one declared redundant would be the first one in the formula). If two cities had the same rainfall, `city` might be declared redundant even though tied cities might be deemed non-redundant in another setting. To ensure that all categories may be predicted well from other variables, use the `allcat` option. To ignore categories that are too infrequent or too frequent, set `minfreq` to a nonzero integer. When the number of observations in the category is below this number or the number of observations not in the category is below this number, no attempt is made to predict observations being in that category individually for the purpose of redundancy detection.

## Value

an object of class `"redun"`

## Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

`areg`, `dataframeReduce`, `transcan`, `varclus`, `subselect::genetic`

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```set.seed(1) n <- 100 x1 <- runif(n) x2 <- runif(n) x3 <- x1 + x2 + runif(n)/10 x4 <- x1 + x2 + x3 + runif(n)/10 x5 <- factor(sample(c('a','b','c'),n,replace=TRUE)) x6 <- 1*(x5=='a' | x5=='c') redun(~x1+x2+x3+x4+x5+x6, r2=.8) redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40) redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE) # x5 is no longer redundant but x6 is ```

### Example output

```Loading required package: lattice

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

format.pval, units

Redundancy Analysis

redun(formula = ~x1 + x2 + x3 + x4 + x5 + x6, r2 = 0.8)

n: 100 	p: 6 	nk: 3

Number of NAs:	 0

Transformation of target variables forced to be linear

R-squared cutoff: 0.8 	Type: ordinary

R^2 with which each variable can be predicted from all other variables:

x1    x2    x3    x4    x5    x6
0.994 0.995 0.998 0.999 1.000 1.000

Rendundant variables:

x5 x4 x3

Predicted from variables:

x1 x2 x6

Variable Deleted   R^2 R^2 after later deletions
1               x5 1.000                       1 1
2               x4 0.999                     0.997
3               x3 0.995

Redundancy Analysis

redun(formula = ~x1 + x2 + x3 + x4 + x5 + x6, r2 = 0.8, minfreq = 40)

n: 100 	p: 4 	nk: 3

Number of NAs:	 0

Transformation of target variables forced to be linear

Minimum category frequency required for retention of a binary or
categorical variable: 40

Binary or categorical variables removed because of inadequate frequencies:

x5 x6

R-squared cutoff: 0.8 	Type: ordinary

R^2 with which each variable can be predicted from all other variables:

x1    x2    x3    x4
0.994 0.994 0.998 0.999

Rendundant variables:

x4 x3

Predicted from variables:

x1 x2

Variable Deleted   R^2 R^2 after later deletions
1               x4 0.999                     0.997
2               x3 0.995

Redundancy Analysis

redun(formula = ~x1 + x2 + x3 + x4 + x5 + x6, r2 = 0.8, allcat = TRUE)

n: 100 	p: 6 	nk: 3

Number of NAs:	 0

Transformation of target variables forced to be linear

All levels of a categorical variable had to be redundant before the
variable was declared redundant

R-squared cutoff: 0.8 	Type: ordinary

R^2 with which each variable can be predicted from all other variables:

x1    x2    x3    x4    x5    x6
0.994 0.995 0.998 0.999 0.313 1.000

(For categorical variables the minimum R^2 for any sufficiently
frequent dummy variable is displayed)

Rendundant variables:

x6 x4 x3

Predicted from variables:

x1 x2 x5

Variable Deleted   R^2 R^2 after later deletions
1               x6 1.000                       1 1
2               x4 0.999                     0.997
3               x3 0.995
```

Hmisc documentation built on Oct. 7, 2021, 9:16 a.m.