areg: Additive Regression with Optimal Transformations on Both... In harrelfe/Hmisc: Harrell Miscellaneous

Description

Expands continuous variables into restricted cubic spline bases and categorical variables into dummy variables and fits a multivariate equation using canonical variates. This finds optimum transformations that maximize R^2. Optionally, the bootstrap is used to estimate the covariance matrix of both left- and right-hand-side transformation parameters, and to estimate the bias in the R^2 due to overfitting and compute the bootstrap optimism-corrected R^2. Cross-validation can also be used to get an unbiased estimate of R^2 but this is not as precise as the bootstrap estimate. The bootstrap and cross-validation may also used to get estimates of mean and median absolute error in predicted values on the original `y` scale. These two estimates are perhaps the best ones for gauging the accuracy of a flexible model, because it is difficult to compare R^2 under different y-transformations, and because R^2 allows for an out-of-sample recalibration (i.e., it only measures relative errors).

Note that uncertainty about the proper transformation of `y` causes an enormous amount of model uncertainty. When the transformation for `y` is estimated from the data a high variance in predicted values on the original `y` scale may result, especially if the true transformation is linear. Comparing bootstrap or cross-validated mean absolute errors with and without restricted the `y` transform to be linear (`ytype='l'`) may help the analyst choose the proper model complexity.

Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```areg(x, y, xtype = NULL, ytype = NULL, nk = 4, B = 0, na.rm = TRUE, tolerance = NULL, crossval = NULL) ## S3 method for class 'areg' print(x, digits=4, ...) ## S3 method for class 'areg' plot(x, whichx = 1:ncol(x\$x), ...) ## S3 method for class 'areg' predict(object, x, type=c('lp','fitted','x'), what=c('all','sample'), ...) ```

Arguments

 `x` A single predictor or a matrix of predictors. Categorical predictors are required to be coded as integers (as `factor` does internally). For `predict`, `x` is a data matrix with the same integer codes that were originally used for categorical variables. `y` a `factor`, categorical, character, or numeric response variable `xtype` a vector of one-letter character codes specifying how each predictor is to be modeled, in order of columns of `x`. The codes are `"s"` for smooth function (using restricted cubic splines), `"l"` for no transformation (linear), or `"c"` for categorical (to cause expansion into dummy variables). Default is `"s"` if `nk > 0` and `"l"` if `nk=0`. `ytype` same coding as for `xtype`. Default is `"s"` for a numeric variable with more than two unique values, `"l"` for a binary numeric variable, and `"c"` for a factor, categorical, or character variable. `nk` number of knots, 0 for linear, or 3 or more. Default is 4 which will fit 3 parameters to continuous variables (one linear term and two nonlinear terms) `B` number of bootstrap resamples used to estimate covariance matrices of transformation parameters. Default is no bootstrapping. `na.rm` set to `FALSE` if you are sure that observations with `NA`s have already been removed `tolerance` singularity tolerance. List source code for `lm.fit.qr.bare` for details. `crossval` set to a positive integer k to compute k-fold cross-validated R-squared (square of first canonical correlation) and mean and median absolute error of predictions on the original scale `digits` number of digits to use in formatting for printing `object` an object created by `areg` `whichx` integer or character vector specifying which predictors are to have their transformations plotted (default is all). The `y` transformation is always plotted. `type` tells `predict` whether to obtain predicted untransformed `y` (`type='lp'`, the default) or predicted `y` on the original scale (`type='fitted'`), or the design matrix for the right-hand side (`type='x'`). `what` When the `y`-transform is non-monotonic you may specify `what='sample'` to `predict` to obtain a random sample of `y` values on the original scale instead of a matrix of all `y`-inverses. See `inverseFunction`. `...` arguments passed to the plot function.

Details

`areg` is a competitor of `ace` in the `acepack` package. Transformations from `ace` are seldom smooth enough and are often overfitted. With `areg` the complexity can be controlled with the `nk` parameter, and predicted values are easy to obtain because parametric functions are fitted.

If one side of the equation has a categorical variable with more than two categories and the other side has a continuous variable not assumed to act linearly, larger sample sizes are needed to reliably estimate transformations, as it is difficult to optimally score categorical variables to maximize R^2 against a simultaneously optimally transformed continuous variable.

Value

a list of class `"areg"` containing many objects

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

References

Breiman and Friedman, Journal of the American Statistical Association (September, 1985).

`cancor`,`ace`, `transcan`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100``` ```set.seed(1) ns <- c(30,300,3000) for(n in ns) { y <- sample(1:5, n, TRUE) x <- abs(y-3) + runif(n) par(mfrow=c(3,4)) for(k in c(0,3:5)) { z <- areg(x, y, ytype='c', nk=k) plot(x, z\$tx) title(paste('R2=',format(z\$rsquared))) tapply(z\$ty, y, range) a <- tapply(x,y,mean) b <- tapply(z\$ty,y,mean) plot(a,b) abline(lsfit(a,b)) # Should get same result to within linear transformation if reverse x and y w <- areg(y, x, xtype='c', nk=k) plot(z\$ty, w\$tx) title(paste('R2=',format(w\$rsquared))) abline(lsfit(z\$ty, w\$tx)) } } par(mfrow=c(2,2)) # Example where one category in y differs from others but only in variance of x n <- 50 y <- sample(1:5,n,TRUE) x <- rnorm(n) x[y==1] <- rnorm(sum(y==1), 0, 5) z <- areg(x,y,xtype='l',ytype='c') z plot(z) z <- areg(x,y,ytype='c') z plot(z) ## Not run: # Examine overfitting when true transformations are linear par(mfrow=c(4,3)) for(n in c(200,2000)) { x <- rnorm(n); y <- rnorm(n) + x for(nk in c(0,3,5)) { z <- areg(x, y, nk=nk, crossval=10, B=100) print(z) plot(z) title(paste('n=',n)) } } par(mfrow=c(1,1)) # Underfitting when true transformation is quadratic but overfitting # when y is allowed to be transformed set.seed(49) n <- 200 x <- rnorm(n); y <- rnorm(n) + .5*x^2 #areg(x, y, nk=0, crossval=10, B=100) #areg(x, y, nk=4, ytype='l', crossval=10, B=100) z <- areg(x, y, nk=4) #, crossval=10, B=100) z # Plot x vs. predicted value on original scale. Since y-transform is # not monotonic, there are multiple y-inverses xx <- seq(-3.5,3.5,length=1000) yhat <- predict(z, xx, type='fitted') plot(x, y, xlim=c(-3.5,3.5)) for(j in 1:ncol(yhat)) lines(xx, yhat[,j], col=j) # Plot a random sample of possible y inverses yhats <- predict(z, xx, type='fitted', what='sample') points(xx, yhats, pch=2) ## End(Not run) # True transformation of x1 is quadratic, y is linear n <- 200 x1 <- rnorm(n); x2 <- rnorm(n); y <- rnorm(n) + x1^2 z <- areg(cbind(x1,x2),y,xtype=c('s','l'),nk=3) par(mfrow=c(2,2)) plot(z) # y transformation is inverse quadratic but areg gets the same answer by # making x1 quadratic n <- 5000 x1 <- rnorm(n); x2 <- rnorm(n); y <- (x1 + rnorm(n))^2 z <- areg(cbind(x1,x2),y,nk=5) par(mfrow=c(2,2)) plot(z) # Overfit 20 predictors when no true relationships exist n <- 1000 x <- matrix(runif(n*20),n,20) y <- rnorm(n) z <- areg(x, y, nk=5) # add crossval=4 to expose the problem # Test predict function n <- 50 x <- rnorm(n) y <- rnorm(n) + x g <- sample(1:3, n, TRUE) z <- areg(cbind(x,g),y,xtype=c('s','c')) range(predict(z, cbind(x,g)) - z\$linear.predictors) ```