standardize: Standardize a formula and data frame for regression.
In standardize: Tools for Standardizing Variables for Regression in R

Description Usage Arguments Details Value Note Author(s) See Also Examples

Create a standardized object which places all variables in data on the same scale based on formula, making regression output easier to interpret. For mixed effects regressions, this also offers computational benefits, and for Bayesian regressions, it also makes determining reasonable priors easier.

1	standardize(formula, data, family = gaussian, scale = 1, offset, ...)

`formula`	A regression `formula`.
`data`	A data.frame containing the variables in `formula`.
`family`	A regression `family` (default gaussian).
`scale`	The desired scale for the regression frame. Must be a single positive number. See 'Details'.
`offset`	An optional `offset` vector. Offsets can also be included in the `formula` (e.g. `y ~ x + offset(o)`), but if this is done, then the column `o` (in this example) must be in any data frame passed as the `newdata` argument to `predict`.
`...`	Currently unused. If `na.action` is specified in `...` and is anything other than `na.pass`, a warning is issued and the argument argument is ignored.

First model.frame is called. Then, if family = gaussian, the response is checked to ensure that it is numeric and has more than two unique values. If scale_by is used on the response in formula, then the scale argument to scale_by is ignored and forced to 1. If scale_by is not called, then scale is used with default arguments. The result is that gaussian responses are on unit scale (i.e. have mean 0 and standard deviation 1), or, if scale_by is used on the left hand side of formula, unit scale within each level of the specified conditioning factor. Offsets in gaussian models are divided by the standard deviation of the the response prior to scaling (within-factor-level if scale_by is used on the response). In this way, if the transformed offset is added to the transformed response, and then placed back on the response's original scale, the result would be the same as if the un-transformed offset had been added to the un-transformed response. For all other values for family, the response and offsets are not checked. If offsets are used within the formula, then they will be in the formula and data elements of the standardized object. If the offset argument to the standardize function is used, then the offset provided in the argument will be in the offset element of the standardized object (scaled if family = gaussian).

For the other predictors in the formula, first any random effects grouping factors in the formula are coerced to factor and unused levels are dropped. The levels of the resulting factor are then recorded in the groups element. Then for the remaining predictors, regardless of their original class, if they have only two unique non-NA values, they are coerced to unordered factors. Then, named_contr_sum and scaled_contr_poly are called for unordered and ordered factors, respectively, using the scale argument provided in the call to standardize as the scale argument to the contrast functions. For numeric variables, if the variable contains a call to scale_by, then, regardless of whether the call to scale_by specifies scale, the value of scale in the call to standardize is used. If the numeric variable does not contain a call to scale_by, then scale is called, ensuring that the result has standard deviation scale.

With the default value of scale = 1, the result is a standardized object which contains a formula and data frame (and offset vector if the offset argument to the standardize function was used) which can be used to fit regressions where the predictors are all on a similar scale. Its data frame has numeric variables on unit scale, unordered factors with named sum sum contrasts, and ordered factors with orthogonal polynomial contrasts on unit scale. For gaussian regressions, the response is also placed on unit scale. If scale = 0.5 (for example), then gaussian responses would still be placed on unit scale, but unordered factors' named sum contrasts would take on values -0.5, 0, 0.5 rather than -1, 0, 1, the standard deviation of each column in the contrast matrices for ordered factors would be 0.5 rather than 1, and the standard deviation of numeric variables would be 0.5 rather than 1 (within-factor-level in the case of scale_by calls).

A standardized object. The formula, data, and offset elements of the object can be used in calls to regression functions.

The scale_by function is supported so long as it is not nested within other function calls. The poly function is supported so long as it is either not nested within other function calls, or is nested as the transformation of the numeric variable in a scale_by call. If poly is used, then the lsmeans function will yield misleading results (as would normally be the case).

In previous versions of standardize (v0.2.0 and earlier), na.action could be specified. Starting with v0.2.1, specifying something other than na.pass is ignored with a warning. Use of na.omit and na.exclude should be done when calling regression fitting functions using the elements returned in the standardized object.

Christopher D. Eager <eager.stats@gmail.com>

For scaling and contrasts, see scale, scale_by, named_contr_sum, and scaled_contr_poly. For putting new data into the same space as the standardized data, see predict. For the elements in the returned object, see standardized.

dat <- expand.grid(ufac = letters[1:3], ofac = 1:3)
dat <- as.data.frame(lapply(dat, function(n) rep(n, 60)))
dat$ofac <- factor(dat$ofac, ordered = TRUE)
dat$x <- rpois(nrow(dat), 5)
dat$z <- rnorm(nrow(dat), rep(rnorm(30), each = 18), rep(runif(30), each = 18))
dat$subj <- rep(1:30, each = 18)
dat$y <- rnorm(nrow(dat), -2, 5)

sobj <- standardize(y ~ log(x + 1) + scale_by(z ~ subj) + ufac + ofac +
  (1 | subj), dat)

sobj
sobj$formula
head(dat)
head(sobj$data)
sobj$contrasts
sobj$groups
mean(sobj$data$y)
sd(sobj$data$y)
mean(sobj$data$log_x.p.1)
sd(sobj$data$log_x.p.1)
with(sobj$data, tapply(z_scaled_by_subj, subj, mean))
with(sobj$data, tapply(z_scaled_by_subj, subj, sd))

sobj <- standardize(y ~ log(x + 1) + scale_by(z ~ subj) + ufac + ofac +
  (1 | subj), dat, scale = 0.5)

sobj
sobj$formula
head(dat)
head(sobj$data)
sobj$contrasts
sobj$groups
mean(sobj$data$y)
sd(sobj$data$y)
mean(sobj$data$log_x.p.1)
sd(sobj$data$log_x.p.1)
with(sobj$data, tapply(z_scaled_by_subj, subj, mean))
with(sobj$data, tapply(z_scaled_by_subj, subj, sd))

## Not run: 
mod <- lmer(sobj$formula, sobj$data)
# this next line causes warnings about contrasts being dropped, but
# these warnings can be ignored (i.e. the statement still evaluates to TRUE)
all.equal(predict(mod, newdata = predict(sobj, dat)), fitted(mod))

## End(Not run)