Regression in rigr

In the rigr package, we have set out to make regression and analysis easier by

  1. allowing you to specify different types of regression from one function;
  2. automatically computing confidence intervals and p-values using robust standard errors;
  3. displaying output in a more intuitive fashion than base R; and
  4. allowing you to specify multiple-partial F-tests.

This capability is implemented in the function regress(). The basic arguments to this function are

fnctl: the functional

We use the concept of a functional to handle our first goal: allowing you to specify different types of regression models using a single function. A functional takes a function as its argument and returns a number. The most common example of a functional in regression is the mean. The allowed functionals to regress() are

|Functional | Type of Regression | base R command | |---|:---|:---| |"mean" | Linear Regression | lm() | |"geometric mean" | Linear Regression on logarithmically transformed Y | lm(), with Y log-transformed | |"odds" | Logistic Regression | glm(family = binomial) | |"rate" | Poisson Regression | glm(family = poisson) | | "hazard" | Proportional Hazards Regression | coxph() |

formula and data

The formula to regress() is the same as a formula given to lm() or glm(), but with additional optional functionality to enable sophisticated analyses (multiple partial F-tests) with fewer headaches.

The data argument is exactly the same as that in lm() or any of the other regression commands.

Linear Regression

As a first example, we run a linear regression of atrophy (a measure of global brain activity) on age, sex and race, from the mri data. This dataset is included in the rigr package; see its documentation by running ?mri.

## Preparing our R session
library(rigr)
data(mri)
regress("mean", atrophy ~ age + sex + race, data = mri)

Notice that by default robust standard error estimates are returned in addition to the naive estimates. The robust estimates are also used to perform inference. Thus, the confidence intervals, statistics, and p-values use these estimates of the standard error.

F-statistics are also displayed by default, including the multiple partial F-tests for the levels of a multi-level category (such as race) as well as the overall F-test for the variable.

Generalized Linear Regression

Logistic Regression

We can also run generalized linear regression using regress(). For example, to model the odds of having diabetes for males compared to females, we could run a logistic regression as follows:

regress("odds", diabetes ~ sex, data = mri)

In all of the generalized linear regression output we see two tables. The Raw Model table displays estimated coefficients (and their standard errors) on the log-odds scale. The Transformed Model table exponentiates the estimated coefficients and their confidence intervals so that the estimated parameters can be interpreted on the odds scale.

Note that the only possible link function in regress with fnctl = odds" is the logit link. Similarly, the only possible link function in regress with fnctl = "rate" is the log link.

Poisson Regression

The next functional that regress supports is "rate", for use in Poisson regression. To regress yrsquit on age, we would run:

regress("rate", yrsquit ~ age, data = mri)

Note that again we have two tables of output, denoted by Raw Model and Transformed Model, with Transformed Model displaying exponentiated estimated coefficients.

Proportional Hazards Regression

The final functional that regress supports is "hazard", for use in proportional hazards regression. To regress age on the death status (note that we need to create a Surv object first), we would run:

library(survival)
regress("hazard", Surv(obstime, death)~age, data=mri)

Similar to the Poisson regression case, we have two tables of output, denoted by Raw Model and Transformed Model, with Transformed Model displaying exponentiated estimated coefficients (i.e., on the hazard scale).

Regression on the Geometric Mean

Most often in linear regression we are interested in modeling the mean of the response variable. However, we are sometimes interested in modeling the mean of the log-transformed response variable, which allows us to make statements about the geometric mean of the response. In regress(), we can use the "geometric mean" functional to fit this model. Regression on the geometric mean of the packyrs variable in the mri dataset can be performed as follows:

regress("geometric mean", packyrs ~ age, data = mri)

It should be noted that many of the packyrs observations are zero, but the geometric mean of data including an observation of zero is zero... regardless of how many non-zeros were also observed. Therefore, by default, zeroes in the outcome variable are replaced by a value equal to one-half the lowest nonzero value in the outcome variable. This is based on the idea that the lowest observed value could be a proxy for the lower limit of detection. If you wish to specify a different value with which to replace zeroes, you may do say using the replaceZeroes argument.

regress("geometric mean", packyrs ~ age, data = mri, replaceZeroes = 1)

In the output from regress using the geometric mean functional, we see a table for the Raw Model and the Transformed Model. The e(Est), e(95%L), and e(95%H) columns in the Transformed Model table correspond to exponentiated values from the Raw Model - you'll notice that $e^{5.119} \approx 167.2.$

Re-parameterizations of a Variable

There are two special functions in rigr which allow us to re-parameterize variables:

Both of these functions may be used in a regress() call, and will additionally give a multiple partial F-test of the entire variable automatically.

Specifying the reference group with dummy

The dummy function is useful for specifying the reference group that you wish to use with categorical variables. Below we show an example of using the reference group "Female" vs. the reference group "Male" in a regression on sex.

regress("mean", atrophy ~ dummy(sex, reference = "Male"), data = mri)
regress("mean", atrophy ~ dummy(sex, reference = "Female"), data = mri)

Notice that below the coefficients table in the output, the reference category is reported.

Polynomial regression

You can fit higher-order polynomials using polynomial:

regress("mean", atrophy ~ polynomial(age, degree = 2), data = mri)

Note that all polynomials less than or equal to the degree specified are included in the model, and that the variables in the polynomial specification are mean-centered by default. You can change the centering using the center parameter in the polynomial function, an example of which is as follows.

regress("mean", atrophy ~ polynomial(age, degree = 2, center = 65), data = mri)

User-specified multiple partial F-tests

You can also perform multiple partial F-tests using formulas and the U function. This is useful when want to test a subset of variables all at once in your regression. For example, to test whether both the variables packyrs and yrsquit are associated with atrophy in a model with age as a predictor, we can run

regress("mean", atrophy ~ age + sex + U("Smoking variables" = ~packyrs + yrsquit), data = mri)

"Smoking variables" is what we name the group of variables packyrs and yrsquit. The overall F statistic and p-value associated with the inclusion of these two smoking variables variables in the model are 4.37 and 0.0130, respectively.

Testing contrasts: hypotheses about linear combinations of regression coefficients

You may be interested in testing a null hypothesis about a linear combination of coefficients in our regression model. For example, to investigate "Is the mean atrophy for a female subject equal to the mean atrophy for a male subject who is 10 years younger?", our hypothesis test involves both the coefficients on age and sex. We can test this hypothesis using the lincom function. First, we need to fit a linear model of age and sex on atrophy:

mod_rigr <- regress("mean", atrophy ~ age + sex, data = mri)
mod_rigr

We then create a vector giving the linear combination of the coefficient that we hypothesized to be zero, and perform the test using lincom. The elements in mod_combo correspond to Intercept, age, and sexMale, because this was their order in the coefficient table shown above.

mod_combo <- c(0, -10, 1)
lincom(mod_rigr, mod_combo)

Note that the standard errors returned by default are robust, as are the associated confidence intervals and p-values.

We could also test the null hypothesis that the mean difference in atrophy between these two groups (females, and males 10 years younger) is equal to -1 as follows:

lincom(mod_rigr, mod_combo, null.hypoth = -1)


Try the rigr package in your browser

Any scripts or data that you put into this service are public.

rigr documentation built on Sept. 7, 2022, 1:05 a.m.