knitr::opts_knit$set(
        stop_on_error = 2L
)

What I Want

There are several options for calculating expected values in R, but none are completely satisfactory. Here is a list of things I want in such a package:

All of these things exist already, but not in any one place. For example, the Zelig package works with a large number of model types, but does not make it easy to construct interaction plots; the effects package has a simple and concise user interface for calculating quantities of interest, but supports a limited number of models and does not provide an easy way to calculate comparisons.

Package comparisons

The comparisons below are designed to provide an overview of currently available packages, to identify good ideas that should be copied, and possibly to find areas where new approaches would be beneficial.

Calculating expected values

The standard way to calculate expected values in R is to use the predict function along with a newdata argument.

data(Prestige, package = "car")
Prestige <- na.omit(Prestige)

m <- lm(prestige ~ income + type, data = Prestige)

xvals <- expand.grid(income = quantile(Prestige$income, c(.25, .50, .75)),
                     women = mean(Prestige$women),
                     type = "bc")

data.frame(xvals, predict(m, newdata = xvals, se.fit = TRUE))

This is not a terrible interface and it has the advantage of working with most models. There are however a few annoying niggles:

Although the predict function is useful these combined issues are enough to make us look for a more convenient alternative.

Easier predict-like methods

The effects, emmeans, and prediction packages address many of the issues with calculating expected values using the predict function:

library(dplyr)
library(effects)
library(prediction)
library(emmeans)
as.data.frame(Effect(c("income"), m))

emmeans(m, ~ income)

prediction(m)

By default Effect gives expected values over a range of predictors, while emmeans gives expected values at the mean by default. prediction does not allow for specifying a focal predictor without specifying values for it. All three methods allow for specifying arbitrary predictor variable values:

as.data.frame(
    Effect(c("income"), m,
           xlevels = list(income = quantile(Prestige$income, c(.25, .5, .75)))))

emmeans(m, ~ income,
        at = list(income = quantile(Prestige$income, c(.25, .5, .75))))

summarise(group_by(prediction(m, at = list(income = quantile(Prestige$income, c(.25, .5, .75)))),
                   income),
          mean(fitted),
          mean(se.fitted))

The observant reader will have noticed that prediction, and Effect returned the same expected value, while emmeans all gave different answers. The difference is thatEffect and prediction gives us the expected values at various income levels averaging over the observed values of type, while emmeans gives the expected values at various income levels averaging over the expected values for each level of type. Each method can be made to produce the results from any other, but in all cases this requires the user to do part of the calculation.

It is also worth noting that the prediction method is currently less user-friendly, in that it requires the user to average expected values, and it does not produce confidence intervals by default.

Self-contained systems

Whereas the effects, emmeans and prediction packages provide methods for calculating expected values for a range of model types, some packages attempt to provide a more self-contained model-fitting ecosystem that includes functionality for computing expected values. Notable examples include the rms and Zelig packages.

The rms package takes a somewhat strange approach to specifying the values at which expected values are calculated. Defaults are set by generating a datadist object, and then setting an option to use this object as the default values when generating predictions. It looks like this:

library(rms)
dd <- datadist(Prestige)
options(datadist = "dd")

m.rms <- ols(prestige ~ income + type, data = Prestige)
Predict(m.rms, income = quantile(Prestige$income, c(.25, .5, .75)))

I dislike almost everything about this, but specifying predictor variable values as named arguments is quite nice.

The Zelig package similarly requires using it's own wrapper function to fit the model. Then x-values can be set using the setx function. Finally, expected values can be calculated using the sim function.

library(Zelig)
m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE)

summary(sim(setx(m.z, income = quantile(Prestige$income, c(.25, .5, .75)))))

Like the rms package the setx function uses variable names as arguments.

The best parts

Each of the packages reviewed above has advantages and disadvantages relative to the others. If we were to combine the best parts of each approach into a single package it would have the following features:

Taking these considerations into account, thesmargins function calculates expected values averaging over the observed values of non-specified predictors. Thus by default the expected values are the same as those obtained by Effect and prediction:

library(smargins)

summary(smargins(m,
                 income = quantile(income, c(.25, .5, .75))))

Producing the expected values at specific values of the predictors as Zelig and rms do is also easy, as one simply needs to specify values for all the predictors:

summary(smargins(m,
                 income = quantile(income, c(.25, .5, .75)),
                 type = "bc"))

Obtaining marginal means as computed by emmeans is a simple matter of computing all the expected values and then averaging as desired. For example the emmeans result above can be produced with smargins and summary as follows:

summary(smargins(m,
                 income = quantile(income, c(.25, .5, .75))),
        vars = "income")

The nice thing about this approach is that it is transparent and flexible. Users can specify any combination of variables to average over. In fact, users can easily do it themselves if desired:

(all.ev <- summary(smargins(m,
                           income = quantile(income, c(.25, .5, .75)),
                           type = unique(type))))

(em.inc.ev <- summarise_all(group_by(select(all.ev, -type), income), funs(mean)))

The summary method is just a convenience that makes it easier to do this. The simplicity and transparency of this operation is a direct consequence of the decision to return expected values in a simple data.frame structure that is familiar to R users and is easy for them to manipulate.

Comparisons and other transformations

Some of the packages reviewed above provide some tools for calculating differences among expected values. For example the emmeans package provides some pre-defined contrasts and allows for user written contrast functions as well:

m.em <- emmeans(m, ~ type, at = list(income = mean(Prestige$income)))

emmeans::contrast(m.em, method = "pairwise")

last_vs_others <- function(...) {
    data.frame("wc vs other" = c(-.5, -.5, 1))
}

emmeans::contrast(m.em, method = last_vs_others)

The Zelig package also provides some limited support for this by allowing for computing differences between quantities of interest at two different sets of predictor variable values. That is:

library(Zelig)

m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE)

summary(sim(m.z,
            setx(m.z, type = "bc"),
            setx(m.z, type = "wc")))

summary(sim(m.z,
            setx(m.z, type = "bc"),
            setx(m.z, type = "prof")))

Unlike emmeans there is no simple way to perform arbitrary comparisons (e.g., to compare the average of type = "bc" and type = "prof" to type = "wc").

All in all, current implementations don't do a great job of making it easy to compare expected values in a simple and flexibly ways. Figuring out a nice interface for this in the smargins package is an on-going project, but already some progress has been made.

The scompare function that gives pairwise differences:

m.sm <- smargins(m, type = unique(type), income = mean(income))
summary(scompare(m.sm, "type"))

For custom comparisons the structure produced by smargins is simple enough that the user can perform their own arbitrary transformations. For example, we can compare the average of blue collar and professional with white collar:

wc <- subset(m.sm, type == "wc")$.smargin_qi
bc_prof <- (subset(m.sm, type == "bc")$.smargin_qi +
            subset(m.sm, type == "prof")$.smargin_qi)/2
sumstats(wc - bc_prof)

While the simple structure allows users to compute arbitrary transformations without too much difficulty, it would be nice to have some convenience functions for this. The scompare function needs more work, starting with conceptual work to hammer out a description of the desired interface. The emmeans::contrast approach is one source of inspiration; other places to look include car::lht / multcomp::glht and https://github.com/goshevs/effcomp .

Plots

TODO

Interactions

TODO

Custom displays



izahn/smargins documentation built on Sept. 11, 2019, 2:08 p.m.