knitr::opts_knit$set( stop_on_error = 2L )
There are several options for calculating expected values in R, but none are completely satisfactory. Here is a list of things I want in such a package:
All of these things exist already, but not in any one place. For example, the Zelig package works with a large number of model types, but does not make it easy to construct interaction plots; the effects package has a simple and concise user interface for calculating quantities of interest, but supports a limited number of models and does not provide an easy way to calculate comparisons.
The comparisons below are designed to provide an overview of currently available packages, to identify good ideas that should be copied, and possibly to find areas where new approaches would be beneficial.
The standard way to calculate expected values in R is to use the predict
function along with a newdata
argument.
data(Prestige, package = "car") Prestige <- na.omit(Prestige) m <- lm(prestige ~ income + type, data = Prestige) xvals <- expand.grid(income = quantile(Prestige$income, c(.25, .50, .75)), women = mean(Prestige$women), type = "bc") data.frame(xvals, predict(m, newdata = xvals, se.fit = TRUE))
This is not a terrible interface and it has the advantage of working with most models. There are however a few annoying niggles:
Although the predict
function is useful these combined issues are enough to make us look for a more convenient alternative.
The effects
, emmeans
, and prediction
packages address many of the issues with calculating expected values using the predict
function:
library(dplyr) library(effects) library(prediction) library(emmeans)
as.data.frame(Effect(c("income"), m)) emmeans(m, ~ income) prediction(m)
By default Effect
gives expected values over a range of predictors, while emmeans
gives expected values at the mean by default. prediction
does not allow for specifying a focal predictor without specifying values for it. All three methods allow for specifying arbitrary predictor variable values:
as.data.frame( Effect(c("income"), m, xlevels = list(income = quantile(Prestige$income, c(.25, .5, .75))))) emmeans(m, ~ income, at = list(income = quantile(Prestige$income, c(.25, .5, .75)))) summarise(group_by(prediction(m, at = list(income = quantile(Prestige$income, c(.25, .5, .75)))), income), mean(fitted), mean(se.fitted))
The observant reader will have noticed that prediction
, and Effect
returned the same expected value, while emmeans
all gave different answers. The difference is thatEffect
and prediction
gives us the expected values at various income levels averaging over the observed values of type, while emmeans
gives the expected values at various income levels averaging over the expected values for each level of type. Each method can be made to produce the results from any other, but in all cases this requires the user to do part of the calculation.
It is also worth noting that the prediction
method is currently less user-friendly, in that it requires the user to average expected values, and it does not produce confidence intervals by default.
Whereas the effects
, emmeans
and prediction
packages provide methods for calculating expected values for a range of model types, some packages attempt to provide a more self-contained model-fitting ecosystem that includes functionality for computing expected values. Notable examples include the rms
and Zelig
packages.
The rms
package takes a somewhat strange approach to specifying the values at which expected values are calculated. Defaults are set by generating a datadist
object, and then setting an option to use this object as the default values when generating predictions. It looks like this:
library(rms)
dd <- datadist(Prestige) options(datadist = "dd") m.rms <- ols(prestige ~ income + type, data = Prestige) Predict(m.rms, income = quantile(Prestige$income, c(.25, .5, .75)))
I dislike almost everything about this, but specifying predictor variable values as named arguments is quite nice.
The Zelig
package similarly requires using it's own wrapper function to fit the model. Then x-values can be set using the setx
function. Finally, expected values can be calculated using the sim
function.
library(Zelig)
m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE) summary(sim(setx(m.z, income = quantile(Prestige$income, c(.25, .5, .75)))))
Like the rms
package the setx
function uses variable names as arguments.
Each of the packages reviewed above has advantages and disadvantages relative to the others. If we were to combine the best parts of each approach into a single package it would have the following features:
data.frame
return value, like prediction
summary
method like emmeans
Effect
and prediction
) marginal (as produced by emmeans
) expected valuesrmd::Predict
and Zelig::setx
)Taking these considerations into account, thesmargins
function calculates expected values averaging over the observed values of non-specified predictors. Thus by default the expected values are the same as those obtained by Effect
and prediction
:
library(smargins) summary(smargins(m, income = quantile(income, c(.25, .5, .75))))
Producing the expected values at specific values of the predictors as Zelig
and rms
do is also easy, as one simply needs to specify values for all the predictors:
summary(smargins(m, income = quantile(income, c(.25, .5, .75)), type = "bc"))
Obtaining marginal means as computed by emmeans
is a simple matter of computing all the expected values and then averaging as desired. For example the emmeans
result above can be produced with smargins
and summary
as follows:
summary(smargins(m, income = quantile(income, c(.25, .5, .75))), vars = "income")
The nice thing about this approach is that it is transparent and flexible. Users can specify any combination of variables to average over. In fact, users can easily do it themselves if desired:
(all.ev <- summary(smargins(m, income = quantile(income, c(.25, .5, .75)), type = unique(type)))) (em.inc.ev <- summarise_all(group_by(select(all.ev, -type), income), funs(mean)))
The summary
method is just a convenience that makes it easier to do this. The simplicity and transparency of this operation is a direct consequence of the decision to return expected values in a simple data.frame
structure that is familiar to R users and is easy for them to manipulate.
Some of the packages reviewed above provide some tools for calculating differences among expected values. For example the emmeans
package provides some pre-defined contrasts and allows for user written contrast functions as well:
m.em <- emmeans(m, ~ type, at = list(income = mean(Prestige$income))) emmeans::contrast(m.em, method = "pairwise") last_vs_others <- function(...) { data.frame("wc vs other" = c(-.5, -.5, 1)) } emmeans::contrast(m.em, method = last_vs_others)
The Zelig
package also provides some limited support for this by allowing for computing differences between quantities of interest at two different sets of predictor variable values. That is:
library(Zelig) m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE) summary(sim(m.z, setx(m.z, type = "bc"), setx(m.z, type = "wc"))) summary(sim(m.z, setx(m.z, type = "bc"), setx(m.z, type = "prof")))
Unlike emmeans
there is no simple way to perform arbitrary comparisons (e.g., to compare the average of type = "bc"
and type = "prof"
to type = "wc"
).
All in all, current implementations don't do a great job of making it easy to compare expected values in a simple and flexibly ways. Figuring out a nice interface for this in the smargins
package is an on-going project, but already some progress has been made.
The scompare
function that gives pairwise differences:
m.sm <- smargins(m, type = unique(type), income = mean(income)) summary(scompare(m.sm, "type"))
For custom comparisons the structure produced by smargins
is simple enough that the user can perform their own arbitrary transformations. For example, we can compare the average of blue collar and professional with white collar:
wc <- subset(m.sm, type == "wc")$.smargin_qi bc_prof <- (subset(m.sm, type == "bc")$.smargin_qi + subset(m.sm, type == "prof")$.smargin_qi)/2 sumstats(wc - bc_prof)
While the simple structure allows users to compute arbitrary transformations without too much difficulty, it would be nice to have some convenience functions for this. The scompare
function needs more work, starting with conceptual work to hammer out a description of the desired interface. The emmeans::contrast
approach is one source of inspiration; other places to look include car::lht
/ multcomp::glht
and https://github.com/goshevs/effcomp .
TODO
TODO
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.