knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette describes some of the contrast coding helpers this package provides.
It should be noted that this package does not seek to provide an interface for
creating hypothesis matrices and their respective contrast matrices. For that
functionality you might look into the multcomp
or hypr
packages. The contrast
functionality of this package is focused on the following:
This vignette will not go into detail how different contrast schemes are derived or interpreted, but will describe some of the high level differences. I recommend reading the following:
I also have a few blog posts about contrast coding; if you need further explanation of contrasts and more working examples beyond this vignette, read the first one below:
library(contrastable) library(dplyr) library(MASS)
Statistical models such as linear regression need numbers to operate with, but often we have discrete categories that we want to model such as ethnicity, country, treatment group, etc. Typically, we have some comparisons in mind before we do our statistics: compare the group receiving treatment to a reference level who did not, or something like compare group A to group B. Contrast coding is the process of assigning numbers that encode the comparisons we want to make prior to fitting a statistical model. What this allows us to do is strategically set up our model to be interpretable in a way that's meaningful to us even though the fit of the statistical model is unchanged. So, contrast coding is important for inferential statistics and reporting coefficients correctly, but is not that important for group-level predictions.
Throughout this vignette and package, I refer to different contrast schemes, which I define as an encoding of a principled set of comparisons for an arbitrary number of factor levels. The important bit here is the arbitrary number part.
To start with an example, consider an experiment where people are tasked with keeping track of the location of different shapes under different levels of cognitive load, such as needing to remember different strings of numbers. Basically, a change detection task with a digit span cognitive load manipulation. We'll call the levels of this cognitive load factor none, low, and high. A priori, the speed with which people detect a shape that changes location should vary depending on the cognitive load. Specifically, reaction time (RT) should be slower when cognitive load is higher.
What kinds of comparisons could we make, and how do we encode these in our statistical model? One comparison might be to compare low to none, then high to none. Another could be to compare low to none, then high to low. Yet another could be to compare low to none, then high to low and none together. Or, we could compare each level to the average across the conditions.[^dfnote] Specifying these different comparisons, which will really just amount to differences in group means, is our goal here.
In our example, what if we went found that our high cognitive load condition wasn't high enough? What if we needed to add an extra-high level? We could add a comparison of extra-high to none, or compare extra-high to high. Or, we could compare extra-high to none plus low plus high together. Ideally, if we're going for pairwise comparisons from a reference level, successive differences between levels, or some kind of nested set of comparisons, we would want to be able to specify this consistently without needing to revisit every instance of specifying our comparisons just to accommodate the new level.
[^dfnote]: Technically, due to degrees of freedom, we'd compare only 2 of the 3 levels to the average.
Throughout this vignette and package, I refer to different contrast schemes, which I define as an encoding of a principled set of comparisons for an arbitrary number of factor levels. The important bit here is the arbitrary number part such that if the goal is to take pairwise comparisons from a reference level, then adding a new level should only amount to encoding an additional pairwise comparison to the same reference level. I'll contrast this approach to a more manual approach to contrast coding, where you would explicitly write out, in code, every single comparison for every single factor you want to make. Consider if we added a new three-level predictor crossed with our four level example from before (we'll call the new predictor group with levels A, B, C). If we also just wanted to do pairwise comparisons, a scheme approach would encode pairwise comparisons for cognitive load and pairwise comparisons for the new factor. The manual approach would say to encode low-none, high-none, and extra high-none; then, encode B-A, and C-A. If we introduce a new group D later, we have to remember to go in and specify D-A too.
So, while the manual approach is maximally explicit about exactly what we want
to compare, it's also more things to write out and more chances that we make a
mistake while implementing it.
Given that contrasts are specified using matrices in R, the chances of a mistake
are high.
Accordingly, R provides a few functions that implement common contrast schemes
such as contr.treatment
and contr.sum
, which return matrices given a number
of levels.
However, there's some trickiness with them.
Below is a quick list of some issues this package attempts to circumvent; if
you're familiar with contrast coding from your own work, some of this may sound
familiar and some of it may surprise you.
dim()
, dimnames()
, or colnames()
.
In my experience, I rarely see people in my field do this, and so I believe it
would be naive to expect people to suddenly (learn how to and) start doing so.contr.treatment
, places the
reference level as the first level alphabetically while contr.sum
places the
reference level as the LAST level alphabetically. To verify for yourself, try
running contr.treatment(5)
and contr.sum(5)
. respectively, the reference
levels are identifiable by the rows containing either all 0s or all -1s.-contr.sum(2)/2
. However, this does not generalize when the number of
levels is greater than 2, i.e,. -contr.sum(3)/2
will not give you what you expect.contr.helmert
returns unscaled matrices, giving effect estimates that need
to be manually scaled. Tests of statistical significance are still meaningful,
but interpreting the magnitude of effects using matrices from this function
are likely to be incorrect if not scaled manually.I'm going to use the mtcars dataset just to have something to work with. We'll convert a bunch of the columns to factors up front.
mdl_data <- mtcars |> as_tibble() |> mutate(cyl = factor(cyl), twolevel = factor(rep(c("a", "b"), times = nrow(mtcars) / 2)), gear = factor(gear), carb = factor(carb))
We can inspect the contrasts for the factors using contrasts()
, which will use
the default contrasts for unordered factors
options("contrasts") # Show defaults for unordered and ordered contrasts(mdl_data$twolevel) contrasts(mdl_data$carb) # Note the reference level
We can see that the levels for carb
are reflected in the coefficient labels:
summary(lm(mpg ~ carb, data = mdl_data))
But if we changed the contrasts ourselves those helpful labels go away and are replaced with numeric indices. Note that the level names were originally also numbers, so it would be easy to get mixed up at this point.
mdl_data2 <- mdl_data contrasts(mdl_data2$carb) <- contr.sum(6) contrasts(mdl_data2$carb) # Note the reference level summary(lm(mpg ~ carb, data = mdl_data2))
Now I'll introduce the main workhorse of this package: the set_contrasts
function.
This function features a special syntax to quickly set contrast schemes.
I'll introduce features one at a time.
First, we'll set carb
to use sum coding as we did before.
We do this using two-sided formulas, where the left hand side of the formula
is a column in our dataset and the right hand side includes information about
the contrasts we want to set.
For example, this takes the form varname ~ code_by
.
Below, we specify that we want carb
to use contrasts generated by the sum_code
function.
Two things to note: set_contrasts
will automatically coerce specified
columns into factors, so you don't need to worry about calling factor()
on
columns first.
Note that the function will also give messages about other factors that you
haven't set explicitly.
mdl_data3 <- set_contrasts(mdl_data, carb ~ sum_code) contrasts(mdl_data2$carb) # matrix from before contrasts(mdl_data3$carb) # new matrix, note the column names
Note that, unlike last time, the reference level is now the first level
alphabetically and the labels are retained (compare the contrast matrices)
For schemes that have a singular reference level, setting contrasts with
set_contrasts
will always set the reference level to the first level
alphabetically.
Currently, the reference level for carb
is 1
.
If we want to change the reference level, we use the +
operator in our formula.
The change to the formula looks like varname ~ code_by + reference
.
Let's change it to 3
:
mdl_data4 <- set_contrasts(mdl_data, carb ~ sum_code + "3") contrasts(mdl_data4$carb)
Note that when using sum coding, the intercept will be the mean of the group means:
# mean of group means: group_means <- summarize(mdl_data4, grp_mean = mean(mpg), .by = "carb") group_means mean(group_means$grp_mean) # model coefficients coef(lm(mpg ~ carb, data = mdl_data4))
Let's say, hypothetically, we wanted to get differences of each comparison
level to the grand mean, but we also wanted the intercept to be not the
grand mean but the mean of our reference level?
We can set this using the *
operator:
varname ~ code_by + reference * intercept
.
mdl_data5 <- set_contrasts(mdl_data, carb ~ sum_code + 3 * 3) contrasts(mdl_data5$carb) coef(lm(mpg ~ carb, data = mdl_data5))
Finally, let's say we wanted labels that were a bit more clear about what the
comparisons are here.
We can set the label names ourselves using the |
operator:
varname ~ code_by + reference * intercept | labels
mdl_data6 <- set_contrasts(mdl_data, carb ~ sum_code + 3 * 3 | c("1-3", "2-3", "4-3", "6-3", "8-3")) contrasts(mdl_data5$carb) coef(lm(mpg ~ carb, data = mdl_data6))
Here's what it would look like if you tried to manually do this yourself. It's highly unlikely you would correctly specify every single value and every single label correctly on the first go (I actually messed up myself while writing out the matrix).
mdl_data7 <- mdl_data my_contrasts <- matrix(c(2, 1, 0, 1, 1, 1, 1, 2, 0, 1, 1, 1, 1, 1, 0, 2, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 0, 1, 1, 2), nrow = 6) dimnames(my_contrasts) <- list( c("1", "2", "3", "4", "6", "8"), c("1-3", "2-3", "4-3", "6-3", "8-3") ) contrasts(mdl_data7$carb) <- my_contrasts contrasts(mdl_data7$carb)
To see how this can get even more error prone, consider a contrast matrix like the one below. Personally I would not trust myself to get the matrix correct, and if I ever had to add a level to the design it would be obscene to try and edit it.
mdl_data8 <- set_contrasts(mdl_data, carb ~ helmert_code * 4) MASS::fractions(contrasts(mdl_data8$carb))
The work would compound when we have multiple factors, but with the
set_contrasts
function we can specify what our desired scheme is easily:
mdl_data9 <- set_contrasts(mdl_data, carb ~ helmert_code * 4, cyl ~ scaled_sum_code + 4, twolevel ~ scaled_sum_code + "a", gear ~ helmert_code) MASS::fractions(contrasts(mdl_data9$carb)) MASS::fractions(contrasts(mdl_data9$cyl)) MASS::fractions(contrasts(mdl_data9$twolevel)) MASS::fractions(contrasts(mdl_data9$gear))
Practically speaking, the most common things you'll be doing is setting
a contrast scheme and maybe changing the reference level, so the +
operator
is very important.
A related function is the enlist_contrasts
function, which follows the same
exact rules as set_contrasts
but will return a list of contrasts instead of
setting the contrasts to the data itself.
Henceforth, when I'm not going to use the dataframe in an example, I'll just use
this to show the contrast matrices.
enlist_contrasts(mdl_data, carb ~ helmert_code * 4, cyl ~ scaled_sum_code + 4)
I showed a few functions that generate contrasts like sum_code
and
scaled_sum_code
and helmert_code
.
Below is a list of functions provided by this package:
treatment_code
: Pairwise comparisons to a reference level, intercept equals
the reference level (equivalent to contr.treatment
)sum_code
: Comparisons to the grand mean, intercept is the grand mean
(equivalent to contr.sum
, but reference level is the first level)scaled_sum_code
: Pairwise comparisons to a reference level, intercept
equals the grand meanhelmert_code
: Nested comparisons starting from the first level. Note
this is NOT equivalent to contr.helmert
, as the latter gives an unscaled
matrix.reverse_helmert_code
: Nested comparisons starting from the last level.backward_difference_code
: Successive differences. For levels A, B, C,
gives the differences B-A and C-Bforward_difference_code
: Successive differences. For levels A, B, C,
gives the differences A-B and B-C.cumulative_split_code
: Dichotomous differences. For levels A, B, C, D,
gives the differences A-(B+C+D), (A+B)-(C+D), (A+B+C)-D.orth_polynomial_code
: Orthogonal polynomial coding, equivalent to
contr.poly
.raw_polynomial_code
: Raw polynomial coding, I do not recommend using this
unless you know what you're getting into.polynomial_code
: An alias for orthogonal_polynomial_code
You can also use contrast functions from any other package so long as the first argument is the number of levels used to generate the matrix. However, again, if there's a singular level used for the reference level, this will always be set to the first level regardless of what the original matrix was.
# all equivalent to carb ~ sum_code foo <- sum_code enlist_contrasts(mdl_data, carb ~ contr.sum) enlist_contrasts(mdl_data, carb ~ sum_code) enlist_contrasts(mdl_data, carb ~ foo)
If you want to suppress this behavior, you can use the wrapper I
from base R.
Note this will also result in not setting the labels for you: it will use
the generated matrix as-is.
This is useful if you explicitly create a matrix and
want to make sure the reference level isn't shifted around.
enlist_contrasts(mdl_data, carb ~ I(contr.sum))
One last quick note is that you can keep the parentheses with the function, so if your tab-autocomplete puts parentheses you don't have to worry about it.
enlist_contrasts(mdl_data, carb ~ contr.sum())
However, the function does need to exist; the following won't run:
foo <- contr.sum(6) # foo is a matrix enlist_contrasts(mdl_data, carb ~ foo()) # foo is not a function
If you know you're going to use the same scheme for multiple variables, but don't need to set the reference level or labels for any one in particular, you can use tidyselect functionality to set multiple contrasts. I'll show a few examples of how to do this.
First, you can specify multiple variables on the left hand side using the +
operator.
# equivalent to: enlist_contrasts(mdl_data, cyl ~ sum_code, gear ~ sum_code) enlist_contrasts(mdl_data, cyl + gear ~ sum_code)
You can also use selecting functions like the below:
enlist_contrasts(mdl_data, where(is.factor) ~ sum_code) # see also enlist_contrasts(mdl_data, where(is.unordered) ~ sum_code) # see also enlist_contrasts(mdl_data, where(is.numeric) ~ sum_code)
This also lets you programmatically set contrasts like so:
these_vars <- c("cyl", "gear") enlist_contrasts(mdl_data, all_of(these_vars) ~ sum_code)
However, you have to ensure that no variables are duplicated. The following are not allowed regardless of whether the right right hand sides evaluate to the same contrast schemes.
enlist_contrasts(mdl_data, cyl ~ sum_code, where(is.factor) ~ sum_code) # cyl is a factor for mdl_data enlist_contrasts(mdl_data, cyl ~ sum_code, cyl + gear ~ sum_code) # cyl can't be specified twice
You can check out the tidyselect package for more information.
is.unordered
is provided in this package as an analogue to is.ordered
.
You can also set contrasts using things other than a function.
Specifically, you can pass the following classes to code_by
in the two-sided
formulas:
array
matrix
hypr
function
(must return a contrast matrix, cannot be an anonymous function)symbol
(must evaluate to a contrast matrix though)If you're using a custom matrix for a tricky analysis, then you can specify the matrix yourself and then pass it to set_contrasts:
my_matrix <- contr.sum(6) / 2 enlist_contrasts(mdl_data, carb ~ my_matrix)
You can also specify a list of formulas and pass that to different functions.
my_contrasts <- list(carb ~ contr.sum, gear ~ scaled_sum_code) enlist_contrasts(mdl_data, my_contrasts) mdl_data12 <- set_contrasts(mdl_data, my_contrasts)
This functionality is very important for the next function I'll discuss.
It can be very useful to get a summary of the different factors and their
contrasts in your dataset.
The glimpse_contrasts
function does this for you by passing it your dataset
and your contrasts.
my_contrasts <- list(carb ~ contr.sum, gear ~ treatment_code + 4, twolevel ~ scaled_sum_code * "b", cyl ~ helmert_code) mdl_data$twolevel enlist_contrasts(mdl_data, cyl ~ scaled_sum_code + 6 * 6) mdl_data13 <- set_contrasts(mdl_data, my_contrasts) glimpse_contrasts(mdl_data13, my_contrasts)
Note that if you haven't actually set your contrasts to the dataset, you'll receive a series of warnings:
glimpse_contrasts(mdl_data, my_contrasts)
You can also specify the formulas directly, like with set_contrasts
and
enlist_contrasts
--- but usually you'll want to have a list of formulas and
pass this to the different functions.
Note how we retyped everything we typed above just to get the summary table.
glimpse_contrasts(mdl_data13, carb ~ contr.sum, gear ~ treatment_code + 4, twolevel ~ scaled_sum_code * "b", cyl ~ helmert_code)
If you don't provide any contrast information, glimpse_contrasts
will assume
that factors should be using their default contrasts.
If the contrasts are not the same as the defaults, you'll receive warnings:
glimpse_contrasts(mdl_data) # no warnings glimpse_contrasts(mdl_data13) # warnings
If there are mismatches between the data and the contrasts specified by the formulas, you'll also get warnings:
glimpse_contrasts(mdl_data, carb ~ contr.sum, gear ~ treatment_code * 4, cyl ~ contr.treatment | c("diff1", "diff2"))
Remember that set_contrasts
will automatically coerce dataframe columns into
factors when specifying a contrast, but glimpse_contrasts
does not because
it does not modify the provided dataframe.
There is an additional operator for the formulas that only applies to polynomial
contrasts.
For background, polynomial contrasts encode polynomial "trends" across the
levels of a factor.
So, testing for a linear-like trend, a quadratic-like trend, etc.
One thing you can do, though I don't necessarily recommend it is removing
higher-order trends.
For instance, if you have 8 levels but don't want to bother with polynomials
over degree 3 (even though you'll get up to degree 7 for "free").
In this situation you can use the -
operator:
varname ~ code_by + reference * intercept - low:high | labels
enlist_contrasts(mdl_data, carb ~ contr.poly) enlist_contrasts(mdl_data, carb ~ contr.poly - 3:5)
Note that you can ONLY use this with enlist_contrasts
and glimpse_contrasts
.
This is because if you remove columns from a contrast matrix, R will fill them
back with something else.
mdl_data14 <- set_contrasts(mdl_data, carb ~ contr.sum) contrasts(mdl_data14$carb) contrasts(mdl_data14$carb) <- contrasts(mdl_data14$carb)[, 1:2] contrasts(mdl_data14$carb)
What do these values correspond to? If we check out the hypothesis matrix, we can see that it seems like it's trying to encode 0 but with some floating point error:
MASS::fractions( contrastable:::.convert_matrix( contrasts(mdl_data14$carb) ) )
This is why I don't recommend actually using the -
operator.
If you use something like lm
to fit functions, you can use the output
from enlist_contrasts
as the argument to contrasts
.
Although, I'm pretty sure the model matrix will just add in these values anyways.
If you try to use the -
operator with set_contrasts
you'll get a warning,
and the -
operator will be ignored:
mdl_data15 <- set_contrasts(mdl_data, carb ~ polynomial_code - 3:5)
This is somewhat related to the -
operator described above, but has more uses.
When considering a factor with 3 levels, you'll end up with two coefficients in
your model.
In other words, a single factor is "decomposed" behind the scenes into multiple
numeric predictors for the model matrix.
Sometimes it can be helpful to work directly with these decomposed predictors.
For example, interactions in GAM models with mgcv
can often be a pain to work
with, and sometimes it's easier to just group the interaction of two predictors
into a series of individual predictors.
However, doing the matrix multiplication one at a time (or trying to
conceptualize how the matrices fit together) can be difficult, especially when
R's model matrix functionality can do it for us.
decompose_contrasts
provides an interface to do such decomposition.
mdl_data16 <- mdl_data |> set_contrasts(cyl ~ helmert_code, gear ~ helmert_code) |> decompose_contrasts(~ cyl * gear) # Look at the decomposed contrast columns mdl_data16 |> dplyr::select(matches("^cyl|gear")) |> head()
The column names take the name from the labels, so these may not always be
easy to work with when there are special characters.
You can wrangle these or use something like the {janitor}
package as needed.
Here's a comparison of using the cyl
predictor normally (which gives us
two coefficients), and specifying the comparisons separately after decomposing
the contrasts:
coef(lm(mpg ~ cyl, data = mdl_data16)) coef(lm(mpg ~ `cyl<6` + `cyl<8`, data = mdl_data16)) coef(lm(mpg ~ cyl * gear, data = mdl_data16)) coef(lm(mpg ~ `cyl<6` + `cyl<8` + `gear<4` + `gear<5` + `cyl<6:gear<4` + `cyl<8:gear<4` + `cyl<6:gear<5` + `cyl<8:gear<5`, data = mdl_data16))
If we really wanted to, we could use just one of the comparisons, but note that this may give misleading results:
coef(lm(mpg ~ `cyl<6` + `cyl<8`, data = mdl_data16)) coef(lm(mpg ~ `cyl<6`, data = mdl_data16)) # not the same estimate as the above
This function can be especially valuable for pedagogical purposes, as it shows how the qualitative labels in the factor data is converted into numeric values for the model to work with.
There are two general patterns you'll take in analyses with this package:
Here's a general usage pattern when you don't want to use glimpse_contrasts
.
raw_data <- mtcars # load raw data from a csv or something here # wrangle data for your final model final_data <- mtcars |> # mutate(# ... some data wrangling transformations ..) |> set_contrasts(carb ~ sum_code) # set contrasts at the very end mdl <- lm(mpg ~ carb, data = final_data) # run model with contrasts set
Here's a usage pattern if you wanted to, say, publish the contrast matrices or report the glimpse table in a publication or preregistration:
raw_data <- mtcars # load raw data from a csv or something here # specify contrasts up front my_contrasts <- list(carb ~ sum_code, cyl ~ scaled_sum_code, gear ~ sum_code) # wrangle data for your final model final_data <- mtcars |> # mutate(# ... some data wrangling transformations ..) |> set_contrasts(my_contrasts) # set contrasts at the very end # Show the matrices we're using enlist_contrasts(final_data, my_contrasts) # Show a summary glimpse_contrasts(final_data, my_contrasts) # Fit the model with contrasts set mdl <- lm(mpg ~ carb, data = final_data)
If you want to show fractions instead of decimals, use MASS::fractions
with
each element of the list of contrasts:
enlist_contrasts(final_data, my_contrasts) |> lapply(MASS::fractions)
If you use the targets
package, you can set up the list of contrasts as its
own target, then pass that to other targets so that your analyses will be rerun
whenever you change the contrasts (though this shouldn't happen frequently).
I discussed the main functions in this package and the special syntax that is implemented to set contrasts. If you want additional examples, see my blog posts below:
If you need additional information regarding the various warnings and messages
this package throws, you can check out the warnings
vignette in this package.
When citing the package in a paper, ideally three things are achieved:
In a paper with multiple analyses, these don't necessarily need to all be mentioned on each analysis, especially when there are commonalities between them.
Bad on first mention: "Condition and group are sum coded." (this can be fine if "sum code" is defined previously)
Good: "Condition is coded using the sum_code()
function from the contrastable package (Sostarics 2024), using A as the reference level. Group is similarly sum coded,
with X as the reference level."
Also good: "For all of our analyses, we use the contrastable package (Sostarics, 2024) to set the contrasts for our categorical variables. Condition and group are sum coded (reference level -1, comparisons +1) with A and X as the reference levels, respectively." The point here is to disambiguate what is meant by "sum code", which has inconsistent usage in the literature.
Here's a paragraph example describing two models:
A bit repetitive: "In the model for Experiment 1, Condition is treatment coded
using the treatment_code()
function from the contrastable package (Sostarics, 2024),
with A as the reference, and Group is scaled sum coded using the scaled_sum_code()
function from the contrastable package, with X as the reference.
In the model for Experiment 2, Condition is treatment coded with the treatment_code()
function (reference=A) and Group is scaled sum coded with scaled_sum_code()
(reference=X), while the additional Context predictor is scaled sum coded using the scaled_sum_code()
function (reference=NoContext)."
Rewritten: "We use the treatment_code() and scaled_sum_code() functions from the contrastable package (Sostarics, 2024) when setting the contrasts for our categorical variables. In the model for Experiment 1, Condition is treatment coded (reference=A) and Group is scaled sum coded (reference=X). For Experiment 2, the additional Context predictor is scaled sum coded (reference=NoContext); as in Exp. 1, Condition is treatment coded (reference=A) and Group is sum coded (reference=X)."
Other examples:
scaled_sum_code()
function from the contrastable package (Sostarics 2024).helmert_code()
and
scaled_sum_code()
functions from the contrastable R package (Sostarics 2024)set_contrasts()
and sum_code()
functions from the contrastable package
(Sostarics 2024) to sum code (+1/-1) our variablesI also recommend writing out, potentially in a footnote, what the comparisons are.
Good: "We use the contrastable package's sum_code()
function (+1/-1, Sostarics 2024)
for all categorical variables."
Better: "We use the contrastable package's sum_code()
function (+1/-1, Sostarics 2024)
for all categorical variables.
This contrast scheme encodes differences between each comparison level and the grand mean."
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.