adventr: more categorical predictors
In adventr: Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"

library(forcats)
library(learnr)
library(tidyverse)

library(BayesFactor)
library(lm.beta)
library(robust)
library(sjstats)
library(WRS2)



knitr::opts_chunk$set(echo = FALSE, warning = FALSE)
tutorial_options(exercise.cap = "Exercise")

hint_text <- function(text, text_color = "#E69F00"){
  hint <- paste("<font color='", text_color, "'>", text, "</font>", sep = "")
  return(hint)
}

#Read dat files needed for the tutorial
memory_tib <- adventr::memory_dat

#setup objects for code blocks

An Adventure in R: Categorial predictors (several categories)

Overview

This tutorial is one of a series that accompanies An Adventure in Statistics [@RN10163] by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, ^[Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.]

Who is the tutorial aimed at?
- Anyone teaching from or reading An Adventure in Statistics may find them useful.
What is covered?
- This tutorial looks at how to fit a linear model with a predictor variable that represents more than two independent categories. We look at robust variants of the model, and computing a Bayes factor. It would be a useful tutorial to run alongside teaching based on Chapter 16 of An Adventure in Statistics.
- This tutorial does not teach the background theory: it is assumed you have either attended my lecture or read the relevant chapter in the aforementioned books (or someone else's)
- The aim of this tutorial is to augment the theory that you already know by guiding you through fitting linear models using R and RStudio and asking you questions to test your knowledge along the way.

Story précis

Why a précis?

Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don't have the book I begin each tutorial with a précis of the story. If you're not interested then fair enough - click past this section.

General context for the story

It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.

Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.

But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.

Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions

The Chippers who have had WiFi-enabled chips implanted into their brains, enabling them to record and broadcast what they see and think in real time; upload memories for future generations into a newly-created memoryBank and live-stream music and films directly into their brains.
The Clocktarians, followers of the old pre-Prism ways who use steam punk style technologies, who have elected not to have chips in their brains, regarded by the Chippers as backward-looking stuck in a ‘clockwork, Victorian society’.

Everyone has a star, a limitless space on which to store their digital world.

Zach and Alice are Clocktarians. Their technology consists mainly of:

A Proteus, a device made from programmable matter that can transform shape and function simply by the owners’ wishes. Zach calls his a diePad, in the shape of a tombstone in an ironic reference to an over-reliance on technology at the expense of memory.
A Reality Checker, a clockwork mechanism that, at the point of critical velocity, projects an opaque human head that is linked to everything and can tell you anything. Every head has a personality and Zach’s is a handsome, laid back ‘dude’ who is like an electronic friend, who answers questions if he feels like it and often winds Zach up by giving him false information. And he often flirts with Alice.

Main Protagonists

Zach
- Rock musician in band called The Reality Enigma.
- Spellbinding performer, has huge fan-base.
- Only people living in Elpis get to see The Reality Enigma in the flesh. Otherwise all performances are done via an oculus riff, a multisensory headset for experiencing virtual gigs.
- Zach’s music has influenced and changed thousands of lives.
- Wishes he had lived pre-Revolutionary times, the turn of the 21st Century, a golden age for music when bands performed in reality at festivals.
- Kind, gentle and self-doubting.
- Believes science and maths are dull and uninspiring. Creates a problem between him and Alice as she thinks that because he isn’t interested in science, he isn’t interested in her. Leads to lots of misunderstandings between them.
Alice
- Shy, lonely, academically-gifted – estranged from the social world until she met Zach in the college library.
- Serious scientist, works at the Beimeni Centre of Genetics.
- At 21, won the World Science Federation’s Einstein Medal for her genetics research
- Desperately wants Zach to get over his fear of science so he can open his mind to the beauty of it.

How Zach's adventure begins

Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.

On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.

Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.

Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.

Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.

Packages and data

Packages

This tutorial uses the following packages:

BayesFactor [@RN9444] to compute Bayes factors
boot [@RN11409] to compute bootstrap confidence intervals
Hmisc [@RN11417] to get confidence intervals
lm.beta [@RN11412] to compute standardized b-values
robust [@RN11411] to compute robust estimates
sjstats [@RN11413] to produce robust parameter estimates and confidence intervals
tidyverse [@RN11407] for general data processing
WRS2 [@RN10205] to do various robust tests

These packages are automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing install.packages("package_name"), where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing library(package_name), where package_name is the name of the package.

Data

This tutorial has the data files pre-loaded so you shouldn't need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:

ais_16_memory_erasing.csv.

You can load the file in several ways:

Assuming that you follow the workflow recommended in the tutorial adventr_02 (see also this online tutorial), you can load the data into an object called memory_tib by executing:
- memory_tib <- readr::read_csv("../data/ais_16_memory_erasing.csv")
- If you don't follow my suggested workflow, you will adjust the file location in the above command.
Alternatively, if you have an internet connection (and my server has not exploded!) load the file direct from the URL by executing:
- memory_tib <- readr::read_csv("http://www.discoveringstatistics.com/repository/ais_data/ais_16_memory_erasing.csv")

Categorical variables

When working within the tutorial the data are already prepared for you. However, if you are trying to replicate the tutorial within R or R Studio then you need to explicitly convert categorical predictors to factors. The data for this tutorial contains a variable group, which will be read in from the csv file as a character variable. To convert this variable to a factor we can use the as_factor() function from the forcats package. There are several ways to do this, but the most tidyverse way is to use mutate() from dplyr (part of tidyverse), which is a way of adding variables to a tibble or overwriting existing variables. We could execute:

library(tidyverse)
memory_tib <- memory_tib %>% 
  dplyr::mutate(
    group = forcats::as_factor(group)
  )

This code recreates the memory_tib tibble from itself then uses mutate to recreate the variable group from itself by placing it within the as_factor() function which we met in an earlier tutorial.

Exploring the data

The model

During Zach's visit to the JIG:SAW complex he visits several research buildings. In the second, he discovers experiments being conducted related to memory. In one experiment thirty-six participants with ID chips were tested. All participants met a stranger who had a 5-minute scripted conversation with them containing 10 critical pieces of information. A week later, the participants were asked to recall the encounter for 5 minutes, and then after a 10-minute break wrote everything that they could remember about the original encounter. During the 5-minute initial recall, one of three things happened: the control group (N = 12) received no intervention; the erase group (N = 12) had a pulse of electricity sent to their brain via their ID chip; and the replace group (N = 12) had conflicting verbal descriptions sent to their ID chip. The outcome was how many of the 10 critical pieces of information the participants wrote down after the recall phase. The prediction was that recall will be worse (i.e. lower scores) for participants who had an electrical pulse, or conflicting information sent to their ID chips during recall than those, in the control group, who experienced no interference. The data are in the tibble memory_tib, which has seven variables:

id: Participant ID
group: To which of the no intervention, memory erase or memory replace conditions the participant was assigned
recall: How many of the 10 critical pieces of information the participant identified
erase_dummy: Dummy variable that compares the erase condition to no intervention
replace_dummy: Dummy variable that compares the replace condition to no intervention
contrast_1: Variable that uses contrast codes to compare the memory conditions (combined) to no intervention
contrast_2: Variable that uses contrast codes to compare the erase condition to the replace condition

The most important variables are the first three - the others will be used to illustrate various things. Use the code box to inspect the tibble.

memory_tib

The model we're fitting is described by the following equation:

$$ \begin{aligned} Y_i & = b_0 + b_1X_i+ ε_i\ \text{recall}_i & = b_0 + b_1\text{dummy 1}_i + b_2\text{dummy 2}_i + ε_i \ & = b_0 + b_1\text{erase vs. control}_i + b_2\text{replace vs. control}_i + ε_i \end{aligned} $$

The dummy variables are made up of 0s and 1s that code the three groups across the two variables (see the book chapter for details)

Descriptive statistics

Using what you've learnt in previous tutorials, create a tibble called mem_summary containing the group means and their confidence intervals.

r hint_text("Tip: if you're doing this outside of the tutorial remember to load the packages tidyverse and Hmisc")

mem_summary <-  memory_tib %>%
  dplyr::group_by(group) %>%
  dplyr::summarize(
    mean = mean(recall),
    ci_low = ggplot2::mean_cl_normal(recall)$ymin,
    ci_upp = ggplot2::mean_cl_normal(recall)$ymax
)
mem_summary

Plotting the data

Let's plot the data. Use the code box below to create an error bar chart with group on the x-axis and recall on the y-axis. If you feel like it, try to superimpose the raw data. (Hint, you can adapt the code that you used in the adventr_15 tutorial.)

mem_plot <- ggplot2::ggplot(memory_tib, aes(group, recall))
mem_plot +
  geom_point(colour = "#E69F00", position = position_jitter(width = 0.1, height = 0)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", colour = "grey35") +
  labs(x = "Experimental group", y = "Recall (out of 10") +
  coord_cartesian(ylim = c(0, 10)) +
  scale_y_continuous(breaks = 0:10) +
  theme_bw()

Predictors with several categories

Fitting the model

It looks like recognition scores are, indeed lower in the replace condition compared to the control, but perhaps not the erase group (compared to the control). To fit the model we use the lm() function, because we are fitting a linear model with a categorical predictor. We've used this function before, just to recap it takes the following general form (I've retained only the key options):

lm(outcome ~ predictor(s), data, subset, na.action = na.fail)

Using what we've learnt in previous tutorials we could do the following:

Create a model called mem_mod using lm()
Add standardized betas using lm.beta() (Remember to execute library(lm.beta) first if trying this outside of the tutorial.)
Summarize the model using summary()
Use confint() to get confidence intervals for the model parameters (the bs)
Plot diagnostics with plot()

Using these hints, see whether you can fit the model in the code box below. (Another hint is you're going to use the same code as in adventr_15 but just changing the variable names and the name of the tibble.)

mem_mod <- lm(recall ~ group, data = memory_tib)
mem_mod <- lm.beta(mem_mod)
summary(mem_mod)
confint(mem_mod)
plot(mem_mod)

The output provides estimates of the model parameters (the b-values) and the significance of these values. R automatically turns the variable group into dummy variables. It will order the categories alphabetically (so the order for our variable will be Control, Erase, Replace). It then creates dummy variables using the first category as the reference category (see Table 1). For our data dummy variable 1 will, therefore, represent the control group vs the erase group (which R has labelled as groupErase, which tells us that this is the variable group comparing the Erase group to the first category). Dummy variable 1 represents the control group vs the replace group (which R has labelled as groupReplace, which tells us that this is the variable group comparing the Erase group to the first category). For these data, these default comparisons make sense (we compare everything to the control group) but there will be times when the default creation of dummy variables and the alphabetical order of your groups do not combine to make the comparisons that you want. We'll look at how to change this in due course.

Group <-  c("Control", "Erase", "Replace")
groupErase <- c(0, 1, 0)
groupReplace <- c(0, 0, 1)

knitr::kable(tibble(Group, groupErase, groupReplace), caption = "Table 1: Dummy codes used by R (see text for details)")

For now, the Y intercept ($b_0$) is 5, which is the value of recognition when group is the reference category. As just described, R will have coded the variable group alphabetically (unless you tell it something different) so the reference category is the control group. So, ($b_0$) is 5 because the mean of the control group is 5. As we saw above, the two predictors are dummy variables, and their respective bs reflect the differences between means that the dummy variables code. groupErase will be the difference between the mean of the erase group (4.75) and the mean of the control group (5.00), which is $4.75 - 5.00 = -0.25$, and groupReplace will be the difference between the mean of the replace group (2.67) and the mean of the control group (5.00), which is $2.67 - 5.00 = -2.33$.

quiz(
  question("How would you interpret the information about the *F*-statistic?",
    answer("Overall recall scores can not be significantly predicted from the group means. This implies that the group means are not significantly different.", correct = T, message = "Because the *p*-value associated with *F* is 0.0967, which is more than the typical criterion of 0.05, we would conclude that using the group means does not help us to predict recall scores. In other words the group means are not significantly different."),
    answer("Overall recall scores can be significantly predicted from the group means. This implies that the group means are significantly different.", message = "Because the *p*-value associated with *F* is 0.0967, which is more than the typical criterion of 0.05, we would conclude that using the group means does NOT help us to predict recall scores. In other words the group means are not significantly different."),
    answer("Tell me a cat joke", message = "A cat walks into a bar and orders a pint of fish. The bar person says \'Sorry, but we don't serve fish\'. The cat replies \'That's OK, I'm a cat\'."),
    correct = "Correct - well done!",
    incorrect = "Good try.",
    random_answer_order = TRUE,
    allow_retry = T
    ),
  question("How would you interpret the table labelled **Coefficients**?",
    answer("Recall scores were significantly different between the replace group and the control group but not between the erase group and the control group.", correct = T, message = "With the caveat that the overall fit of the model wasn't significant, cecause the value of *Sig.* is 0.049 for the second dummy variable, which is less than the criterion of 0.05, we know that the difference between the means of the replace group and the control group is significantly different. However, becasue the value of *Sig.* is 0.828 for the first dummy variable, which is greater than the criterion of 0.05, we know that the difference between the means of the erase group and the control group was not significantly different."),
    answer("Recall scores were not significantly different between the replace group and the control group and also not between the erase group and the control group.", message = "The interpretation for the replace group is incorrect."),
    answer("Recall scores were significantly different between both the replace group and the control group and the erase group and the control group.", message = "The interpretation for the erase group is incorrect."),
    answer("Recall scores were not significantly different between both the replace group and the control group but were between the erase group and the control group.", message = "The interpretation for both dummy variables is incorrect."),
    answer("Recall scores were not significantly different between the replace group and the erase group.", message = "There's no information in the table that relates to this group comparison."),
    answer("Give me a clue", message = "Look at the value in the column labelled *Pr(>|t|)* for the rows *groupErase* and *groupReplace*"),
    correct = "Correct - well done!",
    incorrect = "Good try.",
    random_answer_order = TRUE,
    allow_retry = T
    )
)

The plots look a bit odd when you have a categorical predictor variable. The Q-Q and residual plots suggest some unusual cases but not particularly influential ones.

Just to illustrate what's going on

The tibble memory_tib contains two variables called erase_dummy and replace_dummy that contain the dummy codes that R has used. These variables are not used in the model, but are included to illustrate what R does when fitting a categorical predictor. It effectively creates variables like erase_dummy and replace_dummy that compare each group to the reference category (by default, the category with the first letter that is closest to the beginning of the alphabet). Try creating a model, called dum_mod, that predicts recall from erase_dummy and replace_dummy, and display a summary of the model. You should find that it is identical to the model that we just fitted using group. This example illustrates, then, how R takes a categorical variable and redefines it as a series of dummy coding variables.

dum_mod <- lm(recall ~ erase_dummy + replace_dummy, data = memory_tib)
summary(dum_mod)

Contrasts

Assigning contrasts

We can override the default dummy coding (or, as we shall see, use contrast coding) by using the contrasts() function. The way I like to use this function is to create variables containing the codes that I want to use, and then assign these variables to the variable that I want to add contrasts to by using these variables. This method is slightly long-winded but has the advantage that the variable names for the contrasts you create are used in the output. So, it enables you to set names for your contrasts, which make the output more interpretable.

Let's imagine that instead of comparing each group to the control, you wanted to compare each group to the replace group. The codes we would use to achieve this are in Table 2.

Group <-  c("Control", "Erase", "Replace")
replace_vs_control <- c(1, 0, 0)
replace_vs_erase <- c(0, 1, 0)

knitr::kable(tibble(Group, replace_vs_control, replace_vs_erase), caption = "Table 2: Dummy codes comparing each group to the replace group")

First, we need to create the two dummy variables that relate to columns 2 and 3 of the table. We'll call the first rep_vs_control and the second rep_vs_erase because these names remind us of which groups are being compared by the dummy variable. We can create these variables by executing:

rep_vs_control <- c(1, 0, 0)
rep_vs_erase <- c(0, 1, 0)

We want to use these variables to set contrasts for the variable group. We can do this by executing:

contrasts(memory_tib$group) <- cbind(rep_vs_control, rep_vs_erase)

Let's break this down:

contrasts(memory_tib$group) instructs R to set contrasts for the variable group in the tibble called memory_tib.
cbind(rep_vs_control, rep_vs_erase) sets the contrast to be the two dummy variables we created bound together in columns (that's what cbind() does). To get a flavour of what cbind is doing try executing only the right hand side of the command in the code box. You should see two columns that look like columns 2 and 3 of Table 2.

r hint_text("Tip: the contrast function only works on variables that are factors. The variable group in this tutorial has already been converted into a factor, but when you import data files remember to set variables that define groups as factors. There are various different ways to do this including the as_factor() function form the forcats package.")

I don't personally do this for the reason already explained, but if you don't care about setting variable names for the dummy variables, you can set the contrasts in a single command that puts the codes directly into the cbind() function:

contrasts(memory_tib$group) <- cbind(c(1, 0, 0), c(0, 1, 0))

In the code box below set a contrast for the group variable that compares each group to the replace group as just described. Then re-fit the model predicting recall from group. Call the model mem_mod_2 and don't worry about requestioning standardized betas, confidence intervals or plots.

rep_vs_control <- c(1, 0, 0)
rep_vs_erase <- c(0, 1, 0)
contrasts(memory_tib$group) <- cbind(rep_vs_control, rep_vs_erase)
mem_mod_2 <- lm(recall ~ group, data = memory_tib)
summary(mem_mod_2)

Compare the output with that of the previous model. Note that the overall fit of the model hasn't changed F(2, 33) = 2.51, p = 0.097, but that the b-values have. ($b_0$) is now 2.67, which is the mean of the replace group (because this is now the reference category). The first b for group is now the difference between the mean of the control group (5.00) and the mean of the replace group (2.67), which is $5.00 - 2.67 = 2.33$, and the second b for group is now the difference between the mean of the erase group (4.75) and the mean of the replace group (2.67), which is $4.75 - 2.67 = 2.08$. Finally, note that by naming our coding variables when we specified them, the output helpfully labels the bs for group with rep_vs_control and rep_vs_erase.

Contrast coding

In the chapter, lecture we was that we can use different types of coding to compare different groups. One alternative to dummy coding is contrast coding. We can achieve contrast coding in the same way as just described: using the contrasts() function. In the book, Zach discovered that the researchers at JIG:SAW had predicted that recall will be worse (i.e. lower scores) for participants who had an electrical pulse, or conflicting information sent to their ID chips during recall than those, in the control group, who experienced no interference. In the chapter/lecture we discovered that we could operationalize these hypotheses as two dummy variables using contrast codes.

Group <-  c("Control", "Erase", "Replace")
Interference_vs_none <- c(-2, 1, 1)
Erase_vs_replace <- c(0, -1, 1)

knitr::kable(tibble(Group, Interference_vs_none, Erase_vs_replace), caption = "Table 3: Contrast codes for the memory data")

The first contrast (Interference_vs_none in Table 3) compares the control group to any group that had interference sent to their ID chips (i.e. the erase and replace groups combined), whereas the second (Erase_vs_replace in Table 3) ignores the control group (by coding it with 0) and compares the erase group to the replace group. These contrasts test these questions:

Does having interference (but ignoring if the interference attempted to erase or replace the memory) sent to your ID chip result in different memory recall than having no interference?
Do attempts to replace memories lead to different recall than attempts to erase them?

We can implement these contrasts in the same way as just described. All that we'll be changing is the names of the contrasts and the codes that we use. See whether you can adapt your code from the previous code box, to set these two contrasts for the variable group and then create a model called con_mod (with standardized betas, confidence intervals and plots) that you then summarize. If you're stuck click .

interference_vs_none <- c(-2, 1, 1)
erase_vs_replace <- c(0, -1, 1)
contrasts(memory_tib$group) <- cbind(interference_vs_none, erase_vs_replace)
con_mod <- lm(recall ~ group, data = memory_tib)
summary(con_mod)

You should find that the b for the first contrast is proportional to the difference between the average of the erase and replace group (because the group sizes are equal we can take a straight average, $\frac{4.75 + 2.67}{2} = 3.71$) and the average of the control (5.00), which is $5.00-3.71 = -1.29$. The b is this value divided by the number of groups, 3 ($\frac{-1.29}{3} = -0.43$). The b for the first contrast is proportional to the difference between the means of the erase and replace groups ($2.67-4.75 = -2.08$. The b is this value divided by the number of groups, 2 ($\frac{-2.08}{2} = -1.04$).

Trends (polynomial contrasts)

As well as defining our own contrasts, we can use pre-defined ones. For example, if we predicted a trend in our group means (i.e., our groups have a meaningful order such as increasing does of a drug) we might want to do a trend analysis, also known as polynomial contrasts. These contrasts test for linear, quadratic, cubic (and higher) trends. When the predictor variable is categorical, these contrasts test for a patterns in the means across groups that follow a linear, quadratic (etc.) pattern. We can set this trend using contr.poly() and specifying the number of groups (in this case 3). For example, we would set a polynomial contrast for the variable group by executing:

contrasts(memory_tib$group) <- contr.poly(3)

r hint_text("Tip: Our factor levels are ordered as: control, erase, replace. This order is sensible because we expect recall to get progressively worse across those groups. However, if the levels of your grouping variable do not represent the order that you want, you can use the fct_relevel() function in the forcats package to re-order the levels more sensibly.")

Set the variable group to have polynomial contrast (as above) then create and summarize a model called mem_mod_poly that fits a linear model predicting recall from group.

contrasts(memory_tib$group) <- contr.poly(3)
mem_mod_poly <- lm(recall ~ group, data = memory_tib)
summary(mem_mod_poly)

The output contains two contrasts (because there were 3 groups). The b labelled group.L tests the linear contrast (that's what the 'L' is for) and the b labelled group.Q tests the quadratic contrast (hence the 'Q'). With three groups we can't test for a higher order polynomial, but if you had 4 categories then you'd get a cubic contrast, and so on. The output shows that the linear trend is significant, t = -2.04, p = 0.049, whereas the quadratic trend is not, t = -0.93, p = 0.361.

Robust models

Robust standard errors

As in previous tutorials, we can use one of the methods (HC0 to HC5) for producing standard errors (and hence confidence intervals) that are robust to heteroscedasticity.

r hint_text("Tip: if you're doing this outside of the tutorial you need the sjstats package installed and you need to execute library(sjstats).")

Remember that we place the model we created with lm() (in this case mem_mod) into the function robust() from the sjstats package. For example, we could execute:

robust(mem_mod, conf.int = T) to use the default "HC3" method and request 95% confidence intervals
robust(mem_mod, vcov.type = "HC4", conf.int = T) to use "HC4" and request 95% confidence intervals

The HC3 and HC4 refer to different methods to correct the standard errors (refer back to the tutorial adventr_14). Try these commands in the code box and compare the results.

mem_mod <- lm(recall ~ group, data = memory_tib)

sjstats::robust(mem_mod, conf.int = T)
sjstats::robust(mem_mod, vcov.type = "HC4", conf.int = T)

Robust estimates

In adventr_14 we used the lmRob() function from the robust package (again, if you're working outside of the tutorial remember to execute library(robust)) to get robust parameter estimates. Recall that the function is used in the same way as lm() so you can copy your earlier code and replace lm with lmRob. Try this in the code box to create a model called mem_rob, and use summary() to view it.

mem_rob <- robust::lmRob(recall ~ group, data = memory_tib)
summary(mem_rob)

Note that the parameter estimates for the effect of group have changed because they no longer represents the difference between arithmetic means but instead represents a robust estimate of those differences.

Bootstrap confidence intervals

We can also bootstrap confidence intervals as we did in adventr_14 and adventr_15. refer back to those tutorials to see whether you can understand the code below, then execute it.

r hint_text("Tip: remember that you need the package boot, so execute library(boot) before executing the code below.")

library(boot) #This loads the package boot

#write a function to do the bootstrapping. This is identical to what was used in the tutorial adventr_14
boot_lm <- function(formula, data, index){
  d <- data[index,]
  fit <- lm(formula, data = d)
  coef(fit)
}

#The following runs the bootstrap and extracts the CIs
boot_result <- boot::boot(boot_lm, formula = recall ~ group, data = memory_tib, R = 2000)
boot::boot.ci(boot_result, type = "bca", index = 1) #for the intercept
boot::boot.ci(boot_result, type = "bca", index = 2) #for the first dummy variable for group (this code will run the default unless you also include a line to add contrasts to the variable, group.)
boot::boot.ci(boot_result, type = "bca", index = 3) #for the second dummy variable for group (this code will run the default unless you also include a line to add contrasts to the variable, group.)

The WRS2 package

The WRS2 package [@RN10205] wraps up a few of the many functions described by Wilcox to perform robust variants of tests [@RN5244]. We'll look at these functions that compare several means:

t1way(formula, data = tibble, tr = 0.2, nboot = 100): is a test for trimmed means.
lincon(formula, data = tibble, tr = 0.2): post hoc tests for trimmed means
t1waybt(formula, data = tibble, tr = 0.2, nboot = 599): a variant of the test above that also uses a bootstrap
mcppb20(formula, data = tibble, tr = 0.2, nboot = 599): post hoc tests for the above test

These functions have similar arguments:

formula: the functions use a formula like that for lm(), so in this case we could use recall ~ group.
data = tibble: replace tibble with the name of the tibble or data frame in which the data are stored (in this case data = memory_tib)
tr = 0.2: determines the level of trim. The default is a 20% trim, which has been shown to perform well, but you could change this to 0.1 (10% trim) or any value up to 0.5 (although if you want to compare medians you should a function called med1way()). You can omit this argument if you're happy with a 20% trim.
nboot = 599: specifies the number of bootstrap samples. There is a case for increasing the defaults to 1000.

See if you can run the three commands in the code box below based on my description (if not, press )

WRS2::t1way(recall ~ group, data = memory_tib)
WRS2::lincon(recall ~ group, data = memory_tib)
WRS2::t1waybt(recall ~ group, data = memory_tib, nboot = 1000)
WRS2::mcppb20(recall ~ group, data = memory_tib, nboot = 1000)

Based on t1way we'd conclude that the group means were significantly different, $F_t(2, 13.9) = 4.13, p = 0.039$, with recall scores similar in the control and erase conditions, $\hat{\psi} = 1 [-1.92, 3.92], p = 0.372$, and in the erase and replace groups, $\hat{\psi} = 2.5 [-0.69, 5.669], p = 0.053$, but were significantly lower in the replace group than both the control group, $\hat{\psi} = 3.5 [0.24, 6.76], p = 0.012$.

Based on t1waybt we'd conclude that the group means were significantly different, $F_t(2, 13.9) = 4.13, p = 0.029$, with recall scores similar in the control and erase conditions, $\hat{\psi} = 1 [-2.00, 3.25], p = 0.429$, and in the erase and replace groups, $\hat{\psi} = 2.5 [-1.25, 5.125], p = 0.079$, but were significantly lower in the replace group than both the control group, $\hat{\psi} = 3.5 [-0.63, 5.75], p = 0.045$ although the 95% confidence interval crossed zero.

Bayesian models

Bayes factors

Like in previous tutorials (adventr_11, adventr_14, adventr_15) we can use the BayesFactor package. In this scenario we use the anovaBF() function. This function has basically the same format as most of the other functions in this tutorial:

object = anovaBF(formula = outcome ~ group_variable, data = tibble, whichModels = "withmain", rscaleFixed = "medium")

The function uses default priors that can be specified as a number or as "medium" (the default), "wide", and "ultrawide". These labels correspond to r scale values of 1/2, $^\sqrt{2}/_2$, and 1. These priors are explained in adventr_11 and also read @RN9309.

So we would use the same formula that we have used throughout, and specify our tibble as memory_tib. It's a good idea to save this model into a new object (lets call it mem_bf) because you can do useful things with it (we didn't do this for the t1way() functions, but you could).

r hint_text("Tip: if you are using a tibble (rather than a data frame) then place the tibble's name into the function data.frame(), to convert it to a data frame. This step is necessary because (at the time of writing) the BayesFactor package doesn't accept tibbles.")

For our data (using the default prior) we could execute:

mem_bf <- memory_tib %>% 
  BayesFactor::anovaBF(formula = recall ~ group, data = .)
mem_bf

The first line creates the object mem_bf as described above and the second line prints it for us to see. Try this in the code box:

mem_bf <- memory_tib %>% 
  BayesFactor::anovaBF(formula = recall ~ group, data = .)
mem_bf

The value of 0.976 means that the probability of the data given the alternative hypothesis is 0.976 that given the null hypothesis. Put another way, you should shift your belief in the alternative hypothesis relative to the null by a factor of 0.976. There is basically equal evidence for the null and alternative hypothesis.

Other resources

Statistics

The tutorials typically follow examples described in detail in @RN10163, so for most of them there's a thorough account in there. You might also find @RN4832 useful for the R stuff.
There are free lectures and screen casts on my YouTube channel
There are free statistical resources on my website www.discoveringstatistics.com

R

Information on using ggplot2
R for data science is the open-access version of the book by tidyverse creator Hadley Wickham [@RN11404]. It covers the tidyverse and data management.
ModernDive is an open-access textbook on R and RStudio
RStudio cheat sheets
RStudio list of online resources
SwirlStats is a package for R that launches a bunch of interactive tutorials.

References

Any scripts or data that you put into this service are public.

adventr documentation built on July 1, 2020, 11:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

adventr
Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"

adventr: more categorical predictors
In adventr: Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"

An Adventure in R: Categorial predictors (several categories)

Overview

Story précis

Why a précis?

General context for the story

Main Protagonists

How Zach's adventure begins

Packages and data

Packages

Data

Categorical variables

Exploring the data

The model

Descriptive statistics

Plotting the data

Predictors with several categories

Fitting the model

Just to illustrate what's going on

Contrasts

Assigning contrasts

Contrast coding

Trends (polynomial contrasts)

Robust models

Robust standard errors

Robust estimates

Bootstrap confidence intervals

The WRS2 package

Bayesian models

Bayes factors

Other resources

Statistics

R

References

Try the adventr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

adventr Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"

adventr: more categorical predictors In adventr: Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"

An Adventure in R: Categorial predictors (several categories)

Overview

Story précis

Why a précis?

General context for the story

Main Protagonists

How Zach's adventure begins

Packages and data

Packages

Data

Categorical variables

Exploring the data

The model

Descriptive statistics

Plotting the data

Predictors with several categories

Fitting the model

Just to illustrate what's going on

Contrasts

Assigning contrasts

Contrast coding

Trends (polynomial contrasts)

Robust models

Robust standard errors

Robust estimates

Bootstrap confidence intervals

The WRS2 package

Bayesian models

Bayes factors

Other resources

Statistics

R

References

Try the adventr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

adventr
Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"

adventr: more categorical predictors
In adventr: Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"