# Ensure that libraries are loaded. library(tidyverse) library(learnr) library(gradethis) library(knitr) library(kableExtra) library(haven) #For importing SPSS data files. library(car) #For ANOVA. library(papaja) #For APA formatted results tables. library(texreg) #For pretty regression results. library(effects) #For two-way interaction plots. library(broom) #For cleaning up statistical results. tutorial_options(exercise.timelimit = 20, exercise.checker = gradethis::grade_learnr) knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
# Ensure that the data is loaded for the remainder of this tutorial. glbwarm <- UsingRTutorials::glbwarm glbwarm_spss <- UsingRTutorials::glbwarm_spss # The estimated regression model with rstanarm. model_1aBayes <- UsingRTutorials::model_1aBayes
Course Content
Data Project
Let us practice with some of the most common statistical analyses in R.
Consult Sections 3 and 4 in Help, My Collaborator Uses R! An Introduction to Reproducible Statistical Analyses in R and R help on the functions that we use.
Example data: glbwarm
(accessible within this tutorial).
Source: Erik Nisbet; http://afhayes.com/
Inspect the variables in the Environment.
Main data types:
1. Number: govact
, posemot
, negemot
, age
.
2. Character: ideology
, sex
, partyid
.
summary(glbwarm)
t.test()
You already know how to execute an independent-samples t test (Session 5).
There are different versions of the same function for different t tests.
Usage (in ?t.test
):
t.test(x, ...) ### Default S3 method: t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) ### S3 method for class 'formula' t.test(formula, data, subset, na.action, ...)
x
is for a one sample t test: specify the hypothesized population mean with argument mu =
. x, y
is for paired samples t tests. y
must be a variable with two categories. # Use the `t.test()` version that matches the kind of t test you need: on one # mean, paired samples, or independent samples.
# Note that the 'data = ' argument only works if we use the formula form 'y ~ x'. # Independent samples t test: t.test(negemot ~ sex, data = glbwarm) # For the other versions, the tibble name must be used and the dollar sign to # fuly define the variable. # t test on one mean (complete it yourself): t.test(glbwarm$negemot, ... )
# The code checker expects the three tests in the exact order as specified in # the question.
t.test(negemot ~ sex, data = glbwarm) t.test(glbwarm$negemot, mu = 3.0) t.test(glbwarm$negemot, glbwarm$posemot, paired = TRUE)
gradethis::grade_code( correct = "", incorrect = "" )
var.test()
In contrast to SPSS, R only gives you what you ask for.
Version of the independent samples t test that we must use, depends on whether the population variances are equal for the two groups.
vartest <- ____
vartest <- var.test(govact ~ sex, data = glbwarm)
gradethis::grade_code( correct = "", incorrect = "" )
vartest$p.value
gradethis::grade_code( correct = "`e-08` (scientific notation) means `* 10^-8`, that is, divided by 10 to the power 8 (100,000,000). Note that the results are stored as class htest, just like the results from `t.test()`.", incorrect = "Perhaps you used double square brackets instead of the dollar sign to pull out the p value. That's OK." )
In R, we can use a function within an argument of another function.
Example for an independent samples t test:
var-equal
argument is FALSE
by default.TRUE
if the p value of var.test()
is larger then .05.t.test(govact ~ sex, data = glbwarm, var.equal = _____ )
# You already executed the t test in this tutorial. Add the var.equal argument. t.test(govact ~ sex, data = glbwarm)
# In the preceding exercise, you pulled the p value from the stored test result. vartest$p.value # Add it to the var.equal argument in such a way that a p value over .05 yields TRUE.
# Replace the stored test result by the test function itself. t.test(govact ~ sex, data = glbwarm, var.equal = vartest$p.value > 0.05)
t.test(govact ~ sex, data = glbwarm, var.equal = var.test(govact ~ sex, data = glbwarm)$p.value > 0.05)
gradethis::grade_code( correct = "", incorrect = "" )
lm()
Usage (in ?lm
):
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)
See book p. 358-371 {Section 23.4} for using regression formulas:
model_1 <- lm(govact ~ age + negemot + partyid, data = glbwarm)
gradethis::grade_code( correct = "", incorrect = "Perhaps, you used the independent variables in a different order within the formula. That is fine." )
For quick inspection, data objects for results of statistical analyses always have:
summary()
function; print()
function . Not for presentation of results!
lm()
takes care of:
(This is easier than in SPSS.)
model_1a <- lm(govact ~ age*negemot + partyid, data = glbwarm) print(model_1a)
gradethis::grade_code( correct = "", incorrect = "Perhaps, you used the independent variables in a different order within the formula. That is fine." )
lm(govact ~ age + negemot*partyid, data = glbwarm)
gradethis::grade_code( correct = "", incorrect = "" )
lm()
and car::Anova()
In R, analysis of variance consists of two steps.
Step 1: ANOVA is linear regression with special contrasts (contr.sum
).
contr.sum
gives deviations from the mean. contrasts =
argument requires:contrasts = list()
;contrasts = list(sex = contr.sum, partyid = contr.sum)
model_2 <- lm(govact ~ sex * partyid, data = glbwarm, contrasts= ____ )
model_2 <- lm(govact ~ sex * partyid, data = glbwarm, contrasts=list(sex=contr.sum, partyid=contr.sum))
gradethis::grade_code( correct = "Have a look at the results: send model_2 to the screen.", incorrect = "" )
Step 2: Calculate the sums of squares partition.
Functions:
stats::anova()
for balanced designs.car::Anova()
for (balanced and) unbalanced designs (Type !! or III sums of squares). Anova(model_2)
gradethis::grade_code( correct = "", incorrect = "Perhaps you used the package name in the command, which is fine." )
The anova functions return a data frame, which you can use as any data frame.
For example, knit it to a pretty (HTML or PDF) table with knitr::kable()
.
We will do that later on in this tutorial.
How a stat::
package functions deal with missing values depends on the na.action =
argument:
na.omit
(default and preferred) or na.exclude
: listwise deletion;na.fail
: stops with an error.# Get the current option for na.action. getOption("na.action") # Set the option (if necessary). options(na.action = "na.omit")
There are several packages that help you to tabulate statistical results. The table below lists some of them with their characteristic features.
# Create a data frame for the contents of the table. dt <- data.frame( Package = c("base, stats", "papaja", "stargazer", "texreg"), Models = c("all", "t test, regression, anova", "regression", "regression"), Format = c("plain text", "PDF, Word (HTML)", "PDF, HTML, plain", "PDF, HTML, plain"), Style = c("-", "APA", "div., not APA", "generic"), Comparison = c("-", "stacked", "side-by-side", "side-by-side"), Peculiarities = c("summary() and print(), only for quick inspection", "2 steps: apa_print() and apa_table()", "", "texreg(), hmtlreg(), screenreg()"), stringsAsFactors = FALSE ) names(dt)[5] <- paste0(names(dt)[5], footnote_marker_symbol(1)) dt %>% knitr::kable(align = "llllll", escape = F) %>% #show with kable() from the knitr package kable_styling(full_width = T) %>% row_spec(0, font_size = 18) %>% footnote(symbol = "Results of two or more models in one table.")
One of our favorite packages for Open Science projects!
Package papaja
is not in the CRAN repository (or any of its mirrors).
# Execute this line of code in the RStudio console. remotes::install_github("crsh/papaja")
Statistical results tables in APA format require two papaja
commands:
apa_print()
: formats the results from various statistical methods in accordance with APA guidelines.apa_table()
: displays results as an APA format table.The below code produces the table of regression results.
# Attach the papaja package. library(papaja) # Estimate the regression model (as before). model_1a <- lm(govact ~ age*negemot + partyid, data = glbwarm) # Format the results of the regression model. model_1a_formatted <- apa_print(model_1a) # Display the results as an APA formatted table. apa_table(model_1a_formatted$table)
We will see more of papaja
in Session 7.
texreg
library(texreg) model_1 <- lm(govact ~ age + negemot + partyid, data = glbwarm) model_1a <- lm(govact ~ age*negemot + partyid, data = glbwarm) # Table for HTML output. texreg::htmlreg(list(model_1,model_1a), #the regression model(s) shown single.row = T, #coefficient and standard error on the same row star.symbol = "\\*", doctype = F, #better for Markdown document html.tag = F, #better for Markdown document head.tag = F, #better for Markdown document body.tag = F, #better for Markdown document caption = "", #no caption to save space on the slide custom.coef.names = c(NA, "Age", "Negative emotions", "Independent", "Republican", "Age*Neg. emotions"), vertical.align.px = 6) # For PDF output, use the texreg() function, with slightly different arguments (options). # Use Help to see more arguments.
model_1 <- lm(govact ~ age + negemot + partyid, data = glbwarm) model_1a <- lm(govact ~ age*negemot + partyid, data = glbwarm)
# Attach the texreg package. library(texreg) # Table for HTML output. texreg::htmlreg(list(model_1,model_1a), #the regression model(s) shown single.row = T, #coefficient and standard error on the same row star.symbol = "\\*", doctype = F, #better for Markdown document html.tag = F, #better for Markdown document head.tag = F, #better for Markdown document body.tag = F, #better for Markdown document caption = "", #no caption to save space on the slide custom.coef.names = c(NA, "Age", "Negative emotions", "Independent", "Republican", "Age*Neg. emotions"), vertical.align.px = 6) # For PDF output, use the texreg() function, with slightly different arguments (options). # Use Help to see more arguments.
htmlreg()
produces HTML code:
The results='asis'
code chunk option is needed to knit the html output of the code chunk as formatted text.
The full code chunk in the RMarkdown document (note the results='asis'
argument):
knitr::include_graphics("images/asis.png")
And this is what the knitted text looks like:
# Create regression data objects. model_1 <- lm(govact ~ age + negemot + partyid, data = glbwarm) model_1a <- lm(govact ~ age*negemot + partyid, data = glbwarm) # Attach the texreg package. library(texreg) # Table for HTML output. texreg::htmlreg(list(model_1,model_1a), #the regression model(s) shown single.row = T, #coefficient and standard error on the same row star.symbol = "\\*", doctype = F, #better for Markdown document html.tag = F, #better for Markdown document head.tag = F, #better for Markdown document body.tag = F, #better for Markdown document caption = "", #no caption to save space on the slide custom.coef.names = c(NA, "Age", "Negative emotions", "Independent", "Republican", "Age*Neg. emotions"), vertical.align.px = 6)
htmlreg()
is one example of a function that operates on lm() objects.
Other useful functions:
confint()
, coef()
, resid()
.broom
and knitr
For full control of your table, create it with packages broom
and knitr
.
(broom
is part of the tidyverse package)
You need 3 steps:
tidy()
in the broom
package to extract relevant statistics from a statistical results object into a tibble.knitr::kable()
and kableExtra
with all formatting options you need.tidy()
in the broom
package to extract relevant statisticsmodel_1a %>% tidy(conf.int = TRUE, conf.level = 0.95)
gradethis::grade_code( correct = "", incorrect = "" )
broom
produces a tibble (data frame), so you can wrangle it like any other.
model_1a %>% tidy(conf.int = TRUE, conf.level = 0.95) %>% mutate( estimate = format(round(estimate, digits = 2), nsmall = 2), p.value = format(round(p.value, digits = 3), nsmall = 3), CI = paste0( "[", format( round(conf.low, digits = 2), nsmall = 2 ), ", ", format( round(conf.high, digits = 2), nsmall = 2 ), "]" ) ) %>% select(term, estimate, p.value, CI)
If you want to use stars to mark the significance level of regression coefficients, you can add a new character variable showing the number of stars.
model_1a %>% tidy(conf.int = TRUE, conf.level = 0.95) %>% mutate( estimate = format(round(estimate, digits = 2), nsmall = 2), p.value = format(round(p.value, digits = 3), nsmall = 3), sig = case_when( p.value < .001 ~ "***", p.value < .01 ~ "**", p.value < .05 ~ "*", TRUE ~ "" ), CI = paste0( "[", format( round(conf.low, digits = 2), nsmall = 2 ), ", ", format( round(conf.high, digits = 2), nsmall = 2 ), "]" ) ) %>% select(term, estimate, p.value, CI)
knitr::kable()
and kableExtra
With knitr
and kableExtra
, we can create a table including footnotes.
model_1a %>% tidy(conf.int = TRUE, conf.level = 0.95) %>% mutate( estimate = format(round(estimate, digits = 2), nsmall = 2), p.value = format(round(p.value, digits = 3), nsmall = 3), sig = case_when( p.value < .001 ~ "***", p.value < .01 ~ "**", p.value < .05 ~ "*", TRUE ~ "" ), CI = paste0( "[", format( round(conf.low, digits = 2), nsmall = 2 ), ", ", format( round(conf.high, digits = 2), nsmall = 2 ), "]" ) ) %>% select(term, estimate, sig, CI) %>% #p.value dropped kable(digits = c(0, 2, 0, 0), col.names = c("Parameter", "B", "", "95% CI"), align = "lrlc", caption = "Table 1. Predicting opinions about global warming.", booktabs = TRUE, #nicer layout in PDF escape = FALSE #pay attention to special characters ) %>% kable_styling(full_width = FALSE) %>% row_spec(0, font_size = 16) %>% column_spec(1, width = "5cm") %>% column_spec(2, width = "3cm") %>% column_spec(3, width = "0.5cm") %>% column_spec(4, width = "5cm") %>% footnote( general_title = "", general = " * p < .05. ** p < .01. *** p < .001." )
Some final points about tabulating results:
Special characters such as stars (*
) and percentage signes (%
) can be troublesome in tables. You may have to escape them with one or more backslashes (\\
).
PDF output has more formatting options than HTML (or Word).
kable()
does not knit nicely to Word. Knit to HTML and import HTML in Word.
plot()
functionA data object with statistical results usually has a plot()
function:
model_1 <- lm(govact ~ age + negemot + partyid, data = glbwarm) model_1a <- lm(govact ~ age*negemot + partyid, data = glbwarm)
There are many packages offering ready-to-use plots, for example:
papaja
: plots for analysis of variance.coefplot
: plots regression coefficients for one or more models (ggplot2
plots)visreg
: plots regression lines (ggplot2
plots).effects
: plots regression lines (not ggplot2
plots).Note that ggplot2
plots created by such packages can be further customized: Save the plot (e.g., p
) and then add layers, themes, ... (e.g., p + theme_bw()
).
effects
PackageYou can graph interaction effects with the effects
package in two steps.
# Load effects package. library(effects) # Step 1: Create a data object containing all effects. eff.model2 <- effects::allEffects(model_1a) # Step 2: Plot interaction effects. plot(eff.model2, 'age:negemot', x.var = 'age')
Note the rug
on the horizontal axis, showing the age score of all cases within a negemot
group.
ggplot()
It is not so difficult to create this plot with ggplot()
.
Advantage: Full control. E.g., why does the plot from the effects
package skip negative emotions around three?
# Create the binned negative emotions variable. glbwarm %>% mutate(negemot_bin = case_when( negemot < 1.5 ~ 1, negemot < 2.5 ~ 2, negemot < 3.5 ~ 3, negemot < 4.5 ~ 4, negemot < 5.5 ~ 5, negemot >= 5.5 ~ 6 ) )
# Pipe the tibble into ggplot() and use geom_smooth(). glbwarm %>% mutate(negemot_bin = case_when( negemot < 1.5 ~ 1, negemot < 2.5 ~ 2, negemot < 3.5 ~ 3, negemot < 4.5 ~ 4, negemot < 5.5 ~ 5, negemot >= 5.5 ~ 6 ) ) %>% ggplot( ) + geom_smooth( )
# Use geom_rug() to represent all observations on the horizontal axis.
# Use facet_wrap() on the binned negative emotions variable.
glbwarm %>% mutate(negemot_bin = case_when( negemot < 1.5 ~ 1, negemot < 2.5 ~ 2, negemot < 3.5 ~ 3, negemot < 4.5 ~ 4, negemot < 5.5 ~ 5, negemot >= 5.5 ~ 6)) %>% ggplot(aes(x = age)) + geom_smooth(aes(y = govact), method = lm) + geom_rug() + facet_wrap(vars(negemot_bin))
Do you notice differences between your plot and the plot created with the effects
package?
Which plot do you trust more?
It is not that difficult to create a means plot showing the results of analysis of variance.
glbwarm %>% group_by(partyid, sex) %>% summarise(avg_govact = mean(govact)) %>% ggplot(aes(x = partyid, y = avg_govact, color = sex)) + geom_line(aes(group = sex)) + geom_point() + labs(x = "Party identification", y = "Gov.intervention") + scale_y_continuous( limits = c(min(glbwarm$govact), max(glbwarm$govact)), breaks = 1:7 ) + theme_bw() + theme(legend.position = c(0.8, 0.8), legend.background = element_blank())
glbwarm %>% group_by(partyid, sex) %>% summarise(avg_govact = mean(govact)) %>% ggplot( ____ )
# First calculate the group means that must be shown. glbwarm %>% group_by(partyid, sex) %>% summarise(avg_govact = mean(govact)) # Important: You plot summaries now, not the original observations.
# Use geom_point() to show the dots.
# Use geom_line() to show the lines with the group argument.
# Use theme_bw() for the general appearance of the plot. # More on this in Session 7.
# Use legend.position and legend.background within theme() # for the fine details of the legend.
SPSS data files have a complicated setup with variable labels and value labels.
R data frames or tibbles do not have such labels.
In case you later have SPSS data that you want to analyze in R, here are two options for importing SPSS data.
read_csv()
(as you learned before) to import the CSV file. # The CSV is available in the data directory of this tutorial. glbwarm <- read_csv("data/glbwarm.csv") # Inspect the variables. str(glbwarm)
haven
.The tidyverse package haven
contains function read_sav()
(or read_spss()
) for importing SPSS (and other software packages) data files.
# The SPSS system file is available in the data directory of this tutorial. glbwarm_spss <- haven::read_sav("data/glbwarm.sav") # Inspect the variables. str(glbwarm_spss)
Imported categorical variables such as ideology:
# Add sex as a factor to the tibble. glbwarm_spss <- glbwarm_spss %>% mutate(sex_fct = haven::as_factor(sex)) # Inspect the original and new variable. glbwarm_spss %>% count(sex, sex_fct)
Passing the entire tibble glbwarm_spss
to haven::as_factor()
will change all labelled variables into factors.
I am not sure this is what you want here. Perhaps you would like to use the ideology
variable as a seven-point scale.
If you can execute regression models in R, you can also execute these using Bayesian statistics instead of traditional (frequentist) statistics.
The popularity of Bayesian statistics as an alternative to null hypothesis significance testing is growing. If you want to be among the first in your field going Bayesian, check out the short introduction provided in Help, My Collaborator Goes Bayesian! Why And How To Apply Bayesian Data Analysis. Section 3.2 of the document offers a short introduction to using the rstanarm
package for Bayesian data analysis.
The below code estimates a regression model predicting support for governmental action (govact
) from age, negative emotions and party identification, with an interaction effect of age with negative emotions.
# Load the rstanarm package. library(rstanarm) # Estimate the regression model with rstanarm. model_1aBayes <- rstanarm::stan_glm(govact ~ age * negemot + partyid, data = glbwarm) # Standard output to screen. print(model_1aBayes)
# Shows only the output of print. print(model_1aBayes)
Bayesian estimation yields a probability distribution for every parameter.
The printed summary gives the median of the probability distribution of a regression coefficient as its point estimate. In addition, it shows the Mean Average Deviation of the probability distribution: a simple type of standard deviation.
It is easy to get the posterior distributions as a data frame or tibble, so you can find any probability you like for a parameter.
# Extract the posteriors from the fitted model to a tibble. posteriors <- as_tibble(model_1aBayes) # Overview of variables in posteriors: each independent variable and the error term (sigma). str(posteriors) # Calculate the probability that the effect of negemot is larger than 0 in the population. prob <- posteriors %>% summarize(b_negemot = mean(negemot > 0)) # Plot the probability distribution and display this information. posteriors %>% mutate(positive = negemot > 0) %>% ggplot(mapping = aes(x = negemot)) + geom_histogram( aes(fill = positive), boundary = 0, bins = 30, show.legend = FALSE ) + geom_text(aes(label = prob$b_negemot, x = 0.15, y = 100))
The function launch_shinystan()
in the shinystan::
package (automatically loaded by package rstanarm::
) offers an interactive overview of estimation and model checks and results.
shinystan::launch_shinystan(model_1aBayes) # Note: This function does not work from within a tutorial.
Statistical analyses are not necessary for the Data Project. The Data Project focuses on visualizations.
You can, however, use statistical analysis to detect patterns in your data that you then try to visualize. If you do that, do not use off-the-shelf plots. Show that you can create a plot that hopefully is more attractive and more informative than off-the-shelf statistical plots.
Last 15 minutes of the session.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.