# knitr settings knitr::opts_chunk$set( # Code output: warning = FALSE, message = FALSE, echo = TRUE, # Figure: out.width = "100%", fig.width = 16 / 2.5, fig.height = 9 / 2.5, fig.align = "center", fig.show = "hold", # Etc: collapse = TRUE, comment = "##" # tidy = FALSE ) # Needed packages in vignette library(moderndive) library(ggplot2) library(dplyr) library(knitr) library(broom) # Needed packages internally library(patchwork) # Random number generator seed value set.seed(76) # Set ggplot defaults for rticles output: if (!knitr::is_html_output()) { # Grey theme: theme_set(theme_light()) scale_colour_discrete <- ggplot2::scale_colour_viridis_d } # Set output width for rticles: options(width = 70)
We present the moderndive
{target="blank"} R package of datasets and functions for tidyverse-friendly introductory linear regression [@tidyverse2019]. These tools leverage the well-developed tidyverse
and broom
packages to facilitate 1) working with regression tables that include confidence intervals, 2) accessing regression outputs on an observation level (e.g. fitted/predicted values and residuals), 3) inspecting scalar summaries of regression fit (e.g. $R^2$, $R^2{adj}$, and mean squared error), and 4) visualizing parallel slopes regression models using ggplot2
-like syntax [@R-ggplot2; @R-broom]. This R package is designed to supplement the book "Statistical Inference via Data Science: A ModernDive into R and the Tidyverse" [@ismay2019moderndive]. Note that the book is also available online at https://moderndive.com and is referred to as "ModernDive" for short.
Linear regression has long been a staple of introductory statistics courses. While the curricula of introductory statistics courses has much evolved of late, the overall importance of regression remains the same [@ASAGuidelines]. Furthermore, while the use of the R statistical programming language for statistical analysis is not new, recent developments such as the tidyverse
suite of packages have made statistical computation with R accessible to a broader audience [@tidyverse2019]. We go one step further by leveraging the tidyverse
and the broom
packages to make linear regression accessible to students taking an introductory statistics course [@R-broom]. Such students are likely to be new to statistical computation with R; we designed moderndive
with these students in mind.
Let's load all the R packages we are going to need.
library(moderndive) library(ggplot2) library(dplyr) library(knitr) library(broom)
Let's consider data gathered from end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin [@diez2015openintro]. This data is included in the evals
data frame from the moderndive
package.
evals_sample <- evals %>% select(ID, prof_ID, score, age, bty_avg, gender, ethnicity, language, rank) %>% sample_n(5)
In the following table, we present a subset of r ncol(evals_sample)
of the r ncol(evals)
variables included for a random sample of r nrow(evals_sample)
courses^[For details on the remaining r ncol(evals) - ncol(evals_sample)
variables, see the help file by running ?evals
.]:
ID
uniquely identifies the course whereas prof_ID
identifies the professor who taught this course. This distinction is important since many professors taught more than one course.score
is the outcome variable of interest: average professor evaluation score out of 5 as given by the students in this course.bty_avg
(average "beauty" score) for that professor as given by a panel of 6 students.^[Note that gender
was collected as a binary variable at the time of the study (2005).]evals_sample %>% kable()
Let's fit a simple linear regression model of teaching score
as a function of instructor age
using the lm()
function.
score_model <- lm(score ~ age, data = evals)
Let's now study the output of the fitted model score_model
"the good old-fashioned way": using summary()
which calls summary.lm()
behind the scenes (we'll refer to them interchangeably throughout this paper).
summary(score_model)
moderndive
As an improvement to base R's regression functions, we've included three functions in the moderndive
package that take a fitted model object as input and return the same information as summary.lm()
, but output them in tidyverse-friendly format [@tidyverse2019]. As we'll see later, while these three functions are thin wrappers to existing functions in the broom
package for converting statistical objects into tidy tibbles, we modified them with the introductory statistics student in mind [@R-broom].
r
get_regression_table(score_model)
r
get_regression_points(score_model)
r
get_regression_summaries(score_model)
Furthermore, say you would like to create a visualization of the relationship between two numerical variables and a third categorical variable with $k$ levels. Let's create this using a colored scatterplot via the ggplot2
package for data visualization [@R-ggplot2]. Using geom_smooth(method = "lm", se = FALSE)
yields a visualization of an interaction model where each of the $k$ regression lines has their own intercept and slope. For example in \autoref{fig:interaction-model}, we extend our previous regression model by now mapping the categorical variable ethnicity
to the color
aesthetic.
# Code to visualize interaction model: ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Age", y = "Teaching score", color = "Ethnicity")
However, many introductory statistics courses start with the easier to teach "common slope, different intercepts" regression model, also known as the parallel slopes model. However, no argument to plot such models exists within geom_smooth()
.
Evgeni Chasnovski{target="_blank"} thus wrote a custom geom_
extension to ggplot2
called geom_parallel_slopes()
; this extension is included in the moderndive
package. Much like geom_smooth()
from the ggplot2
package, you add geom_parallel_slopes()
as a layer to the code, resulting in \autoref{fig:parallel-slopes-model}.
# Code to visualize parallel slopes model: ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE) + labs(x = "Age", y = "Teaching score", color = "Ethnicity")
In the GitHub repository README, we present an in-depth discussion of six features of the moderndive
package:
ggplot2
Furthermore, we discuss the inner-workings of the moderndive
package:
broom
package in its wrappersggplot2
geometry for the geom_parallel_slopes()
function that allows for quick visualization of parallel slopes models in regression. Albert Y. Kim and Chester Ismay contributed equally to the development of the moderndive
package. Albert Y. Kim wrote a majority of the initial version of this manuscript with Chester Ismay writing the rest. Max Kuhn provided guidance and feedback at various stages of the package development and manuscript writing.
Many thanks to Jenny Smetzer \@smetzer180{target="_blank"}, Luke W. Johnston \@lwjohnst86{target="_blank"}, and Lisa Rosenthal \@lisamr{target="_blank"} for their helpful feedback for this paper and to Evgeni Chasnovski \@echasnovski{target="_blank"} for contributing the geom_parallel_slopes()
function via GitHub pull request{target="_blank"}. The authors do not have any financial support to disclose.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.