In reyesem/IntroAnalysis: Functions for introductory statistics using linear models

## ---- Load Packages ----
pkgs <- c("tidyverse",
          "knitr",
          "IntroAnalysis")

for(pkg in pkgs) library(pkg, character.only = TRUE)


## ---- Change Options ----
# Suppress status bar in dplyr.
# Change handling of ordered factors
options(dplyr.show_progress = FALSE,
        contrasts = rep("contr.treatment", 2))

# Change theme for plots
theme_set(theme_bw(12))
theme_update(legend.position = "bottom")

This document provides a list of the code most commonly used in the course. There may be times, especially when working on an independent capstone project, when you want to move beyond what is discussed in class. There are many resources available online, or feel free to ask for guidance from the instructor. However, the majority of code needed for the course is documented here.

Throughout this document, we will be making use of the mtcars dataset available with all R installations. More information can be found about the variables in this dataset by typing ?mtcars into your R console or typing mtcars into the help search box in RStudio. Here are the first few records of the dataset:

mtcars <- datasets::mtcars

mtcars |>
  head() |>
  knitr::kable(digits = 3)

As you progress through the following examples, remember that R is case sensitive.

Basics

Before proceeding to the specific aspects of the course; let's first talk about a few basics about programming in R.

Packages

Throughout the course we will be using a standard suite of packages (additional functions which provide additional analytical capabilities). In order to ensure that you have all required functionality for the course, be sure that each RMarkdown file you write has the following block at the top, which should not be altered:

IntroAnalysis:::ma223_setup()

Anatomy of RMarkdown File

An RMarkdown file is comprised of two components: basic text and "code chunks" which perform the analysis. When the document is compiled ("knit") the code is evaluated and the results added seamlessly into your document. This encourages a reproducible workflow (someone can recreate all your results should the data change).

Code blocks can be added using the "Insert" button at the top of the script window. Code blocks have the following form:

cat("```r","\n", "Code goes here...","\n", "```", "\n", sep="")

Notice that the code blocks always begin and end with the triple backticks (found on your keyboard in the upper-left-hand corner with the tilde).

Anatomy of R Code

Most code chunks all look the same. Below is a very simple code chunk. Objects are "assigned" to an object using an equal sign (or arrow <-). The code essentially says that the thing on the left will contain anything generated on the right.

# Comments (not evaluated) begin with a hashtag.
ObjectName = code to generate results

Just typing the name of an object prints that object out; that output, in turn, will appear in your final document.

If we want to do something to that object, we will make use of the "pipe operator" (|>) which essentially says take the thing before and use it in the next function. For example, we might summarise an object as follows:

ObjectTwo = ObjectOne |>
  summarise()

This code says take the thing called "ObjectOne" and "pipe it into" (use within) the summarise function; whatever the result of this operation is will be called "ObjectTwo."

Data Manipulation

Data is rarely perfect when we first obtain it. Often times, we will need to make changes. There are a few tasks which are performed regularly; these can generally be accomplished with the following two functions:

mutate(): manipulate variables or make transformations.
filter(): subset the data based on some criteria.
case_match(): recode the levels of a variable.

mutate(): creating or changing variables

As an example, while the US measures fuel efficiency in miles per gallon, other countries often report the gallons required for a single mile. This new variable Gallons_per_Mile can be created using mutate():

mtcars = mtcars |>
  mutate(Gallons_per_Mile = 1/mpg)

Notice that we use the pipe operator (|>) here; this essentially says take the dataset mtcars and feed it into the mutate() function. The mutate() function itself uses the notation New_Variable_Name = transformation of stuff. In this case, we create a new variable within the dataset mtcars. The result is stored within a dataset called mtcars (since that is on the left of the equal sign), essentially overwriting the original dataset.

The first few records of this dataset are shown below:

mtcars |>
  head() |>
  knitr::kable(digits = 3)

filter(): subsetting a variable

If we only want to look at a subset of the data, we use the filter() function. For example, suppose we only want to consider vehicles which have automatic transmissions. Further, we would like to store those in a new dataset called automatic.only, we use the following code:

automatic.only = mtcars |>
  filter(am == 0)

When making comparisons, == is asking "are you equal?" We keep the records for which we would answer "yes" to this question. The original mtcars dataset is not altered; it remains unchanged. Instead, we have taken it, filtered it down, and then stored the results in automatic.only (the first few records of which are shown below).

automatic.only |>
  head() |>
  knitr::kable(digits = 3)

case_match(): Changing the values of a variable

When we work with categorical variables, we often want the values (or group labels) to be something readable. In order to do this, we may need to recode the data. Consider the variable am (transmission type). It is currently a 0/1 variable, and it is not obvious that 1 represents manual transmission and 0 automatic transmission. We can make this clearer by mutating the dataset and match each level of a variable with a new value, thereby recoding that variable.

mtcars = mtcars |>
  mutate(am = case_match(am,
                         0 ~ "automatic",
                         1 ~ "manual"))

The first few records of the updated dataset are shown below:

mtcars |>
  head() |>
  knitr::kable(digits = 3)

If we now wanted to go further and change "automatic" to "Automatic Transmission", we could recode again; this time, since the variable is already a phrase, we do not use backticks.

mtcars = mtcars |>
  mutate(am = case_match(am,
                         "automatic" ~ "Automatic Transmission",
                         "manual" ~ "Manual Transmission"))

If your dataset has groups coded as numbers (like above when 0 = automatic and 1 = manual), it is best to recode in order for graphics and models to work appropriately.

Numerical Summaries

Numerical summaries (statistics) tell us about the distribution of our sample. These also estimate the corresponding parameters of interest in the population. The functions we introduce below make use of "formula" notation in R. We take a second to introduce that here, and it will make more sense in the later modeling sections. Let's compare two basic formulas:

y ~ 1
y ~ group

Both formulas have two essential components: a left-hand-side and a right-hand-side separated by the ~. Think of the ~ as an "equal sign" in a mathematical formula or as a way of saying "related to." The first formula says that the variable y is not related to anything; you are interested in summarizing it across all records. The second formula says you want to summarize the variable y within groups which are defined by the variable group.

Of course, the "variable y" is not very helpful unless you tell the computer where to look for that variable. So, formulas are often accompanied by a data = argument to allow you to specify the data set in which the variable is located.

summarize_variable(): Summarizing a single variable.

The summarize_variable() function allows us to compute individual summaries of variables. The following code computes the average mpg of vehicles in the sample as well as the standard deviation.

summarize_variable(mpg ~ 1, data = mtcars, mean, sd)

The average fuel efficiency might differ for automatic and manual transmissions. In order to compute the above summaries for each type of car, we include the grouping variable.

summarize_variable(mpg ~ am, data = mtcars, mean, sd)

After the data = option, we specify key words for computing summaries. The most used key words are given in a table later in this section. to compute a proportion, we need to specify what value we want the proportion for:

summarize_variable((am=="Manual Transmission") ~ 1, data = mtcars, percent)

List of functions for numerical summaries

The following table contains a list of functions used to construct common summaries.

Numerical Summary Keyword/Function

Mean/Average mean Standard Deviation sd Median median Minimum min Maximum max 25-th Percentile Q1 75-th Percentile Q3 Interquartile Range IQR Percent percent Proportion mean 1-st Percentile perc01 5-th Percentile perc05 10-th Percentile perc10 20-th Percentile perc20 80-th Percentile perc80 85-th Percentile perc85 90-th Percentile perc90 95-th Percentile perc95 99-th Percentile perc99

Table: List of functions for numerical summaries

If missing values are present in the data, these summaries cannot be computed. In order to make the computations excluding the missing values, pass the argument na.rm = TRUE to the function:

summarize_variable(mpg ~ 1, data = mtcars, mean, .args = list(na.rm = TRUE))

summarize_relationship(): Summarize the relationship between two variables.

When numerically summarizing the relationship between two variables, we are typically interested in computing the correlation coefficient relating two numeric variables. This is done with summarize_relationship(). The following code computes the correlation relating the fuel efficiency with the horsepower of the engine.

summarize_relationship(mpg ~ hp, data = mtcars)

Again, the formula interface is used to suggest which two variables we are relating.

Graphics

A picture is worth 1000 words. A good graphical summary of the data should not only be a first step in an analysis, but it should also accompany any good analysis to help convey the message of the data. All graphics in this class will be produced starting with the ggplot() function. While this codebook provides a set of recipes commonly encountered in class, if you are interested in moving beyond what is described here, I would encourage you to look at the Ggplot Cheat Sheet.

There is a bit of a learning curve to using ggplot(). If you would like, you can also click on Add Ins > 'ggplot2' builder (under the ESQUISSE heading after installing the esquisse package); this will allow you to create a graphic interactively and then export the code.

The ggplot() function takes a single component:

data = specifies the data set in which the variables exist.

Additional components of a graphic are "added" on in layers; the most important of these maps the aesthetics --- that is, it communicates what variables are mapped to each axis in the plot; this is done with the aes() function, which (optionally) can have the following components:

x = specifies the variable for the x-axis.
y = if needed, specifies the variable for the y-axis.
color = specifies which variable in the data determines the exterior color of the objects.
fill = specifies which variable in the data determines the fill-color of the objects plotted.
size = specifies which variable in the data determines the size of the objects plotted.
shape = specifies which variable in the data determines the shape of the objects plotted.

We can then add labels to our graphic using the labs() function, which (optionally) takes the following components:

title = "title text" title for the graphic.
subtitle = "subtitle text" subtitle for the graphic.
caption = "caption text" caption for the graphic.
x = "x text" label for the x-axis.
y = "y text" label for the y-axis.
color = "color text" label for the color legend.
fill = "fill text" label for the fill legend.
size = "size text" label for the size legend.
shape = "shape text" label for the shape legend.

If you would like a blank label, simply enter x = "" for example.

The last thing you need is to actually add the various summaries for the data; these are done through "geometric representations" (or "geoms"). Choosing the geom changes how the data is plotted:

Graphic Type Geom

Bar Chart geom_bar Histogram geom_histogram Density Plot geom_density Boxplot geom_boxplot Violin Plot geom_violin Jitter Plot geom_jitter Scatterplot geom_point Smoother geom_smooth Line Chart geom_line

Table: List of common geoms for creating graphics.

We now illustrate various ways of combining these geoms to create common graphics.

Categorical Variable: Bar Chart

A bar chart is the simplest (and often best) method for summarizing the distribution of a categorical variable. The key steps to making a bar chart include:

Place the variable you want to summarize on the x-axis.
Add geom_bar().
Note: nothing goes on the y-axis.

For example, we can summarize the number of vehicles with each type of transmission using the following code.

ggplot(data = mtcars) +
  aes(x = am) +
  labs(x = "",
       y = "Frequency") +
  geom_bar()

We remove the label on the x-axis because the value of the variable makes it clear what is being summarized.

Multiple Categorical Variables: Bar Charts

If you have multiple categorical variables, consider using additional aesthetics to group variables. We will consider a bar chart summarizing the type of engine and the type of transmission for each vehicle. First, we clean up the way that the variable vs is coded:

mtcars = mtcars |>
  mutate(vs = recode(vs,
                     `0` = "V Engine",
                     `1` = "Straight Engine"))

Now, we construct the graphic using the following steps:

Place primary variable of interest on x-axis.
Use fill = to set the secondary variable.
Add geom_bar().

Again, we remove the labels of both the fill and the x variables because the values are self-evident.

ggplot(data = mtcars) +
  aes(x = am,
      fill = vs) +
  labs(x = "",
       fill = "",
       y = "Frequency") +
  geom_bar()

Quantitative Variable: Histogram

When looking at the distribution of a single variable, a histogram is a reasonable choice, especially for medium-sized datasets.

Set the variable to summarize on the x-axis.
Add geom_histogram().
(Optional) Set the binwidth = option inside of geom_histogram() to a reasonable size (how large you want each bin to be). This changes for each problem and is a matter of opinion on what gives a reasonable looking graphic. Leaving this option off will result in the computer choosing a default.

ggplot(data = mtcars) +
  aes(x = mpg) +
  labs(y = "Frequency",
       x = "Miles per Gallon") +
  geom_histogram(binwidth = 1)

In this graphic, we set the bin width to be 1. We could also have considered a larger bin width of 3, for example:

ggplot(data = mtcars) +
  aes(x = mpg) +
  labs(y = "Frequency",
       x = "Miles per Gallon") +
  geom_histogram(binwidth = 3)

If you exclude the binwidth = option, it will choose a value for you:

ggplot(data = mtcars) +
  aes(x = mpg) +
  labs(y = "Frequency",
       x = "Miles per Gallon") +
  geom_histogram()

Quantitative Variable: Density

A density plot is like a smooth histogram; this works best with larger datasets.

Set the variable to summarize on the x-axis.
Add geom_density()

ggplot(data = mtcars) +
  aes(x = mpg) +
  labs(y = "Density",
       x = "Miles per Gallon") +
  geom_density()

The y-axis on this graphic is not directly interpretable; essentially they have been chosen such that the area under the curve is equal to 1. Our primary interest here is that larger values mean the values on the x-axis are more likely.

Quantitative Variable: Boxplot

A boxplot gives a quick summary of location and spread for a numeric variable.

Set the variable to summarize on the y-axis.
Set the x-axis to be empty: "".
Add geom_boxplot()

ggplot(data = mtcars) +
  aes(y = mpg,
      x = "") +
  labs(y = "Miles per Gallon",
       x = "") +
  geom_boxplot()

Quantitative Variable Across Groups: Boxplots

If we are examining the distribution of a quantitative variable across various groups formed by a categorical variable, we can use side-by-side graphics, such as boxplots.

Set the variable to be summarized on the y-axis.
Place the grouping variable on the x-axis.
Add geom_boxplot().

ggplot(data = mtcars) +
  aes(y = mpg,
      x = am) +
  labs(y = "Miles per Gallon",
       x = "") +
  geom_boxplot()

Quantitative Variable Across Groups: Violin Plots

Boxplots summarize the center and spread of the data, but the shape is lost. Violin plots generalize boxplots by incorporating the shape.

Set the variable to be summarized on the y-axis.
Place the grouping variable on the x-axis.
Add geom_violin().

ggplot(data = mtcars) +
  aes(y = mpg,
      x = am) +
  labs(y = "Miles per Gallon",
       x = "") +
  geom_violin()

Quantitative Variable Across Groups: Individual Value Plots

When we have a small sample, boxplots are inappropriate because they hide the raw data and present an overly confident summary. Instead, we prefer to place the raw values on the plot.

Set the variable to be summarized on the y-axis.
Place the grouping variable on the x-axis.
Add geom_point().

ggplot(data = mtcars) +
  aes(y = mpg,
      x = am) +
  labs(y = "Miles per Gallon",
       x = "") +
  geom_point()

If it looks too overwhelming because you have just enough points that they overlap, you can "jitter" the points:

Set the variable to be summarized on the y-axis.
Place the grouping variable on the x-axis.
Add geom_jitter().
Set height = 0 within geom_jitter() to prevent the data from being jittered vertically and set width = 0.25 to narrow the amount of jitter added horizontally.

ggplot(data = mtcars) +
  aes(y = mpg,
      x = am) +
  labs(y = "Miles per Gallon",
       x = "") +
  geom_jitter(height = 0,
              width = 0.25)

Quantitative Variable Across Multiple Groups

Boxplots, Violin Plots, and Individual Value Plots can all be extended to include additional other groups by adding aesthetics. For example, we can construct a side-by-side boxplot which also fills according to the engine type.

Set variable to be summarized on the y-axis.
Set the primary grouping variable on the x-axis.
Set the secondary grouping variable as an aesthetic (fill =, color =, etc.)
Add geom_boxplot() for example.

ggplot(data = mtcars) +
  aes(y = mpg,
      x = am,
      fill = vs) +
  labs(y = "Miles per Gallon",
       x = "",
       fill = "") +
  geom_boxplot()

Two Quantitative Variables: Scatterplot

When examining the relationship between two quantitative variables, we can use a scatterplot.

Set the response (variable to be explained or predicted) on the y-axis.
Set the predictor (the "given" information) on the x-axis.
Add geom_point().

ggplot(data = mtcars) +
  aes(y = mpg,
      x = wt) +
  labs(y = "Miles per Gallon",
       x = "Weight (1000 lbs)") +
  geom_point()

Adding Features to Scatterplots

Just as with other plots, we can add additional features to a scatterplot. In general, the following two guiding principles help when thinking about these additional aesthetics:

Groups are best distinguished using color or shape.
Quantitative variables are captured using color or size.

The following graphic examines the relationship between the fuel efficiency (MPG) of a vehicle and its weight, while also considering the transmission type of the vehicle. Since "transmission type" is categorical, we use the color to separate these groups.

Set the response (variable to be explained or predicted) on the y-axis.
Set the predictor (the "given" information) on the x-axis.
Set the grouping variable using color =.
Add geom_point().

ggplot(data = mtcars) +
  aes(y = mpg,
      x = wt,
      color = am) +
  labs(y = "Miles per Gallon",
       x = "Weight (1000 lbs)",
       color = "") +
  geom_point()

We could also have used shape of the points instead.

ggplot(data = mtcars) +
  aes(y = mpg,
      x = wt,
      shape = am) +
  labs(y = "Miles per Gallon",
       x = "Weight (1000 lbs)",
       shape = "") +
  geom_point()

If we want to look at the relationship between the fuel efficiency (MPG) of a vehicle and its weight, while also considering its horsepower, we do things a bit differently. Since "horsepower" is quantitative, we can use color or size to distinguish between various values. Both versions are shown below.

Set the response (variable to be explained or predicted) on the y-axis.
Set the predictor (the "given" information) on the x-axis.
Set the secondary predictor using color =.
Add geom_point().

ggplot(data = mtcars) +
  aes(y = mpg,
      x = wt,
      color = hp) +
  labs(y = "Miles per Gallon",
       x = "Weight (1000 lbs)",
       color = "Horsepower") +
  geom_point()

ggplot(data = mtcars) +
  aes(y = mpg,
      x = wt,
      size = hp) +
  labs(y = "Miles per Gallon",
       x = "Weight (1000 lbs)",
       size = "Horsepower") +
  geom_point()

Specifying a Mean Model

Parameters in a population govern the data-generating process for a particular variable of interest. Throughout this course, we are focused on making inference on the mean response. For example, suppose we are interested in estimating the mean fuel efficiency; then, our model for the data generating process will have the form

$$(\text{MPG})_i = \mu + \epsilon_i$$

where $\mu$ is the parameter of interest (the average fuel efficiency). In order to communicate this data generating process to the computer, we use the function specify_mean_model() which makes use of the formula notation similar to when summarizing variables.

mpg.model = specify_mean_model(mpg ~ 1, data = mtcars)

Here, the left-hand-side of the formula corresponds to the variable on the left-hand-side of the mathematical formula constructed above. The right-hand-side specifies the terms which contain a parameter. Since the $\mu$ in our mathematical model is alone, it is multiplied by a 1 (hence the 1 in the formula for the function). Again, we use the data = option in order to specify in which data set mpg can be found.

Note: the error term is understood in the model and need not be communicated to the computer.

Finally, this mean model is something that we will be working with; so, it should be saved. Here, we save it under the name mpg.model.

Confidence Interval for a Single Mean

We begin with the same mean model described in the previous section:

In order to construct a confidence interval, we use the estimate_parameters() function:

estimate_parameters(mpg.model,
                    confidence.level = 0.95)

This function has a few requirements:

You must specify the mean model containing the parameters of interest.
You must specify the confidence level for confidence intervals computed; failing to do this will result in only point estimates provided.
Specify whether certain conditions should be assumed.

Note that the computer will always assume the data is independent of one another, that the errors are identically distributed, and that the mean model is correctly specified.

Classical theory will be employed if we additionally assume the normality condition holds. Otherwise, bootstrapping will be used to compute the confidence intervals. For example, to compute a 95\% confidence interval using classical theory ("t-interval"), use the following code:

estimate_parameters(mpg.model,
                    confidence.level = 0.95,
                    assume.normality = TRUE)

Remember that each time you use a bootstrap procedure, the model for the sampling distribution will change slightly due to randomness. Therefore, the endpoints of the interval may very slightly. In order to prevent this, you can "set a seed" for random number generation. Think of this as specifying how the random number generator should begin. Running the same block of code multiple times will produce the same output.

set.seed(123)
estimate_parameters(mpg.model,
                    confidence.level = 0.95)

The seed value can be any integer.

Examining sampling distribution graphically.

If you want to examine the entire model for the sampling distribution graphically (instead of just having the CI), you must first save the output of the estimate_parameters() function and then use it in the plot_sampling_distribution() function. Below is the plot of the model for the sampling distribution of the sample mean fuel efficiency when we do not assume normality. Here, the results of estimate_parameters() is stored in the name sampling.distn which can be used later on.

sampling.distn = estimate_parameters(mpg.model,
                                     confidence.level = 0.95)

plot_sampling_distribution(sampling.distn,
                           show.confidence.interval = TRUE)

We can optionally choose to illustrate the confidence interval on the plot, which can be printed by printing out the contents of sampling.distn, which just holds the result of the call to the estimate_parameters() function:

sampling.distn

P-values for Mean

We begin with the same mean model $$(\text{MPG})_i = \mu + \epsilon_i$$ Suppose we are interested in testing the hypothesis $$H_0: \mu = 18 \qquad \text{vs.} \qquad H_1: \mu \neq 18$$

Hypothesis testing at its core is about comparing two models: a full (unrestricted) model (alternative hypothesis) and a restricted model (null hypothesis). Specifically, notice that under the alternative hypothesis, we have $$\text{Unrestricted Model:} \quad (\text{MPG})_i = \mu + \epsilon_i$$

This is easily communicated to the computer:

mpg.model.h1 = specify_mean_model(mpg ~ 1, data = mtcars)

Under the null hypothesis, we have $$\text{Restricted Model:} \quad (\text{MPG})_i = 18 + \epsilon_i$$

In this restricted model, there are no parameters to estimate. Instead, the parameter has a specific constant value. When specifying this model, we need to use the constant() notation to signify this.

mpg.model.h0 = specify_mean_model(mpg ~ constant(18), data = mtcars)

Now, the two models can be compared:

compare_models(mpg.model.h1,
               mpg.model.h0,
               alternative = "not equal")

In particular, pay attention to specifying the direction of the alternative hypothesis. The default is to perform a two-sided test ($H_1: \mu \neq \mu_0$); if you want to perform a one-sided test, specify the direction of the alternative by saying alternative = "greater than" or alternative = "less than" as appropriate.

Examining null distribution graphically.

If you want to examine the entire model for the null distribution graphically (instead of just having the p-value), you must first save the output of the compare_models() function and then use it in the plot_null_distribution() function. Below is the plot of the model for the null distribution of the statistic summarizing the evidence against the null hypothesis described above. Here, the results of compare_models() is stored in the name null.distn which can be used later on.

null.distn = compare_models(mpg.model.h1,
               mpg.model.h0,
               alternative = "not equal")

plot_null_distribution(null.distn,
                       show.pvalue = TRUE)

We can optionally choose to illustrate the region which will make up the p-value, which can be printed by printing out the contents of null.distn, which just holds the result of the call to the compare_models() function:

null.distn

Again, the classical theory can be obtained by additionally imposing the normality condition with assume.normality=TRUE.

Simple Linear Regression Models

Regression models extend our mean model by allowing the response to depend linearly on a covariate. Consider the following simple model allowing the fuel efficiency to depend upon the weight of the vehicle.

$$(\text{MPG})_i = \beta_0 + \beta_1(\text{Weight})_i + \epsilon_i$$

This is specified to the computer using specify_mean_model() as follows:

reg.model = specify_mean_model(mpg ~ 1 + wt, data = mtcars)

Again, the left-hand-side communicates the response variable. The right-hand-side has a 1 (indicating the intercept term) plus the variable wt. Each term separated by a plus sign will have a parameter associated with it, which are understood by the computer.

Parameters do not need to be specified; one is attached to each term separated by a + sign.

Fitting Regression Models: Least Squares Estimates

The "best fit" line typically encountered is formed by computing the least squares estimates of the parameters in the regression model. The estimate_parameters() function does this.

estimate_parameters(reg.model)

The output does not contain confidence intervals because we did not specify a confidence level or the conditions we are willing to assume.

Confidence Intervals for Regression Parameters

In order to obtain confidence intervals, we need to specify the conditions we are willing to assume.

estimate_parameters(reg.model, 
                    confidence.level = 0.99,
                    assume.constant.variance = TRUE,
                    assume.normality = TRUE)

If both normality and constant variance are assumed, classical method are used to compute the confidence intervals. Otherwise, bootstrapping is used.

P-values Comparing Two Models

Hypothesis testing at its core is about comparing two models: a full (unrestricted) model (alternative hypothesis) and a restricted model (null hypothesis). Referring to our model for the fuel efficiency described above, suppose we wanted to test the hypothesis that

$$H_0: \beta_1 = 0 \qquad \text{vs.} \qquad H_1: \beta_1 \neq 0$$

Under the alternative hypothesis (no restriction on $\beta_1$), we have the same model we began with: $$\text{Unrestricted Model:} \quad (\text{MPG})_i = \beta_0 + \beta_1(\text{Weight})_i + \epsilon_i$$

Under the null hypothesis, we have a restricted model: $$\text{Restricted Model:} \quad (\text{MPG})_i = \beta_0 + \epsilon_i$$

We can communicate each of these models to the computer using specify_mean_model(). Both must be specified so that they can be compared in the program.

reg.model.h1 = specify_mean_model(mpg ~ 1 + hp, data = mtcars)
reg.model.h0 = specify_mean_model(mpg ~ 1, data = mtcars)

These two models are then compared (to generate a p-value) using the compare_models() function.

compare_models(reg.model.h1,
               reg.model.h0,
               alternative = "not equal",
               assume.constant.variance = TRUE,
               assume.normality = TRUE)

What results is an ANOVA table summarizing the comparison between the two models. The line "Additional Terms" refers to the additional variability in the response captured by the lack of restrictions in the alternative hypothesis. The p-value is the last component of this output.

Similar to estimating parameters, bootstrapping is performed if either the normality or constant variance conditions are not assumed.

Computing R-Squared

R-squared is one measure for quantifying the predictive ability of the model. It can be obtained using the summarize_model_fit() function.

summarize_model_fit(reg.model)

While summarize_model_fit() returns a lot of information about the model, we will focus on the r.squared value.

Computing Residuals

Diagnostic graphics used to assess conditions on the stochastic portion of the model depend on residuals and fitted values. In order to obtain the residuals, we augment the original data based on the model constructed using the obtain_diagnostics() function. These results must be stored in a dataset. This function cannot be used until you first specify a model so that residuals can be computed.

First argument is the model fit (from the full model under the alternative hypothesis if comparing two models).

mtcars.augment = obtain_diagnostics(reg.model)

The result is actually a new dataset which includes various new variables (beginning with period in the name) used to assess a model fit. The variable .resid contains the residuals, and the variable .fitted contains the fitted values or predicted values. The first few records of this new dataset are shown below:

mtcars.augment |>
  head() |>
  knitr::kable(digits = 3)

Diagnostic Plots: Time-Series Plot

If the data is recorded in the order in which it was collected, we can construct a time-series plot of the residuals.

Use the augmented dataset (see above) that contains the residuals.
Set y = .resid.
Set x = seq_along(.resid) to construct the order of the observations.
Add geom_point() and geom_line() to connect the dots for an better view.

ggplot(data = mtcars.augment) +
  aes(y = .resid,
      x = seq_along(.resid)) +
  labs(y = "Residuals",
       x = "Order of Observations") +
  geom_point() +
  geom_line()

The nice thing about this graphic is that the only thing that changes is the dataset from which to pull the residuals.

Diagnostic Plots: Residual vs. Fitted Values

A plot of the residuals against the fitted values can be helpful for assessing various conditions.

Use the augmented dataset (see above) that contains the residuals.
Set y = .resid.
Set x = .fitted.
Add geom_point().

ggplot(data = mtcars.augment) +
  aes(y = .resid,
      x = .fitted) +
  labs(y = "Residuals",
       x = "Fitted Values") +
  geom_point()

It can sometimes be helpful to include a smoother on this graphic. This can be done easily by adding geom_smooth() to the graphic.

ggplot(data = mtcars.augment) +
  aes(y = .resid,
      x = .fitted) +
  labs(y = "Residuals",
       x = "Fitted Values") +
  geom_point() +
  geom_smooth()

Diagnostic Plots: Probability Plot

A probability plot, or quanitile-quantile plot (QQ-plot for short) can also be made of the residuals.

Use the augmented dataset (see above) that contains the residuals.
Set sample = .resid.
Add geom_qq().

ggplot(data = mtcars.augment) +
  aes(sample = .resid) +
  labs(y = "Sample Quantiles",
       x = "Theoretical Quantiles") +
  geom_qq()

Estimating the Mean Response

Consider our model for allowing the mean fuel efficiency to depend upon the weight of the vehicle: $$\text{Unrestricted Model:} \quad (\text{MPG})_i = \beta_0 + \beta_1(\text{Weight})_i + \epsilon_i$$

Suppose we are interested in estimating the average fuel efficiency for a vehicle which weighs 2000 pounds and one which weighs 3000 pounds (keeping in mind that wt is specified in 1000's of pounds). Instead of estimating the parameters of the model, we are interested in estimating the mean response, which is done with estimate_mean_response():

estimate_mean_response(reg.model,
                       confidence.level = 0.95,
                       assume.constant.variance = TRUE,
                       assume.normality = TRUE,
                       wt = c(2, 3))

The majority of the arguments for this function are similar to estimate_parmaeters(). However, you must also specify the values of the covariates for which to estimate your response. Multiple values are placed in a vector using the c() construct.

ANOVA Models

ANOVA models are extensions of regression model which compare a quantitative response across the levels of a categorical variable. For our purposes, consider comparing the efficiency of vehicles with an automatic transmission to those with a manual transmission. Our model has the following form:

$$ \begin{equation} (\text{MPG})_i = \mu_1 (\text{Automatic})_i + \mu_2 (\text{Manual})_i + \epsilon_i \end{equation} $$

where

$$ \begin{aligned} (\text{Automatic})_i &= \begin{cases} 1 & \text{if i-th vehicle has an automatic transmission} \ 0 & \text{otherwise} \end{cases} \ (\text{Manual})_i &= \begin{cases} 1 & \text{if i-th vehicle has a manual transmission} \ 0 & \text{otherwise} \end{cases} \end{aligned} $$

As with any model, our first step is to specify that model in the computer. Notice that the above model has a parameter for each level of the variable am and does not have an intercept. This can be specified as follows:

anova.model = specify_mean_model(mpg ~ am, data = mtcars)

As long as the variable on the right-hand-side is a categorical variable, the computer will automatically construct the appropriate indicator functions.

Computing a P-Value

Comparing two ANOVA models is not different than comparing two regression models. Consider the hypothesis of interest:

$$H_0: \mu_1 = \mu_2 \qquad \text{vs.} \qquad H_1: \mu_1 \neq \mu_2$$

Under the alternative hypothesis, we have an unrestricted model:

$$ (\text{MPG})_i = \mu_1 (\text{Automatic})_i + \mu_2 (\text{Manual})_i + \epsilon_i $$

If the null hypothesis is true, the model for our data generating process would be restricted:

$$\text{Restricted Model:} \quad (\text{MPG})_i = \mu + \epsilon_i$$

where $\mu$ is the overall mean response --- the common value of $\mu_1$ and $\mu_2$. In order to construct the p-value we must communicate both models to the computer and then compare them.

anova.model.h1 = specify_mean_model(mpg ~ am, data = mtcars)
anova.model.h0 = specify_mean_model(mpg ~ 1, data = mtcars)

compare_models(anova.model.h1,
               anova.model.h0,
               assume.constant.variance = TRUE,
               assume.normality = TRUE)

Notice that we did not specify the alternative; the only option for ANOVA models is for the alternative hypothesis to be "at least 1 differs."

The output contains a summary of the comparison between the two models in the form of a classical ANOVA table. Changing the conditions will determine if the p-value is computed via bootstrapping or classical probability theory.

Computing Residuals

Diagnostic graphics are similar to those in regression. The one change is that it is often helpful to view the residuals across the groups. This can be done by making a side-by-side boxplot of the residuals from augmented data.

mtcars.augment = obtain_diagnostics(anova.model.h1)

ggplot(data = mtcars.augment) +
  aes(y = .resid,
      x = am) +
  labs(y = "Residuals",
       x = "") +
  geom_boxplot()

Repeated Measures ANOVA Models

ANOVA models allowed us to compare a quantitative response across levels of a categorical variable. One of the key conditions required for inference is that the error in the response for one observation is independent of the error in the response for all other observations. When designing a study, however, we may opt for a design which intentionally makes use of dependent observations in order to increase the power of our study. For example, we may choose to incorporate "blocks" in our study design. The repeated measures ANOVA model extends our previous models allowing us to compare a quantitative response across the levels of a categorical predictor while accounting for the additional variability resulting from blocks.

For this example, we add additional information to the mtcars dataset; in particular, we add the manufacturer of each vehicle and the country in which the manufacturer has its headquarters.

#| label: update-makers
mtcars <- mtcars |>
  mutate(Country = c('Japan', 'Japan', 'Japan', 'USA', 'USA',
                     'USA', 'USA', 'Germany', 'Germany', 'Germany',
                     'Germany', 'Germany', 'Germany', 'Germany', 'USA',
                     'USA', 'USA', 'Italy', 'Japan', 'Japan', 
                     'Japan', 'USA', 'USA', 'USA', 'USA',
                     'Italy', 'Germany', 'England', 'USA', 'Italy',
                     'Italy', 'Sweeden'),
         Brand = c('Mazda', 'Mazda', 'Datsun', 'AMC', 'AMC',
                   'Plymouth', 'Plymouth', 'Mercedes', 'Mercedes', 'Mercedes',
                   'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Cadillac',
                   'Lincoln', 'Chrysler', 'Fiat', 'Honda', 'Toyota',
                   'Toyota', 'Dodge', 'AMC', 'Chevrolet', 'Pontiac',
                   'Fiat', 'Porsche', 'Lotus', 'Ford', 'Ferrari',
                   'Maserati', 'Volvo'))

For our purposes, we are interested in comparing the average fuel efficiency for automatic and manual transmissions, while accounting for additional variability due to the country in which the vehicle was manufactured. That is, we know different countries have different standards in their fuel efficiency; so, we will treat the country as a blocking term.

Note: this is not the same as having a second predictor in the model as we are not interested in comparing one country to another; we are simply accounting for the fact that cars produced in the same country are likely to be similar to one another with regard to their fuel efficiency. That is, the variability in fuel efficiency across cars is more similar within a country than between countries.

Notice that we have six different countries (six levels for our blocking term). Our model is then:

$$ \begin{aligned} (\text{MPG})_i &= \mu_1 (\text{Automatic})_i + \mu_2 (\text{Manual})_i \ &\quad + \beta_2 (\text{German})_i + \beta_3 (\text{Italian})_i \ &\quad + \beta_4 (\text{Japanese})_i + \beta_5 (\text{Sweedish})_i \ &\quad + \beta_6 (\text{American})_i + \epsilon_i \end{aligned} $$

As with any model, our first step is to specify that model in the computer. Notice that the above model has a parameter for each level of the variable am and has a parameter for all but one blocking term. This can be specified as follows:

block.model = specify_mean_model(mpg ~ am + Country, data = mtcars)

As long as the variables on the right-hand-side are categorical variables, the computer will automatically construct the appropriate indicator variables.

Computing a P-Value

Comparing two repeated measures ANOVA models is not that different than comparing two ANOVA models. Consider the hypothesis of interest:

$$H_0: \mu_1 = \mu_2 \qquad \text{vs.} \qquad H_1: \mu_1 \neq \mu_2$$

Under the alternative hypothesis, we have an unrestricted model:

If the null hypothesis is true, the model for our data generating process would be restricted:

$$ \begin{aligned} (\text{MPG})_i &= \mu + \beta_2 (\text{German})_i + \beta_3 (\text{Italian})_i \ &\quad + \beta_4 (\text{Japanese})_i + \beta_5 (\text{Sweedish})_i \ &\quad + \beta_6 (\text{American})_i + \epsilon_i \end{aligned} $$

where $\mu$ is the common mean response for cars manufactured in England --- the common value of $\mu_1$ and $\mu_2$. Note that the blocks remain in the model to continue to account for the additional source of variability. In order to construct the p-value we must communicate both models to the computer and then compare them.

block.model.h1 = specify_mean_model(mpg ~ am + Country, data = mtcars)
block.model.h0 = specify_mean_model(mpg ~ 1 + Country, data = mtcars)

compare_models(block.model.h1,
               block.model.h0,
               assume.constant.variance = TRUE,
               assume.normality = TRUE)

The output contains a summary of the comparison between the two models in the form of an ANOVA table. Changing the conditions will determine if the p-value is computed via bootstrapping or classical probability theory.

Computing Residuals

Diagnostic graphics are similar to those in regression.

Frequently Asked Questions

There are some things students run into a lot. Here are some common situations.

I just want to look at my dataset; how?

The easiest way is to go to the "Environment" tab in RStudio and double-click on the name of the dataset. It will open a spread-sheet like interface for looking at your dataset.

My variable name has spaces and R does not like it!

Variable names which have spaces in them need to be wrapped in back ticks. Technically, all variable names can be done this way, but it is a requirement when the variable name has spaces. For example, the following is allowed:

mtcars = mtcars |>
  mutate(`Miles Per Gallon` = mpg)

Why does my code block not have a green arrow?

A code block should be completely separate from surrounding text. So, be sure there is a blank line both before the code block and after it. Usually, this fixes the problem.

R is not treating my "grouping" variable correctly!

Generally, this happens when your "grouping" variable is a number instead of a phrase. For example, in the mtcars dataset, the variable cyl takes values 4, 6, or 8. However, R does not know that these values represent groups instead of actual values. In order to get the program to recognize it as a group, rewrite the variable using factor():

mtcars = mtcars |>
  mutate(cyl = factor(cyl))

Now, cyl will be treated as a grouping variable. This needs to be done only once and will work for all subsequent graphics and models constructed.

Note: you could also recode the variable manually if you wanted to:

mtcars = mtcars |>
  mutate(cyl = case_match(cyl,
                          4 ~ "4 Cylinder",
                          6 ~ "6 Cylinder",
                          8 ~ "8 Cylinder"))

The choice is yours.

reyesem/IntroAnalysis documentation built on March 29, 2025, 3:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

reyesem/IntroAnalysis Functions for introductory statistics using linear models

In reyesem/IntroAnalysis: Functions for introductory statistics using linear models

Basics

Packages

Anatomy of RMarkdown File

Anatomy of R Code

Data Manipulation

mutate(): creating or changing variables

filter(): subsetting a variable

case_match(): Changing the values of a variable

Numerical Summaries

summarize_variable(): Summarizing a single variable.

List of functions for numerical summaries

summarize_relationship(): Summarize the relationship between two variables.

Graphics

Categorical Variable: Bar Chart

Multiple Categorical Variables: Bar Charts

Quantitative Variable: Histogram

Quantitative Variable: Density

Quantitative Variable: Boxplot

Quantitative Variable Across Groups: Boxplots

Quantitative Variable Across Groups: Violin Plots

Quantitative Variable Across Groups: Individual Value Plots

Quantitative Variable Across Multiple Groups

Two Quantitative Variables: Scatterplot

Adding Features to Scatterplots

Specifying a Mean Model

Confidence Interval for a Single Mean

Examining sampling distribution graphically.

P-values for Mean

Examining null distribution graphically.

Simple Linear Regression Models

Fitting Regression Models: Least Squares Estimates

Confidence Intervals for Regression Parameters

P-values Comparing Two Models

Computing R-Squared

Computing Residuals

Diagnostic Plots: Time-Series Plot

Diagnostic Plots: Residual vs. Fitted Values

Diagnostic Plots: Probability Plot

Estimating the Mean Response

ANOVA Models

Computing a P-Value

Computing Residuals

Repeated Measures ANOVA Models

Computing a P-Value

Computing Residuals

Frequently Asked Questions

I just want to look at my dataset; how?

My variable name has spaces and R does not like it!

Why does my code block not have a green arrow?

R is not treating my "grouping" variable correctly!

R Package Documentation

Browse R Packages

We want your feedback!

reyesem/IntroAnalysis
Functions for introductory statistics using linear models