## ---- Load Packages ---- pkgs <- c("tidyverse", "knitr", "IntroAnalysis") for(pkg in pkgs) library(pkg, character.only = TRUE) ## ---- Change Options ---- # Suppress status bar in dplyr. # Change handling of ordered factors options(dplyr.show_progress = FALSE, contrasts = rep("contr.treatment", 2)) # Change theme for plots theme_set(theme_bw(12)) theme_update(legend.position = "bottom")
This document provides a list of the code most commonly used in the course. There may be times, especially when working on an independent capstone project, when you want to move beyond what is discussed in class. There are many resources available online, or feel free to ask for guidance from the instructor. However, the majority of code needed for the course is documented here.
Throughout this document, we will be making use of the mtcars
dataset available with all R installations. More information can be found about the variables in this dataset by typing ?mtcars
into your R console or typing mtcars
into the help search box in RStudio. Here are the first few records of the dataset:
mtcars <- datasets::mtcars mtcars |> head() |> knitr::kable(digits = 3)
As you progress through the following examples, remember that R is case sensitive.
Before proceeding to the specific aspects of the course; let's first talk about a few basics about programming in R.
Throughout the course we will be using a standard suite of packages (additional functions which provide additional analytical capabilities). In order to ensure that you have all required functionality for the course, be sure that each RMarkdown file you write has the following block at the top, which should not be altered:
IntroAnalysis:::ma223_setup()
An RMarkdown file is comprised of two components: basic text and "code chunks" which perform the analysis. When the document is compiled ("knit") the code is evaluated and the results added seamlessly into your document. This encourages a reproducible workflow (someone can recreate all your results should the data change).
Code blocks can be added using the "Insert" button at the top of the script window. Code blocks have the following form:
cat("```r","\n", "Code goes here...","\n", "```", "\n", sep="")
Notice that the code blocks always begin and end with the triple backticks (found on your keyboard in the upper-left-hand corner with the tilde).
Most code chunks all look the same. Below is a very simple code chunk. Objects are "assigned" to an object using an equal sign (or arrow <-
). The code essentially says that the thing on the left will contain anything generated on the right.
# Comments (not evaluated) begin with a hashtag. ObjectName = code to generate results
Just typing the name of an object prints that object out; that output, in turn, will appear in your final document.
If we want to do something to that object, we will make use of the "pipe operator" (|>
) which essentially says take the thing before and use it in the next function. For example, we might summarise an object as follows:
ObjectTwo = ObjectOne |> summarise()
This code says take the thing called "ObjectOne" and "pipe it into" (use within) the summarise function; whatever the result of this operation is will be called "ObjectTwo."
Data is rarely perfect when we first obtain it. Often times, we will need to make changes. There are a few tasks which are performed regularly; these can generally be accomplished with the following two functions:
mutate()
: manipulate variables or make transformations.filter()
: subset the data based on some criteria.case_match()
: recode the levels of a variable.As an example, while the US measures fuel efficiency in miles per gallon, other countries often report the gallons required for a single mile. This new variable Gallons_per_Mile
can be created using mutate()
:
mtcars = mtcars |> mutate(Gallons_per_Mile = 1/mpg)
Notice that we use the pipe operator (|>
) here; this essentially says take the dataset mtcars
and feed it into the mutate()
function. The mutate()
function itself uses the notation New_Variable_Name = transformation of stuff
. In this case, we create a new variable within the dataset mtcars
. The result is stored within a dataset called mtcars
(since that is on the left of the equal sign), essentially overwriting the original dataset.
The first few records of this dataset are shown below:
mtcars |> head() |> knitr::kable(digits = 3)
If we only want to look at a subset of the data, we use the filter()
function. For example, suppose we only want to consider vehicles which have automatic transmissions. Further, we would like to store those in a new dataset called automatic.only
, we use the following code:
automatic.only = mtcars |> filter(am == 0)
When making comparisons, ==
is asking "are you equal?" We keep the records for which we would answer "yes" to this question. The original mtcars
dataset is not altered; it remains unchanged. Instead, we have taken it, filtered it down, and then stored the results in automatic.only
(the first few records of which are shown below).
automatic.only |> head() |> knitr::kable(digits = 3)
When we work with categorical variables, we often want the values (or group labels) to be something readable. In order to do this, we may need to recode the data. Consider the variable am
(transmission type). It is currently a 0/1 variable, and it is not obvious that 1 represents manual transmission and 0 automatic transmission. We can make this clearer by mutating the dataset and match each level of a variable with a new value, thereby recoding that variable.
mtcars = mtcars |> mutate(am = case_match(am, 0 ~ "automatic", 1 ~ "manual"))
The first few records of the updated dataset are shown below:
mtcars |> head() |> knitr::kable(digits = 3)
If we now wanted to go further and change "automatic" to "Automatic Transmission", we could recode again; this time, since the variable is already a phrase, we do not use backticks.
mtcars = mtcars |> mutate(am = case_match(am, "automatic" ~ "Automatic Transmission", "manual" ~ "Manual Transmission"))
If your dataset has groups coded as numbers (like above when 0 = automatic and 1 = manual), it is best to recode in order for graphics and models to work appropriately.
Numerical summaries (statistics) tell us about the distribution of our sample. These also estimate the corresponding parameters of interest in the population. The functions we introduce below make use of "formula" notation in R
. We take a second to introduce that here, and it will make more sense in the later modeling sections. Let's compare two basic formulas:
y ~ 1
y ~ group
Both formulas have two essential components: a left-hand-side and a right-hand-side separated by the ~
. Think of the ~
as an "equal sign" in a mathematical formula or as a way of saying "related to." The first formula says that the variable y
is not related to anything; you are interested in summarizing it across all records. The second formula says you want to summarize the variable y
within groups which are defined by the variable group
.
Of course, the "variable y
" is not very helpful unless you tell the computer where to look for that variable. So, formulas are often accompanied by a data =
argument to allow you to specify the data set in which the variable is located.
The summarize_variable()
function allows us to compute individual summaries of variables. The following code computes the average mpg of vehicles in the sample as well as the standard deviation.
summarize_variable(mpg ~ 1, data = mtcars, mean, sd)
The average fuel efficiency might differ for automatic and manual transmissions. In order to compute the above summaries for each type of car, we include the grouping variable.
summarize_variable(mpg ~ am, data = mtcars, mean, sd)
After the data =
option, we specify key words for computing summaries. The most used key words are given in a table later in this section. to compute a proportion, we need to specify what value we want the proportion for:
summarize_variable((am=="Manual Transmission") ~ 1, data = mtcars, percent)
The following table contains a list of functions used to construct common summaries.
Numerical Summary Keyword/Function
Mean/Average mean
Standard Deviation sd
Median median
Minimum min
Maximum max
25-th Percentile Q1
75-th Percentile Q3
Interquartile Range IQR
Percent percent
Proportion mean
1-st Percentile perc01
5-th Percentile perc05
10-th Percentile perc10
20-th Percentile perc20
80-th Percentile perc80
85-th Percentile perc85
90-th Percentile perc90
95-th Percentile perc95
99-th Percentile perc99
Table: List of functions for numerical summaries
If missing values are present in the data, these summaries cannot be computed. In order to make the computations excluding the missing values, pass the argument na.rm = TRUE
to the function:
summarize_variable(mpg ~ 1, data = mtcars, mean, .args = list(na.rm = TRUE))
When numerically summarizing the relationship between two variables, we are typically interested in computing the correlation coefficient relating two numeric variables. This is done with summarize_relationship()
. The following code computes the correlation relating the fuel efficiency with the horsepower of the engine.
summarize_relationship(mpg ~ hp, data = mtcars)
Again, the formula interface is used to suggest which two variables we are relating.
A picture is worth 1000 words. A good graphical summary of the data should not only be a first step in an analysis, but it should also accompany any good analysis to help convey the message of the data. All graphics in this class will be produced starting with the ggplot()
function. While this codebook provides a set of recipes commonly encountered in class, if you are interested in moving beyond what is described here, I would encourage you to look at the Ggplot Cheat Sheet.
There is a bit of a learning curve to using ggplot()
. If you would like, you can also click on Add Ins > 'ggplot2' builder
(under the ESQUISSE
heading after installing the esquisse
package); this will allow you to create a graphic interactively and then export the code.
The ggplot()
function takes a single component:
data =
specifies the data set in which the variables exist.Additional components of a graphic are "added" on in layers; the most important of these maps the aesthetics --- that is, it communicates what variables are mapped to each axis in the plot; this is done with the aes()
function, which (optionally) can have the following components:
x =
specifies the variable for the x-axis.y =
if needed, specifies the variable for the y-axis.color =
specifies which variable in the data determines the exterior color of the objects.fill =
specifies which variable in the data determines the fill-color of the objects plotted.size =
specifies which variable in the data determines the size of the objects plotted.shape =
specifies which variable in the data determines the shape of the objects plotted.We can then add labels to our graphic using the labs()
function, which (optionally) takes the following components:
title = "title text"
title for the graphic.subtitle = "subtitle text"
subtitle for the graphic.caption = "caption text"
caption for the graphic.x = "x text"
label for the x-axis.y = "y text"
label for the y-axis.color = "color text"
label for the color legend.fill = "fill text"
label for the fill legend.size = "size text"
label for the size legend.shape = "shape text"
label for the shape legend.If you would like a blank label, simply enter x = ""
for example.
The last thing you need is to actually add the various summaries for the data; these are done through "geometric representations" (or "geoms"). Choosing the geom changes how the data is plotted:
Graphic Type Geom
Bar Chart geom_bar
Histogram geom_histogram
Density Plot geom_density
Boxplot geom_boxplot
Violin Plot geom_violin
Jitter Plot geom_jitter
Scatterplot geom_point
Smoother geom_smooth
Line Chart geom_line
Table: List of common geoms for creating graphics.
We now illustrate various ways of combining these geoms to create common graphics.
A bar chart is the simplest (and often best) method for summarizing the distribution of a categorical variable. The key steps to making a bar chart include:
geom_bar()
.For example, we can summarize the number of vehicles with each type of transmission using the following code.
ggplot(data = mtcars) + aes(x = am) + labs(x = "", y = "Frequency") + geom_bar()
We remove the label on the x-axis because the value of the variable makes it clear what is being summarized.
If you have multiple categorical variables, consider using additional aesthetics to group variables. We will consider a bar chart summarizing the type of engine and the type of transmission for each vehicle. First, we clean up the way that the variable vs
is coded:
mtcars = mtcars |> mutate(vs = recode(vs, `0` = "V Engine", `1` = "Straight Engine"))
Now, we construct the graphic using the following steps:
fill =
to set the secondary variable.geom_bar()
.Again, we remove the labels of both the fill
and the x
variables because the values are self-evident.
ggplot(data = mtcars) + aes(x = am, fill = vs) + labs(x = "", fill = "", y = "Frequency") + geom_bar()
When looking at the distribution of a single variable, a histogram is a reasonable choice, especially for medium-sized datasets.
geom_histogram()
.binwidth =
option inside of geom_histogram()
to a reasonable size (how large you want each bin to be). This changes for each problem and is a matter of opinion on what gives a reasonable looking graphic. Leaving this option off will result in the computer choosing a default.ggplot(data = mtcars) + aes(x = mpg) + labs(y = "Frequency", x = "Miles per Gallon") + geom_histogram(binwidth = 1)
In this graphic, we set the bin width to be 1. We could also have considered a larger bin width of 3, for example:
ggplot(data = mtcars) + aes(x = mpg) + labs(y = "Frequency", x = "Miles per Gallon") + geom_histogram(binwidth = 3)
If you exclude the binwidth =
option, it will choose a value for you:
ggplot(data = mtcars) + aes(x = mpg) + labs(y = "Frequency", x = "Miles per Gallon") + geom_histogram()
A density plot is like a smooth histogram; this works best with larger datasets.
geom_density()
ggplot(data = mtcars) + aes(x = mpg) + labs(y = "Density", x = "Miles per Gallon") + geom_density()
The y-axis on this graphic is not directly interpretable; essentially they have been chosen such that the area under the curve is equal to 1. Our primary interest here is that larger values mean the values on the x-axis are more likely.
A boxplot gives a quick summary of location and spread for a numeric variable.
geom_boxplot()
ggplot(data = mtcars) + aes(y = mpg, x = "") + labs(y = "Miles per Gallon", x = "") + geom_boxplot()
If we are examining the distribution of a quantitative variable across various groups formed by a categorical variable, we can use side-by-side graphics, such as boxplots.
geom_boxplot()
.ggplot(data = mtcars) + aes(y = mpg, x = am) + labs(y = "Miles per Gallon", x = "") + geom_boxplot()
Boxplots summarize the center and spread of the data, but the shape is lost. Violin plots generalize boxplots by incorporating the shape.
geom_violin()
.ggplot(data = mtcars) + aes(y = mpg, x = am) + labs(y = "Miles per Gallon", x = "") + geom_violin()
When we have a small sample, boxplots are inappropriate because they hide the raw data and present an overly confident summary. Instead, we prefer to place the raw values on the plot.
geom_point()
.ggplot(data = mtcars) + aes(y = mpg, x = am) + labs(y = "Miles per Gallon", x = "") + geom_point()
If it looks too overwhelming because you have just enough points that they overlap, you can "jitter" the points:
geom_jitter()
.height = 0
within geom_jitter()
to prevent the data from being jittered vertically and set width = 0.25
to narrow the amount of jitter added horizontally.ggplot(data = mtcars) + aes(y = mpg, x = am) + labs(y = "Miles per Gallon", x = "") + geom_jitter(height = 0, width = 0.25)
Boxplots, Violin Plots, and Individual Value Plots can all be extended to include additional other groups by adding aesthetics. For example, we can construct a side-by-side boxplot which also fills according to the engine type.
fill =
, color =
, etc.)geom_boxplot()
for example.ggplot(data = mtcars) + aes(y = mpg, x = am, fill = vs) + labs(y = "Miles per Gallon", x = "", fill = "") + geom_boxplot()
When examining the relationship between two quantitative variables, we can use a scatterplot.
geom_point()
.ggplot(data = mtcars) + aes(y = mpg, x = wt) + labs(y = "Miles per Gallon", x = "Weight (1000 lbs)") + geom_point()
Just as with other plots, we can add additional features to a scatterplot. In general, the following two guiding principles help when thinking about these additional aesthetics:
The following graphic examines the relationship between the fuel efficiency (MPG) of a vehicle and its weight, while also considering the transmission type of the vehicle. Since "transmission type" is categorical, we use the color to separate these groups.
color =
.geom_point()
.ggplot(data = mtcars) + aes(y = mpg, x = wt, color = am) + labs(y = "Miles per Gallon", x = "Weight (1000 lbs)", color = "") + geom_point()
We could also have used shape of the points instead.
ggplot(data = mtcars) + aes(y = mpg, x = wt, shape = am) + labs(y = "Miles per Gallon", x = "Weight (1000 lbs)", shape = "") + geom_point()
If we want to look at the relationship between the fuel efficiency (MPG) of a vehicle and its weight, while also considering its horsepower, we do things a bit differently. Since "horsepower" is quantitative, we can use color or size to distinguish between various values. Both versions are shown below.
color =
.geom_point()
.ggplot(data = mtcars) + aes(y = mpg, x = wt, color = hp) + labs(y = "Miles per Gallon", x = "Weight (1000 lbs)", color = "Horsepower") + geom_point()
ggplot(data = mtcars) + aes(y = mpg, x = wt, size = hp) + labs(y = "Miles per Gallon", x = "Weight (1000 lbs)", size = "Horsepower") + geom_point()
Parameters in a population govern the data-generating process for a particular variable of interest. Throughout this course, we are focused on making inference on the mean response. For example, suppose we are interested in estimating the mean fuel efficiency; then, our model for the data generating process will have the form
$$(\text{MPG})_i = \mu + \epsilon_i$$
where $\mu$ is the parameter of interest (the average fuel efficiency). In order to communicate this data generating process to the computer, we use the function specify_mean_model()
which makes use of the formula notation similar to when summarizing variables.
mpg.model = specify_mean_model(mpg ~ 1, data = mtcars)
Here, the left-hand-side of the formula corresponds to the variable on the left-hand-side of the mathematical formula constructed above. The right-hand-side specifies the terms which contain a parameter. Since the $\mu$ in our mathematical model is alone, it is multiplied by a 1 (hence the 1 in the formula for the function). Again, we use the data =
option in order to specify in which data set mpg
can be found.
Note: the error term is understood in the model and need not be communicated to the computer.
Finally, this mean model is something that we will be working with; so, it should be saved. Here, we save it under the name mpg.model
.
We begin with the same mean model described in the previous section:
In order to construct a confidence interval, we use the estimate_parameters()
function:
estimate_parameters(mpg.model, confidence.level = 0.95)
This function has a few requirements:
Note that the computer will always assume the data is independent of one another, that the errors are identically distributed, and that the mean model is correctly specified.
Classical theory will be employed if we additionally assume the normality condition holds. Otherwise, bootstrapping will be used to compute the confidence intervals. For example, to compute a 95\% confidence interval using classical theory ("t-interval"), use the following code:
estimate_parameters(mpg.model, confidence.level = 0.95, assume.normality = TRUE)
Remember that each time you use a bootstrap procedure, the model for the sampling distribution will change slightly due to randomness. Therefore, the endpoints of the interval may very slightly. In order to prevent this, you can "set a seed" for random number generation. Think of this as specifying how the random number generator should begin. Running the same block of code multiple times will produce the same output.
set.seed(123) estimate_parameters(mpg.model, confidence.level = 0.95)
The seed value can be any integer.
If you want to examine the entire model for the sampling distribution graphically (instead of just having the CI), you must first save the output of the estimate_parameters()
function and then use it in the plot_sampling_distribution()
function. Below is the plot of the model for the sampling distribution of the sample mean fuel efficiency when we do not assume normality. Here, the results of estimate_parameters()
is stored in the name sampling.distn
which can be used later on.
sampling.distn = estimate_parameters(mpg.model, confidence.level = 0.95) plot_sampling_distribution(sampling.distn, show.confidence.interval = TRUE)
We can optionally choose to illustrate the confidence interval on the plot, which can be printed by printing out the contents of sampling.distn
, which just holds the result of the call to the estimate_parameters()
function:
sampling.distn
We begin with the same mean model $$(\text{MPG})_i = \mu + \epsilon_i$$ Suppose we are interested in testing the hypothesis $$H_0: \mu = 18 \qquad \text{vs.} \qquad H_1: \mu \neq 18$$
Hypothesis testing at its core is about comparing two models: a full (unrestricted) model (alternative hypothesis) and a restricted model (null hypothesis). Specifically, notice that under the alternative hypothesis, we have $$\text{Unrestricted Model:} \quad (\text{MPG})_i = \mu + \epsilon_i$$
This is easily communicated to the computer:
mpg.model.h1 = specify_mean_model(mpg ~ 1, data = mtcars)
Under the null hypothesis, we have $$\text{Restricted Model:} \quad (\text{MPG})_i = 18 + \epsilon_i$$
In this restricted model, there are no parameters to estimate. Instead, the parameter has a specific constant value. When specifying this model, we need to use the constant()
notation to signify this.
mpg.model.h0 = specify_mean_model(mpg ~ constant(18), data = mtcars)
Now, the two models can be compared:
compare_models(mpg.model.h1, mpg.model.h0, alternative = "not equal")
In particular, pay attention to specifying the direction of the alternative hypothesis. The default is to perform a two-sided test ($H_1: \mu \neq \mu_0$); if you want to perform a one-sided test, specify the direction of the alternative by saying alternative = "greater than"
or alternative = "less than"
as appropriate.
If you want to examine the entire model for the null distribution graphically (instead of just having the p-value), you must first save the output of the compare_models()
function and then use it in the plot_null_distribution()
function. Below is the plot of the model for the null distribution of the statistic summarizing the evidence against the null hypothesis described above. Here, the results of compare_models()
is stored in the name null.distn
which can be used later on.
null.distn = compare_models(mpg.model.h1, mpg.model.h0, alternative = "not equal") plot_null_distribution(null.distn, show.pvalue = TRUE)
We can optionally choose to illustrate the region which will make up the p-value, which can be printed by printing out the contents of null.distn
, which just holds the result of the call to the compare_models()
function:
null.distn
Again, the classical theory can be obtained by additionally imposing the normality condition with assume.normality=TRUE
.
Regression models extend our mean model by allowing the response to depend linearly on a covariate. Consider the following simple model allowing the fuel efficiency to depend upon the weight of the vehicle.
$$(\text{MPG})_i = \beta_0 + \beta_1(\text{Weight})_i + \epsilon_i$$
This is specified to the computer using specify_mean_model()
as follows:
reg.model = specify_mean_model(mpg ~ 1 + wt, data = mtcars)
Again, the left-hand-side communicates the response variable. The right-hand-side has a 1 (indicating the intercept term) plus the variable wt
. Each term separated by a plus sign will have a parameter associated with it, which are understood by the computer.
Parameters do not need to be specified; one is attached to each term separated by a
+
sign.
The "best fit" line typically encountered is formed by computing the least squares estimates of the parameters in the regression model. The estimate_parameters()
function does this.
estimate_parameters(reg.model)
The output does not contain confidence intervals because we did not specify a confidence level or the conditions we are willing to assume.
In order to obtain confidence intervals, we need to specify the conditions we are willing to assume.
estimate_parameters(reg.model, confidence.level = 0.99, assume.constant.variance = TRUE, assume.normality = TRUE)
If both normality and constant variance are assumed, classical method are used to compute the confidence intervals. Otherwise, bootstrapping is used.
Hypothesis testing at its core is about comparing two models: a full (unrestricted) model (alternative hypothesis) and a restricted model (null hypothesis). Referring to our model for the fuel efficiency described above, suppose we wanted to test the hypothesis that
$$H_0: \beta_1 = 0 \qquad \text{vs.} \qquad H_1: \beta_1 \neq 0$$
Under the alternative hypothesis (no restriction on $\beta_1$), we have the same model we began with: $$\text{Unrestricted Model:} \quad (\text{MPG})_i = \beta_0 + \beta_1(\text{Weight})_i + \epsilon_i$$
Under the null hypothesis, we have a restricted model: $$\text{Restricted Model:} \quad (\text{MPG})_i = \beta_0 + \epsilon_i$$
We can communicate each of these models to the computer using specify_mean_model()
. Both must be specified so that they can be compared in the program.
reg.model.h1 = specify_mean_model(mpg ~ 1 + hp, data = mtcars) reg.model.h0 = specify_mean_model(mpg ~ 1, data = mtcars)
These two models are then compared (to generate a p-value) using the compare_models()
function.
compare_models(reg.model.h1, reg.model.h0, alternative = "not equal", assume.constant.variance = TRUE, assume.normality = TRUE)
What results is an ANOVA table summarizing the comparison between the two models. The line "Additional Terms" refers to the additional variability in the response captured by the lack of restrictions in the alternative hypothesis. The p-value is the last component of this output.
Similar to estimating parameters, bootstrapping is performed if either the normality or constant variance conditions are not assumed.
R-squared is one measure for quantifying the predictive ability of the model. It can be obtained using the summarize_model_fit()
function.
summarize_model_fit(reg.model)
While summarize_model_fit()
returns a lot of information about the model, we will focus on the r.squared
value.
Diagnostic graphics used to assess conditions on the stochastic portion of the model depend on residuals and fitted values. In order to obtain the residuals, we augment the original data based on the model constructed using the obtain_diagnostics()
function. These results must be stored in a dataset. This function cannot be used until you first specify a model so that residuals can be computed.
mtcars.augment = obtain_diagnostics(reg.model)
The result is actually a new dataset which includes various new variables (beginning with period in the name) used to assess a model fit. The variable .resid
contains the residuals, and the variable .fitted
contains the fitted values or predicted values. The first few records of this new dataset are shown below:
mtcars.augment |> head() |> knitr::kable(digits = 3)
If the data is recorded in the order in which it was collected, we can construct a time-series plot of the residuals.
y = .resid
.x = seq_along(.resid)
to construct the order of the observations.geom_point()
and geom_line()
to connect the dots for an better view.ggplot(data = mtcars.augment) + aes(y = .resid, x = seq_along(.resid)) + labs(y = "Residuals", x = "Order of Observations") + geom_point() + geom_line()
The nice thing about this graphic is that the only thing that changes is the dataset from which to pull the residuals.
A plot of the residuals against the fitted values can be helpful for assessing various conditions.
y = .resid
.x = .fitted
.geom_point()
.ggplot(data = mtcars.augment) + aes(y = .resid, x = .fitted) + labs(y = "Residuals", x = "Fitted Values") + geom_point()
It can sometimes be helpful to include a smoother on this graphic. This can be done easily by adding geom_smooth()
to the graphic.
ggplot(data = mtcars.augment) + aes(y = .resid, x = .fitted) + labs(y = "Residuals", x = "Fitted Values") + geom_point() + geom_smooth()
A probability plot, or quanitile-quantile plot (QQ-plot for short) can also be made of the residuals.
sample = .resid
.geom_qq()
.ggplot(data = mtcars.augment) + aes(sample = .resid) + labs(y = "Sample Quantiles", x = "Theoretical Quantiles") + geom_qq()
Consider our model for allowing the mean fuel efficiency to depend upon the weight of the vehicle: $$\text{Unrestricted Model:} \quad (\text{MPG})_i = \beta_0 + \beta_1(\text{Weight})_i + \epsilon_i$$
Suppose we are interested in estimating the average fuel efficiency for a vehicle which weighs 2000 pounds and one which weighs 3000 pounds (keeping in mind that wt
is specified in 1000's of pounds). Instead of estimating the parameters of the model, we are interested in estimating the mean response, which is done with estimate_mean_response()
:
estimate_mean_response(reg.model, confidence.level = 0.95, assume.constant.variance = TRUE, assume.normality = TRUE, wt = c(2, 3))
The majority of the arguments for this function are similar to estimate_parmaeters()
. However, you must also specify the values of the covariates for which to estimate your response. Multiple values are placed in a vector using the c()
construct.
ANOVA models are extensions of regression model which compare a quantitative response across the levels of a categorical variable. For our purposes, consider comparing the efficiency of vehicles with an automatic transmission to those with a manual transmission. Our model has the following form:
$$ \begin{equation} (\text{MPG})_i = \mu_1 (\text{Automatic})_i + \mu_2 (\text{Manual})_i + \epsilon_i \end{equation} $$
where
$$ \begin{aligned} (\text{Automatic})_i &= \begin{cases} 1 & \text{if i-th vehicle has an automatic transmission} \ 0 & \text{otherwise} \end{cases} \ (\text{Manual})_i &= \begin{cases} 1 & \text{if i-th vehicle has a manual transmission} \ 0 & \text{otherwise} \end{cases} \end{aligned} $$
As with any model, our first step is to specify that model in the computer. Notice that the above model has a parameter for each level of the variable am
and does not have an intercept. This can be specified as follows:
anova.model = specify_mean_model(mpg ~ am, data = mtcars)
As long as the variable on the right-hand-side is a categorical variable, the computer will automatically construct the appropriate indicator functions.
Comparing two ANOVA models is not different than comparing two regression models. Consider the hypothesis of interest:
$$H_0: \mu_1 = \mu_2 \qquad \text{vs.} \qquad H_1: \mu_1 \neq \mu_2$$
Under the alternative hypothesis, we have an unrestricted model:
$$ (\text{MPG})_i = \mu_1 (\text{Automatic})_i + \mu_2 (\text{Manual})_i + \epsilon_i $$
If the null hypothesis is true, the model for our data generating process would be restricted:
$$\text{Restricted Model:} \quad (\text{MPG})_i = \mu + \epsilon_i$$
where $\mu$ is the overall mean response --- the common value of $\mu_1$ and $\mu_2$. In order to construct the p-value we must communicate both models to the computer and then compare them.
anova.model.h1 = specify_mean_model(mpg ~ am, data = mtcars) anova.model.h0 = specify_mean_model(mpg ~ 1, data = mtcars) compare_models(anova.model.h1, anova.model.h0, assume.constant.variance = TRUE, assume.normality = TRUE)
Notice that we did not specify the alternative; the only option for ANOVA models is for the alternative hypothesis to be "at least 1 differs."
The output contains a summary of the comparison between the two models in the form of a classical ANOVA table. Changing the conditions will determine if the p-value is computed via bootstrapping or classical probability theory.
Diagnostic graphics are similar to those in regression. The one change is that it is often helpful to view the residuals across the groups. This can be done by making a side-by-side boxplot of the residuals from augmented data.
mtcars.augment = obtain_diagnostics(anova.model.h1) ggplot(data = mtcars.augment) + aes(y = .resid, x = am) + labs(y = "Residuals", x = "") + geom_boxplot()
ANOVA models allowed us to compare a quantitative response across levels of a categorical variable. One of the key conditions required for inference is that the error in the response for one observation is independent of the error in the response for all other observations. When designing a study, however, we may opt for a design which intentionally makes use of dependent observations in order to increase the power of our study. For example, we may choose to incorporate "blocks" in our study design. The repeated measures ANOVA model extends our previous models allowing us to compare a quantitative response across the levels of a categorical predictor while accounting for the additional variability resulting from blocks.
For this example, we add additional information to the mtcars
dataset; in particular, we add the manufacturer of each vehicle and the country in which the manufacturer has its headquarters.
#| label: update-makers mtcars <- mtcars |> mutate(Country = c('Japan', 'Japan', 'Japan', 'USA', 'USA', 'USA', 'USA', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'USA', 'USA', 'USA', 'Italy', 'Japan', 'Japan', 'Japan', 'USA', 'USA', 'USA', 'USA', 'Italy', 'Germany', 'England', 'USA', 'Italy', 'Italy', 'Sweeden'), Brand = c('Mazda', 'Mazda', 'Datsun', 'AMC', 'AMC', 'Plymouth', 'Plymouth', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Cadillac', 'Lincoln', 'Chrysler', 'Fiat', 'Honda', 'Toyota', 'Toyota', 'Dodge', 'AMC', 'Chevrolet', 'Pontiac', 'Fiat', 'Porsche', 'Lotus', 'Ford', 'Ferrari', 'Maserati', 'Volvo'))
For our purposes, we are interested in comparing the average fuel efficiency for automatic and manual transmissions, while accounting for additional variability due to the country in which the vehicle was manufactured. That is, we know different countries have different standards in their fuel efficiency; so, we will treat the country as a blocking term.
Note: this is not the same as having a second predictor in the model as we are not interested in comparing one country to another; we are simply accounting for the fact that cars produced in the same country are likely to be similar to one another with regard to their fuel efficiency. That is, the variability in fuel efficiency across cars is more similar within a country than between countries.
Notice that we have six different countries (six levels for our blocking term). Our model is then:
$$ \begin{aligned} (\text{MPG})_i &= \mu_1 (\text{Automatic})_i + \mu_2 (\text{Manual})_i \ &\quad + \beta_2 (\text{German})_i + \beta_3 (\text{Italian})_i \ &\quad + \beta_4 (\text{Japanese})_i + \beta_5 (\text{Sweedish})_i \ &\quad + \beta_6 (\text{American})_i + \epsilon_i \end{aligned} $$
As with any model, our first step is to specify that model in the computer. Notice that the above model has a parameter for each level of the variable am
and has a parameter for all but one blocking term. This can be specified as follows:
block.model = specify_mean_model(mpg ~ am + Country, data = mtcars)
As long as the variables on the right-hand-side are categorical variables, the computer will automatically construct the appropriate indicator variables.
Comparing two repeated measures ANOVA models is not that different than comparing two ANOVA models. Consider the hypothesis of interest:
$$H_0: \mu_1 = \mu_2 \qquad \text{vs.} \qquad H_1: \mu_1 \neq \mu_2$$
Under the alternative hypothesis, we have an unrestricted model:
$$ \begin{aligned} (\text{MPG})_i &= \mu_1 (\text{Automatic})_i + \mu_2 (\text{Manual})_i \ &\quad + \beta_2 (\text{German})_i + \beta_3 (\text{Italian})_i \ &\quad + \beta_4 (\text{Japanese})_i + \beta_5 (\text{Sweedish})_i \ &\quad + \beta_6 (\text{American})_i + \epsilon_i \end{aligned} $$
If the null hypothesis is true, the model for our data generating process would be restricted:
$$ \begin{aligned} (\text{MPG})_i &= \mu + \beta_2 (\text{German})_i + \beta_3 (\text{Italian})_i \ &\quad + \beta_4 (\text{Japanese})_i + \beta_5 (\text{Sweedish})_i \ &\quad + \beta_6 (\text{American})_i + \epsilon_i \end{aligned} $$
where $\mu$ is the common mean response for cars manufactured in England --- the common value of $\mu_1$ and $\mu_2$. Note that the blocks remain in the model to continue to account for the additional source of variability. In order to construct the p-value we must communicate both models to the computer and then compare them.
block.model.h1 = specify_mean_model(mpg ~ am + Country, data = mtcars) block.model.h0 = specify_mean_model(mpg ~ 1 + Country, data = mtcars) compare_models(block.model.h1, block.model.h0, assume.constant.variance = TRUE, assume.normality = TRUE)
The output contains a summary of the comparison between the two models in the form of an ANOVA table. Changing the conditions will determine if the p-value is computed via bootstrapping or classical probability theory.
Diagnostic graphics are similar to those in regression.
There are some things students run into a lot. Here are some common situations.
The easiest way is to go to the "Environment" tab in RStudio and double-click on the name of the dataset. It will open a spread-sheet like interface for looking at your dataset.
Variable names which have spaces in them need to be wrapped in back ticks. Technically, all variable names can be done this way, but it is a requirement when the variable name has spaces. For example, the following is allowed:
mtcars = mtcars |> mutate(`Miles Per Gallon` = mpg)
A code block should be completely separate from surrounding text. So, be sure there is a blank line both before the code block and after it. Usually, this fixes the problem.
Generally, this happens when your "grouping" variable is a number instead of a phrase. For example, in the mtcars
dataset, the variable cyl
takes values 4, 6, or 8. However, R does not know that these values represent groups instead of actual values. In order to get the program to recognize it as a group, rewrite the variable using factor()
:
mtcars = mtcars |> mutate(cyl = factor(cyl))
Now, cyl
will be treated as a grouping variable. This needs to be done only once and will work for all subsequent graphics and models constructed.
Note: you could also recode the variable manually if you wanted to:
mtcars = mtcars |> mutate(cyl = case_match(cyl, 4 ~ "4 Cylinder", 6 ~ "6 Cylinder", 8 ~ "8 Cylinder"))
The choice is yours.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.