In reyesem/IntroAnalysis: Functions for introductory statistics using linear models

The IntroAnalysis package provides wrappers for estimating parameters and comparing nested models (conducting hypothesis tests) when working with linear and generalized linear models. These wrappers provide a single interface for performing statistical inference imposing the classical conditions or via bootstrapping. The functions have been written specifically with introductory statistics students in mind. The introductory course we have in mind introduces students to statistical inference using the linear model as the primary framework. The course covers inference for a single mean, multiple means (ANOVA) and regression. We cover each of these topics through the lens of the linear model. In order to facilitate analysis in these scenarios, we have the following goals:

Code leads to students focusing on the modeling process and not remembering the proper recipe when implementing an analysis.
Function names reinforce the concept behind the action they perform.
Inferential methods are linked to the conditions on the model that are to be assumed, reinforcing how models for the sampling (or null) distribution are constructed.
Both classical inference and bootstrapping are integrated into a single interface.

As a result of these goals, we avoid the use of functions common in introductory tutorials with R such as t.test() and aov() in favor of a single workflow for each of the above described scenarios. While not illustrated in this vignette, we also recommend the use of the tidyverse package alongside IntroAnalysis for performing data manipulation.

The remainder of this vignette illustrates the use of the IntroAnalysis package for performing analyses typically encountered in an introductory statistics course. This vignette is written for instructors who may be considering the adoption of this package and therefore assumes familiarity with the following topics:

Classical inference for a single mean (one-sample t-test and t-interval)
Classical inference for two means (two-sample t-test and t-interval)
Classical inference for multiple means (ANOVA)
Classical inference for the parameters in a linear regression model
Classical inference for a single proportion
Classical inference for comparing two proportions
The lm() function in R
The glm() function in R
Conceptual understanding of bootstrapping

We draw connections between these common methods and the functions provided by the IntroAnalysis package. We also give some explanation of how this package is utilized to reinforce conceptual understanding of inference.

Such "teaching tips" are separated from the remainder of the text in a special environment like this one.

Throughout the vignette, we will be working with the mtcars data set within R. The data, extracted from the 1974 Motor Trend US magazine, contains the fuel efficiency (miles per gallon) and 10 additional characteristics of the vehicle design for 32 automobiles. More information on the variables can be obtained by typing help("mtcars") into the console.

The IntroAnalysis package can be installed from CRAN and then and loaded using library().

install.packages(IntroAnalysis)

library(IntroAnalysis)

mtcars <- datasets::mtcars
set.seed(20180228)

The Modeling Process

The package considers teaching statistics through the lens of a data generating process. As a result, all analyses begin by specifying a model for this process. Once the model has been specified, the parameters in the model can be estimated (confidence intervals), or two competing processes can be compared (hypothesis testing). This process is reflected in the modeling phrases (functions) implemented in the package:

specify_mean_model(): specifies the mean model for the data generating process
estimate_parameters(): estimates the parameters of a mean model
compare_models(): computes a p-value comparing two nested models
obtain_diagnostics(): computes residuals and fitted values for assessing modeling conditions
estimate_mean_response(): estimates the mean response of a model given a specified set of covariates

Confidence Intervals

Consider the goal of constructing interval estimates for a parameter of interest.

Single Mean

Suppose we are interested in estimating the overall fuel efficiency (miles per gallon, mpg) of cars in 1974. The parameter of interest is then $$\mu = \text{mean fuel efficieny (mpg) for 1974 automobiles}.$$

This parameter defines the "mean model," the deterministic portion of the following model for the data generating process: $$(\text{MPG})_i = \mu + \epsilon_i$$

where $(\text{MPG})_i$ represents the fuel efficiency for the $i$-th vehicle and $\epsilon_i$ represents the error in the $i$-th observed fuel efficiency.

We emphasize that we always have two models --- one for the data generating process and one for the sampling distribution of the estimates. The data generating process attacks our goal of explaining the variability in the response variable as a function of our parameters of interest. The model of the sampling distribution allows us to perform inference on the parameters.

This model must be specified in R. This is done through the specify_mean_model() function.

fit_single <- specify_mean_model(mpg ~ 1, data = mtcars)

The specify_mean_model() function is a wrapper for lm(); there are three important differences:

Generalized linear models can also be specified through specify_mean_model()
Instead of using offset() in the formula interface, we recommend using constant() to allow for scalar offsets
The intercept is not fit by default

The primary reason we introduce this wrapper is that the function name itself links to the modeling process.

Additional Notes: lm() or glm() can be used in place of specify_mean_model() if you prefer.

If you choose to use the lm() function, we recommend explaining it as a way of representing a model for the data generating process to the computer. While it clearly does a lot more than that, we simply tell students that it stores additional information regarding the model which is then accessed by helper functions.

Classical Inference: The classical approach to constructing an interval estimate for $\mu$ would be to begin with our point estimate $\widehat{\mu} = \bar{x}$ and then model the sampling distribution of $$T = \frac{\bar{x} - \mu}{s/\sqrt{n}}$$

using a t-distribution with $n - 1 = r nrow(mtcars) - 1$ degrees of freedom. Such a model requires we impose the following conditions on the error term:

The errors are independent of one another
The errors have mean 0
The errors are identically distributed
The errors follow a Normal distribution

As the goal here is to compute an interval estimate, we use the function estimate_parameters(). In addition to the model for the data generating process, we must also communicate the above conditions. As the first two conditions are imposed on any linear model (at least in the introductory course), these are always imposed by the function. The remaining assumptions, however, must be specified by the user.

estimate_parameters(fit_single,
                    confidence.level = 0.95,
                    assume.normality = TRUE)

Of course, technically, we could relax the "identically distributed" idea and additionally assume that the variability of the errors for each term varies from one observation to another; however, as this is more common in a regression setting, we save discussion of that until then.

The function reinforces that the confidence interval is an estimate of the unknown parameter.

The function outputs the following (stored in a data.frame):

point.estimate: point (least squares) estimate of the parameter(s).
standard.error: standard error of the estimate.
k% lower: lower limit of a k% confidence interval.
k% upper: upper limit of a k% confidence interval.

We notice the output is similar to that provided by the common t.test() function in the stats package:

t.test(mtcars$mpg, conf.level = 0.95)

The primary difference is that t.test() combines hypothesis testing and interval estimation within the same interface, while we choose to separate these two in our package. Conversely, if we instead prefer to change the underlying conditions, t.test() is no longer applicable, where estimate_parameters() is, as we now see.

Inference via Bootstrap: Suppose we are unwilling to assume the data is consistent with the condition that the errors are Normally distributed. In that case, it may be inappropriate to model the sampling distribution using the t-distribution as above. Instead, we build an empirical model of the sampling distribution via bootstrapping. The standard error and associated confidence interval are then built upon this empirical model. Conceptually, we are implementing the same process: constructing an interval estimate based on a model for the sampling distribution. It is the conditions we are willing to impose which now differ. This suggests using the same interface with different parameters.

estimate_parameters(fit_single,
                    confidence.level = 0.95,
                    assume.normality = FALSE)

The output has the same form as before: we have a point estimate, associated standard error and accompanying confidence interval. However, without assuming the response is Normally distributed, the standard error and accompanying confidence interval are computed via a residual bootstrap.

We do not expect students to become familiar with the various bootstrapping algorithms. Instead, we focus on teaching bootstrapping as resampling the data and do not feel the need to discuss that we are implementing a residual bootstrap within the functions. Again, the focus is on conceptual understanding as opposed to knowing the details of a particular algorithm. Similarly, while many types of bootstrap intervals can be computed, we only teach the percentile interval.

From the student perspective, they are not focused on memorizing different functions but on which conditions they feel are appropriate given the data.

Additional Notes: As bootstrapping has an element of randomness, results are not inherently repeatable. However, you can use set.seed() in order to ensure results are repeatable. For this document, the seed was set a single time after loading the package and was set to 20180228.

Two Independent Means

In 1974, automatic and manual transmissions were both common. Suppose we are interested in using the data to estimate the difference in the fuel efficiency, on average, between vehicles with automatic and manual transmissions in 1974. The parameter of interest is then $$\beta_1 = \mu_m - \mu_a$$

where $\mu_m$ and $\mu_a$ represent the mean fuel efficiency (mpg) for manual and automatic transmissions, respectively. This parameter is incorporated into the following data generating process: $$(\text{MPG})_i = \beta_0 + \beta_1 (\text{manual})_i + \epsilon_i$$

where

$$(\text{manual})_i = \begin{cases} 1 & \text{if i-th observation has a manual transmission} \ 0 & \text{otherwise} \end{cases}.$$

We find students initially struggle with the idea of an indicator function. However, we have had success describing them as "light switches" which turn on or off depending on whether the statement is true.

In order to fully represent this model, we first perform a little bit of data cleaning.

mtcars$am <- factor(mtcars$am,
                    levels = c(0, 1),
                    labels = c("automatic", "manual"))

Again, we recomend the tidyverse package for data manipulation whenever possible instead of the above, especially for introductory students.

The model can be represented using the specify_mean_model() function.

fit_two <- specify_mean_model(mpg ~ 1 + am, data = mtcars)

Notice that unlike lm(), using specify_mean_model(mpg ~ am, data = mtcars) would not have resulted in the same model. This decision was made to force students to think about the intercept as a parameter in the process, instead of accepting it by default.

Classical Inference: The classical approach to constructing an interval estimate for $\beta_1$ would be to begin with our point estimate $\widehat{\beta}1$, the least squares estimator, and then model the sampling distribution of $$T = \frac{\widehat{\beta}_1 - \beta_1}{SE{\widehat{\beta}_1}}$$

using a t-distribution with $n - 2 = r nrow(mtcars) - 2$ degrees of freedom. Such a model requires we impose the following conditions on the error term:

The errors are independent of one another
The errors have mean 0
The variability of the errors is constant
The errors follow a Normal distribution

Again, this can be accomplished within estimate_parameters().

estimate_parameters(fit_two,
                    confidence.level = 0.95,
                    assume.constant.variance = TRUE,
                    assume.normality = TRUE)

These are the same intervals that would be produced using confint() or t.test() where var.equal = TRUE.

Inference via Bootstrapping: Instead of imposing all the conditions above, consider choosing to relax these and impose only the following:

The errors are independent of one another.
The errors have mean 0.

Turning off these flags in the estimate_parameters() function produces different interval estimates.

estimate_parameters(fit_two,
                    confidence.level = 0.95,
                    assume.constant.variance = FALSE,
                    assume.normality = FALSE)

In order to accommodate these relaxed conditions, we use a wild bootstrap to compute the standard error and confidence intervals for each parameter.

Again, students are aware that a different bootstrapping procedure is being implemented but need not worry about the details of the algorithm. We want them to focus on the conditions they are imposing.

Additional Notes: Notice that we avoided the common "two-sample Satterthwaite interval" for the difference in means in which normality is assumed but constant variance is not. Our goal in the course is not to teach a multitude of tools but instead reinforce the idea that the model for the sampling distribution depends upon the conditions imposed. If you specify that normality can be assumed (assume.normality = TRUE) but constant variance cannot (assume.constant.variance = FALSE), then the Huber-White method (see ?car::hccm) is used to adjust the variance-covariance matrix prior to construct the confidence interval. We choose this form of the variance-covariance matrix instead of the Satterthwaite because it handles a broader class of problems.

Suppose instead of estimating the difference in the two means, you simply wanted to place a confidence interval on each mean individually. As our focus is on understanding the process, and not manipulating R output, we attack this by re-defining the model for the data generating process: $$(\text{MGP})_i = \mu_a (\text{automatic})_i + \mu_m (\text{manual})_i + \epsilon_i $$

where

$$ \begin{aligned} (\text{automatic})_i &= \begin{cases} 1 & \text{if i-th observation has an automatic transmission} \ 0 & \text{otherwise} \end{cases} \ (\text{manual})_i &= \begin{cases} 1 & \text{if i-th observation has a manual transmission} \ 0 & \text{otherwise} \end{cases}. \end{aligned} $$

This model is specified and the parameters estimated (under the classical conditions) below.

fit_two_alt <- specify_mean_model(mpg ~ am, data = mtcars)
estimate_parameters(fit_two_alt,
                    confidence.level = 0.95,
                    assume.constant.variance = TRUE,
                    assume.normality = TRUE)

Regression

By this point, it should seem straight forward how to compute a confidence interval within a regression setting. We believe our students also feel this way. The process is now identical to what we have seen in other settings. The focus is then on describing the data generating process and the conditions under which the model for the sampling distribution should be constructed.

For example, suppose we are concerned (rightly so, given this is observational data) that the difference in the average fuel efficiency we estimated above could be due to confounding. In addition to the transmission type, we would like to account for the horsepower of the vehicle. A model for this process is $$(\text{MPG})_i = \beta_0 + \beta_1 (\text{manual})_i + \beta_2 (\text{HP})_i + \epsilon_i$$

This model is specified and the parameters estimated as follows.

fit_reg <- lm(mpg ~ 1 + am + hp, data = mtcars)
estimate_parameters(fit_reg,
                    confidence.level = 0.95,
                    assume.constant.variance = TRUE,
                    assume.normality = TRUE)

The benefit of estimate_parameters() over summary() in our mind is that we are not combining p-values and confidence intervals into the same output. Instead, we clearly separate the inferential goals of estimation and hypothesis testing via p-values. We also have a single interface for addressing bootstrapping.

estimate_parameters(fit_reg,
                    confidence.level = 0.95,
                    assume.constant.variance = TRUE,
                    assume.normality = FALSE)

Here, a residual bootstrap is performed in order to account for this particular mix of conditions imposed.

Additional Notes: We intentionally chose to not implement a default confidence level and chose the default assumptions to be FALSE for constant variance and normality. Again, we have introductory students in mind, and we want to them to think about an appropriate confidence level and to think about what conditions they will impose on the model and recognize these as assumptions.

Hypothesis Testing

While we may favor confidence intervals for inference, hypothesis testing is also important. We find this is easiest to motivate in the Analysis of Variance (ANOVA) setting.

ANOVA

Suppose we are interested in comparing the fuel efficiency, on average, across cars with a different number of cylinders. Specifically, consider testing the hypothesis $$H_0: \mu_4 = \mu_6 = \mu_8 \qquad \text{vs.} \qquad H_1: \text{At least one } \mu_j \text{ differs}$$

where $\mu_4$, $\mu_6$, and $\mu_8$ are the average fuel efficiency (mpg) for vehicles with 4, 6, and 8 cylinders, respectively.

Our approach to hypothesis testing is to emphasize that we are comparing two models for the data generating process.

The above hypotheses suggest two different models for the data generating process. Under the null hypothesis, there is a single mean response $\mu$. Therefore, the data generating process has the form $$\text{Model under } H_0: (\text{MPG})_i = \mu + \epsilon_i$$

Under the alternative hypothesis, we place no restrictions on the three parameters. Therefore, we have a model of the form $$ \text{Model under } H_1: (\text{MPG})_i = \mu_4 (\text{Cyl 4})_i + \mu_6 (\text{Cyl 6})_i + \mu_8 (\text{Cyl 8})_i + \epsilon_i $$

where

$$(\text{Cyl K})_i = \begin{cases} 1 & \text{if i-th vehicle has K cylinders} \ 0 & \text{otherwise} \end{cases}.$$

Testing the above hypotheses corresponds to comparing these two models for the data generating process and asking if the simpler model is sufficient for explaining the variability in the response. Since we are comparing models, we use the compare_models() function. This function requires both the "full" model (under the alternative) and the "reduced" model (under the null). Before specifying these models, we again do a little cleaning.

mtcars$cyl <- factor(mtcars$cyl,
                     levels = c(4, 6, 8))

Now, we fit both models.

fit_anova1 <- specify_mean_model(mpg ~ cyl, data = mtcars)
fit_anova0 <- specify_mean_model(mpg ~ 1, data = mtcars)

Forcing students to fit both models reinforces the key idea of hypothesis testing. Instead of relying on the default test produced by many programs, we ask students to critically consider how the hypothesis changes their explanation for how the data was generated.

Classical Inference: Under the traditional ANOVA framework, we impose the following conditions in order to test the above hypotheses:

The errors are independent of one another.
The errors have mean 0.
The variability of the errors is identical in each group.
The errors follow a Normal distribution.

We communicate these conditions to the compare_models() function.

compare_models(fit_anova1,
               fit_anova0,
               assume.constant.variance = TRUE,
               assume.normality = TRUE)

The compare_models() function produces a data.frame which takes the form of a classical ANOVA table with the following elements:

source: an indicator of which component of variability is being summarized
df: the degrees of freedom associated with each component (may not add to the total if not all terms are tested)
ss: the sum of squares for each component of variability (may not add to the total if not all terms are tested)
ms: the mean square associated with each component (used to determine the test statistic)
standardized.statistic: the F-statistic comparing the two models
p.value: the p-value associated with the test

We use the "Additional Terms" phrase to indicate the terms placed in the full model that are excluded in the model under the null hypothesis. We notice the output is similar to that provided by the common summary() after running aov() from the stats package:

summary(aov(mpg ~ cyl, data = mtcars))

The anova() function in the stats package provides similar functionality to compare_models() but is more suitable to a course on linear models. When working with introductory students, we find the classical ANOVA table more appealing for discussing the various sources of variability.

anova(fit_anova0, fit_anova1)

A key limitation of the aov() function is that it is specific to ANOVA while compare_models() is not. Both the aov() approach and the anova() approach are also limited to classical inference; compare_models() allows for inference via bootstrapping.

Inference via Bootstrapping: Many traditional curriculum do not allow for relaxing the conditions of the classical ANOVA framework. However, if students are familiar with bootstrapping, this can be extended to comparing multiple means.

The focus is again on understanding that the model for the null distribution is determined by the conditions we are willing to impose. Instead of the F-distribution, we can construct an empirical model via bootstrapping.

Suppose we are only willing to impose the following conditions:

The errors are independent of one another.
The errors have mean 0.
The variability of the errors is identical in each group.

This can be done by changing the conditions within the compare_models() function.

compare_models(fit_anova1,
               fit_anova0,
               assume.constant.variance = TRUE,
               assume.normality = FALSE)

The output is similar to what we had above. The only difference is the p-value itself.

This emphasizes that the test statistic is simply a summary of the data; it is the p-value which is dependent upon the model for the null distribution.

If we wish to relax the assumptions further, we can remove the constraint that the variability of the errors be similar for all groups.

compare_models(fit_anova1,
               fit_anova0,
               assume.constant.variance = FALSE,
               assume.normality = FALSE)

Additional Notes: Similar to when we computed interval estimates in the two-sample setting, we do not implement Welch's F-test (the generalization of the Satterthwaite approximation in the two-sample setting). However, the White-Huber adjustment is available if you are unwilling to assume constant variance but are willing to assume normality.

We note that in the above case, the p-value is identical regardless of whether a residual bootstrap or wild bootstrap is implemented. This is because the p-value is defined as $$p = \frac{\text{# test statistics} >= \text{observed test statistic}}{b + 1}$$

where "# test statistics" includes the observed test statistic and $b$ is the number of bootstrap samples taken. In this case, none of the bootstrap test statistics were greater than the observed. That is, only the observed test statistic was at least as large as the observed test statistic; so, the numerator is 1. The denominator is 5000 since the default value of $b = 4999$. Therefore, the p-value given by the bootstrap procedure is always a bit conservative. See the discussion in Boos and Stefanski[^BoosStefanski] within the Bootstrap chapter for more details (and the "99" rule).

Finally, we implemented the above models within specify_mean_model() using a parameterization for the mean of each group directly. The results are of course equivalent if we fit the full model as

fit_anova1 <- specify_mean_model(mpg ~ 1 + cyl, data = mtcars)

Estimating the parameters differs, but comparing the models is equivalent.

Regression

Consider predicting the fuel efficiency as a function of both the horsepower of the vehicle as well as the number of cylinders it has. The model could be written as $$(\text{MPG})_i = \beta_0 + \beta_1 (\text{HP})_i + \beta_2 (\text{Cyl 6})_i + \beta_3 (\text{Cyl 8})_i + \epsilon_i$$

Suppose we are interested in testing whether the fuel efficiency is linearly related to the horsepower after accounting for the number of cylinders the vehicle has. This would equate to the hypotheses: $$H_0: \beta_1 = 0 \qquad \text{vs.} \qquad H_1: \beta_1 \neq 0$$

Under the null hypothesis, the following model for the data generating process is assumed: $$(\text{MPG})_i = \gamma_0 + \gamma_2 (\text{Cyl 6})_i + \gamma_3 (\text{Cyl 8})_i + \epsilon_i$$

Classical Inference: Suppose we are willing to impose the standard conditions on the error term, namely,

The errors are independent of one another
The errors have mean 0 for all values of the predictors
The variability of the errors is constant for all values of the predictors
The errors are Normally distributed

Then, we have that the test statistic comparing these two models follows an F-distribution with 1 numerator and $n - 4 = nrow(mtcars) - 4 denominator degrees of freedom. This is obtained via compare_models() below. The process is similar to before, and we are asking students to really consider the two models being compared.

fit_reg1 <- specify_mean_model(mpg ~ 1 + hp + cyl, data = mtcars)
fit_reg0 <- specify_mean_model(mpg ~ 1 + cyl, data = mtcars)

compare_models(fit_reg1,
               fit_reg0,
               assume.constant.variance = TRUE,
               assume.normality = TRUE)

This p-value could have been obtained readily using summary(fit_reg1) from the more traditional "t-test" of this hypothesis. However, this also conflates many different hypotheses being tested simultaneously which often leads students to over-interpret the results instead of first understanding the process of hypothesis testing.

Forcing students to consider each hypothesis forces them to think about the questions when developing the models instead of fitting first and then asking questions after looking at the output.

An advantage of compare_models() is that it provides a seamless framework for more complex hypotheses, which summary() does not readily allow. For example, consider testing instead whether the number of cylinders a car has is related to the fuel efficiency, after accounting for the horsepower. Then, our hypothesis would have been $$H_0: \beta_2 = \beta_3 = 0 \qquad \text{vs.} \qquad H_1: \text{at least one } \beta_j \text{ not equal to 0}$$

This could be tested as follows:

fit_reg0_alt <- specify_mean_model(mpg ~ 1 + hp, data = mtcars)

compare_models(fit_reg1,
               fit_reg0_alt,
               assume.constant.variance = TRUE,
               assume.normality = TRUE)

The primary advantage of compare_models() is that there is a single interface for classical inference and inference via bootstrap.

Inference via Bootstrap: Consider the last hypothesis tested above (relationship between fuel efficiency and the number of cylinders). Suppose we are only willing to impose the following conditions:

The errors are independent of one another.
The errors have mean 0.

Then, under these conditions, the p-value can be computed via a wild bootstrap procedure.

compare_models(fit_reg1,
               fit_reg0_alt,
               assume.constant.variance = FALSE,
               assume.normality = FALSE)

Single Mean

Hypothesis testing is typically introduced in the case of a single mean. In the case of a single mean, we prefer inference be conducted via confidence intervals. In the case of ANOVA, inference via a p-value is more easily motivated in our opinion. However, for completeness, we present the code below for testing a hypothesis of the form

$$H_0: \mu \geq 23 \qquad \text{vs.} \qquad H_1: \mu < 23$$

where $\mu$ represents the mean fuel efficiency (mpg) for all 1974 vehicles. From a classical perspective, this is typically done using a one-sample t-test. This can be implemented via compare_models(). Again, we focus on the data generating processes being compared.

The model for the data generating process under the alternative is quite simple: $$(\text{MPG})_i = \mu + \epsilon_i$$

which is translated into R via specify_mean_model() as follows.

fit_single1 <- specify_mean_model(mpg ~ 1, data = mtcars)

Under the null hypothesis, the model is easy to write down: $$(\text{MPG})_i = 23 + \epsilon_i$$

This model can be translated into R using the constant() notation within the specify_mean_model() function. Anything inside constant() is treated as a constant term in the data generating process and no parameter is specified.

fit_single0 <- specify_mean_model(mpg ~ constant(23), data = mtcars)

Keep in mind that the intercept is not forced into the model with the specify_mean_model() function and therefore this model corresponds closely with the mathematical equation expressed above. This is one of the benefits of specify_mean_model(). Specifying this same model in lm() is not intuitive as the offset() function does not allow scalars:

fit_single0 <- lm(mpg ~ -1, offset = rep(23, length(mpg)), data = mtcars)

Now, the two models can be compared.

compare_models(fit_single1,
               fit_single0,
               alternative = 'less than',
               assume.constant.variance = TRUE,
               assume.normality = TRUE)

This matches the output of t.test(); again, the benefit is that it can be easily extended to relaxing the conditions and implementing a bootstrap procedure. Note that since we are not testing a "not equal" to alternative, we should explicitly specify it to get the one-sided p-value.

Additional Notes: The use of the constant() notation can be helpful in specifying more complex hypotheses, such as testing whether the slope in a regression model takes on a particular value. Again, the notation is meant to link directly to the model a student would write down using mathematical terms.

Additional Features

The above sections illustrate the primary components of the package, but other features are available and discussed in this section.

Computing Summaries

In order to make a link between the data generating process and making data summaries, the summarize_variable() and summarize_relationship() functions were constructed.

Consider computing the mean fuel efficiency and the standard deviation of the fuel efficiency for vehicles with the same type of transmission. This is done using summarize_variable():

summarize_variable(mpg ~ am, data = mtcars, mean, sd)

The summarize_variable() function takes a series of functions (standard R) functions that should be applied to the variable. If that function requires additional parameters, they can be specified using the .args argument; you just need to ensure the function returns only a single value.

summarize_variable(mpg ~ am, data = mtcars, quantile, 
                   .args = list(probs = c(0.25)))

Computing the correlation between two variables, such as the fuel efficiency and the horsepower, can be done using the summarize_relationship() function.

summarize_relationship(mpg ~ hp, data = mtcars, FUN = cor)

This function has a different syntax (more like apply() in order to address the increased complexity of the functions that might be used).

Plotting the Sampling and Null Distributions

Making the connection between the construction of confidence intervals and the underlying model for the sampling distribution is a critical learning objective for introductory students. The same can be said regarding the connection between the computation of a p-value and the underlying model for the null distribution. While the functions estimate_parameters() and compare_models() discussed above try to emphasize this connection through the specification of their parameters (assume.normality and assume.constant.variance), it can be helpful to plot the sampling distribution.

Consider the problem of estimating the mean fuel efficiency. As we have seen, the confidence interval (under the classical assumptions) is constructed as follows:

fit_single1 <- specify_mean_model(mpg ~ 1, data = mtcars)

samp_distn <- estimate_parameters(fit_single1,
                                  confidence.level = 0.95,
                                  assume.constant.variance = TRUE,
                                  assume.normality = TRUE)
samp_distn

In order to visualize the corresponding model for the sampling distribution, we can use the plot_sampling_distribution() function in conjunction with the information stored within the output from estimate_parameters().

plot_sampling_distribution(samp_distn)

Built on top of ggplot2, this function returns a density estimate of the underlying sampling distribution. In the cases in which a specific probability model is known (such as this case), a random sample was taken from the known distribution and the summary of that random sample is what is shown. In the cases in which the confidence interval was constructed via bootstrapping, the density plot of the bootstrap statistics is given. Additional parameters to the function allow you to display the confidence interval or restrict which sampling distributions are displayed (when the linear model has more than one parameter).

Similarly, we can plot the null distribution for a specific test with the function plot_null_distribution().

fit_single0 <- specify_mean_model(mpg ~ constant(23), data = mtcars)

null_distn <- compare_models(fit_single1,
                             fit_single0,
                             alternative = 'not equal',
                             assume.constant.variance = TRUE,
                             assume.normality = TRUE)

plot_null_distribution(null_distn)

We note that since we always use a test statistic of the form $\frac{MSR}{MSE}$, the null distribution will not be symmetric but follows (at least approximately) and F-distribution.

Estimating the Mean Response

We have already discussed how the parameters of this model are estimated:

fit_reg <- specify_mean_model(mpg ~ 1 + hp + cyl, data = mtcars)

estimate_parameters(fit_reg, 
                    confidence.level = 0.95,
                    assume.constant.variance = TRUE,
                    assume.normality = TRUE)

However, suppose we are primarily interested in estimating the average fuel efficiency for a vehicle with 123 horsepower and a 6 cylinder engine. This can be done using estimate_mean_response().

estimate_mean_response(fit_reg,
                       confidence.level = 0.95,
                       assume.constant.variance = TRUE,
                       assume.normality = TRUE,
                       hp = 123,
                       cyl = factor(c("6"), levels = c("4", "6", "8")))

Assessing Conditions

In order to make residual diagnostic plots, we must obtain the residuals and fitted values which we do through the function obtain_diagnostics() which is a wrapper for the augment() function in the broom package. The primary reason for providing a wrapper is to provide a clear "phrase" for identifying the function.

diagnostics_mtcars <- obtain_diagnostics(fit_reg, data = mtcars)

Inference for Proportions

Many introductory courses cover inference about a single proportion to comparing two proportions. For example, suppose we are interested in determining if there is evidence that more than half of cars produced in 1974 had an automatic transmission. That is, we are interested in testing $$H_0: p \leq 0.5 \qquad \text{vs.} \qquad H_1: p > 0.5$$

where $p$ is the proportion of vehicles with an automatic transmission. Letting $Y_i$ take on the value 1 if the $i$-th vehicle has an automatic transmission (and 0 otherwise), the model for our data generating process is $$Pr\left(Y_i = y\right) = p^y (1 - p)^{1-y}$$

That is, $Y_i \sim Bin(1, p)$ and $E\left(Y_i\right) = p$. That is, our mean model has a single parameter but has a different distributional family than when working with the linear model. This distributional family needs to be communicated when specifying the mean model.

fit_p <- specify_mean_model((am=="automatic") ~ 1, data = mtcars, 
                            family = binomial)

Notice that the left-hand-side of the formula specifies which event we are interested in modeling. We do not ask the user to switch to a different function (glm() for logistic regression instead of lm()) but specify the model using the same function specify_mean_model() to illustrate that while the distributional family is now being specified, the essential process is the same. We can now estimate the mean response (since that is the parameter of interest).

estimate_mean_response(fit_p,
                       confidence.level = 0.95,
                       Intercept = 1)

The parameter(s) of the logistic regression model can also be estimated.

estimate_parameters(fit_p,
                    confidence.level = 0.95)

[^BoosStefanski]: Boos D.D., Stefanski L.A. Essential Statistical Inference: Theory and Methods (2013). Springer-Verlag New York.

reyesem/IntroAnalysis documentation built on March 29, 2025, 3:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

reyesem/IntroAnalysis
Functions for introductory statistics using linear models

In reyesem/IntroAnalysis: Functions for introductory statistics using linear models

The Modeling Process

Confidence Intervals

Single Mean

Two Independent Means

Regression

Hypothesis Testing

ANOVA

Regression

Single Mean

Additional Features

Computing Summaries

Plotting the Sampling and Null Distributions

Estimating the Mean Response

Assessing Conditions

Inference for Proportions

R Package Documentation

Browse R Packages

We want your feedback!

reyesem/IntroAnalysis Functions for introductory statistics using linear models

In reyesem/IntroAnalysis: Functions for introductory statistics using linear models

The Modeling Process

Confidence Intervals

Single Mean

Two Independent Means

Regression

Hypothesis Testing

ANOVA

Regression

Single Mean

Additional Features

Computing Summaries

Plotting the Sampling and Null Distributions

Estimating the Mean Response

Assessing Conditions

Inference for Proportions

R Package Documentation

Browse R Packages

We want your feedback!

reyesem/IntroAnalysis
Functions for introductory statistics using linear models