In jefftwebb/lightbulb: Tutorials for Learning Data Science

library(learnr)
knitr::opts_chunk$set(echo = FALSE, knitr.duplicate.label = "allow")
library(tidyverse)
library(ISLR)
library(janitor)

sales <- read_csv("sales.csv") %>% 
  clean_names() %>% 
  mutate(price = (price * 4) %>%  round(2),
         coupon = coupon+displaycoupon,
         coupon = factor(ifelse(coupon == 0, "no", "yes"), levels = c("no","yes")),
         display = display+displaycoupon,
         display = factor(ifelse(display ==0, "no", "yes"),, levels = c("no","yes")),
         sales = ifelse(price < 4, sales + 100,
                        ifelse(price < 4.5 & price > 4, sales+ 50, sales)) %>%  round(2)) %>% 
  dplyr::select(-obs, -displaycoupon)

   model <- lm(sales ~ price, data = sales)    

   coupon_model <- lm(sales ~ coupon, data = sales)

   errors <- data.frame(residuals = resid(model),
                     fitted = fitted(model))

Introduction

This tutorial introduces simple linear regression. "Simple" here means: a statistical model of a continuous outcome or target variable---hence a regression rather than a classification model---with just one predictor variable. Often the predictor is continuous, as in the videos for this module, which covered regression basics. But the predictor can also be categorical, a factor variable, in which case the interpretation of the model output changes slightly.

We will review the concepts from the lecture video, then work through two example applications using simple linear regression, one with a continuous predictor, our main example, and one with a categorical predictor. The goal is explain how to understand and interpret the model output in both situations. The discussion will touch on the topics we covered in the module on statistical inference, since the statistical test for the slope coefficient in a simple linear regression is, in fact, a t-test. Moreover, simple linear regression with a categorical predictor is, like a t-test, a method for comparing group means.

Example

A market response model relates inputs such as price and advertising to an output like sales, profit, customer preference or brand awareness. The goal is to understand how customers react to marketing activities and to use that information to adjust product prices, optimize marketing strategies, and plan future marketing activities. Consider price promotion. In the simplest case, we can assume that the relationship between price and sales is linear, meaning that it can be usefully described by a line defined by two parameters: an intercept and a slope. Then, using historical data and the least squares algorithm used to fit a linear regression model we can estimate those parameters. The results---in particular, the regression equation---can be used to inform decision-making. Our example will be weekly beer sales.

Data

The following data set represents aggregate beer sales for 124 weeks at a grocery store. Start, as always, by inspecting the data:

# Inspect
glimpse(sales)
summary(sales)

In addition to price per unit and sales (denominated in 100s), this dataset contains information on marketing campaigns during each week: factor variables indicate whether there was a display or a coupon promotion in a given week. A multiple regression model would combine this information with price to model sales. As noted above, we will focus on bivariate relationships using simple linear regression.

We will first consider a simple linear regression model of sales using a continuous predictor, price. Then, in order to compare simple linear regression methodologically with a t-test, we will conclude by fitting a model of sales using a categorical predictor, coupon.

Regression with Continuous Predictor

Are changes in price associated with changes in sales?

Let's visualize the relationship. Visualization is always a good idea before fitting the model! By convention we put the response (or target or outcome) variable on the y-axis and the predictor (or explanatory) variable on the x-axis. ggplot very usefully allows us to add a summary least squares line to the data with one line of code using geom_smooth(). The default summary line in geom_smooth() is a local regression called LOESS; to get a least squares line we need to specify method = "lm". ggplot also adds standard error bands by default, which can be turned off with se = F. These bands, and standard errors in general, indicate estimation uncertainty, telling us how much we might expect the regression line to vary under repeated sampling.

# Visualize sales ~ price
ggplot(sales, aes(price, sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(title = "Sales ~ price")

There is, clearly, quite a bit non-linearity in this relationship, with bumps in sales at certain prices and one large outlier (perhaps a holiday). The plot makes it clear that there is likely a lot more to sales than just price. Nevertheless, the plot suggests there is a slight negative relationship between sales and price in this plot, such that increases in price---as we would expect---are associated with decreases in sales. If we remove the outlier for purposes of visualization the relationship becomes clearer:

# Visualize sales ~ price without outlier
sales %>% 
  filter(sales < 1000) %>% 
  ggplot(aes(price, sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(title = "Sales ~ price")

Can we say that price causes this decrease in sales? Not at all. This is one of the drawbacks of a simple linear model: with just one explanatory variable we cannot take other possible explanations of sales off the table. Consequently, our ability to do causal inference---to argue that lower sales is caused by higher prices---is limited. In this case we can see that there must be other influences on sales in addition to price---so much so that a linear summary of the relationship is not very satisfying.

Fitting the model

Fitting a linear response model in R is very straightforward. The workhorse function for linear modeling is lm(), which uses the so-called "formula syntax," y ~ x, that is nearly universal among R model functions and that mimics the conceptual formulation of the regression function. (y is usually used to represent the target or response variable in a regression, while x represents the explanatory or predictor variable.) We use lm() as follows:

# Fit the model and save as an object
(model <- lm(sales ~ price, data = sales))

The output from the lm() function alone is pretty minimal, providing only information about the call and the coefficient estimates. In this case we can see that the intercept is 659.9 and the coefficient for price is -111.4. The model equation is $\widehat{Sales} = 659.9 - 111.4price$. (The hat is used to indicate that this is an estimate.) This equation defines the regression line, which we will also call the model's fitted values*.

Interpreting the model

Let's review what the numbers in the model equation mean. But first a word about notation.

The so-called least squares algorithm used to fit a linear regression finds the regression line that minimizes the residual sum of squares or RSS. Like all lines in Cartesian space, the regression line can be fully described by its intercept and slope parameters, conventionally represented as $\hat\beta_0$ and $\hat\beta_1$. ($\beta$ is the Greek letter pronounced "beta." For $\hat\beta$ symbol we say "beta hat.") These intercept and slope parameters are estimates, from a sample, of population values. We define RSS as $\sum_{i=1}^{n} (y_i - \hat{y_i})^2$, or, equivalently, $\sum_{i=1}^{n} (y_i - \hat\beta_0 + \hat\beta_1 x_i)^2$. (Remember that $\hat{y_i} = \hat\beta_0 + \hat\beta_1 x_i$.) The residuals are just the difference between the regression line, defined as $\hat y = \hat\beta_0 + \hat\beta_1 x$, and the observed values for $y$.

The summation symbol, $\sum$, used in the above paragraph may be unfamiliar. The concept is simple. $\sum_{i=1}^{n}(y_i - \hat{y_i})^2$ means to sum $(y_i - \hat{y_i})^2$ over all the rows, indexed by $i$, from 1 to $n$. You can think of it as a loop that cumulatively performs a sum over all the rows. Variables not indexed by $i$, like intercept and slope coefficients, $\hat\beta_0$ and $\hat\beta_0$, are constants: they do not change by row.

The intercept in a linear model represents estimated average sales when the input, in this case price, equals 0. Clearly, the intercept in this model is not very meaningful, since price never will equal zero. One way to make the intercept meaningful would be to center price by subtracting average price from each observed price. That would make average price 0, and the intercept would then be interpretable: average sales at average price. In the next module we will cover how and when to center inputs.

The coefficient for price represents estimated average change in the outcome associated with a one unit change in the input. It is a slope. In this case the coefficient is negative, which fits with the summary line we added to the plot above. One of the virtues of a linear model is that it quantifies relationships precisely. Here, the coefficient indicates that a one unit increase in price was associated with an average drop in sales of 111.4 (in 100s). The slope coefficients in a regression can thus be thought of as an effect size, indicating how strong the relationship is between predictor and outcome.

It is important to keep the estimated slope coefficient in perspective. Does it represent a large effect? We must interpret the coefficient in the context of sales generally. We saw above that maximum sales was over 1400 and the mean was 144. So a drop of 111 represents a modest change with respect typical sales, particularly given that the entire range of price is only about 1.5. And, indeed, we can see from the plot that the line is not very steep. Such determinations about whether the effect is large or small can only be made with respect to a particular analytic context. What is small in one context might be large in another, and vice versa.

Linear model concept practice

Consider the following example regression:

reg_ex <- data.frame(x = c(0, 3, 7, 10),
                     y = c(5,5,27,31))

# lm(y ~ x, data = reg_ex)

ggplot(reg_ex, aes(x, y)) +
  geom_point()+
  geom_smooth(method = "lm", se = F)+
  theme_minimal()+
  labs(title = "y ~ x")

Each gridline represents an increment of 5 on the y-axis. Here is a table of values:

library(knitr)
reg_ex$residuals <- resid(lm(y ~ x, data = reg_ex))

reg_ex[, 1:2]

quiz(
  question("What are the approximate residuals for this regression?",
    answer("3, 6, 4, 1", message =" Residuals  below the regression line or negative, while those above are positive. The answer should be: 3, -6, 4, -1."),
    answer("3, -6, 4, -1", correct = TRUE, message ="Residuals  below the regression line or negative, while those above are positive.")
  )
)

quiz(
  question("What is the residual sum of squares (RSS) for this regression?  Use the residuals from the previous question.",
    answer("0", message ="Square the residuals (3, -6, 4, -1) and add them together: 9 + 36 + 16 + 1 = 62."),
    answer("62", correct = TRUE, message ="Square the residuals (3, -6, 4, -1) and add them together: 9 + 36 + 16 + 1 = 62."),
    answer("14", message ="Square the residuals (3, -6, 4, -1) and add them together: 9 + 36 + 16 + 1 = 62.")
  )
)

quiz(
  question("A approximate equation for this regression line would be:",
    answer("Impossible to determine", message ="The intercept is the value of  y when x equals 0. The slope is the change in y associated with a one unit change in x. Hence y = 2 + 3x"),
    answer("y = 5 + 12x", message ="The intercept is the value of  y when x equals 0. The slope is the change in y associated with a one unit change in x. Hence y = 2 + 3x"),
    answer("y = 2 + 3x", correct = TRUE, message ="The intercept is the value of  y when x equals 0. The slope is the change in y associated with a one unit change in x. Hence y = 2 + 3x"),
    answer("y = 2 + x",  message ="The intercept is the value of  y when x equals 0. The slope is the change in y associated with a one unit change in x. Hence y = 2 + 3x")
  )
)

Video on simple regression (optional)

Linear Model Output

Information can be extracted from the model using the generic summary() function.

# Obtain more information with summary() 
model %>% 
  summary

Let's take each of these elements in turn, starting with residuals.

Residuals, denoted $e_i$, represent the difference between the linear regression line, or fitted values from the model, and the actual observations: $e_i = y_i - \hat{y_i}$, or, equivalently, $e_i = y_i - \hat{\beta_0} + \hat{\beta_1} x_i$. (We index each residual, observation and fitted value with $i$ because these are row-specific values, not summaries, like the coefficients.) Here is visualization of the residuals for the market response model (minus the one large outlier):

newsales <- sales
newsales$fitted <- predict(model)

newsales %>% 
  filter(sales<1000) %>% 
ggplot(aes(price, sales)) +
  geom_line(aes(price, fitted), col = 2)+
  #geom_smooth(method = "lm", se = FALSE, col=2) +  # Plot regression slope
  geom_segment(aes(yend = fitted,  xend=price), alpha = .2) +  
  geom_point() +
  theme_minimal() +
  labs(title = "Residuals for sales ~ price")

The vertical lines represent the residuals.

As you can see from this plot, adding the residual to the fitted value, gives you the observed value: $y_i = \hat{y_i} + e_i$ or $y_i = \hat{\beta_0} +\hat{\beta_1}x_i + e_i$.

Also called "model errors," residuals tell us exactly where the model is making mistakes. Ideally, for a model that fits the data well the residuals will be normally distributed, with mean 0 and roughly symmetrical minimum and maximum. In other words, residuals should be distributed $N(0, \sigma^{2})$. This is one of the assumptions of a linear model.

The output for the market response model shows that the max residual is much larger than the minimum residual (1285.79 vs. -121.96), whereas they should be symmetrical about 0 (that's what $N(0, \sigma^{2})$ means). This indicates that the data contains large sales observations that are not well described by the model. The median residual is negative, which tells us that the majority of observations lie below the regression line and that the mean---guaranteed to be 0 by the least squares fitting procedure---is getting pulled away from the median by large positive outliers.

To analyze the residuals, it helps to visualize them. In a conventional residual plot, the fitted values go on the x-axis and residuals on the y-axis.

errors <- data.frame(residuals = resid(model),
                     fitted = fitted(model))

# Two ways to visualize residuals:

# The built in way:
plot(model, which = 1)

# Hand-coded way
errors <- data.frame(residuals = resid(model),
                     fitted = fitted(model))

ggplot(errors, aes(fitted, residuals)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Residual plot for sales ~ price")

There are several points to make about the above code:

resid() or residuals() will extract the residuals from a model object.

fitted() or predict() will extract the fitted values from the model object. (predict() is the more general function.) By including the newdata argument to predict() we can use the model--- which is actually just the model equation---to compute predicted values of the outcome given new inputs. For example, what is predicted sales for a price of 4?

# Use predict() to estimate sales at price = 4
predict(model, newdata = data.frame(price = 4))

This is just the value of sales when price is 4 as defined by the model equation: 659.9 - 111.4 * 4 = 214.3.

When the newdata argument is not included, predict() returns the fitted values of the model and is exactly equivalent to fitted().

# Compare predict() to to fitted()
predict(model) %>% head 
fitted(model) %>% head

In simple linear models it is easy to anticipate what the residuals will look like based on a bivariate plot and summary linear regression line. We've noted the non-linearity in the data. Those regions where the linear model failed to capture bumps in sales---where the observations were far away from the regression line---will produce large positive residuals. In general, this model seems to be doing better with smaller rather than larger sales volumes. The built-in method for plotting model residuals, plot(model, which = 1), actually labels the large residuals with row numbers to make further investigation convenient.

We can also look at a histogram of the residuals, which, again, for a model with good fit, should be centered at 0 and normal in shape.

# Histogram of residuals 
ggplot(errors, aes(residuals)) +
  geom_histogram() +
  geom_vline(xintercept = mean(residuals(model)), col = 2) +
  theme_minimal()+
  labs(title = "Histogram of residuals")

This model, as noted, does not fit the data very well. The histogram is right-skewed and not symmetric around 0, with that one problematic outlier.

RMSE is probably the most widely used measure of model fit and is reported in the model summary as the residual standard error.* Model fit metrics are always based on residuals. In the case of RMSE, the residuals are squared in order to avoid having the negative and positive residuals would cancel each other out. Hence:

$$RMSE = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{df}}$$

where $y_i - \hat{y_i}$ represents model residuals, $\sum_{i=1}^n (y_i - \hat{y_i})^2$ is the residual sum of squares or RSS, and $df$ is degrees of freedom, the number of observations minus the number of model parameters,$n-p$, which in this case is two: an intercept and a coefficient. To compute RMSE it is most convenient to define a function.

# Define RMSE function
rmse <- function(actual, fitted, n, p){
  squares <- (actual-fitted)^2
  degrees_of_freedom <- n - p
  mean_squares <- sum(squares) / degrees_of_freedom 
  rmse <- sqrt(mean_squares)
  rmse
}

# Calculate RMSE
n <- nrow(sales)
p <- 2
rmse(sales$sales, fitted(model), n, p)

Note that it is fine, for practical purposes, to calculate RMSE with mean squared error rather than the RSS divided by degrees of freedom:

$$RMSE = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{n}}$$

We can code this simpler version of RMSE as follows:

# Calculate simpler RMSE

(sales$sales - fitted(model))^2 %>% 
  mean %>% 
  sqrt

This result is only slightly different.

RMSE represents average model error, which in this case is about 156. This number is important for contextualizing model performance, for validating how useful the model is for solving a business problem. Average weekly sales is 144 (in 100s). The RMSE for this model will be strongly influenced by that single large outlier among the residuals. Still, it should be clear that a model in which error is larger than the average outcome is not a good model!

The model coefficients are estimates of the intercept and slope for the population regression line. We don't know the equation for the population regression line---the true relationship between price and sales---which is why we need to estimate the equation using sample data.

As discussed in the last module, SE is the standard deviation of a sampling distribution. In the case of a linear model, SEs for coefficient estimates quantify estimation uncertainty---how much, on average, we should expect the $\hat\beta$ coefficients to vary under repeated sampling. (Note: we typically ignore the SE for the intercept.) The SE for slope coefficients will get smaller as $n$ goes up, and larger as the variability in the sample increases.

95% CIs for Regression Coefficients

The SE can be used to compute approximate 95% t-intervals for model coefficients: $\widehat{price} \pm 2 *SE$. Why 2? That level of precision is probably sufficient for most applications. If we wanted to be more careful we could find the exact critical t-value for the degrees of freedom ($n - p$). In R code:

# Define variables
beta_hat <- -111.38
critical_value <- 2 
se <- 34.38

# Calculate interval
beta_hat - critical_value * se
beta_hat + critical_value * se

The confint() function can also be used to compute exact confidence intervals for model parameters:

confint(model)

The results are similar.

The 95% CI basically tells us the expected range of the large majority of coefficient estimates under repeated sampling from the population. Like the SE from which it is derived, it is a measure of uncertainty that we should factor into our interpretation of model results. If the standard error for coefficient is large, for example, it might make sense to take the point estimate for the coefficient with a grain of salt.

We can use the CI for informal statistical inference. If the 95% CI does not include 0 then we would say that the coefficient is statistically significant. Why 0? A coefficient of 0, remember, indicates no relationship between predictor and outcome, represented by a flat regression line. The situation, using hypothetical data, might look something like this:

set.seed(143)
df <- data.frame(x = seq(1,100) + runif(100),
                 y = runif(100))

ggplot(df, aes(x,y)) +
  geom_point(alpha = .5) +
  stat_smooth(geom="line",method = "lm", se= F, col="blue")+
  theme_minimal()+
  labs(title = "Regression line for y ~ x")

In this plot the flat line indicates that an increase in $x$ is associated with no change in $y$. However, even if there is no relationship between $x$ and $y$ in the population we would expect, with repeated sampling, non-zero coefficients for $x$ in some samples. The situation would look something like this, using the same hypothetical data and simulating multiple regression lines:

set.seed(143)
df <- data.frame(x = seq(1,100) + runif(100),
                 y = runif(100))

ggplot(df, aes(x,y)) +
  geom_point(alpha = .5) +
  stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
  stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F,, alpha = .5, col="blue") +
stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
 stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
 stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(df, 50), aes(x,y), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
  theme_minimal()+
  labs(title = "Simulated regression lines for y ~ x")

This is simulated data. By design, there is no relationship between $x$ and $y$, so the regression lines are mostly close to 0, without usually being exactly 0. What about for the market response model? Here is a similar plot for that data with simulated regression lines.

ggplot(sales, aes(price,sales)) +
  geom_point(alpha = .5) +
  stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
  stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +
stat_smooth(data= sample_n(sales, 50), aes(price,sales), geom="line",method = "lm", se= F, alpha = .5, col="blue") +

  theme_minimal()+
  labs(title = "Simulated regression lines for sales ~ price")

The slopes of these regression lines can be extracted from the regression output, expressed numerically, and displayed in a density plot with lines added to indicate the lower and upper bounds of the 95% CI.

slopes <- NULL
for(i in 1:1000){
  d <- sample_frac(sales, 1, replace = TRUE)
  slopes[i] <- coef(lm(sales~price, d))[2]
}

cv <- quantile(slopes, probs=c(.025, .975))

ggplot(data.frame(slopes = slopes), aes(slopes))+
  geom_density()+
  theme_minimal()+
  labs(title="Density of simulated regression slopes for sales ~ price",
       subtitle = "Lower and upper bounds of 95% CI in red")+
  geom_vline(xintercept = cv[1], col= 2, lty=2)+
              geom_vline(xintercept = cv[2], col = 2, lty=2)

The message of the simulation is the same as for the regression output. This CI does not include 0, suggesting that the observed slope is likely not due to chance or random sampling variation. The slope of the regression line, we can conclude, is statistically different from 0. There is strong evidence of a relationship between sales and price in the population, such that increases in price are associated with decreases in sales.

P-values and t-values

The t-value in the model output is a t-statistic. The formula for a one-sample t-statistic is:

$$t = \frac{\bar{x} - \mu}{SE}$$

In this case, the value denoted by $\mu$ in the numerator is 0, representing the null hypothesis of no relationship between predictor and outcome. The other value in the numerator, $\bar{x}$, is equivalent to $\hat\beta$, the regression coefficient. Thus, the t-value can be obtained simply by dividing $\hat\beta$ by $\hat{SE}(\hat\beta)$. (We don't usually care about the t-value or the p-value for the intercept.)

What does the final column in the output mean, P(>|t|)?

The model summary function uses a t-test to do formal inference with NHST on the slope coefficient. Above we did informal inference by constructing a 95% CI for the coefficient estimate for price. The formal NHST procedure consists in the following familiar steps:

Define the null hypothesis. We start out assuming what we want to disprove---that the coefficient estimate is 0.
Define the null distribution. This is the range of slopes we would expect to see due to sampling variability if the population coefficient was indeed 0. For regression a t-distribution is used with $n - p$ degrees of freedom. In this case, $p$ will always be 1 because there is just one parameter: the estimated coefficient.
Compare the observed slope to the null distribution. If the coefficient lies in the rejection region of the null distribution, indicating that it would occur with low probability---by convention $p < .05$---then we would "reject the null" and recognize the coefficient estimate as "statistically significant."

As noted, the null distribution for the slope coefficient is a t-distribution. Why? The t-distribution has fatter tails than the normal distribution and works better (is more conservative) when $n$ is small. (At larger $n$ it is basically identical to the normal distribution.) The basic idea is that we convert the coefficient to a standardized t-value (dividing it by the SE) and then compare the t-value to the null distribution of slopes defined as $T(df = n-1)$. The situation can be visualized as follows for the market response model:

regions <- qt(c(.025, .975), 123)

probs <- data.frame(t = seq(-4,4, by =.01))
probs$density <- dt(probs$t, df = 123)

plt <- ggplot(probs, aes(t, density)) +
    geom_line()+
  geom_vline(xintercept = -3.24, col= 2, lty = 2)+
  geom_area(aes(ifelse(t< regions[1] , t, NA), density), fill = 2, alpha = .5, na.rm=T)+
  geom_area(aes(ifelse(t> regions[2], t, NA), density), fill = 2, alpha = .5, na.rm=T) + 
  annotate("text", label ="<--- Observed t-value" , x =-2, y = .075)+
  annotate("text", label ="<--------------- Rejection regions -->", x =0, y = .03) +
  theme_minimal() +
  labs(title = "Null distribution for price coefficient")
suppressWarnings(print(plt))

This is the theoretical distribution of regression slopes under the null hypothesis, represented numerically in terms of the t-distribution. The critical t-values, demarcating the rejection regions colored in red, are at -1.98 and 1.98. The observed t-value is in the rejection regions, so we can reject the null.

This result can be verified by calculating an exact p-value. What is the probability under the null of observing a t-value equal to or more extreme than 3.24 or -3.24 (we include both tails for a two-tailed test)? Specifically, is that probability lower than the $p < .05$ threshold for statistical significance?

# Calculate exact t-value
(tvalue <- -111.38 / 34.38)

# Calculate p-value with df n - p
pt(q = tvalue, df = 123) * 2

The p-value for this one-sample, 2-tailed test is .00154, exactly what was reported in the final column of the model output, titled P(>|t|). This can be read as: the proportion of the null t-distribution with $df = n -1$ that is equal to or more extreme than $\pm$ the observed t-value. The result is $p < .05$, confirming our reasoning above that we can reject the null hypothesis. We have good evidence that the population slope, $\beta$, differs from 0.

In general, using a 95% CI to do statistical inference is preferable (in my opinion) to using a p-value simply because the CI provides more information than just a binary decision threshold: significant/non-significant. The CI also provides information about the estimate's proximity to 0 in relation to uncertainty, indicated by the width of the interval. The approximate 95% CI would be: -111.38 $\pm$ 2 * 34.38 = [-180.14, -42.62]. This interval does not include 0, from which we can conclude that there is a relationship between price and sales in the population. Moreover, the interval is fairly wide, indicating quite a bit of uncertainty about the exact relationship in the population, but the upper bound is far from 0. This is a good illustration of the additional information we can derive from a confidence interval. In this case the 95% CI indicates uncertainty about the point estimate for $\beta$ but certainty that it is not 0.

Model fit

We will cover the other components of model output---R-squared, Adjusted R-Squared and the F-statistic---in the next module.

Regression with Categorical Predictor

In the market response model above we have used a continuous predictor---price---to model sales. A complete introduction to simple linear regression requires talking also about using a categorical predictor to model sales. How does this work? We'll use coupon as an example. Remember that coupon is a factor variable coded no or yes depending on whether, in a given week, there was a coupon promotion.

Is a change in coupon associated with changes in sales? (Notice that, for reasons discussed above, I am avoiding causal language in posing this question. Simple linear regression cannot reveal whether A causes or impacts or affects B but only whether there is an association or correlation between A and B.)

As is our practice, let's do some exploratory analysis of this relationship.

# Summary table
sales %>% 
  group_by(coupon) %>% 
  summarize(avg_sales = mean(sales) %>% round(2),
            n = n()) %>% 
  mutate(proportion = (n/sum(n)) %>% round(2))

Coupon promotions, while rather infrequent, were associated with much higher weekly sales. The statistical question, of course, is whether this relationship was likely to have happened by chance. Before considering that, however, let's visualize the relationship using a boxplot:

# Boxplot
ggplot(sales, aes(coupon, sales)) +
  geom_boxplot()+
  labs(title = "sales ~ coupon")

It certainly looks like coupon would be a good predictor of sales. What does this mean exactly? Going from no to yes in coupon was associated with a fairly large difference in sales. Consequently, knowing whether a coupon promotion was being conducted would allow us to make a more accurate guess about sales. But what would the linear relationship between a factor variable, coupon, and a continuous variable,sales, look like? ggplot will not plot a regression line when the x-axis variable is a factor, so the procedure for plotting in this case is to make coupon numeric by turning it into a 0/1 variable.

# Plot with regression line

sales %>% 
  mutate(coupon = ifelse(coupon == "yes", 1, 0)) %>% 
  ggplot(aes(coupon, sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = F)+
  labs(title = "sales ~ coupon")+
  theme_minimal()

This scatterplot looks strange because the points are overplotted. We can use geom_jitter() instead which randomly offsets the points and makes the distribution of sales more visible at each level of coupon:

# Plot with jittered points and regression line
sales %>% 
  mutate(coupon = ifelse(coupon == "yes", 1, 0)) %>% 
  ggplot(aes(coupon, sales)) +
  geom_jitter(width = .05) +
  geom_smooth(method = "lm", se = F) +
  labs(title = "sales ~ coupon") +
  theme_minimal()

The regression line in this case connects average sales when coupon is 0 to average sales when coupon is 1. The slope of that line therefore corresponds to the change in average sales associated with a 1-unit increase in coupon. That change will be represented by the coefficient for coupon in the following regression (the syntax for which, note, is exactly the same as when using a continuous predictor):

# Fit model
(coupon_model <- lm(sales ~ coupon, data = sales)) %>% 
  summary

Here are interpretive details:

The intercept is the average value of the outcome at the reference level of the predictor. In this case, the reference level is no. Hence, we expect average sales of 101.92 when there is no coupon promotion.

The coefficient for a categorical predictor represents the change in the outcome, on average, for every unit increase in the predictor. Here, with a two-level factor as the predictor, a one-unit increase is from the reference level, no, to the next level, yes. Thus, a coupon promotion is associated with an increase, on average, of 276.14 in sales. Note that, as in simple linear regression with a continuous predictor, the coefficient represents the change in the outcome, not its predicted value. To get predicted sales for a coupon promotion we need to add the intercept and the coefficient: 101.92 + 276.14 = 378.06. Why? Think of no as 0 and and yes as 1 in the model equation. The equation for a coupon promotion would be: sales = 101.92 + 276.14 x 1. Without a coupon promotion the equation would be: sales = 101.92 + 276.14 x 0.

The predict() function can be used to obtain the same quantities:

# With a coupon promotion. Note: the input to newdata must be a dataframe.
predict(coupon_model, newdata = data.frame(coupon = "yes")) 

# Without a coupon promotion
predict(coupon_model, newdata = data.frame(coupon = "no"))

To demystify regression a bit, consider that we can reproduce these numbers exactly using simple dplyr code to calculate conditional means. What is a conditional mean? Rather than calculating the overall mean of sales, we instead calculate its mean conditional on the level of another variable, in this case coupon.

# Calculate conditional means
sales %>% 
  group_by(coupon) %>% 
  summarize(avg_sales = mean(sales)) %>% 
  data.frame

Notice that the first line of this table is identical to the model intercept. The second line is identical to the intercept plus the coefficient.

The standard error for the coefficient represents estimation uncertainty---how much, on average, we should expect the $\hat\beta$ coefficient to vary under repeated sampling. We could construct an approximate 95% CI for the coupon coefficient using the familiar formula: 276.14 $\pm$ 2 x 32.14.

The t value and p-value have exactly the same interpretation as for a continuous predictor. In this case, the t-value is in the rejection region of the null distribution and the p-value is, correspondingly, statistically significant.

Here are the key points to emphasize:

A linear model is, among other things, a sophisticated machine for calculating conditional means. In the case of a simple linear regression with a categorical predictor, the results can be exactly reproduced by calculating conditional means directly, for example with dplyr.
Statistically, a simple linear regression with a categorical predictor is equivalent to a t-test, since the coefficient represents the average difference between groups. It is, like a t-test, a comparison of means. In this case, the two groups are defined by coupon promotion. The null hypothesis for the regression, as for the t-test, is: no difference between the groups, coupon vs. no coupon, in sales. The statistically significant coefficient in this case indicates that the observed coupon-related difference in sales is unlikely to have happened by chance. A t-test and simple linear regression will not produce exactly the same t-values and p-values---the math is different---but the procedures should lead to the same statistical conclusion.

jefftwebb/lightbulb documentation built on March 28, 2023, 12:35 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jefftwebb/lightbulb
Tutorials for Learning Data Science

In jefftwebb/lightbulb: Tutorials for Learning Data Science

Introduction

Example

Data

Regression with Continuous Predictor

Fitting the model

Interpreting the model

Linear model concept practice

Video on simple regression (optional)

Linear Model Output

95% CIs for Regression Coefficients

P-values and t-values

Model fit

Regression with Categorical Predictor

R Package Documentation

Browse R Packages

We want your feedback!

jefftwebb/lightbulb Tutorials for Learning Data Science

In jefftwebb/lightbulb: Tutorials for Learning Data Science

Introduction

Example

Data

Regression with Continuous Predictor

Fitting the model

Interpreting the model

Linear model concept practice

Video on simple regression (optional)

Linear Model Output

95% CIs for Regression Coefficients

P-values and t-values

Model fit

Regression with Categorical Predictor

R Package Documentation

Browse R Packages

We want your feedback!

jefftwebb/lightbulb
Tutorials for Learning Data Science