In gcfrench/store: Store commonly used functions and lookups

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This example using linear regression to explore the relationship between the body mass and flipper length of palmer penguins, taken from the palmerpenguins package, was created whilst reading

Chapter 3 of An Introduction to Statistical Learning with Applications in R, second edition, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
Statistical Inference via Data Science. A ModernDive into R and the Tidyverse, by Chester Ismay and Albert Kim

suppressPackageStartupMessages({
  suppressWarnings({
    library(palmerpenguins)
    library(dplyr)
    library(ggplot2)
    library(broom)
    library(knitr)
    library(gt)
  })
})

# gt global table style
gt_table <- function(.data) {
  .data %>% 
    gt() %>% 
    tab_options(table.align = "left",
                column_labels.background.color = "grey90",
                table_body.hlines.color = "white")
}

# penguins mass flipper data
penguins_mass_flipper <- penguins %>%
  select(species, flipper_length_mm, body_mass_g)

Linear regression calculates a least squares line with the smallest sum of squared residuals and is run using the lm function in the stats package.

penguins_regression_model <- lm(formula = body_mass_g ~ flipper_length_mm, 
                                data = penguins_mass_flipper)

The regression line along with standard error 95% confidence intervals can be plotted using the geom_smooth function in the ggplot2 package.

# penguins mass flipper plot
penguins_mass_flipper_plot <- ggplot(data = penguins_mass_flipper,
                                     aes(x = flipper_length_mm,
                                         y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", colour = "darkred") +
  theme_minimal() +
  labs(title = "Penguin size, Palmer Station LTER",
       subtitle = "Flipper length and body mass for 3 species of penguins",
       x = "Flipper length (mm)",
       y = "Body mass (g)") +
  theme(plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")
penguins_mass_flipper_plot

Outcome

The outcome of the linear regression model is presented as a more tidy data frame using the tidy and glance functions in the broom package.

estimate gives the intercept (first row) and slope (second row) values for the regression line.

penguins_regression_model %>% 
  tidy() %>% 
  select(term, estimate) %>% 
  gt_table()

Questions

Is there a relationship between the predictor, X and response, Y variables that can be explained by a minimum acceptable probability of chance?
statistic t-statistic giving the number of standard deviations the slope value is from zero. The t-distribution is similar to a standard normal distribution when the number of points is approximately more than 30.
p.value probability that there is a relationship between the predictor, X and response, Y variables where slope does not equal zero, from chance alone. With a small p-value than we reject the null hypothesis that there is no relationship between X and Y, inferring that there is a relationship between the predictor X and response Y variables.

penguins_regression_model %>% 
  tidy() %>% 
  select(term, estimate, statistic, p.value) %>% 
  gt_table()

How well does the model fit this relationship between the predictor, X and response Y variables?
Residual standard error gives a measure of lack of fit of the regression model to the underlying data, estimating the average amount the response values deviates from the underlying population regression line.
R^2^ statistic converts the residual standard error to a proportion between 0 and 1, with 1 indicating good fit with a large proportion of the variability in the response Y variable explained by the regression model and 0 indicating a poor fit with a low proportion of the variability in the response Y variable explained by the regression model.

penguins_regression_model %>% 
  glance() %>% 
  select(r.squared, adj.r.squared) %>% 
  gt_table()

Within 95% confidence where is the population regression line?
std.error indicates how close the sample regression line is to the underlying population regression line, with the 95% confidence interval approximately equal to ± 2 x Standard Error.
conf.low and conf.high gives the more accurate upper and lower values for the 95% confidence interval.

penguins_regression_model %>% 
  tidy(conf.int = TRUE) %>% 
  select(term, estimate, std.error, conf.low, conf.high) %>% 
  gt_table()

Assumptions

The relationship between the variables is approximately linear.
Different observations must be independent of one another.
The residuals should be normally distributed, centred around zero so that there is an equal number of positive and negative errors above and below the slope, with on average the error equalling zero.
The residuals should show equal variance across all values of the predictor variable, so that the value and spread of the residuals don't depend on the value of the predictor variable on the x axis.