knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This example using linear regression to explore the relationship between the body mass and flipper length of palmer penguins, taken from the palmerpenguins package, was created whilst reading

suppressPackageStartupMessages({
  suppressWarnings({
    library(palmerpenguins)
    library(dplyr)
    library(ggplot2)
    library(broom)
    library(knitr)
    library(gt)
  })
})

# gt global table style
gt_table <- function(.data) {
  .data %>% 
    gt() %>% 
    tab_options(table.align = "left",
                column_labels.background.color = "grey90",
                table_body.hlines.color = "white")
}

# penguins mass flipper data
penguins_mass_flipper <- penguins %>%
  select(species, flipper_length_mm, body_mass_g)

Linear regression calculates a least squares line with the smallest sum of squared residuals and is run using the lm function in the stats package.

penguins_regression_model <- lm(formula = body_mass_g ~ flipper_length_mm, 
                                data = penguins_mass_flipper)

The regression line along with standard error 95% confidence intervals can be plotted using the geom_smooth function in the ggplot2 package.

# penguins mass flipper plot
penguins_mass_flipper_plot <- ggplot(data = penguins_mass_flipper,
                                     aes(x = flipper_length_mm,
                                         y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", colour = "darkred") +
  theme_minimal() +
  labs(title = "Penguin size, Palmer Station LTER",
       subtitle = "Flipper length and body mass for 3 species of penguins",
       x = "Flipper length (mm)",
       y = "Body mass (g)") +
  theme(plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")
penguins_mass_flipper_plot

Outcome

The outcome of the linear regression model is presented as a more tidy data frame using the tidy and glance functions in the broom package.

penguins_regression_model %>% 
  tidy() %>% 
  select(term, estimate) %>% 
  gt_table()

Questions

  1. Is there a relationship between the predictor, X and response, Y variables that can be explained by a minimum acceptable probability of chance?

  2. statistic t-statistic giving the number of standard deviations the slope value is from zero. The t-distribution is similar to a standard normal distribution when the number of points is approximately more than 30.

  3. p.value probability that there is a relationship between the predictor, X and response, Y variables where slope does not equal zero, from chance alone. With a small p-value than we reject the null hypothesis that there is no relationship between X and Y, inferring that there is a relationship between the predictor X and response Y variables.

penguins_regression_model %>% 
  tidy() %>% 
  select(term, estimate, statistic, p.value) %>% 
  gt_table()
  1. How well does the model fit this relationship between the predictor, X and response Y variables?

  2. Residual standard error gives a measure of lack of fit of the regression model to the underlying data, estimating the average amount the response values deviates from the underlying population regression line.

  3. R^2^ statistic converts the residual standard error to a proportion between 0 and 1, with 1 indicating good fit with a large proportion of the variability in the response Y variable explained by the regression model and 0 indicating a poor fit with a low proportion of the variability in the response Y variable explained by the regression model.

penguins_regression_model %>% 
  glance() %>% 
  select(r.squared, adj.r.squared) %>% 
  gt_table()
  1. Within 95% confidence where is the population regression line?

  2. std.error indicates how close the sample regression line is to the underlying population regression line, with the 95% confidence interval approximately equal to ± 2 x Standard Error.

  3. conf.low and conf.high gives the more accurate upper and lower values for the 95% confidence interval.

penguins_regression_model %>% 
  tidy(conf.int = TRUE) %>% 
  select(term, estimate, std.error, conf.low, conf.high) %>% 
  gt_table()

Assumptions



gcfrench/store documentation built on May 17, 2024, 5:52 p.m.