The tidycat package: expand broom::tidy() output for categorical parameter estimates"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The tidycat package includes the tidy_categorical() function to expand broom::tidy() outputs for categorical parameter estimates.

Installation

You can install the released version of tidycat from CRAN with:

install.packages("tidycat")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("guyabel/tidycat")

Additional columns for categorical parameter estimates

The tidy() function in the broom package takes the messy output of built-in functions in R, such as lm(), and turns them into tidy data frames.

library(dplyr)
library(broom)
m0 <- esoph %>%
   mutate_if(is.factor, ~factor(., ordered = FALSE)) %>%
   glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, data = ., family = binomial())
# tidy
tidy(m0)

Note: Currently ordered factor not supported in tidycat, hence their removal in mutate_if() above

The tidy_categorical() function adds further columns (variable, level and effect) to the broom::tidy() output to help manage categorical variables

library(tidycat)
m0 %>%
  tidy() %>%
  tidy_categorical(m = m0, include_reference =  FALSE)

Additional rows for reference categories

Include additional rows for reference category terms and a column to indicate their location by setting include_reference = TRUE (default). Setting exponentiate = TRUE ensures the parameter estimates in the reference group are set to one instead of zero (even odds in the logistic regression example below).

m0 %>%
  tidy(exponentiate = TRUE) %>%
  tidy_categorical(m = m0, exponentiate = TRUE, reference_label = "Baseline") %>%
  select(-statistic, -p.value)

Standard coefficient plots

The results from broom::tidy() can be used to quickly plot estimated coefficients and their confidence intervals.

# store parameter estimates and confidence intervals (except for the intercept)
d0 <- m0 %>%
  tidy(conf.int = TRUE) %>%
  slice(-1)
d0

library(ggplot2)
library(tidyr)
ggplot(data = d0,
        mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

Enhanced coefficient plots

The additional columns from tidy_categorical() can be used to group together terms from the same categorical variable by setting colour = variable

d0 <- m0 %>%
  tidy(conf.int = TRUE) %>%
  tidy_categorical(m = m0, include_reference = FALSE) %>%
  slice(-1)

d0 %>%
  select(-(3:5))

ggplot(data = d0,
        mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
                      colour = variable)) +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

The additional rows from tidy_categorical() can be used to include the reference categories in a coefficient plot, allowing the reader to better grasp the meaning of the parameter estimates in each categorical variable. Using ggforce::facet_col() the terms of each variable can be separated to further improve the presentation of the coefficient plot.

d0 <- m0 %>%
  tidy(conf.int = TRUE) %>%
  tidy_categorical(m = m0) %>%
  slice(-1)

d0 %>%
  select(-(3:5))

library(ggforce)
ggplot(data = d0,
        mapping = aes(x = level, y = estimate, colour = reference,
                      ymin = conf.low, ymax = conf.high)) +
   facet_col(facets = vars(variable), scales = "free_y", space = "free") +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

Note the switch of the x aesthetic to the level column rather than term.

Alternatively, horizontal plots can be obtained using ggforce::facet_row() and loosing coord_flip();

ggplot(data = d0,
      mapping = aes(x = level, y = estimate,
                    ymin = conf.low, ymax = conf.high,
                    colour = reference)) +
 facet_row(facets = vars(variable), scales = "free_x", space = "free") +
 geom_hline(yintercept = 0, linetype = "dashed") +
 geom_pointrange() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interactions

Models with interactions can also be handled in tidy_categorical(). Using the mtcars data we can create three types of interactions (between two numeric variables, between a numeric variable and categorical variable and between two categorical variables)

m1 <- mtcars %>%
  mutate(engine = recode_factor(vs, `0` = "straight", `1` = "V-shaped"),
         transmission = recode_factor(am, `0` = "automatic", `1` = "manual")) %>%
  lm(mpg ~ as.factor(cyl) + wt * hp + wt * transmission + engine * transmission , data = .)

tidy(m1)

Setting n_level = TRUE creates an additional column to monitor the number of observations in each of level of the categorical variables, including interaction terms in the model:

d1 <- m1 %>%
  tidy(conf.int = TRUE) %>%
  tidy_categorical(m = m1, n_level = TRUE) %>%
  slice(-1)

d1 %>%
  select(-(2:7))

We can use similar plotting code as above to plot the interactions:

ggplot(data = d1,
        mapping = aes(x = level, y = estimate, colour = reference,
                      ymin = conf.low, ymax = conf.high)) +
   facet_col(facets = "variable", scales = "free_y", space = "free") +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

The empty levels can be dropped by filtering on the n_level column for categories with more than zero observations and not NA in term column.

d1 %>%
  dplyr::filter(n_level > 0 | !is.na(term)) %>%
  ggplot(mapping = aes(x = level, y = estimate, colour = reference,
                       ymin = conf.low, ymax = conf.high)) +
  facet_col(facets = "variable", scales = "free_y", space = "free") +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange()

Issues

If you have any trouble or suggestions please let me know by creating an issue on the tidycat Github



Try the tidycat package in your browser

Any scripts or data that you put into this service are public.

tidycat documentation built on Aug. 2, 2021, 9:07 a.m.