The tidycat package: expand broom::tidy() output for categorical parameter estimates"
In tidycat: Expand Tidy Output for Categorical Parameter Estimates

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The tidycat package includes the tidy_categorical() function to expand broom::tidy() outputs for categorical parameter estimates.

Installation

You can install the released version of tidycat from CRAN with:

install.packages("tidycat")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("guyabel/tidycat")

Additional columns for categorical parameter estimates

The tidy() function in the broom package takes the messy output of built-in functions in R, such as lm(), and turns them into tidy data frames.

library(dplyr)
library(broom)
m0 <- esoph %>%
   mutate_if(is.factor, ~factor(., ordered = FALSE)) %>%
   glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, data = ., family = binomial())
# tidy
tidy(m0)

Note: Currently ordered factor not supported in tidycat, hence their removal in mutate_if() above

The tidy_categorical() function adds further columns (variable, level and effect) to the broom::tidy() output to help manage categorical variables

library(tidycat)
m0 %>%
  tidy() %>%
  tidy_categorical(m = m0, include_reference =  FALSE)

Additional rows for reference categories

Include additional rows for reference category terms and a column to indicate their location by setting include_reference = TRUE (default). Setting exponentiate = TRUE ensures the parameter estimates in the reference group are set to one instead of zero (even odds in the logistic regression example below).

m0 %>%
  tidy(exponentiate = TRUE) %>%
  tidy_categorical(m = m0, exponentiate = TRUE, reference_label = "Baseline") %>%
  select(-statistic, -p.value)

Standard coefficient plots

The results from broom::tidy() can be used to quickly plot estimated coefficients and their confidence intervals.

# store parameter estimates and confidence intervals (except for the intercept)
d0 <- m0 %>%
  tidy(conf.int = TRUE) %>%
  slice(-1)
d0

library(ggplot2)
library(tidyr)
ggplot(data = d0,
        mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

Enhanced coefficient plots

The additional columns from tidy_categorical() can be used to group together terms from the same categorical variable by setting colour = variable

d0 <- m0 %>%
  tidy(conf.int = TRUE) %>%
  tidy_categorical(m = m0, include_reference = FALSE) %>%
  slice(-1)

d0 %>%
  select(-(3:5))

ggplot(data = d0,
        mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
                      colour = variable)) +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

The additional rows from tidy_categorical() can be used to include the reference categories in a coefficient plot, allowing the reader to better grasp the meaning of the parameter estimates in each categorical variable. Using ggforce::facet_col() the terms of each variable can be separated to further improve the presentation of the coefficient plot.

d0 <- m0 %>%
  tidy(conf.int = TRUE) %>%
  tidy_categorical(m = m0) %>%
  slice(-1)

d0 %>%
  select(-(3:5))

library(ggforce)
ggplot(data = d0,
        mapping = aes(x = level, y = estimate, colour = reference,
                      ymin = conf.low, ymax = conf.high)) +
   facet_col(facets = vars(variable), scales = "free_y", space = "free") +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

Note the switch of the x aesthetic to the level column rather than term.

Alternatively, horizontal plots can be obtained using ggforce::facet_row() and loosing coord_flip();

ggplot(data = d0,
      mapping = aes(x = level, y = estimate,
                    ymin = conf.low, ymax = conf.high,
                    colour = reference)) +
 facet_row(facets = vars(variable), scales = "free_x", space = "free") +
 geom_hline(yintercept = 0, linetype = "dashed") +
 geom_pointrange() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interactions

Models with interactions can also be handled in tidy_categorical(). Using the mtcars data we can create three types of interactions (between two numeric variables, between a numeric variable and categorical variable and between two categorical variables)

m1 <- mtcars %>%
  mutate(engine = recode_factor(vs, `0` = "straight", `1` = "V-shaped"),
         transmission = recode_factor(am, `0` = "automatic", `1` = "manual")) %>%
  lm(mpg ~ as.factor(cyl) + wt * hp + wt * transmission + engine * transmission , data = .)

tidy(m1)

Setting n_level = TRUE creates an additional column to monitor the number of observations in each of level of the categorical variables, including interaction terms in the model:

d1 <- m1 %>%
  tidy(conf.int = TRUE) %>%
  tidy_categorical(m = m1, n_level = TRUE) %>%
  slice(-1)

d1 %>%
  select(-(2:7))

We can use similar plotting code as above to plot the interactions:

ggplot(data = d1,
        mapping = aes(x = level, y = estimate, colour = reference,
                      ymin = conf.low, ymax = conf.high)) +
   facet_col(facets = "variable", scales = "free_y", space = "free") +
   coord_flip() +
   geom_hline(yintercept = 0, linetype = "dashed") +
   geom_pointrange()

The empty levels can be dropped by filtering on the n_level column for categories with more than zero observations and not NA in term column.

d1 %>%
  dplyr::filter(n_level > 0 | !is.na(term)) %>%
  ggplot(mapping = aes(x = level, y = estimate, colour = reference,
                       ymin = conf.low, ymax = conf.high)) +
  facet_col(facets = "variable", scales = "free_y", space = "free") +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange()