knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The tidycat
package includes the tidy_categorical()
function to expand broom::tidy()
outputs for categorical parameter estimates.
You can install the released version of tidycat from CRAN with:
install.packages("tidycat")
And the development version from GitHub with:
# install.packages("devtools") devtools::install_github("guyabel/tidycat")
The tidy()
function in the broom package takes the messy output of built-in functions in R, such as lm()
, and turns them into tidy data frames.
library(dplyr) library(broom) m0 <- esoph %>% mutate_if(is.factor, ~factor(., ordered = FALSE)) %>% glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, data = ., family = binomial()) # tidy tidy(m0)
Note: Currently ordered factor not supported in tidycat
, hence their removal in mutate_if()
above
The tidy_categorical()
function adds further columns (variable
, level
and effect
) to the broom::tidy()
output to help manage categorical variables
library(tidycat) m0 %>% tidy() %>% tidy_categorical(m = m0, include_reference = FALSE)
Include additional rows for reference category terms and a column to indicate their location by setting include_reference = TRUE
(default). Setting exponentiate = TRUE
ensures the parameter estimates in the reference group are set to one instead of zero (even odds in the logistic regression example below).
m0 %>% tidy(exponentiate = TRUE) %>% tidy_categorical(m = m0, exponentiate = TRUE, reference_label = "Baseline") %>% select(-statistic, -p.value)
The results from broom::tidy()
can be used to quickly plot estimated coefficients and their confidence intervals.
# store parameter estimates and confidence intervals (except for the intercept) d0 <- m0 %>% tidy(conf.int = TRUE) %>% slice(-1) d0 library(ggplot2) library(tidyr) ggplot(data = d0, mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange()
The additional columns from tidy_categorical()
can be used to group together terms from the same categorical variable by setting colour = variable
d0 <- m0 %>% tidy(conf.int = TRUE) %>% tidy_categorical(m = m0, include_reference = FALSE) %>% slice(-1) d0 %>% select(-(3:5)) ggplot(data = d0, mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, colour = variable)) + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange()
The additional rows from tidy_categorical()
can be used to include the reference categories in a coefficient plot, allowing the reader to better grasp the meaning of the parameter estimates in each categorical variable. Using ggforce::facet_col()
the terms of each variable can be separated to further improve the presentation of the coefficient plot.
d0 <- m0 %>% tidy(conf.int = TRUE) %>% tidy_categorical(m = m0) %>% slice(-1) d0 %>% select(-(3:5)) library(ggforce) ggplot(data = d0, mapping = aes(x = level, y = estimate, colour = reference, ymin = conf.low, ymax = conf.high)) + facet_col(facets = vars(variable), scales = "free_y", space = "free") + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange()
Note the switch of the x
aesthetic to the level
column rather than term
.
Alternatively, horizontal plots can be obtained using ggforce::facet_row()
and loosing coord_flip()
;
ggplot(data = d0, mapping = aes(x = level, y = estimate, ymin = conf.low, ymax = conf.high, colour = reference)) + facet_row(facets = vars(variable), scales = "free_x", space = "free") + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Models with interactions can also be handled in tidy_categorical()
. Using the mtcars
data we can create three types of interactions (between two numeric variables, between a numeric variable and categorical variable and between two categorical variables)
m1 <- mtcars %>% mutate(engine = recode_factor(vs, `0` = "straight", `1` = "V-shaped"), transmission = recode_factor(am, `0` = "automatic", `1` = "manual")) %>% lm(mpg ~ as.factor(cyl) + wt * hp + wt * transmission + engine * transmission , data = .) tidy(m1)
Setting n_level = TRUE
creates an additional column to monitor the number of observations in each of level of the categorical variables, including interaction terms in the model:
d1 <- m1 %>% tidy(conf.int = TRUE) %>% tidy_categorical(m = m1, n_level = TRUE) %>% slice(-1) d1 %>% select(-(2:7))
We can use similar plotting code as above to plot the interactions:
ggplot(data = d1, mapping = aes(x = level, y = estimate, colour = reference, ymin = conf.low, ymax = conf.high)) + facet_col(facets = "variable", scales = "free_y", space = "free") + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange()
The empty levels can be dropped by filtering on the n_level
column for categories with more than zero observations and not NA
in term column.
d1 %>% dplyr::filter(n_level > 0 | !is.na(term)) %>% ggplot(mapping = aes(x = level, y = estimate, colour = reference, ymin = conf.low, ymax = conf.high)) + facet_col(facets = "variable", scales = "free_y", space = "free") + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange()
If you have any trouble or suggestions please let me know by creating an issue on the tidycat Github
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.