The aim of this vignette is to illustrate the use of pubh
functions for common regression analysis in Public Health. In particular, the vignette shows the use of the following functions from pubh
:
cosm_reg
for reporting tables of coefficients.cross_tbl
for reporting tables of descriptive statistics by exposure of interest.multiple
for adjusting confidence intervals and p-values for post-hoc analysis.get_r2
for accessing $R^2$ or pseudo-$R^2$ from regression models.glm_coef
for some special cases of regression models.The advantages and limitations of glm_coef
are:
gee
, glm
and survreg
.We start by loading relevant packages and setting the theme for the plots (as suggested in the Template of this package):
rm(list = ls()) library(tidyverse) library(rstatix) library(parameters) library(performance) library(jtools) library(pubh) library(sjlabelled) library(sjPlot) theme_set(sjPlot::theme_sjplot2(base_size = 10)) theme_update(legend.position = "top") options('huxtable.knit_print_df' = FALSE) options('huxtable.autoformat_number_format' = list(numeric = "%5.2f")) knitr::opts_chunk$set(comment = NA)
For continuous outcomes there is no need of exponentiating the results unless the outcome was fitted in the log-scale. In our first example we want to estimate the effect of smoking and race on the birth weight of babies.
We can generate factors and assign labels in the same pipe
stream:
data(birthwt, package = "MASS") birthwt <- birthwt %>% mutate( smoke = factor(smoke, labels = c("Non-smoker", "Smoker")), race = factor(race, labels = c("White", "African American", "Other")) ) %>% var_labels( bwt = 'Birth weight (g)', smoke = 'Smoking status', race = 'Race' )
Is good to start with some basic descriptive statistics, so we can compare the birth weight between groups.
Graphical analysis:
birthwt %>% box_plot(bwt ~ smoke, fill = ~ race)
Another way to compare the means between the groups is with gen_bst_df
which estimates means with corresponding bootstrapped CIs.
birthwt %>% gen_bst_df(bwt ~ race|smoke) %>% as_hux() %>% theme_pubh()
We fit a linear model.
model_norm <- lm(bwt ~ smoke + race, data = birthwt)
Note: Model diagnostics are not be discussed in this vignette.
Traditional output from the model:
model_norm %>% Anova()
model_norm %>% parameters()
model_norm %>% performance()
Table of coefficients for publication:
model_norm %>% tbl_regression() %>% cosm_reg() %>% theme_pubh() %>% add_footnote(get_r2(model_norm), font_size = 9)
model_norm %>% glm_coef(labels = model_labels(model_norm)) %>% as_hux() %>% set_align(everywhere, 2:3, "right") %>% theme_pubh() %>% add_footnote(get_r2(model_norm), font_size = 9)
Function glm_coef
allows the use of robust standard errors.
model_norm %>% glm_coef(se_rob = TRUE, labels = model_labels(model_norm)) %>% as_hux() %>% set_align(everywhere, 2:3, "right") %>% theme_pubh() %>% add_footnote(paste( get_r2(model_norm), "\n", "CIs and p-values estimated with robust standard errors."), font_size = 9)
To construct the effect plot, we can use plot_model
from the sjPlot
package. The advantage of plot_model
is that recognises labelled data and uses that information for annotating the plots.
model_norm %>% plot_model("pred", terms = ~race|smoke, dot.size = 1.5, title = "")
When the explanatory variables are categorical, another option is emmip
from the emmeans
package. We can include CIs in emmip
but as estimates are connected, the resulting plots look more messy, so I recommend emmip
to look at the trace.
emmip(model_norm, smoke ~ race) %>% gf_labs(y = get_label(birthwt$bwt), x = "", col = "Smoking status")
For logistic regression we are interested in the odds ratios. We will look at the effect of amount of fibre intake on the development of coronary heart disease.
data(diet, package = "Epi") diet <- diet %>% mutate( chd = factor(chd, labels = c("No CHD", "CHD")) ) %>% var_labels( chd = "Coronary Heart Disease", fibre = "Fibre intake (10 g/day)" )
We start with descriptive statistics:
diet %>% estat(~ fibre|chd) %>% as_hux() %>% theme_pubh()
diet %>% na.omit() %>% copy_labels(diet) %>% box_plot(fibre ~ chd)
We fit a linear logistic model:
model_binom <- glm(chd ~ fibre, data = diet, family = binomial)
Summary:
model_binom %>% parameters(exponentiate = TRUE)
model_binom %>% performance()
Table of coefficients for publication:
model_binom %>% tbl_regression(exponentiate = TRUE) %>% cosm_reg() %>% theme_pubh() %>% add_footnote(get_r2(model_binom), font_size = 9)
Effect plot:
model_binom %>% plot_model("pred", terms = "fibre [all]", title = "")
We will look at a matched case-control study on the effect of oestrogen use and history of gall bladder disease on the development of endometrial cancer.
data(bdendo, package = "Epi") bdendo <- bdendo %>% mutate( cancer = factor(d, labels = c('Control', 'Case')), gall = factor(gall, labels = c("No GBD", "GBD")), est = factor(est, labels = c("No oestrogen", "Oestrogen")) ) %>% var_labels( cancer = 'Endometrial cancer', gall = 'Gall bladder disease', est = 'Oestrogen' )
We start with descriptive statistics:
bdendo %>% mutate( cancer = relevel(cancer, ref = "Case"), gall = relevel(gall, ref = "GBD"), est = relevel(est, ref = "Oestrogen") ) %>% copy_labels(bdendo) %>% select(cancer, gall, est) %>% tbl_strata( strata = est, .tbl_fun = ~ .x %>% tbl_summary(by = gall) ) %>% cosm_sum(bold = TRUE, head_label = " ") %>% theme_pubh(2) %>% set_align(1, everywhere, "center")
bdendo %>% gf_percents(~ cancer|gall, fill = ~ est, position = "dodge", alpha = 0.6) %>% gf_labs( y = "Percent", x = "", fill = "" )
We fit the conditional logistic model:
require(survival, quietly = TRUE) model_clogit <- clogit(cancer == 'Case' ~ est * gall + strata(set), data = bdendo)
model_clogit %>% glm_coef(labels = c("Oestrogen/No oestrogen", "GBD/No GBD", "Oestrogen:GBD Interaction")) %>% as_hux() %>% set_align(everywhere, 2:3, "right") %>% theme_pubh() %>% add_footnote(get_r2(model_clogit), font_size = 9)
We use data about house satisfaction.
#require(MASS, quietly = TRUE) data(housing, package = "MASS") housing <- housing %>% var_labels( Sat = "Satisfaction", Infl = "Perceived influence", Type = "Type of rental", Cont = "Contact" )
We fit the ordinal logistic model:
model_ord <- MASS::polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing, Hess = TRUE)
model_ord %>% tbl_regression(exponentiate = TRUE) %>% cosm_reg() %>% theme_pubh() %>% add_footnote(get_r2(model_ord), font_size = 9)
Effect plots:
model_ord %>% plot_model(type = "pred", terms = c("Infl", "Cont"), dot.size = 1, title = "") %>% gf_theme(axis.text.x = element_text(angle = 45, hjust = 1))
model_ord %>% plot_model(type = "pred", terms = c("Infl", "Type"), dot.size = 1, title = "") %>% gf_theme(axis.text.x = element_text(angle = 45, hjust = 1))
emmip(model_ord, Infl ~ Type |Cont) %>% gf_labs(x = "Type of rental", col = "Perceived influence")
For Poisson regression we are interested in incidence rate ratios. We will look at the effect of sex, ethnicity and age group on number of absent days from school in a year.
data(quine, package = "MASS") levels(quine$Eth) <- c("Aboriginal", "White") levels(quine$Sex) <- c("Female", "Male")
quine <- quine %>% var_labels( Days = "Number of absent days", Eth = "Ethnicity", Age = "Age group" )
Descriptive statistics:
quine %>% cross_tbl(by = "Eth") %>% theme_pubh(2) %>% add_footnote("n (%); Median (IQR)", font_size = 9)
quine %>% box_plot(Days ~ Age|Sex, fill = ~ Eth)
We start by fitting a standard Poisson linear regression model:
model_pois <- glm(Days ~ Eth + Sex + Age, family = poisson, data = quine)
model_pois %>% tbl_regression(exponentiate = TRUE) %>% cosm_reg() %>% theme_pubh() %>% add_footnote(get_r2(model_pois), font_size = 9)
model_pois %>% performance()
The assumption is that the mean is equal than the variance. If that is the case, deviance should be close to the degrees of freedom of the residuals. We can check for overdispersion:
model_pois %>% check_overdispersion()
Thus, we have over-dispersion. One option is to use a negative binomial distribution.
model_negbin <- MASS::glm.nb(Days ~ Eth + Sex + Age, data = quine)
model_negbin %>% tbl_regression(exponentiate = TRUE) %>% cosm_reg() %>% theme_pubh() %>% add_footnote(get_r2(model_negbin), font_size = 9)
model_negbin %>% performance()
Effect plot:
model_negbin %>% plot_model(type = "pred", terms = c("Age", "Eth"), dot.size = 1.5, title = "")
emmip(model_negbin, Eth ~ Age|Sex) %>% gf_labs(y = "Number of absent days", x = "Age group", col = "Ethnicity")
We adjust for multiple comparisons:
multiple(model_negbin, ~ Age|Eth)$df
We can see the comparison graphically with:
multiple(model_negbin, ~ Age|Eth)$fig_ci %>% gf_labs(x = "IRR")
We will use an example on the effect of thiotepa versus placebo on the development of bladder cancer.
data(bladder) bladder <- bladder %>% mutate(times = stop, rx = factor(rx, labels=c("Placebo", "Thiotepa")) ) %>% var_labels(times = "Survival time", rx = "Treatment")
model_surv <- survreg(Surv(times, event) ~ rx, data = bladder)
Using robust standard errors:
model_surv %>% glm_coef(labels = c("Treatment: Thiotepa/Placebo", "Scale"), se_rob = TRUE) %>% as_hux() %>% set_align(everywhere, 2:3, "right") %>% theme_pubh(1) %>% add_footnote(get_r2(model_surv), font_size = 9)
In this example the scale parameter is not statistically different from one, meaning hazard is constant and thus, we can use the exponential distribution:
model_exp <- survreg(Surv(times, event) ~ rx, data = bladder, dist = "exponential")
model_exp %>% glm_coef(labels = c("Treatment: Thiotepa/Placebo"), se_rob = TRUE) %>% as_hux() %>% set_align(everywhere, 2:3, "right") %>% theme_pubh() %>% add_footnote(get_r2(model_exp), font_size = 9)
Interpretation: Patients receiving Thiotepa live on average 64% more than those in the Placebo group.
Using naive standard errors:
model_exp %>% glm_coef(labels = c("Treatment: Thiotepa/Placebo")) %>% as_hux() %>% set_align(everywhere, 2:3, "right") %>% theme_pubh() %>% add_footnote(get_r2(model_exp), font_size = 9)
model_exp %>% plot_model(type = "pred", terms = ~ rx, dot.size = 1.5, title = "") %>% gf_labs(y = "Survival time", x = "Treatment", title = "")
model_cox <- coxph(Surv(times, event) ~ rx, data = bladder)
model_cox %>% tbl_regression(exponentiate = TRUE) %>% cosm_reg() %>% theme_pubh() %>% add_footnote(get_r2(model_cox), font_size = 9)
Interpretation: Patients receiving Thiotepa are 64% less likely of dying than those in the Placebo group.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.