There has been a long relationship between Epidemiology / Public Health and Biostatistics. Students frequently find introductory textbooks explaining statistical methods and the math behind them, but how to implement those techniques on a computer is rarely explained.
One of the most popular statistical software's in public health settings is Stata
. Stata
has the advantage of offering a solid interface with functions that can be accessed via the use of commands or by selecting the proper input in the graphical unit interface (GUI). Furthermore, Stata
offers "Grad Plans" to postgraduate students, making it relatively affordable from an economic point of view.
R
is a free alternative to Stata
. Its use keeps growing, and its popularity is also increasing due to the relatively high number of packages and textbooks available.
In epidemiology, some good packages are already available for R
, including: Epi
, epibasix
, epiDisplay
, epiR
and epitools
. The pubh
package does not intend to replace any of them, but to only provide a standard syntax for the most frequent statistical analysis in epidemiology.
Most students and professionals from the disciplines of Epidemiology and Public Health analyse variables in terms of outcome, exposure and confounders. The following table shows the most common names used in the literature to characterise variables in a cause-effect relationships:
Response variable|Explanatory variable(s) ---|--- Outcome|Exposure and confounders Outcome|Predictors Dependent variable|Independent variable(s) y|x
In R
, formulas
declare relationships between variables. Formulas are also common in classic statistical tests and regression methods.
Formulas have the following standard syntax:
y ~ x, data = my_data
Where y
is the outcome or response variable, x
is the exposure or predictor of interest, and my_data
specifies the data frame's name where x
and y
can be found.
The symbol ~
is used in R
for formulas. It can be interpreted as depends on. In the most typical scenario, y ~ x
means y
depends on x
or y
is a function of x
:
y = f(x)
Using epidemiology friendly terms:
Outcome = f(Exposure)
It is worth noting that Stata
requires variables to be given in the same order as in formulas: outcome first, predictors next.
The pubh
package integrates well with other packages of the tidyverse
which use formulas and the pipe operator |>
to add layers over functions. In particular, pubh
uses ggformula
as a graphical interface for plotting and takes advantage of variable labels from sjlabelled
. This versatility allows it to interact also with tables from crosstable
.
One way to control for confounders is the use of stratification. In Stata
, stratification is done by including the by
option as part of the command. In the ggformula
package, one way of doing stratification is with a formula like:
y ~ x|z, data = my_data
Where y
is the outcome or response variable, x
is the exposure or predictor of interest, z
is the stratification variable (a factor) and my_data
specifies the name of the data frame where x
, y
and z
can be found.
tidyverse
The tidyverse
has become now the standard in data manipulation in R
. The use of the pipe function |>
allows for cleaner code. The principle of pipes is to change the paradigm in coding. In standard codding, when many functions are used, one goes from inner parenthesis to outer ones.
For example if we have three functions, a common syntax would look like:
f3(f2(f1(..., data), ...), ...)
With pipes, however, the code reads top to bottom and left to right:
data |> f1(...) |> f2(...) |> f3(...)
Most of the functions from pubh
are compatible with pipes and the tidyverse
.
pubh
packageThe main contributions of the pubh
package to students and professionals from the disciplines of Epidemiology and Public Health are:
glm_coef
that displays coefficients from most common regression analysis in a way that can be easy interpreted and used for publications.There are many options currently available for displaying descriptive statistics. I recommend the function crosstable
from the crosstable
package for constructing tables of descriptive statistics where results need to be stratified.
The estat
function from the pubh
package displays descriptive statistics of continuous outcomes; it can use labels to display nice tables. To aid in understanding the variability, estat
also shows the relative dispersion (coefficient of variation), which is particularly interesting for comparing variances between groups (factors).
Some examples. We start by loading packages.
rm(list = ls()) library(dplyr) library(rstatix) library(crosstable) library(pubh) library(sjlabelled)
We will use a data set about a study of onchocerciasis in Sierra Leone.
data(Oncho) Oncho |> head()
crosstable_options( total = "row", percent_pattern="{n} ({p_col})", percent_digits = 1, funs = c("Mean (std)" = meansd, "Median [IQR]" = mediqr) )
A two-by-two contingency table:
Oncho |> select(mf, area) |> mutate( mf = relevel(mf, ref = "Infected") ) |> copy_labels(Oncho) |> crosstable(by = area) |> ctf()
Table with all descriptive statistics except id
and mfload
:
Oncho |> select(- c(id, mfload)) |> mutate( mf = relevel(mf, ref = "Infected") ) |> copy_labels(Oncho) |> crosstable(by = area) |> ctf()
Next, we use a data set about blood counts of T cells from patients in remission from Hodgkin's disease or in remission from disseminated malignancies. We generate the new variable Ratio
and add labels to the variables.
data(Hodgkin) Hodgkin <- Hodgkin |> mutate(Ratio = CD4/CD8) |> var_labels( Ratio = "CD4+ / CD8+ T-cells ratio" ) Hodgkin |> head()
Descriptive statistics for CD4+ T cells:
Hodgkin |> estat(~ CD4)
In the previous code, the left-hand side of the formula is empty as it's the case when working with only one variable.
For stratification, estat
recognises the following two syntaxes:
outcome ~ exposure
~ outcome | exposure
where, outcome
is continuous and exposure
is a categorical (factor
) variable.
For example:
Hodgkin |> estat(~ Ratio|Group)
As before, we can report a table of descriptive statistics for all variables in the data set:
Hodgkin |> mutate( Group = relevel(Group, ref = "Hodgkin") ) |> copy_labels(Hodgkin) |> crosstable(by = Group) |> ctf()
From the descriptive statistics of Ratio, we know that the relative dispersion in the Hodgkin group is almost as high as the relative dispersion in the non-Hodgkin group.
For new users of R
, it helps to have a standard syntax in most of the commands, as R
could sometimes be challenging and even intimidating. We can test if the difference in variance is statistically significant with the var.test
command, which uses can use a formula syntax:
var.test(Ratio ~ Group, data = Hodgkin)
What about normality? We can look at the QQ-plots against the standard Normal distribution.
Hodgkin |> qq_plot(~ Ratio|Group)
Let's say we choose a non-parametric test to compare the groups:
wilcox.test(Ratio ~ Group, data = Hodgkin)
For relatively small samples (for example, less than 30 observations per group), it is a standard practice to show the actual data in dot plots with error bars. The pubh
package offers two options to show graphically differences in continuous outcomes among groups:
strip_error
bar_error
For our current example:
Hodgkin |> strip_error(Ratio ~ Group)
The error bars represent 95% confidence intervals around mean values.
Adding a line on top is relatively easy to show that the two groups are significantly different. The function gf_star
needs the reference point on how to draw an horizontal line to display statistical differences or to annotate a plot; in summary, gf_star
:
Thus:
$$ y1 < y2 < y3 $$
In our current example:
Hodgkin |> strip_error(Ratio ~ Group) |> gf_star(x1 = 1, y1 = 4, x2 = 2, y2 = 4.05, y3 = 4.1, "**")
For larger samples we could use bar charts with error bars. For example:
data(birthwt, package = "MASS") birthwt <- birthwt |> mutate( smoke = factor(smoke, labels = c("Non-smoker", "Smoker")), Race = factor(race > 1, labels = c("White", "Non-white")), race = factor(race, labels = c("White", "Afican American", "Other")) ) |> var_labels( bwt = 'Birth weight (g)', smoke = 'Smoking status', race = 'Race', )
birthwt |> bar_error(bwt ~ smoke)
Quick normality check:
birthwt |> qq_plot(~ bwt|smoke)
The (unadjusted) $t$-test:
birthwt |> t_test(bwt ~ smoke, detailed = TRUE) |> as.data.frame()
The final plot with annotation to highlight statistical difference (unadjusted):
birthwt |> bar_error(bwt ~ smoke) |> gf_star(x1 = 1, x2 = 2, y1 = 3400, y2 = 3500, y3 = 3550, "**")
Both strip_error
and bar_error
can generate plots stratified by a third variable, for example:
birthwt |> bar_error(bwt ~ smoke, fill = ~ Race)
birthwt |> bar_error(bwt ~ smoke|Race)
birthwt |> strip_error(bwt ~ smoke, pch = ~ Race, col = ~ Race)
The pubh
package includes the function cosm_reg
, which adds some cosmetics to objects generated by tbl_regression
and huxtable
. The function multiple
helps analyse and visualise multiple comparisons.
For simplicity, here we show the analysis of the linear model of smoking on birth weight, adjusting by race (and not by other potential confounders).
model_bwt <- lm(bwt ~ smoke + race, data = birthwt)
model_bwt |> glm_coef(labels = model_labels(model_bwt))
In the last table of coefficients, confidence intervals and p-values for race levels are not adjusted for multiple comparisons. We can use functions from the emmeans
package to make the corrections.
multiple(model_bwt, ~ race)$df
multiple(model_bwt, ~ race)$fig_ci |> gf_labs(x = "Difference in birth weights (g)")
multiple(model_bwt, ~ race)$fig_pval |> gf_labs(y = " ")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.