Regression Modeling
In summata: Publication-Ready Summary Tables and Forest Plots

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  fig.width = 12,
  fig.height = 6,
  dpi = 150
)

# Use ragg for better font rendering if available
if (requireNamespace("ragg", quietly = TRUE)) {
  knitr::opts_chunk$set(dev = "ragg_png")
}

old_opts <- options(width = 180)

Regression analysis quantifies the relationship between a response variable and one or more predictors while accounting for potential confounding. The standard analytical workflow proceeds in two phases: univariable (unadjusted) analysis, in which each predictor is examined independently; and multivariable (adjusted) analysis, in which effects conditional on other predictors are examined collectively. Comparing these estimates found in these analyses reveals the extent of confounding and identifies independent predictors of a given outcome.

The summata package provides three principal functions for regression analysis:

| Function | Purpose | |:---------|:--------| | uniscreen() | Univariable screening of multiple predictors | | fit() | Single univariable/multivariable model | | fullfit() | Combined univariable and multivariable analysis |

All functions support a wide range of model types including linear models, generalized linear models (logistic, Poisson, Gaussian, Gamma, negative binomial), Cox proportional hazards models, and mixed-effects models, with consistent syntax and formatted output.

As with other summata functions, these functions adhere to the standard calling convention:

fullfit(data, outcome, predictors, model_type, ...)

where data is the dataset, outcome is the outcome variable, predictors is a vector of predicting variables, and model_type is the type of model to be implemented. This vignette demonstrates the various capabilities of these functions using the included sample dataset.

Preliminaries

The examples in this vignette use the clintrial dataset included with summata:

library(summata)
library(survival)

data(clintrial)
data(clintrial_labels)

The clintrial dataset simulates a multisite oncology clinical trial with realistic outcomes and site-level clustering. Variables suitable for different regression approaches include:

| Variable | Description | Type | Application | |:---------|:------------|:-----|:------------| | pain_score | Postoperative Pain Score (0-10) | Continuous | Linear regression | | readmission_30d | 30-Day Readmission | Binary | Logistic regression | | any_complication | Any Complication | Binary | Logistic regression | | los_days | Length of Hospital Stay (days) | Continuous (positive) | Linear or Gamma regression | | recovery_days | Days to Functional Recovery | Continuous (positive) | Linear or Gamma regression | | ae_count | Adverse Event Count | Count (overdispersed) | Negative binomial / quasipoisson | | fu_count | Follow-Up Visit Count | Count (equidispersed) | Poisson regression | | pfs_months, pfs_status | Progression-Free Survival Time (months) | Time-to-event | Cox regression | | os_months, os_status | Overall Survival Time (months) | Time-to-event | Cox regression | | site | Study Site | Clustering variable | Mixed-effects models |

Supported Model Types

The following table summarizes all supported model types (model_type) and families/links (family). For most estimates, effect measures are displayed by default; set exponentiate = FALSE to display raw coefficients.

| model_type | family | Outcome | Effect Measure | Package | |:-------------|:------------|:--------|:---------------|:--------| | "lm" | — | Continuous | β coefficient | stats | | "glm" | "binomial" | Binary | Odds ratio | stats | | "glm" | "quasibinomial" | Binary (overdispersed) | Odds ratio | stats | | "glm" | "poisson" | Count | Rate ratio | stats | | "glm" | "quasipoisson" | Count (overdispersed) | Rate ratio | stats | | "glm" | "gaussian" | Continuous | β coefficient | stats | | "glm" | "Gamma" | Positive continuous | Ratio (with log link) | stats | | "glm" | "inverse.gaussian" | Positive continuous | Coefficient | stats | | "negbin" | — | Count (overdispersed) | Rate ratio | MASS | | "coxph" | — | Time-to-event | Hazard ratio | survival | | "clogit" | — | Matched binary | Odds ratio | survival | | "lmer" | — | Continuous (clustered) | β coefficient | lme4 | | "glmer" | "binomial" | Binary (clustered) | Odds ratio | lme4 | | "glmer" | "poisson" | Count (clustered) | Rate ratio | lme4 | | "coxme" | — | Time-to-event (clustered) | Hazard ratio | coxme |

Univariable Screening

The uniscreen() function fits separate regression models for each predictor, combining results into a single table. This approach identifies candidate predictors for multivariable modeling.

Example 1: Binary Outcome Screen (Logistic Regression)

For binary outcomes, use model_type = "glm" (logistic regression is the default family). Here we examine predictors of 30-day hospital readmission:

screening_vars <- c("age", "sex", "race", "bmi", "smoking", 
                    "diabetes", "stage", "ecog", "treatment")

example1 <- uniscreen(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = screening_vars,
  model_type = "glm",
  labels = clintrial_labels
)

example1

Each entry represents a separate univariable model. For categorical predictors, each non-reference level appears as a separate row.

Example 2: Filtering Screen by p-value

The p_threshold parameter retains only predictors meeting a significance criterion:

example2 <- uniscreen(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = screening_vars,
  model_type = "glm",
  p_threshold = 0.01,
  labels = clintrial_labels
)

example2

Example 3: Continuous Outcome Screen (Linear Regression)

For continuous outcomes, specify model_type = "lm":

example4 <- uniscreen(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "sex", "stage", "diabetes", "ecog"),
  model_type = "lm",
  labels = clintrial_labels
)

example4

Example 4: Time-to-Event Outcome Screen (Cox Regression)

For survival outcomes, specify the outcome using Surv() notation and set model_type = "coxph":

example3 <- uniscreen(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage", "ecog"),
  model_type = "coxph",
  labels = clintrial_labels
)

example3

Example 5: Retaining Model Objects

Setting keep_models = TRUE stores the fitted model objects for diagnostics:

example5 <- uniscreen(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  keep_models = TRUE
)

# Access individual models
models <- attr(example5, "models")
names(models)

# Examine a specific model
summary(models[["age"]])

Multivariable Modeling

The fit() function estimates a single regression model with multiple predictors, producing adjusted effect estimates. It is effectively a wrapper for existing model functions in R and other supported packages that enforces summata syntax for more consistent function calling.

Example 6: Logistic Regression

For binary outcomes with multiple predictors:

example6 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "treatment", "stage", "diabetes"),
  model_type = "glm",
  labels = clintrial_labels
)

example6

Example 7: Linear Regression

For continuous outcomes:

example8 <- fit(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "sex", "stage", "ecog"),
  model_type = "lm",
  labels = clintrial_labels
)

example8

Example 8: Cox Regression

For time-to-event outcomes:

example7 <- fit(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage"),
  model_type = "coxph",
  labels = clintrial_labels
)

example7

Example 9: Poisson Regression

For count outcomes where variance approximately equals the mean, use model_type = "glm" with family = "poisson". The fu_count variable represents the number of follow-up clinic visits, which is equidispersed (suitable for standard Poisson):

example9 <- fit(
  data = clintrial,
  outcome = "fu_count",
  predictors = c("age", "stage", "treatment", "surgery"),
  model_type = "glm",
  family = "poisson",
  labels = clintrial_labels
)

example9

Output displays rate ratios (RR). A rate ratio of 1.10 indicates a 10% higher event rate compared to the reference group. For overdispersed counts (variance > mean), consider negative binomial or quasipoisson regression instead (see Example 18).

Example 10: Toggling Reference Rows for Factors

By default, reference rows are shown in output tables (reference_rows = TRUE). Setting this parameter to FALSE removes these reference rows:

example10 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("sex", "stage", "treatment"),
  model_type = "glm",
  reference_rows = FALSE,
  labels = clintrial_labels
)

example10

Example 11: Confidence Level

The conf_level parameter adjusts the confidence interval width:

example11 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  conf_level = 0.90
)

example11

Example 12: Confidence Interval Method

By default, summata computes confidence intervals based on per-model best practices:

| Model Class | Default CI Method | Description | |:------------|:------------------|:------------| | GLM (binomial, poisson) | Profile likelihood | Via MASS::confint.glm(). More accurate near boundary estimates | | Negative binomial | Profile likelihood | Via MASS::confint.glm() | | Quasi-likelihood | Wald | No true likelihood; Wald is the only option | | Linear model (lm) | Exact t-distribution | Via confint.lm(). Accounts for estimated residual variance | | Cox PH (coxph) | Wald | Standard in survival analysis literature | | Mixed-effects (lmer, glmer, coxme) | Wald | Profile too slow for routine use |

The conf_method parameter controls this behavior. For faster modeling, Wald (normal approximation) intervals can be substituted; this is computationally simpler but may be less accurate in edge cases. To set Wald intervals as the default method globally, use options(summata.conf_method = "wald").

# Default CIs (profile likelihood for GLM, slower but more accurate)
example12a <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  conf_method = "profile"
)

example12a

# Wald CIs (faster, suitable for simulation studies or exploratory work)
example12b <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  conf_method = "wald"
)

example12b

Example 13: Raw Coefficients

For logistic and Cox models, set exponentiate = FALSE to display log-scale coefficients (β rather than e^β^):

example13 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  exponentiate = FALSE
)

example13

Combined Analysis

The fullfit() function integrates univariable screening and multivariable modeling into a single workflow, producing a combined table with both unadjusted and adjusted estimates.

This function extends fit() with an additional parameter (method) for variable selection:

fullfit(data, outcome, predictors, model_type, method, ...)

For the method parameter, the following options are available. Setting method to one of the following options controls the predictors entering the multivariable model:

| Method | Description | |:-------|:------------| | "screen" | Only predictors meeting the p-threshold in univariable analysis are used in the multivariable model (default) | | "all" | All predictors are used in both univariable and multivariable analyses | | "custom" | Multivariable predictors are explicitly specified |

Example 14: Screening-Based Selection

The default method = "screen" approach specifies that only predictors with univariable p-value below p_threshold are subsequently used in the multivariable analysis:

example14 <- fullfit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "bmi", "smoking", "diabetes",
                 "stage", "treatment"),
  model_type = "glm",
  method = "screen",
  p_threshold = 0.05,
  labels = clintrial_labels
)

example14

Example 15: All Predictors

The method = "all" approach includes the same predictors in both univariable and multivariable analysis:

example15 <- fullfit(
  data = clintrial,
  outcome = "any_complication",
  predictors = c("age", "sex", "treatment", "stage"),
  model_type = "glm",
  method = "all",
  labels = clintrial_labels
)

example15

Example 16: Custom Selection

The method = "custom" approach allows explicit specification of multivariable predictors via the multi_predictors argument:

example16 <- fullfit(
  data = clintrial,
  outcome = "icu_admission",
  predictors = c("age", "sex", "bmi", "smoking", "stage", "treatment"),
  model_type = "glm",
  method = "custom",
  multi_predictors = c("age", "sex", "stage", "treatment"),
  labels = clintrial_labels
)

example16

Example 17: Controlling Output Columns

The columns parameter controls which results are displayed:

# Univariable only
example17a <- fullfit(
  data = clintrial,
  outcome = "wound_infection",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  columns = "uni"
)

example17a

# Multivariable only
example17b <- fullfit(
  data = clintrial,
  outcome = "wound_infection",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  columns = "multi"
)

example17b

Example 18: Survival Analysis

Output tables can use Cox regression for survival outcomes by specifying Surv() notation and model_type = "coxph":

example18 <- fullfit(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage", "ecog"),
  model_type = "coxph",
  method = "screen",
  p_threshold = 0.10,
  labels = clintrial_labels
)

example18

Extended Model Types

The summata package supports several additional model types for specialized analyses. These require external packages that are suggested dependencies.

Example 19: Negative Binomial

Negative binomial models are appropriate for overdispersed count data where the variance exceeds the mean—a common violation of the Poisson assumption. This model type requires the MASS package. The ae_count variable in clintrial is generated with overdispersion, making it ideal for this demonstration.

example19 <- fit(
  data = clintrial,
  outcome = "ae_count",
  predictors = c("age", "sex", "diabetes", "treatment"),
  model_type = "negbin",
  labels = clintrial_labels
)

example19

Example 20: Gamma Regression

Gamma regression is appropriate for positive, continuous, right-skewed outcomes. In the clintrial dataset, length of hospital stay (los_days) fits this description. Note that R's canonical link for the Gamma family is the inverse link (1/μ), which produces coefficients on a difficult-to-interpret scale. The log link is generally preferred for interpretability—exponentiated coefficients then represent multiplicative effects on the mean. When family = "Gamma" is passed as a string, summata defaults to the log link:

# String shorthand: resolves to Gamma(link = "log")
example20 <- fit(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "ecog", "stage", "treatment"),
  model_type = "glm",
  family = "Gamma",
  labels = clintrial_labels
)

example20

Example 21: Random-Intercept Model

Linear mixed-effects models (LMMs) account for hierarchical or clustered data structures. The clintrial dataset includes a site variable representing the study site, making it suitable for demonstrating random effects. This model type requires the lme4 package.

Include random intercepts for study site using (1|site) notation in the predictors:

example21 <- fit(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "sex", "treatment", "stage", "(1|site)"),
  model_type = "lmer",
  labels = clintrial_labels
)

example21

Example 22: GLMM for Binary Outcome

Generalized linear mixed-effects models (GLMMs) extend mixed-effects models to non-normal outcomes (binary, count). This model type also requires the lme4 package.

Model 30-day readmission with site-level random effects:

example22 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "diabetes", "treatment", "(1|site)"),
  model_type = "glmer",
  family = "binomial",
  labels = clintrial_labels
)

example22

Example 23: GLMM for Count Outcome

For count outcomes with clustering, use fu_count (equidispersed) with site-level random effects. Standard Poisson GLMMs assume equidispersion:

example23 <- fit(
  data = clintrial,
  outcome = "fu_count",
  predictors = c("age", "stage", "treatment", "(1|site)"),
  model_type = "glmer",
  family = "poisson",
  labels = clintrial_labels
)

example23

Example 24: Cox Mixed-Effects Model

Cox mixed-effects models account for within-cluster correlation in survival outcomes. This is useful when patients are nested within sites or physicians and outcomes may be correlated within clusters. This model type requires the coxme package. Model overall survival with site-level random effects:

example24 <- fit(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage", "(1|site)"),
  model_type = "coxme",
  labels = clintrial_labels
)

example24

Example 25: Quasibinomial for Overdispersed Binary Data

When binary or count data exhibit overdispersion (residual deviance >> residual degrees of freedom), quasi-likelihood models provide more appropriate standard errors.

example25 <- fit(
  data = clintrial,
  outcome = "any_complication",
  predictors = c("age", "sex", "diabetes", "stage"),
  model_type = "glm",
  family = "quasibinomial",
  labels = clintrial_labels
)

example25

Example 26: Conditional Logistic Model for Matched Data

Conditional logistic regression is used for matched case-control studies where cases and controls are matched within strata. The stratification variable should be specified using the strata parameter. This model type requires the survival package.

# Matched case-control dataset (100 pairs)
set.seed(456)
n_pairs <- 100
matched_data <- do.call(rbind, lapply(1:n_pairs, function(i) {
  base_bmi <- rnorm(1, 27, 4)
  data.frame(
    match_id = i,
    case = c(1, 0),
    smoking = factor(c(rbinom(1, 1, 0.45), rbinom(1, 1, 0.30)), 
                     levels = c(0, 1), labels = c("No", "Yes")),
    diabetes = factor(c(rbinom(1, 1, 0.30), rbinom(1, 1, 0.20)), 
                      levels = c(0, 1), labels = c("No", "Yes")),
    bmi = c(base_bmi + rnorm(1, 1.5, 2), base_bmi + rnorm(1, -0.5, 2))
  )
}))

example26 <- fit(
  data = matched_data,
  outcome = "case",
  predictors = c("smoking", "diabetes", "bmi"),
  model_type = "clogit",
  strata = "match_id"
)

example26

The output shows odds ratios conditional on the matching strata.

Exporting Results

Regression tables can be exported to various formats:

# Microsoft Word
table2docx(
  table = example13,
  file = file.path(tempdir(), "Table2_Regression.docx"),
  caption = "Table 2. Univariable and Multivariable Analysis"
)

# PDF
table2pdf(
  table = example13,
  file = file.path(tempdir(), "Table2_Regression.pdf"),
  caption = "Table 2. Regression Results"
)

See the Table Export vignette for comprehensive documentation.

Best Practices

Variable Selection Strategy

Exploratory phase: Use uniscreen() with liberal p-threshold (e.g., 0.20)
Model building: Use fullfit() with method = "screen" or method = "custom"
Confirmatory analysis: Use fit() with pre-specified predictors
Sensitivity analysis: Compare specifications with compfit() (see Model Comparison)

Choosing the Right Model

| Outcome Type | Characteristics | Recommended Model | |:-------------|:----------------|:------------------| | Binary | Independent observations | glm with binomial | | Binary | Overdispersed | glm with quasibinomial | | Binary | Clustered/hierarchical | glmer with binomial | | Binary | Matched case-control | clogit | | Count | Mean ≈ variance | glm with poisson | | Count | Variance > mean | negbin or quasipoisson | | Count | Clustered | glmer with poisson | | Continuous | Symmetric, normal errors | lm | | Continuous | Positive, right-skewed | glm with Gamma | | Continuous | Clustered | lmer | | Time-to-event | Independent | coxph | | Time-to-event | Clustered | coxme |

Interpreting Results

The comparison between univariable and multivariable estimates reveals confounding:

Large differences: Substantial confounding present
Similar estimates: Association robust to adjustment
Sign reversal: Simpson's paradox; investigate carefully

Sample Size Considerations

Adequate events per predictor are required for stable estimation:

Logistic/Cox models: ≥10 events per predictor (rule of thumb)
Linear models: ≥10–20 observations per predictor
Mixed-effects models: ≥5–10 clusters recommended
Use method = "screen" to reduce predictors when sample size is limited

Common Issues

Missing Data

Regression functions use complete-case analysis by default:

# Check missingness
sapply(clintrial[, c("age", "sex", "stage")], function(x) sum(is.na(x)))

# Create complete-case dataset explicitly
complete_data <- na.omit(clintrial[, c("readmission_30d", "age", "sex", "stage")])

Factor Reference Levels

Ensure reference categories are set appropriately:

# Set specific reference level
clintrial$stage <- relevel(factor(clintrial$stage), ref = "I")

Convergence Issues

For models that fail to converge:

# Access model for diagnostics
result <- fit(data, outcome, predictors, model_type = "glm")
model <- attr(result, "model")

# Check convergence
model$converged

# Large coefficients may indicate separation
coef(model)

Overdispersion Diagnostics

To check for overdispersion in Poisson or binomial models:

# Fit Poisson model
result <- fit(data, outcome, predictors, model_type = "glm", family = "poisson")
model <- attr(result, "model")

# Dispersion estimate (should be ~1 for no overdispersion)
sum(residuals(model, type = "pearson")^2) / model$df.residual

If the dispersion is substantially greater than 1, consider using quasipoisson, quasibinomial, or negbin instead.

Mixed-Effects Model Convergence

Mixed-effects models can be sensitive to starting values and optimization:

# Increase iterations
result <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "treatment", "(1|site)"),
  model_type = "glmer",
  family = "binomial",
  control = lme4::glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 100000))
)

Multinomial and Ordinal Outcomes

The summata package currently supports binary outcomes (via logistic regression) but not multinomial or ordinal outcomes with more than two categories. For categorical outcomes with more than two levels, use dedicated packages such as nnet::multinom() for multinomial regression or MASS::polr() for ordinal regression.

options(old_opts)

summata Publication-Ready Summary Tables and Forest Plots

Regression Modeling In summata: Publication-Ready Summary Tables and Forest Plots