knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, fig.width = 12, fig.height = 6, dpi = 150 ) # Use ragg for better font rendering if available if (requireNamespace("ragg", quietly = TRUE)) { knitr::opts_chunk$set(dev = "ragg_png") } old_opts <- options(width = 180)
Regression analysis quantifies the relationship between a response variable and one or more predictors while accounting for potential confounding. The standard analytical workflow proceeds in two phases: univariable (unadjusted) analysis, in which each predictor is examined independently; and multivariable (adjusted) analysis, in which effects conditional on other predictors are examined collectively. Comparing these estimates found in these analyses reveals the extent of confounding and identifies independent predictors of a given outcome.
The summata package provides three principal functions for regression analysis:
| Function | Purpose |
|:---------|:--------|
| uniscreen() | Univariable screening of multiple predictors |
| fit() | Single univariable/multivariable model |
| fullfit() | Combined univariable and multivariable analysis |
All functions support a wide range of model types including linear models, generalized linear models (logistic, Poisson, Gaussian, Gamma, negative binomial), Cox proportional hazards models, and mixed-effects models, with consistent syntax and formatted output.
As with other summata functions, these functions adhere to the standard calling convention:
fullfit(data, outcome, predictors, model_type, ...)
where data is the dataset, outcome is the outcome variable, predictors is a vector of predicting variables, and model_type is the type of model to be implemented. This vignette demonstrates the various capabilities of these functions using the included sample dataset.
The examples in this vignette use the clintrial dataset included with summata:
library(summata) library(survival) data(clintrial) data(clintrial_labels)
The clintrial dataset simulates a multisite oncology clinical trial with realistic outcomes and site-level clustering. Variables suitable for different regression approaches include:
| Variable | Description | Type | Application |
|:---------|:------------|:-----|:------------|
| pain_score | Postoperative Pain Score (0-10) | Continuous | Linear regression |
| readmission_30d | 30-Day Readmission | Binary | Logistic regression |
| any_complication | Any Complication | Binary | Logistic regression |
| los_days | Length of Hospital Stay (days) | Continuous (positive) | Linear or Gamma regression |
| recovery_days | Days to Functional Recovery | Continuous (positive) | Linear or Gamma regression |
| ae_count | Adverse Event Count | Count (overdispersed) | Negative binomial / quasipoisson |
| fu_count | Follow-Up Visit Count | Count (equidispersed) | Poisson regression |
| pfs_months, pfs_status | Progression-Free Survival Time (months) | Time-to-event | Cox regression |
| os_months, os_status | Overall Survival Time (months) | Time-to-event | Cox regression |
| site | Study Site | Clustering variable | Mixed-effects models |
The following table summarizes all supported model types (model_type) and families/links (family). For most estimates, effect measures are displayed by default; set exponentiate = FALSE to display raw coefficients.
| model_type | family | Outcome | Effect Measure | Package |
|:-------------|:------------|:--------|:---------------|:--------|
| "lm" | — | Continuous | β coefficient | stats |
| "glm" | "binomial" | Binary | Odds ratio | stats |
| "glm" | "quasibinomial" | Binary (overdispersed) | Odds ratio | stats |
| "glm" | "poisson" | Count | Rate ratio | stats |
| "glm" | "quasipoisson" | Count (overdispersed) | Rate ratio | stats |
| "glm" | "gaussian" | Continuous | β coefficient | stats |
| "glm" | "Gamma" | Positive continuous | Ratio (with log link) | stats |
| "glm" | "inverse.gaussian" | Positive continuous | Coefficient | stats |
| "negbin" | — | Count (overdispersed) | Rate ratio | MASS |
| "coxph" | — | Time-to-event | Hazard ratio | survival |
| "clogit" | — | Matched binary | Odds ratio | survival |
| "lmer" | — | Continuous (clustered) | β coefficient | lme4 |
| "glmer" | "binomial" | Binary (clustered) | Odds ratio | lme4 |
| "glmer" | "poisson" | Count (clustered) | Rate ratio | lme4 |
| "coxme" | — | Time-to-event (clustered) | Hazard ratio | coxme |
The uniscreen() function fits separate regression models for each predictor, combining results into a single table. This approach identifies candidate predictors for multivariable modeling.
For binary outcomes, use model_type = "glm" (logistic regression is the default family). Here we examine predictors of 30-day hospital readmission:
screening_vars <- c("age", "sex", "race", "bmi", "smoking", "diabetes", "stage", "ecog", "treatment") example1 <- uniscreen( data = clintrial, outcome = "readmission_30d", predictors = screening_vars, model_type = "glm", labels = clintrial_labels ) example1
Each entry represents a separate univariable model. For categorical predictors, each non-reference level appears as a separate row.
The p_threshold parameter retains only predictors meeting a significance criterion:
example2 <- uniscreen( data = clintrial, outcome = "readmission_30d", predictors = screening_vars, model_type = "glm", p_threshold = 0.01, labels = clintrial_labels ) example2
For continuous outcomes, specify model_type = "lm":
example4 <- uniscreen( data = clintrial, outcome = "los_days", predictors = c("age", "sex", "stage", "diabetes", "ecog"), model_type = "lm", labels = clintrial_labels ) example4
For survival outcomes, specify the outcome using Surv() notation and set model_type = "coxph":
example3 <- uniscreen( data = clintrial, outcome = "Surv(os_months, os_status)", predictors = c("age", "sex", "treatment", "stage", "ecog"), model_type = "coxph", labels = clintrial_labels ) example3
Setting keep_models = TRUE stores the fitted model objects for diagnostics:
example5 <- uniscreen( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "stage"), model_type = "glm", keep_models = TRUE ) # Access individual models models <- attr(example5, "models") names(models) # Examine a specific model summary(models[["age"]])
The fit() function estimates a single regression model with multiple predictors, producing adjusted effect estimates. It is effectively a wrapper for existing model functions in R and other supported packages that enforces summata syntax for more consistent function calling.
For binary outcomes with multiple predictors:
example6 <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "treatment", "stage", "diabetes"), model_type = "glm", labels = clintrial_labels ) example6
For continuous outcomes:
example8 <- fit( data = clintrial, outcome = "los_days", predictors = c("age", "sex", "stage", "ecog"), model_type = "lm", labels = clintrial_labels ) example8
For time-to-event outcomes:
example7 <- fit( data = clintrial, outcome = "Surv(os_months, os_status)", predictors = c("age", "sex", "treatment", "stage"), model_type = "coxph", labels = clintrial_labels ) example7
For count outcomes where variance approximately equals the mean, use model_type = "glm" with family = "poisson". The fu_count variable represents the number of follow-up clinic visits, which is equidispersed (suitable for standard Poisson):
example9 <- fit( data = clintrial, outcome = "fu_count", predictors = c("age", "stage", "treatment", "surgery"), model_type = "glm", family = "poisson", labels = clintrial_labels ) example9
Output displays rate ratios (RR). A rate ratio of 1.10 indicates a 10% higher event rate compared to the reference group. For overdispersed counts (variance > mean), consider negative binomial or quasipoisson regression instead (see Example 18).
By default, reference rows are shown in output tables (reference_rows = TRUE). Setting this parameter to FALSE removes these reference rows:
example10 <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("sex", "stage", "treatment"), model_type = "glm", reference_rows = FALSE, labels = clintrial_labels ) example10
The conf_level parameter adjusts the confidence interval width:
example11 <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "stage"), model_type = "glm", conf_level = 0.90 ) example11
By default, summata computes confidence intervals based on per-model best practices:
| Model Class | Default CI Method | Description |
|:------------|:------------------|:------------|
| GLM (binomial, poisson) | Profile likelihood | Via MASS::confint.glm(). More accurate near boundary estimates |
| Negative binomial | Profile likelihood | Via MASS::confint.glm() |
| Quasi-likelihood | Wald | No true likelihood; Wald is the only option |
| Linear model (lm) | Exact t-distribution | Via confint.lm(). Accounts for estimated residual variance |
| Cox PH (coxph) | Wald | Standard in survival analysis literature |
| Mixed-effects (lmer, glmer, coxme) | Wald | Profile too slow for routine use |
The conf_method parameter controls this behavior. For faster modeling, Wald (normal approximation) intervals can be substituted; this is computationally simpler but may be less accurate in edge cases. To set Wald intervals as the default method globally, use options(summata.conf_method = "wald").
# Default CIs (profile likelihood for GLM, slower but more accurate) example12a <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "stage"), model_type = "glm", conf_method = "profile" ) example12a # Wald CIs (faster, suitable for simulation studies or exploratory work) example12b <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "stage"), model_type = "glm", conf_method = "wald" ) example12b
For logistic and Cox models, set exponentiate = FALSE to display log-scale coefficients (β rather than e^β^):
example13 <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "stage"), model_type = "glm", exponentiate = FALSE ) example13
The fullfit() function integrates univariable screening and multivariable modeling into a single workflow, producing a combined table with both unadjusted and adjusted estimates.
This function extends fit() with an additional parameter (method) for variable selection:
fullfit(data, outcome, predictors, model_type, method, ...)
For the method parameter, the following options are available. Setting method to one of the following options controls the predictors entering the multivariable model:
| Method | Description |
|:-------|:------------|
| "screen" | Only predictors meeting the p-threshold in univariable analysis are used in the multivariable model (default) |
| "all" | All predictors are used in both univariable and multivariable analyses |
| "custom" | Multivariable predictors are explicitly specified |
The default method = "screen" approach specifies that only predictors with univariable p-value below p_threshold are subsequently used in the multivariable analysis:
example14 <- fullfit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "bmi", "smoking", "diabetes", "stage", "treatment"), model_type = "glm", method = "screen", p_threshold = 0.05, labels = clintrial_labels ) example14
The method = "all" approach includes the same predictors in both univariable and multivariable analysis:
example15 <- fullfit( data = clintrial, outcome = "any_complication", predictors = c("age", "sex", "treatment", "stage"), model_type = "glm", method = "all", labels = clintrial_labels ) example15
The method = "custom" approach allows explicit specification of multivariable predictors via the multi_predictors argument:
example16 <- fullfit( data = clintrial, outcome = "icu_admission", predictors = c("age", "sex", "bmi", "smoking", "stage", "treatment"), model_type = "glm", method = "custom", multi_predictors = c("age", "sex", "stage", "treatment"), labels = clintrial_labels ) example16
The columns parameter controls which results are displayed:
# Univariable only example17a <- fullfit( data = clintrial, outcome = "wound_infection", predictors = c("age", "sex", "stage"), model_type = "glm", columns = "uni" ) example17a # Multivariable only example17b <- fullfit( data = clintrial, outcome = "wound_infection", predictors = c("age", "sex", "stage"), model_type = "glm", columns = "multi" ) example17b
Output tables can use Cox regression for survival outcomes by specifying Surv() notation and model_type = "coxph":
example18 <- fullfit( data = clintrial, outcome = "Surv(os_months, os_status)", predictors = c("age", "sex", "treatment", "stage", "ecog"), model_type = "coxph", method = "screen", p_threshold = 0.10, labels = clintrial_labels ) example18
The summata package supports several additional model types for specialized analyses. These require external packages that are suggested dependencies.
Negative binomial models are appropriate for overdispersed count data where the variance exceeds the mean—a common violation of the Poisson assumption. This model type requires the MASS package. The ae_count variable in clintrial is generated with overdispersion, making it ideal for this demonstration.
example19 <- fit( data = clintrial, outcome = "ae_count", predictors = c("age", "sex", "diabetes", "treatment"), model_type = "negbin", labels = clintrial_labels ) example19
Gamma regression is appropriate for positive, continuous, right-skewed outcomes. In the clintrial dataset, length of hospital stay (los_days) fits this description. Note that R's canonical link for the Gamma family is the inverse link (1/μ), which produces coefficients on a difficult-to-interpret scale. The log link is generally preferred for interpretability—exponentiated coefficients then represent multiplicative effects on the mean. When family = "Gamma" is passed as a string, summata defaults to the log link:
# String shorthand: resolves to Gamma(link = "log") example20 <- fit( data = clintrial, outcome = "los_days", predictors = c("age", "ecog", "stage", "treatment"), model_type = "glm", family = "Gamma", labels = clintrial_labels ) example20
Linear mixed-effects models (LMMs) account for hierarchical or clustered data structures. The clintrial dataset includes a site variable representing the study site, making it suitable for demonstrating random effects. This model type requires the lme4 package.
Include random intercepts for study site using (1|site) notation in the predictors:
example21 <- fit( data = clintrial, outcome = "los_days", predictors = c("age", "sex", "treatment", "stage", "(1|site)"), model_type = "lmer", labels = clintrial_labels ) example21
Generalized linear mixed-effects models (GLMMs) extend mixed-effects models to non-normal outcomes (binary, count). This model type also requires the lme4 package.
Model 30-day readmission with site-level random effects:
example22 <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "sex", "diabetes", "treatment", "(1|site)"), model_type = "glmer", family = "binomial", labels = clintrial_labels ) example22
For count outcomes with clustering, use fu_count (equidispersed) with site-level random effects. Standard Poisson GLMMs assume equidispersion:
example23 <- fit( data = clintrial, outcome = "fu_count", predictors = c("age", "stage", "treatment", "(1|site)"), model_type = "glmer", family = "poisson", labels = clintrial_labels ) example23
Cox mixed-effects models account for within-cluster correlation in survival outcomes. This is useful when patients are nested within sites or physicians and outcomes may be correlated within clusters. This model type requires the coxme package. Model overall survival with site-level random effects:
example24 <- fit( data = clintrial, outcome = "Surv(os_months, os_status)", predictors = c("age", "sex", "treatment", "stage", "(1|site)"), model_type = "coxme", labels = clintrial_labels ) example24
When binary or count data exhibit overdispersion (residual deviance >> residual degrees of freedom), quasi-likelihood models provide more appropriate standard errors.
example25 <- fit( data = clintrial, outcome = "any_complication", predictors = c("age", "sex", "diabetes", "stage"), model_type = "glm", family = "quasibinomial", labels = clintrial_labels ) example25
Conditional logistic regression is used for matched case-control studies where cases and controls are matched within strata. The stratification variable should be specified using the strata parameter. This model type requires the survival package.
# Matched case-control dataset (100 pairs) set.seed(456) n_pairs <- 100 matched_data <- do.call(rbind, lapply(1:n_pairs, function(i) { base_bmi <- rnorm(1, 27, 4) data.frame( match_id = i, case = c(1, 0), smoking = factor(c(rbinom(1, 1, 0.45), rbinom(1, 1, 0.30)), levels = c(0, 1), labels = c("No", "Yes")), diabetes = factor(c(rbinom(1, 1, 0.30), rbinom(1, 1, 0.20)), levels = c(0, 1), labels = c("No", "Yes")), bmi = c(base_bmi + rnorm(1, 1.5, 2), base_bmi + rnorm(1, -0.5, 2)) ) })) example26 <- fit( data = matched_data, outcome = "case", predictors = c("smoking", "diabetes", "bmi"), model_type = "clogit", strata = "match_id" ) example26
The output shows odds ratios conditional on the matching strata.
Regression tables can be exported to various formats:
# Microsoft Word table2docx( table = example13, file = file.path(tempdir(), "Table2_Regression.docx"), caption = "Table 2. Univariable and Multivariable Analysis" ) # PDF table2pdf( table = example13, file = file.path(tempdir(), "Table2_Regression.pdf"), caption = "Table 2. Regression Results" )
See the Table Export vignette for comprehensive documentation.
uniscreen() with liberal p-threshold (e.g., 0.20)fullfit() with method = "screen" or method = "custom"fit() with pre-specified predictorscompfit() (see Model Comparison)| Outcome Type | Characteristics | Recommended Model |
|:-------------|:----------------|:------------------|
| Binary | Independent observations | glm with binomial |
| Binary | Overdispersed | glm with quasibinomial |
| Binary | Clustered/hierarchical | glmer with binomial |
| Binary | Matched case-control | clogit |
| Count | Mean ≈ variance | glm with poisson |
| Count | Variance > mean | negbin or quasipoisson |
| Count | Clustered | glmer with poisson |
| Continuous | Symmetric, normal errors | lm |
| Continuous | Positive, right-skewed | glm with Gamma |
| Continuous | Clustered | lmer |
| Time-to-event | Independent | coxph |
| Time-to-event | Clustered | coxme |
The comparison between univariable and multivariable estimates reveals confounding:
Adequate events per predictor are required for stable estimation:
method = "screen" to reduce predictors when sample size is limitedRegression functions use complete-case analysis by default:
# Check missingness sapply(clintrial[, c("age", "sex", "stage")], function(x) sum(is.na(x))) # Create complete-case dataset explicitly complete_data <- na.omit(clintrial[, c("readmission_30d", "age", "sex", "stage")])
Ensure reference categories are set appropriately:
# Set specific reference level clintrial$stage <- relevel(factor(clintrial$stage), ref = "I")
For models that fail to converge:
# Access model for diagnostics result <- fit(data, outcome, predictors, model_type = "glm") model <- attr(result, "model") # Check convergence model$converged # Large coefficients may indicate separation coef(model)
To check for overdispersion in Poisson or binomial models:
# Fit Poisson model result <- fit(data, outcome, predictors, model_type = "glm", family = "poisson") model <- attr(result, "model") # Dispersion estimate (should be ~1 for no overdispersion) sum(residuals(model, type = "pearson")^2) / model$df.residual
If the dispersion is substantially greater than 1, consider using quasipoisson, quasibinomial, or negbin instead.
Mixed-effects models can be sensitive to starting values and optimization:
# Increase iterations result <- fit( data = clintrial, outcome = "readmission_30d", predictors = c("age", "treatment", "(1|site)"), model_type = "glmer", family = "binomial", control = lme4::glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 100000)) )
The summata package currently supports binary outcomes (via logistic regression) but not multinomial or ordinal outcomes with more than two categories. For categorical outcomes with more than two levels, use dedicated packages such as nnet::multinom() for multinomial regression or MASS::polr() for ordinal regression.
options(old_opts)
desctable() for baseline characteristicscompfit() for comparing modelsmultifit() for multi-outcome analysisAny scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.