fullfit: Complete Regression Analysis Workflow
In summata: Publication-Ready Summary Tables and Forest Plots

fullfit

R Documentation

Complete Regression Analysis Workflow

Description

Executes a comprehensive regression analysis pipeline that combines univariable screening, automatic/manual variable selection, and multivariable modeling in a single function call. This function is designed to streamline the complete analytical workflow from initial exploration to final adjusted models, with publication-ready formatted output showing both univariable and multivariable results side-by-side if desired.

Usage

fullfit(
  data,
  outcome,
  predictors,
  method = "screen",
  multi_predictors = NULL,
  p_threshold = 0.05,
  columns = "both",
  model_type = "glm",
  family = "binomial",
  random = NULL,
  conf_level = 0.95,
  reference_rows = TRUE,
  show_n = TRUE,
  show_events = TRUE,
  digits = 2,
  p_digits = 3,
  labels = NULL,
  metrics = "both",
  return_type = "table",
  keep_models = FALSE,
  exponentiate = NULL,
  conf_method = NULL,
  parallel = TRUE,
  n_cores = NULL,
  number_format = NULL,
  verbose = NULL,
  ...
)

Arguments

`data`	Data frame or data.table containing the analysis dataset. The function automatically converts data frames to data.tables for efficient processing.
`outcome`	Character string specifying the outcome variable name. For time-to-event analysis, use `Surv()` syntax for the outcome variable (e.g., `"Surv(os_months, os_status)"`).
`predictors`	Character vector of predictor variable names to analyze. All predictors are tested in univariable models. The subset included in the multivariable model depends on the `method` parameter.
`method`	Character string specifying the variable selection strategy: `"screen"` - Automatic selection based on univariable p-value threshold. Only predictors with p `\le` `p_threshold` in univariable analysis are included in the multivariable model [default] `"all"` - Include all predictors in both univariable and multivariable analyses (no selection) `"custom"` - Manual selection. All predictors in univariable analysis, but only those specified in `multi_predictors` are included in multivariable model
`multi_predictors`	Character vector of predictors to include in the multivariable model when `method = "custom"`. Required when using custom selection. Ignored for other methods. Default is `NULL`.
`p_threshold`	Numeric p-value threshold for automatic variable selection when `method = "screen"`. Predictors with univariable p-value less than or equal to the threshold are included in multivariable model. Common values: 0.05 (strict), 0.10 (moderate), 0.20 (liberal screening). Default is 0.05. Ignored for other methods.
`columns`	Character string specifying which result columns to display: `"both"` - Show both univariable and multivariable results side-by-side [default] `"uni"` - Show only univariable results `"multi"` - Show only multivariable results
`model_type`	Character string specifying the regression model type: `"glm"` - Generalized linear model (default). Supports multiple distributions via the `family` parameter including logistic, Poisson, Gamma, Gaussian, and quasi-likelihood models. `"lm"` - Linear regression for continuous outcomes with normally distributed errors. `"coxph"` - Cox proportional hazards model for time-to-event survival analysis. Requires `Surv()` outcome syntax. `"clogit"` - Conditional logistic regression for matched case-control studies. `"negbin"` - Negative binomial regression for overdispersed count data (requires MASS package). Estimates an additional dispersion parameter compared to Poisson regression. `"glmer"` - Generalized linear mixed-effects model for hierarchical or clustered data with non-normal outcomes (requires lme4 package and `random` parameter). `"lmer"` - Linear mixed-effects model for hierarchical or clustered data with continuous outcomes (requires lme4 package and `random` parameter). `"coxme"` - Cox mixed-effects model for clustered survival data (requires coxme package and `random` parameter).
`family`	For GLM and GLMER models, specifies the error distribution and link function. Can be a character string, a family function, or a family object. Ignored for non-GLM/GLMER models. Binary/Binomial outcomes: `"binomial"` or `binomial()` - Logistic regression for binary outcomes (0/1, TRUE/FALSE). Returns odds ratios (OR). Default. `"quasibinomial"` or `quasibinomial()` - Logistic regression with overdispersion. Use when residual deviance >> residual df. `binomial(link = "probit")` - Probit regression (normal CDF link). `binomial(link = "cloglog")` - Complementary log-log link for asymmetric binary outcomes. Count outcomes: `"poisson"` or `poisson()` - Poisson regression for count data. Returns rate ratios (RR). Assumes mean = variance. `"quasipoisson"` or `quasipoisson()` - Poisson regression with overdispersion. Use when variance > mean. Continuous outcomes: `"gaussian"` or `gaussian()` - Normal/Gaussian distribution for continuous outcomes. Equivalent to linear regression. `gaussian(link = "log")` - Log-linear model for positive continuous outcomes. Returns multiplicative effects. Positive continuous outcomes: `"Gamma"` or `Gamma()` - Gamma distribution for positive, right-skewed continuous data (e.g., costs, lengths of stay). When passed as a string, resolves to log link for interpretable multiplicative effects. `Gamma(link = "inverse")` - Gamma with inverse (canonical) link. `Gamma(link = "identity")` - Gamma with identity link for additive effects on positive outcomes. `"inverse.gaussian"` or `inverse.gaussian()` - Inverse Gaussian for positive, highly right-skewed data. For negative binomial regression (overdispersed counts), use `model_type = "negbin"` instead of the `family` parameter. See `family` for additional details and options.
`random`	Character string specifying the random-effects formula for mixed-effects models (`glmer`, `lmer`, `coxme`). Use standard lme4/coxme syntax, e.g., `"(1\|site)"` for random intercepts by site, `"(1\|site/patient)"` for nested random effects. Required when `model_type` is a mixed-effects model type unless random effects are included in the `predictors` vector. Alternatively, random effects can be included directly in the `predictors` vector using the same syntax (e.g., `predictors = c("age", "sex", "(1\|site)")`), though they will not be screened as predictors. Default is `NULL`.
`conf_level`	Numeric confidence level for confidence intervals. Must be between 0 and 1. Default is 0.95 (95% CI).
`reference_rows`	Logical. If `TRUE`, adds rows for reference categories of factor variables with baseline values. Default is `TRUE`.
`show_n`	Logical. If `TRUE`, includes sample size columns. Default is `TRUE`.
`show_events`	Logical. If `TRUE`, includes events columns (survival and logistic models). Default is `TRUE`.
`digits`	Integer specifying decimal places for effect estimates. Default is 2.
`p_digits`	Integer specifying the number of decimal places for p-values. Values smaller than `10^(-p_digits)` are displayed as `"< 0.001"` (for `p_digits = 3`), `"< 0.0001"` (for `p_digits = 4`), etc. Default is 3.
`labels`	Named character vector or list providing custom display labels for variables. Names should match variable names, values are display labels. Default is `NULL`.
`metrics`	Character specification for which statistics to display: `"both"` - Show effect estimates with CI and p-values [default] `"effect"` - Show only effect estimates with CI `"p"` - Show only p-values Can also be a character vector: `c("effect", "p")` is equivalent to `"both"`.
`return_type`	Character string specifying what to return: `"table"` - Return formatted results table only [default] `"model"` - Return multivariable model object only `"both"` - Return list with both table and model
`keep_models`	Logical. If `TRUE`, stores univariable model objects in the output. Can consume significant memory for many predictors. Default is `FALSE`.
`exponentiate`	Logical. Whether to exponentiate coefficients. Default is `NULL`, which automatically exponentiates for logistic, Poisson, and Cox models, and displays raw coefficients for linear models.
`conf_method`	Character string controlling the confidence interval method. If `NULL` (default), uses `getOption("summata.conf_method", "profile")`. `"profile"` - Profile likelihood intervals for GLM and negative binomial models (via `MASS::confint.glm()`), exact t-distribution intervals for linear models. Falls back to Wald on profiling failure. Quasi-likelihood families always use Wald (no true likelihood). `"wald"` - Wald intervals (coefficient `\pm` z `\times` SE) for all model types. Faster but less accurate near boundary conditions or with small subgroups. Cox and mixed-effects models use Wald intervals regardless of this setting. Set globally with `options(summata.conf_method = "wald")` to use Wald throughout a session. Note: when `method = "screen"` and `columns = "multi"`, the internal screening pass always uses Wald since only p-values are needed for variable selection.
`parallel`	Logical. If `TRUE` (default), fits univariable models in parallel using multiple CPU cores for improved performance.
`n_cores`	Integer specifying the number of CPU cores to use for parallel processing. Default is `NULL` (auto-detect: uses `detectCores() - 1`). Ignored when `parallel = FALSE`.
`number_format`	Character string or two-element character vector controlling thousand and decimal separators in formatted output. Named presets: `"us"` - Comma thousands, period decimal: `1,234.56` [default] `"eu"` - Period thousands, comma decimal: `1.234,56` `"space"` - Thin-space thousands, period decimal: `1 234.56` (SI/ISO 31-0) `"none"` - No thousands separator: `1234.56` Or provide a custom two-element vector `c(big.mark, decimal.mark)`, e.g., `c("'", ".")` for Swiss-style: `⁠1'234.56⁠`. When `NULL` (default), uses `getOption("summata.number_format", "us")`. Set the global option once per session to avoid passing this argument repeatedly: options(summata.number_format = "eu")
`verbose`	Logical. If `TRUE`, displays model fitting warnings (e.g., singular fit, convergence issues). If `FALSE` (default), routine fitting messages are suppressed while unexpected warnings are preserved. When `NULL`, uses `getOption("summata.verbose", FALSE)`.
`...`	Additional arguments passed to model fitting functions (e.g., `weights`, `subset`, `na.action`).

Details

Analysis Workflow:

The function implements a complete regression analysis pipeline:

Univariable screening: Fits separate models for each predictor (outcome ~ predictor). Each predictor is tested independently to understand crude associations.
Variable selection: Based on the method parameter:
- "screen": Automatically selects predictors with univariable p \le p_threshold
- "all": Includes all predictors (no selection)
- "custom": Uses predictors specified in multi_predictors
Multivariable modeling: Fits a single model with selected predictors (outcome ~ predictor1 + predictor2 + ...). Estimates are adjusted for all other variables in the model.
Output formatting: Combines results into publication-ready table with appropriate effect measures and formatting.

Variable Selection Strategies:

"Screen" Method (method = "screen"):

Uses p-value threshold for automatic selection
Liberal thresholds (e.g., 0.20) cast a wide net to avoid missing important predictors
Stricter thresholds (e.g., 0.05) focus on strongly associated predictors
Helps reduce overfitting and multicollinearity
Common in exploratory analyses and when sample size is limited

"All" Method (method = "all"):

No variable selection - includes all predictors
Appropriate when all variables are theoretically important
Risk of overfitting with many predictors relative to sample size
Useful for confirmatory analyses with pre-specified models

"Custom" Method (method = "custom"):

Manual selection based on subject matter knowledge
Runs univariable analysis for all predictors (for comparison)
Includes only specified predictors in multivariable model
Ideal for theory-driven model building
Allows comparison of unadjusted vs adjusted effects for all variables

Interpreting Results:

When columns = "both" (default), tables show:

Univariable columns: Crude associations, unadjusted for other variables. Labeled as "OR/HR/RR/Coefficient (95% CI)" and "Uni p"
Multivariable columns: Adjusted associations, accounting for all other predictors in the model. Labeled as "aOR/aHR/aRR/Adj. Coefficient (95% CI)" and "Multi p" ("a" = adjusted)
Variables not meeting selection criteria show "-" in multivariable columns

Comparing univariable and multivariable results helps identify:

Confounding: Large changes in effect estimates
Independent effects: Similar univariable and multivariable estimates
Mediation: Attenuated effects in multivariable model
Suppression: Effects that emerge only after adjustment

Sample Size Considerations:

Rule of thumb for multivariable models:

Logistic regression: \ge 10 events per predictor variable
Cox regression: \ge 10 events per predictor variable
Linear regression: \ge 10-20 observations per predictor

Use screening methods to reduce predictor count when these ratios are not met.

Value

Depends on return_type parameter:

When return_type = "table" (default): A data.table with S3 class "fullfit_result" containing:

Variable: Character. Predictor name or custom label
Group: Character. Category level for factors, empty for continuous
n/n_group: Integer. Sample sizes (if show_n = TRUE). For variables included in the multivariable model, reflects the complete-case sample size from the fitted model (listwise deletion across all included predictors). For variables not selected into the multivariable model, reflects the per-variable sample size from the univariable analysis. This follows STROBE guideline item 12, which recommends reporting the number of participants included at each stage of analysis.
events/events_group: Integer. Event counts (if show_events = TRUE). Same complete-case convention as n: multivariable rows show events from the fitted model, univariable-only rows show per-variable counts.
OR/HR/RR/Coefficient (95% CI): Character. Unadjusted effect (if columns includes "uni" and metrics includes "effect")
Uni p: Character. Univariable p-value (if columns includes "uni" and metrics includes "p")
aOR/aHR/aRR/Adj. Coefficient (95% CI): Character. Adjusted effect (if columns includes "multi" and metrics includes "effect")
Multi p: Character. Multivariable p-value (if columns includes "multi" and metrics includes "p")

When return_type = "model": The fitted multivariable model object (glm, lm, coxph, etc.).

When return_type = "both": A list with two elements:

table: The formatted results data.table
model: The fitted multivariable model object

The table includes the following attributes:

outcome: Character. The outcome variable name
model_type: Character. The regression model type
method: Character. The variable selection method used
columns: Character. Which columns were displayed
model: The multivariable model object (if fitted)
uni_results: The complete univariable screening results
n_multi: Integer. Number of predictors in multivariable model
screened: Character vector. Names of predictors that passed univariable screening at the specified p-value threshold
significant: Character vector. Names of variables with p < 0.05 in the multivariable model (or univariable if multivariable was not fitted)

Examples

# Load example data
data(clintrial)
data(clintrial_labels)

# Example 1: Basic screening with p < 0.05 threshold
result1 <- fullfit(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking",
                   "hypertension", "diabetes",
                   "treatment", "stage"),
    method = "screen",
    p_threshold = 0.05,
    labels = clintrial_labels
)
print(result1)
# Shows both univariable and multivariable results
# Only significant univariable predictors in multivariable model



# Example 2: Include all predictors (no selection)
result2 <- fullfit(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "treatment", "stage"),
    method = "all",
    labels = clintrial_labels
)
print(result2)

# Example 3: Custom variable selection
result3 <- fullfit(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "treatment", "stage"),
    method = "custom",
    multi_predictors = c("age", "treatment", "stage"),
    labels = clintrial_labels
)
print(result3)
# Univariable for all, multivariable for selected only

# Example 4: Cox regression with screening
library(survival)
cox_result <- fullfit(
    data = clintrial,
    outcome = "Surv(os_months, os_status)",
    predictors = c("age", "sex", "treatment", "stage"),
    model_type = "coxph",
    method = "screen",
    p_threshold = 0.10,
    labels = clintrial_labels
)
print(cox_result)

# Example 5: Linear regression without screening
linear_result <- fullfit(
    data = clintrial,
    outcome = "bmi",
    predictors = c("age", "sex", "smoking", "creatinine"),
    model_type = "lm",
    method = "all",
    labels = clintrial_labels
)
print(linear_result)

# Example 6: Poisson regression for count outcomes
poisson_result <- fullfit(
    data = clintrial,
    outcome = "fu_count",
    predictors = c("age", "stage", "treatment", "surgery"),
    model_type = "glm",
    family = "poisson",
    method = "all",
    labels = clintrial_labels
)
print(poisson_result)

# Example 7: Show only multivariable results
multi_only <- fullfit(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "treatment", "stage"),
    method = "all",
    columns = "multi",
    labels = clintrial_labels
)
print(multi_only)

# Example 8: Return both table and model object
both <- fullfit(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "treatment", "stage"),
    method = "all",
    return_type = "both"
)
print(both$table)
summary(both$model)

# Example 9: Keep univariable models for diagnostics
with_models <- fullfit(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "bmi", "creatinine"),
    keep_models = TRUE
)
uni_results <- attr(with_models, "uni_results")
uni_models <- attr(uni_results, "models")
summary(uni_models[["age"]])

# Example 10: Linear mixed effects with site clustering
if (requireNamespace("lme4", quietly = TRUE)) {
    lmer_result <- fullfit(
        data = clintrial,
        outcome = "los_days",
        predictors = c("age", "treatment", "surgery", "stage"),
        random = "(1|site)",
        model_type = "lmer",
        method = "all",
        labels = clintrial_labels
    )
    print(lmer_result)
}

summata documentation built on May 7, 2026, 5:07 p.m.