uniscreen: Univariable Screening for Multiple Predictors
In summata: Publication-Ready Summary Tables and Forest Plots

uniscreen

R Documentation

Univariable Screening for Multiple Predictors

Description

Performs comprehensive univariable (unadjusted) regression analyses by fitting separate models for each predictor against a single outcome. This function is designed for initial variable screening, hypothesis generation, and understanding crude associations before multivariable modeling. Returns publication-ready formatted results with optional p-value filtering.

Usage

uniscreen(
  data,
  outcome,
  predictors,
  model_type = "glm",
  family = "binomial",
  random = NULL,
  p_threshold = 0.05,
  conf_level = 0.95,
  reference_rows = TRUE,
  show_n = TRUE,
  show_events = TRUE,
  digits = 2,
  p_digits = 3,
  labels = NULL,
  keep_models = FALSE,
  exponentiate = NULL,
  conf_method = NULL,
  parallel = TRUE,
  n_cores = NULL,
  number_format = NULL,
  verbose = NULL,
  ...
)

Arguments

`data`	Data frame or data.table containing the analysis dataset. The function automatically converts data frames to data.tables for efficient processing.
`outcome`	Character string specifying the outcome variable name. For survival analysis, use `Surv()` syntax from the survival package (e.g., `"Surv(time, status)"` or `"Surv(os_months, os_status)"`).
`predictors`	Character vector of predictor variable names to screen. Each predictor is tested independently in its own univariable model. Can include continuous, categorical (factor), or binary variables.
`model_type`	Character string specifying the type of regression model to fit. Options include: `"glm"` - Generalized linear model (default). Supports multiple distributions via the `family` parameter including logistic, Poisson, Gamma, Gaussian, and quasi-likelihood models. `"lm"` - Linear regression for continuous outcomes with normally distributed errors. Equivalent to `glm` with `family = "gaussian"`. `"coxph"` - Cox proportional hazards model for time-to-event survival analysis. Requires `Surv()` outcome syntax. `"clogit"` - Conditional logistic regression for matched case-control studies or stratified analyses. `"negbin"` - Negative binomial regression for overdispersed count data (requires MASS package). Estimates an additional dispersion parameter compared to Poisson regression. `"glmer"` - Generalized linear mixed-effects model for hierarchical or clustered data with non-normal outcomes (requires lme4 package and `random` parameter). `"lmer"` - Linear mixed-effects model for hierarchical or clustered data with continuous outcomes (requires lme4 package and `random` parameter). `"coxme"` - Cox mixed-effects model for clustered survival data (requires coxme package and `random` parameter).
`family`	For GLM and GLMER models, specifies the error distribution and link function. Can be a character string, a family function, or a family object. Ignored for non-GLM/GLMER models. Binary/Binomial outcomes: `"binomial"` or `binomial()` - Logistic regression for binary outcomes (0/1, TRUE/FALSE). Returns odds ratios (OR). Default. `"quasibinomial"` or `quasibinomial()` - Logistic regression with overdispersion. Use when residual deviance >> residual df. `binomial(link = "probit")` - Probit regression (normal CDF link). `binomial(link = "cloglog")` - Complementary log-log link for asymmetric binary outcomes. Count outcomes: `"poisson"` or `poisson()` - Poisson regression for count data. Returns rate ratios (RR). Assumes mean = variance. `"quasipoisson"` or `quasipoisson()` - Poisson regression with overdispersion. Use when variance > mean. Continuous outcomes: `"gaussian"` or `gaussian()` - Normal/Gaussian distribution for continuous outcomes. Equivalent to linear regression. `gaussian(link = "log")` - Log-linear model for positive continuous outcomes. Returns multiplicative effects. `gaussian(link = "inverse")` - Inverse link for specific applications. Positive continuous outcomes: `"Gamma"` or `Gamma()` - Gamma distribution for positive, right-skewed continuous data (e.g., costs, lengths of stay). When passed as a string, resolves to log link for interpretable multiplicative effects. `Gamma(link = "inverse")` - Gamma with inverse (canonical) link. `Gamma(link = "identity")` - Gamma with identity link for additive effects on positive outcomes. `"inverse.gaussian"` or `inverse.gaussian()` - Inverse Gaussian for positive, highly right-skewed data. For negative binomial regression (overdispersed counts), use `model_type = "negbin"` instead of the `family` parameter. See `family` for additional details and options.
`random`	Character string specifying the random-effects formula for mixed-effects models (`glmer`, `lmer`, `coxme`). Use standard lme4/coxme syntax, e.g., `"(1\|site)"` for random intercepts by site, `"(1\|site/patient)"` for nested random effects. Required when `model_type` is a mixed-effects model type unless random effects are included in the `predictors` vector. Alternatively, random effects can be included directly in the `predictors` vector using the same syntax (e.g., `predictors = c("age", "sex", "(1\|site)")`), though they will not be iterated over as predictors. Default is `NULL`.
`p_threshold`	Numeric value between 0 and 1 specifying the p-value threshold used to count significant predictors in the printed summary. All predictors are always included in the output table. Default is 0.05.
`conf_level`	Numeric confidence level for confidence intervals. Must be between 0 and 1. Default is 0.95 (95% confidence intervals).
`reference_rows`	Logical. If `TRUE`, adds rows for reference categories of factor variables with baseline values (OR/HR/RR = 1, coefficient = 0). Makes tables complete and easier to interpret. Default is `TRUE`.
`show_n`	Logical. If `TRUE`, includes the sample size column in the output table. Default is `TRUE`.
`show_events`	Logical. If `TRUE`, includes the events column in the output table (relevant for survival and logistic regression). Default is `TRUE`.
`digits`	Integer specifying the number of decimal places for effect estimates (OR, HR, RR, coefficients). Default is 2.
`p_digits`	Integer specifying the number of decimal places for p-values. Values smaller than `10^(-p_digits)` are displayed as `"< 0.001"` (for `p_digits = 3`), `"< 0.0001"` (for `p_digits = 4`), etc. Default is 3.
`labels`	Named character vector or list providing custom display labels for variables. Names should match predictor names, values are the display labels. Predictors not in `labels` use their original names. Default is `NULL`.
`keep_models`	Logical. If `TRUE`, stores all fitted model objects in the output as an attribute. This allows access to models for diagnostics, predictions, or further analysis, but can consume significant memory for large datasets or many predictors. Models are accessible via `attr(result, "models")`. Default is `FALSE`.
`exponentiate`	Logical. Whether to exponentiate coefficients (display OR/HR/RR instead of log odds/log hazards). Default is `NULL`, which automatically exponentiates for logistic, Poisson, and Cox models, and displays raw coefficients for linear models and other GLM families. Set to `TRUE` to force exponentiation or `FALSE` to force coefficients.
`conf_method`	Character string controlling the confidence interval method. If `NULL` (default), uses `getOption("summata.conf_method", "profile")`. `"profile"` - Profile likelihood intervals for GLM and negative binomial models (via `MASS::confint.glm()`), exact t-distribution intervals for linear models. Falls back to Wald on profiling failure. Quasi-likelihood families always use Wald (no true likelihood). `"wald"` - Wald intervals (coefficient `\pm` z `\times` SE) for all model types. Faster but less accurate near boundary conditions or with small subgroups. Cox and mixed-effects models use Wald intervals regardless of this setting. Set globally with `options(summata.conf_method = "wald")` to use Wald throughout a session.
`parallel`	Logical. If `TRUE` (default), fits models in parallel using multiple CPU cores for improved performance with many predictors. On Unix/Mac systems, uses fork-based parallelism via `mclapply`; on Windows, uses socket clusters via `parLapply`. Set to `FALSE` for sequential processing.
`n_cores`	Integer specifying the number of CPU cores to use for parallel processing. Default is `NULL`, which automatically detects available cores and uses `detectCores() - 1`. During R CMD check, the number of cores is automatically limited to 2 per CRAN policy. Ignored when `parallel = FALSE`.
`number_format`	Character string or two-element character vector controlling thousand and decimal separators in formatted output. Named presets: `"us"` - Comma thousands, period decimal: `1,234.56` [default] `"eu"` - Period thousands, comma decimal: `1.234,56` `"space"` - Thin-space thousands, period decimal: `1 234.56` (SI/ISO 31-0) `"none"` - No thousands separator: `1234.56` Or provide a custom two-element vector `c(big.mark, decimal.mark)`, e.g., `c("'", ".")` for Swiss-style: `⁠1'234.56⁠`. When `NULL` (default), uses `getOption("summata.number_format", "us")`. Set the global option once per session to avoid passing this argument repeatedly: options(summata.number_format = "eu")
`verbose`	Logical. If `TRUE`, displays model fitting warnings (e.g., singular fit, convergence issues). If `FALSE` (default), routine fitting messages are suppressed while unexpected warnings are preserved. When `NULL`, uses `getOption("summata.verbose", FALSE)`.
`...`	Additional arguments passed to the underlying model fitting functions (`glm`, `lm`, `coxph`, etc.). Common options include `weights`, `subset`, `na.action`, and model-specific control parameters.

Details

Analysis Approach:

The function implements a comprehensive univariable screening workflow:

For each predictor in predictors, fits a separate model: outcome ~ predictor
Extracts coefficients, confidence intervals, and p-values from each model
Combines results into a single table for easy comparison
Formats output for publication with appropriate effect measures

Each predictor is tested independently - these are crude (unadjusted) associations that do not account for confounding or interaction effects.

When to Use Univariable Screening:

Initial variable selection: Identify predictors associated with the outcome before building multivariable models
Hypothesis generation: Explore potential associations in exploratory analyses
Understanding crude associations: Report unadjusted effects alongside adjusted estimates
Variable reduction: Use p-value thresholds (e.g., p < 0.20) to identify candidates for multivariable modeling
Checking multicollinearity: Compare univariable and multivariable effects to identify potential collinearity

Threshold for p-values:

The p_threshold parameter controls the significance threshold used in the printed summary to count how many predictors are significant. All predictors are always included in the output table regardless of this setting.

Effect Measures by Model Type:

Logistic regression (model_type = "glm", family = "binomial"): Odds ratios (OR)
Cox regression (model_type = "coxph"): Hazard ratios (HR)
Poisson regression (model_type = "glm", family = "poisson"): Rate/risk ratios (RR)
Negative binomial (model_type = "negbin"): Rate ratios (RR)
Linear regression (model_type = "lm" or GLM with identity link): Raw coefficient estimates
Gamma regression (model_type = "glm", family = "Gamma"): Multiplicative effects (with default log link)

Memory Considerations:

When keep_models = FALSE (default), fitted models are discarded after extracting results to conserve memory. Set keep_models = TRUE only when the following are needed:

Model diagnostic plots
Predictions from individual models
Additional model statistics not extracted by default
Further analysis of specific models

Value

A data.table with S3 class "uniscreen_result" containing formatted univariable screening results. The table structure includes:

Variable: Character. Predictor name or custom label (from labels)
Group: Character. For factor variables: category level. For continuous variables: typically empty or descriptive statistic label
n: Integer. Sample size used in the model (if show_n = TRUE)
n_group: Integer. Sample size for this specific factor level (factor variables only)
events: Integer. Total number of events in the model for survival or logistic regression (if show_events = TRUE)
events_group: Integer. Number of events for this specific factor level (factor variables only)
OR/HR/RR/Coefficient (95% CI): Character. Formatted effect estimate with confidence interval. Column name depends on model type: "OR (95% CI)" for logistic, "HR (95% CI)" for survival, "RR (95% CI)" for counts, "Coefficient (95% CI)" for linear models
p-value: Character. Formatted p-value from the Wald test

The returned object includes the following attributes accessible via attr():

raw_data: data.table. Unformatted numeric results with separate columns for coefficients, standard errors, confidence interval bounds, etc. Suitable for further statistical analysis or custom formatting
models: List (if keep_models = TRUE). Named list of fitted model objects, with predictor names as list names. Access specific models via attr(result, "models")[["predictor_name"]]
outcome: Character. The outcome variable name used
model_type: Character. The regression model type used
model_scope: Character. Always "Univariable" for screening results
screening_type: Character. Always "univariable" to identify the analysis type
p_threshold: Numeric. The p-value threshold used for significance
significant: Character vector. Names of predictors with p-value below the screening threshold, suitable for passing directly to downstream modeling functions

Examples

# Load example data
data(clintrial)
data(clintrial_labels)

# Example 1: Basic logistic regression screening
screen1 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension"),
    model_type = "glm",
    family = "binomial",
    parallel = FALSE
)
print(screen1)



# Example 2: With custom variable labels
screen2 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "treatment"),
    labels = clintrial_labels,
    parallel = FALSE
)
print(screen2)

# Example 3: Filter by p-value threshold
# Only keep predictors with p < 0.20 (common for screening)
screen3 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension", 
                  "diabetes", "stage"),
    p_threshold = 0.20,
    labels = clintrial_labels,
    parallel = FALSE
)
print(screen3)
# Only significant predictors are shown

# Example 4: Cox proportional hazards screening
library(survival)
cox_screen <- uniscreen(
    data = clintrial,
    outcome = "Surv(os_months, os_status)",
    predictors = c("age", "sex", "treatment", "stage", "grade"),
    model_type = "coxph",
    labels = clintrial_labels,
    parallel = FALSE
)
print(cox_screen)
# Returns hazard ratios (HR) instead of odds ratios

# Example 5: Keep models for diagnostics
screen5 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "bmi", "creatinine"),
    keep_models = TRUE,
    parallel = FALSE
)

# Access stored models
models <- attr(screen5, "models")
summary(models[["age"]])
plot(models[["age"]])  # Diagnostic plots

# Example 6: Linear regression screening
linear_screen <- uniscreen(
    data = clintrial,
    outcome = "bmi",
    predictors = c("age", "sex", "smoking", "creatinine", "hemoglobin"),
    model_type = "lm",
    labels = clintrial_labels,
    parallel = FALSE
)
print(linear_screen)

# Example 7: Poisson regression for equidispersed count outcomes
# fu_count has variance ~= mean, appropriate for standard Poisson
poisson_screen <- uniscreen(
    data = clintrial,
    outcome = "fu_count",
    predictors = c("age", "stage", "treatment", "surgery"),
    model_type = "glm",
    family = "poisson",
    labels = clintrial_labels,
    parallel = FALSE
)
print(poisson_screen)
# Returns rate ratios (RR)

# Example 8: Negative binomial for overdispersed counts
# ae_count has variance > mean (overdispersed), use negbin
if (requireNamespace("MASS", quietly = TRUE)) {
    nb_screen <- uniscreen(
        data = clintrial,
        outcome = "ae_count",
        predictors = c("age", "treatment", "diabetes", "surgery"),
        model_type = "negbin",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(nb_screen)
}

# Example 9: Gamma regression for positive continuous outcomes (\emph{e.g.,} costs)
gamma_screen <- uniscreen(
    data = clintrial,
    outcome = "los_days",
    predictors = c("age", "sex", "treatment", "surgery"),
    model_type = "glm",
    family = Gamma(link = "log"),
    labels = clintrial_labels,
    parallel = FALSE
)
print(gamma_screen)

# Example 10: Hide reference rows for factor variables
screen10 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("treatment", "stage", "grade"),
    reference_rows = FALSE,
    parallel = FALSE
)
print(screen10)
# Reference categories not shown

# Example 11: Customize decimal places
screen11 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "bmi", "creatinine"),
    digits = 3,      # 3 decimal places for OR
    p_digits = 4     # 4 decimal places for p-values
)
print(screen11)

# Example 12: Hide sample size and event columns
screen12 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi"),
    show_n = FALSE,
    show_events = FALSE,
    parallel = FALSE
)
print(screen12)

# Example 13: Access raw numeric data
screen13 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "treatment"),
    parallel = FALSE
)
raw_data <- attr(screen13, "raw_data")
print(raw_data)
# Contains unformatted coefficients, SEs, CIs, etc.

# Example 14: Force coefficient display instead of OR
screen14 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "bmi"),
    model_type = "glm",
    family = "binomial",
    parallel = FALSE,
    exponentiate = FALSE  # Show log odds instead of OR
)
print(screen14)

# Example 15: Screening with weights
screen15 <- uniscreen(
    data = clintrial,
    outcome = "Surv(os_months, os_status)",
    predictors = c("age", "sex", "bmi"),
    model_type = "coxph",
    weights = runif(nrow(clintrial), min = 0.5, max = 2),  # Random numbers for example
    parallel = FALSE
)

# Example 16: Strict significance filter (p < 0.05)
sig_only <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension", 
                  "diabetes", "ecog", "treatment", "stage", "grade"),
    p_threshold = 0.05,
    labels = clintrial_labels,
    parallel = FALSE
)

# Check how many predictors passed the filter
n_significant <- length(unique(sig_only$Variable[sig_only$Variable != ""]))
cat("Significant predictors:", n_significant, "\n")

# Example 17: Complete workflow - screen then use in multivariable
# Step 1: Screen with liberal threshold
candidates <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension",
                  "diabetes", "treatment", "stage", "grade"),
    p_threshold = 0.20,
    parallel = FALSE
)

# Step 2: Extract significant predictor names
sig_predictors <- attr(candidates, "significant")

# Step 3: Fit multivariable model with selected predictors
multi_model <- fit(
    data = clintrial,
    outcome = "os_status",
    predictors = sig_predictors,
    labels = clintrial_labels
)
print(multi_model)

# Example 18: Mixed-effects logistic regression (glmer)
# Accounts for clustering by site
if (requireNamespace("lme4", quietly = TRUE)) {
    glmer_screen <- uniscreen(
        data = clintrial,
        outcome = "os_status",
        predictors = c("age", "sex", "treatment", "stage"),
        model_type = "glmer",
        random = "(1|site)",
        family = "binomial",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(glmer_screen)
}

# Example 19: Mixed-effects linear regression (lmer)
if (requireNamespace("lme4", quietly = TRUE)) {
    lmer_screen <- uniscreen(
        data = clintrial,
        outcome = "biomarker_x",
        predictors = c("age", "sex", "treatment", "smoking"),
        model_type = "lmer",
        random = "(1|site)",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(lmer_screen)
}

# Example 20: Mixed-effects Cox model (coxme)
if (requireNamespace("coxme", quietly = TRUE)) {
    coxme_screen <- uniscreen(
        data = clintrial,
        outcome = "Surv(os_months, os_status)",
        predictors = c("age", "sex", "treatment", "stage"),
        model_type = "coxme",
        random = "(1|site)",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(coxme_screen)
}

# Example 21: Mixed-effects with nested random effects
# Patients nested within sites
if (requireNamespace("lme4", quietly = TRUE)) {
    nested_screen <- uniscreen(
        data = clintrial,
        outcome = "os_status",
        predictors = c("age", "treatment"),
        model_type = "glmer",
        random = "(1|site/patient_id)",
        family = "binomial",
        parallel = FALSE
    )
}

# Example 22: Quasipoisson for overdispersed count data
# Alternative to negative binomial when MASS not available
quasi_screen <- uniscreen(
    data = clintrial,
    outcome = "ae_count",
    predictors = c("age", "treatment", "diabetes", "surgery", "stage"),
    model_type = "glm",
    family = "quasipoisson",
    labels = clintrial_labels,
    parallel = FALSE
)
print(quasi_screen)
# Adjusts standard errors for overdispersion

# Example 23: Quasibinomial for overdispersed binary data
quasibin_screen <- uniscreen(
    data = clintrial,
    outcome = "any_complication",
    predictors = c("age", "bmi", "diabetes", "surgery", "ecog"),
    model_type = "glm",
    family = "quasibinomial",
    labels = clintrial_labels,
    parallel = FALSE
)
print(quasibin_screen)

# Example 24: Inverse Gaussian for highly skewed positive data
invgauss_screen <- uniscreen(
    data = clintrial,
    outcome = "recovery_days",
    predictors = c("age", "surgery", "pain_score", "los_days"),
    model_type = "glm",
    family = inverse.gaussian(link = "log"),
    labels = clintrial_labels,
    parallel = FALSE
)
print(invgauss_screen)

summata documentation built on May 7, 2026, 5:07 p.m.