uniscreen: Univariable Screening for Multiple Predictors

View source: R/uniscreen.R

uniscreenR Documentation

Univariable Screening for Multiple Predictors

Description

Performs comprehensive univariable (unadjusted) regression analyses by fitting separate models for each predictor against a single outcome. This function is designed for initial variable screening, hypothesis generation, and understanding crude associations before multivariable modeling. Returns publication-ready formatted results with optional p-value filtering.

Usage

uniscreen(
  data,
  outcome,
  predictors,
  model_type = "glm",
  family = "binomial",
  random = NULL,
  p_threshold = 0.05,
  conf_level = 0.95,
  reference_rows = TRUE,
  show_n = TRUE,
  show_events = TRUE,
  digits = 2,
  p_digits = 3,
  labels = NULL,
  keep_models = FALSE,
  exponentiate = NULL,
  conf_method = NULL,
  parallel = TRUE,
  n_cores = NULL,
  number_format = NULL,
  verbose = NULL,
  ...
)

Arguments

data

Data frame or data.table containing the analysis dataset. The function automatically converts data frames to data.tables for efficient processing.

outcome

Character string specifying the outcome variable name. For survival analysis, use Surv() syntax from the survival package (e.g., "Surv(time, status)" or "Surv(os_months, os_status)").

predictors

Character vector of predictor variable names to screen. Each predictor is tested independently in its own univariable model. Can include continuous, categorical (factor), or binary variables.

model_type

Character string specifying the type of regression model to fit. Options include:

  • "glm" - Generalized linear model (default). Supports multiple distributions via the family parameter including logistic, Poisson, Gamma, Gaussian, and quasi-likelihood models.

  • "lm" - Linear regression for continuous outcomes with normally distributed errors. Equivalent to glm with family = "gaussian".

  • "coxph" - Cox proportional hazards model for time-to-event survival analysis. Requires Surv() outcome syntax.

  • "clogit" - Conditional logistic regression for matched case-control studies or stratified analyses.

  • "negbin" - Negative binomial regression for overdispersed count data (requires MASS package). Estimates an additional dispersion parameter compared to Poisson regression.

  • "glmer" - Generalized linear mixed-effects model for hierarchical or clustered data with non-normal outcomes (requires lme4 package and random parameter).

  • "lmer" - Linear mixed-effects model for hierarchical or clustered data with continuous outcomes (requires lme4 package and random parameter).

  • "coxme" - Cox mixed-effects model for clustered survival data (requires coxme package and random parameter).

family

For GLM and GLMER models, specifies the error distribution and link function. Can be a character string, a family function, or a family object. Ignored for non-GLM/GLMER models.

Binary/Binomial outcomes:

  • "binomial" or binomial() - Logistic regression for binary outcomes (0/1, TRUE/FALSE). Returns odds ratios (OR). Default.

  • "quasibinomial" or quasibinomial() - Logistic regression with overdispersion. Use when residual deviance >> residual df.

  • binomial(link = "probit") - Probit regression (normal CDF link).

  • binomial(link = "cloglog") - Complementary log-log link for asymmetric binary outcomes.

Count outcomes:

  • "poisson" or poisson() - Poisson regression for count data. Returns rate ratios (RR). Assumes mean = variance.

  • "quasipoisson" or quasipoisson() - Poisson regression with overdispersion. Use when variance > mean.

Continuous outcomes:

  • "gaussian" or gaussian() - Normal/Gaussian distribution for continuous outcomes. Equivalent to linear regression.

  • gaussian(link = "log") - Log-linear model for positive continuous outcomes. Returns multiplicative effects.

  • gaussian(link = "inverse") - Inverse link for specific applications.

Positive continuous outcomes:

  • "Gamma" or Gamma() - Gamma distribution for positive, right-skewed continuous data (e.g., costs, lengths of stay). When passed as a string, resolves to log link for interpretable multiplicative effects.

  • Gamma(link = "inverse") - Gamma with inverse (canonical) link.

  • Gamma(link = "identity") - Gamma with identity link for additive effects on positive outcomes.

  • "inverse.gaussian" or inverse.gaussian() - Inverse Gaussian for positive, highly right-skewed data.

For negative binomial regression (overdispersed counts), use model_type = "negbin" instead of the family parameter.

See family for additional details and options.

random

Character string specifying the random-effects formula for mixed-effects models (glmer, lmer, coxme). Use standard lme4/coxme syntax, e.g., "(1|site)" for random intercepts by site, "(1|site/patient)" for nested random effects. Required when model_type is a mixed-effects model type unless random effects are included in the predictors vector. Alternatively, random effects can be included directly in the predictors vector using the same syntax (e.g., predictors = c("age", "sex", "(1|site)")), though they will not be iterated over as predictors. Default is NULL.

p_threshold

Numeric value between 0 and 1 specifying the p-value threshold used to count significant predictors in the printed summary. All predictors are always included in the output table. Default is 0.05.

conf_level

Numeric confidence level for confidence intervals. Must be between 0 and 1. Default is 0.95 (95% confidence intervals).

reference_rows

Logical. If TRUE, adds rows for reference categories of factor variables with baseline values (OR/HR/RR = 1, coefficient = 0). Makes tables complete and easier to interpret. Default is TRUE.

show_n

Logical. If TRUE, includes the sample size column in the output table. Default is TRUE.

show_events

Logical. If TRUE, includes the events column in the output table (relevant for survival and logistic regression). Default is TRUE.

digits

Integer specifying the number of decimal places for effect estimates (OR, HR, RR, coefficients). Default is 2.

p_digits

Integer specifying the number of decimal places for p-values. Values smaller than 10^(-p_digits) are displayed as "< 0.001" (for p_digits = 3), "< 0.0001" (for p_digits = 4), etc. Default is 3.

labels

Named character vector or list providing custom display labels for variables. Names should match predictor names, values are the display labels. Predictors not in labels use their original names. Default is NULL.

keep_models

Logical. If TRUE, stores all fitted model objects in the output as an attribute. This allows access to models for diagnostics, predictions, or further analysis, but can consume significant memory for large datasets or many predictors. Models are accessible via attr(result, "models"). Default is FALSE.

exponentiate

Logical. Whether to exponentiate coefficients (display OR/HR/RR instead of log odds/log hazards). Default is NULL, which automatically exponentiates for logistic, Poisson, and Cox models, and displays raw coefficients for linear models and other GLM families. Set to TRUE to force exponentiation or FALSE to force coefficients.

conf_method

Character string controlling the confidence interval method. If NULL (default), uses getOption("summata.conf_method", "profile").

  • "profile" - Profile likelihood intervals for GLM and negative binomial models (via MASS::confint.glm()), exact t-distribution intervals for linear models. Falls back to Wald on profiling failure. Quasi-likelihood families always use Wald (no true likelihood).

  • "wald" - Wald intervals (coefficient \pm z \times SE) for all model types. Faster but less accurate near boundary conditions or with small subgroups.

Cox and mixed-effects models use Wald intervals regardless of this setting. Set globally with options(summata.conf_method = "wald") to use Wald throughout a session.

parallel

Logical. If TRUE (default), fits models in parallel using multiple CPU cores for improved performance with many predictors. On Unix/Mac systems, uses fork-based parallelism via mclapply; on Windows, uses socket clusters via parLapply. Set to FALSE for sequential processing.

n_cores

Integer specifying the number of CPU cores to use for parallel processing. Default is NULL, which automatically detects available cores and uses detectCores() - 1. During R CMD check, the number of cores is automatically limited to 2 per CRAN policy. Ignored when parallel = FALSE.

number_format

Character string or two-element character vector controlling thousand and decimal separators in formatted output. Named presets:

  • "us" - Comma thousands, period decimal: 1,234.56 [default]

  • "eu" - Period thousands, comma decimal: 1.234,56

  • "space" - Thin-space thousands, period decimal: 1 234.56 (SI/ISO 31-0)

  • "none" - No thousands separator: 1234.56

Or provide a custom two-element vector c(big.mark, decimal.mark), e.g., c("'", ".") for Swiss-style: ⁠1'234.56⁠.

When NULL (default), uses getOption("summata.number_format", "us"). Set the global option once per session to avoid passing this argument repeatedly:

    options(summata.number_format = "eu")
  
verbose

Logical. If TRUE, displays model fitting warnings (e.g., singular fit, convergence issues). If FALSE (default), routine fitting messages are suppressed while unexpected warnings are preserved. When NULL, uses getOption("summata.verbose", FALSE).

...

Additional arguments passed to the underlying model fitting functions (glm, lm, coxph, etc.). Common options include weights, subset, na.action, and model-specific control parameters.

Details

Analysis Approach:

The function implements a comprehensive univariable screening workflow:

  1. For each predictor in predictors, fits a separate model: outcome ~ predictor

  2. Extracts coefficients, confidence intervals, and p-values from each model

  3. Combines results into a single table for easy comparison

  4. Formats output for publication with appropriate effect measures

Each predictor is tested independently - these are crude (unadjusted) associations that do not account for confounding or interaction effects.

When to Use Univariable Screening:

  • Initial variable selection: Identify predictors associated with the outcome before building multivariable models

  • Hypothesis generation: Explore potential associations in exploratory analyses

  • Understanding crude associations: Report unadjusted effects alongside adjusted estimates

  • Variable reduction: Use p-value thresholds (e.g., p < 0.20) to identify candidates for multivariable modeling

  • Checking multicollinearity: Compare univariable and multivariable effects to identify potential collinearity

Threshold for p-values:

The p_threshold parameter controls the significance threshold used in the printed summary to count how many predictors are significant. All predictors are always included in the output table regardless of this setting.

Effect Measures by Model Type:

  • Logistic regression (model_type = "glm", family = "binomial"): Odds ratios (OR)

  • Cox regression (model_type = "coxph"): Hazard ratios (HR)

  • Poisson regression (model_type = "glm", family = "poisson"): Rate/risk ratios (RR)

  • Negative binomial (model_type = "negbin"): Rate ratios (RR)

  • Linear regression (model_type = "lm" or GLM with identity link): Raw coefficient estimates

  • Gamma regression (model_type = "glm", family = "Gamma"): Multiplicative effects (with default log link)

Memory Considerations:

When keep_models = FALSE (default), fitted models are discarded after extracting results to conserve memory. Set keep_models = TRUE only when the following are needed:

  • Model diagnostic plots

  • Predictions from individual models

  • Additional model statistics not extracted by default

  • Further analysis of specific models

Value

A data.table with S3 class "uniscreen_result" containing formatted univariable screening results. The table structure includes:

Variable

Character. Predictor name or custom label (from labels)

Group

Character. For factor variables: category level. For continuous variables: typically empty or descriptive statistic label

n

Integer. Sample size used in the model (if show_n = TRUE)

n_group

Integer. Sample size for this specific factor level (factor variables only)

events

Integer. Total number of events in the model for survival or logistic regression (if show_events = TRUE)

events_group

Integer. Number of events for this specific factor level (factor variables only)

OR/HR/RR/Coefficient (95% CI)

Character. Formatted effect estimate with confidence interval. Column name depends on model type: "OR (95% CI)" for logistic, "HR (95% CI)" for survival, "RR (95% CI)" for counts, "Coefficient (95% CI)" for linear models

p-value

Character. Formatted p-value from the Wald test

The returned object includes the following attributes accessible via attr():

raw_data

data.table. Unformatted numeric results with separate columns for coefficients, standard errors, confidence interval bounds, etc. Suitable for further statistical analysis or custom formatting

models

List (if keep_models = TRUE). Named list of fitted model objects, with predictor names as list names. Access specific models via attr(result, "models")[["predictor_name"]]

outcome

Character. The outcome variable name used

model_type

Character. The regression model type used

model_scope

Character. Always "Univariable" for screening results

screening_type

Character. Always "univariable" to identify the analysis type

p_threshold

Numeric. The p-value threshold used for significance

significant

Character vector. Names of predictors with p-value below the screening threshold, suitable for passing directly to downstream modeling functions

See Also

fit for fitting a single multivariable model, fullfit for complete univariable-to-multivariable workflow, compfit for comparing multiple models, m2dt for converting individual models to tables

Other regression functions: compfit(), fit(), fullfit(), multifit(), print.compfit_result(), print.fit_result(), print.fullfit_result(), print.multifit_result(), print.uniscreen_result()

Examples

# Load example data
data(clintrial)
data(clintrial_labels)

# Example 1: Basic logistic regression screening
screen1 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension"),
    model_type = "glm",
    family = "binomial",
    parallel = FALSE
)
print(screen1)



# Example 2: With custom variable labels
screen2 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "treatment"),
    labels = clintrial_labels,
    parallel = FALSE
)
print(screen2)

# Example 3: Filter by p-value threshold
# Only keep predictors with p < 0.20 (common for screening)
screen3 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension", 
                  "diabetes", "stage"),
    p_threshold = 0.20,
    labels = clintrial_labels,
    parallel = FALSE
)
print(screen3)
# Only significant predictors are shown

# Example 4: Cox proportional hazards screening
library(survival)
cox_screen <- uniscreen(
    data = clintrial,
    outcome = "Surv(os_months, os_status)",
    predictors = c("age", "sex", "treatment", "stage", "grade"),
    model_type = "coxph",
    labels = clintrial_labels,
    parallel = FALSE
)
print(cox_screen)
# Returns hazard ratios (HR) instead of odds ratios

# Example 5: Keep models for diagnostics
screen5 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "bmi", "creatinine"),
    keep_models = TRUE,
    parallel = FALSE
)

# Access stored models
models <- attr(screen5, "models")
summary(models[["age"]])
plot(models[["age"]])  # Diagnostic plots

# Example 6: Linear regression screening
linear_screen <- uniscreen(
    data = clintrial,
    outcome = "bmi",
    predictors = c("age", "sex", "smoking", "creatinine", "hemoglobin"),
    model_type = "lm",
    labels = clintrial_labels,
    parallel = FALSE
)
print(linear_screen)

# Example 7: Poisson regression for equidispersed count outcomes
# fu_count has variance ~= mean, appropriate for standard Poisson
poisson_screen <- uniscreen(
    data = clintrial,
    outcome = "fu_count",
    predictors = c("age", "stage", "treatment", "surgery"),
    model_type = "glm",
    family = "poisson",
    labels = clintrial_labels,
    parallel = FALSE
)
print(poisson_screen)
# Returns rate ratios (RR)

# Example 8: Negative binomial for overdispersed counts
# ae_count has variance > mean (overdispersed), use negbin
if (requireNamespace("MASS", quietly = TRUE)) {
    nb_screen <- uniscreen(
        data = clintrial,
        outcome = "ae_count",
        predictors = c("age", "treatment", "diabetes", "surgery"),
        model_type = "negbin",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(nb_screen)
}

# Example 9: Gamma regression for positive continuous outcomes (\emph{e.g.,} costs)
gamma_screen <- uniscreen(
    data = clintrial,
    outcome = "los_days",
    predictors = c("age", "sex", "treatment", "surgery"),
    model_type = "glm",
    family = Gamma(link = "log"),
    labels = clintrial_labels,
    parallel = FALSE
)
print(gamma_screen)

# Example 10: Hide reference rows for factor variables
screen10 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("treatment", "stage", "grade"),
    reference_rows = FALSE,
    parallel = FALSE
)
print(screen10)
# Reference categories not shown

# Example 11: Customize decimal places
screen11 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "bmi", "creatinine"),
    digits = 3,      # 3 decimal places for OR
    p_digits = 4     # 4 decimal places for p-values
)
print(screen11)

# Example 12: Hide sample size and event columns
screen12 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi"),
    show_n = FALSE,
    show_events = FALSE,
    parallel = FALSE
)
print(screen12)

# Example 13: Access raw numeric data
screen13 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "treatment"),
    parallel = FALSE
)
raw_data <- attr(screen13, "raw_data")
print(raw_data)
# Contains unformatted coefficients, SEs, CIs, etc.

# Example 14: Force coefficient display instead of OR
screen14 <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "bmi"),
    model_type = "glm",
    family = "binomial",
    parallel = FALSE,
    exponentiate = FALSE  # Show log odds instead of OR
)
print(screen14)

# Example 15: Screening with weights
screen15 <- uniscreen(
    data = clintrial,
    outcome = "Surv(os_months, os_status)",
    predictors = c("age", "sex", "bmi"),
    model_type = "coxph",
    weights = runif(nrow(clintrial), min = 0.5, max = 2),  # Random numbers for example
    parallel = FALSE
)

# Example 16: Strict significance filter (p < 0.05)
sig_only <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension", 
                  "diabetes", "ecog", "treatment", "stage", "grade"),
    p_threshold = 0.05,
    labels = clintrial_labels,
    parallel = FALSE
)

# Check how many predictors passed the filter
n_significant <- length(unique(sig_only$Variable[sig_only$Variable != ""]))
cat("Significant predictors:", n_significant, "\n")

# Example 17: Complete workflow - screen then use in multivariable
# Step 1: Screen with liberal threshold
candidates <- uniscreen(
    data = clintrial,
    outcome = "os_status",
    predictors = c("age", "sex", "bmi", "smoking", "hypertension",
                  "diabetes", "treatment", "stage", "grade"),
    p_threshold = 0.20,
    parallel = FALSE
)

# Step 2: Extract significant predictor names
sig_predictors <- attr(candidates, "significant")

# Step 3: Fit multivariable model with selected predictors
multi_model <- fit(
    data = clintrial,
    outcome = "os_status",
    predictors = sig_predictors,
    labels = clintrial_labels
)
print(multi_model)

# Example 18: Mixed-effects logistic regression (glmer)
# Accounts for clustering by site
if (requireNamespace("lme4", quietly = TRUE)) {
    glmer_screen <- uniscreen(
        data = clintrial,
        outcome = "os_status",
        predictors = c("age", "sex", "treatment", "stage"),
        model_type = "glmer",
        random = "(1|site)",
        family = "binomial",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(glmer_screen)
}

# Example 19: Mixed-effects linear regression (lmer)
if (requireNamespace("lme4", quietly = TRUE)) {
    lmer_screen <- uniscreen(
        data = clintrial,
        outcome = "biomarker_x",
        predictors = c("age", "sex", "treatment", "smoking"),
        model_type = "lmer",
        random = "(1|site)",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(lmer_screen)
}

# Example 20: Mixed-effects Cox model (coxme)
if (requireNamespace("coxme", quietly = TRUE)) {
    coxme_screen <- uniscreen(
        data = clintrial,
        outcome = "Surv(os_months, os_status)",
        predictors = c("age", "sex", "treatment", "stage"),
        model_type = "coxme",
        random = "(1|site)",
        labels = clintrial_labels,
        parallel = FALSE
    )
    print(coxme_screen)
}

# Example 21: Mixed-effects with nested random effects
# Patients nested within sites
if (requireNamespace("lme4", quietly = TRUE)) {
    nested_screen <- uniscreen(
        data = clintrial,
        outcome = "os_status",
        predictors = c("age", "treatment"),
        model_type = "glmer",
        random = "(1|site/patient_id)",
        family = "binomial",
        parallel = FALSE
    )
}

# Example 22: Quasipoisson for overdispersed count data
# Alternative to negative binomial when MASS not available
quasi_screen <- uniscreen(
    data = clintrial,
    outcome = "ae_count",
    predictors = c("age", "treatment", "diabetes", "surgery", "stage"),
    model_type = "glm",
    family = "quasipoisson",
    labels = clintrial_labels,
    parallel = FALSE
)
print(quasi_screen)
# Adjusts standard errors for overdispersion

# Example 23: Quasibinomial for overdispersed binary data
quasibin_screen <- uniscreen(
    data = clintrial,
    outcome = "any_complication",
    predictors = c("age", "bmi", "diabetes", "surgery", "ecog"),
    model_type = "glm",
    family = "quasibinomial",
    labels = clintrial_labels,
    parallel = FALSE
)
print(quasibin_screen)

# Example 24: Inverse Gaussian for highly skewed positive data
invgauss_screen <- uniscreen(
    data = clintrial,
    outcome = "recovery_days",
    predictors = c("age", "surgery", "pain_score", "los_days"),
    model_type = "glm",
    family = inverse.gaussian(link = "log"),
    labels = clintrial_labels,
    parallel = FALSE
)
print(invgauss_screen)




summata documentation built on May 7, 2026, 5:07 p.m.