library(svyCausalGLM) library(survey) library(nnet) library(ggplot2) knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE )
In complex survey data, standard logistic regression requires the use of survey design weights (accounting for stratification, clustering, and unequal probabilities of selection) to obtain design-based unbiased estimates. However, when estimating the association between an exposure and a binary outcome, the relationship may be confounded by measured covariates. To address this, propensity-score weighting or prognostic-score weighting can be applied in addition to survey weights. Propensity-score weights balance the distribution of confounders across exposure groups, while prognostic-score weights adjust for covariates predictive of the outcome. Incorporating these weights allows us to estimate the exposure–outcome association adjusted for confounding, while still respecting the complex survey design. Formally, this approach produces doubly weighted regression models, where the combined weights (survey × propensity/prognostic) yield estimates that are both design-consistent and confounding-adjusted. In this package, we implement methods for binary, multinomial, and continuous exposures following established approaches in the survey and causal inference literature (e.g., [Austin, 2011; Lunceford & Davidian, 2004]). These methods enable researchers to report adjusted odds ratios that reflect the association between exposure and outcome under both design-based and confounder-adjusted considerations ([Ahmed and Dwivedi, 2026]).
This vignette is intended as a methodological and practical guide for applied researchers using complex survey data to estimate causal effects and model discrimination measures under propensity- and prognostic-score weighting.
Several R packages support causal and regression analyses under either complex survey designs or confounding adjustment, but typically not both simultaneously. The survey package enables design-based regression using sampling weights, strata, and clustering, ensuring unbiased population-level inference under complex survey designs. In contrast, packages for propensity score methods, such as TWANG ([Ridgeway et al, 2022] ) and PSweight ([Zhou et al, 2020)]) those described in Propensity Score Weighting in R, primarily focus on confounding adjustment using propensity-based weights, generally assuming independent and identically distributed observations and not explicitly incorporating survey design features. As a result, existing software typically applies survey weights alone or propensity/prognostic weights alone, with limited guidance or implementation for combining these approaches within a unified analytical framework. This gap is particularly relevant for observational studies using nationally representative health surveys, where valid inference requires both adjustment for complex sampling designs and control of confounding due to observed covariates. Recently, Ahmed and Dwivedi (2026) proposed a Stata module that integrates propensity score and prognostic score weighting with survey weights, allowing researchers to simultaneously account for complex survey designs and confounding, while also removing additional covariate effects on the exposure or outcome. Building on this methodological framework, the svyCausalGLM package provides an R-based implementation of survey-weighted generalized linear models that incorporate propensity score weighting, prognostic score weighting, or standard covariate adjustment within a design-based inference paradigm.
This vignette demonstrates the use of svyCausalGLM for conducting survey-weighted regression analyses with additional causal adjustments, enabling estimation of adjusted associations between exposure and outcome that are both design-consistent and confounding-adjusted. The package also provides tools for assessing model discrimination through the estimation of the area under the receiver operating characteristic curve (AUC), along with corresponding confidence intervals, after incorporating survey design features and causal weighting adjustments.
The svyCausalGLM package provides functions to fit survey-weighted logistic regression models using design-based inference. It supports propensity-score weighting and prognostic score adjustment while accounting for complex survey design (PSU, strata, weights).
This vignette demonstrates the workflow for:
final_svyglm() – Standard survey-weighted GLM
final_prop_svyglm() – Propensity-weighted survey GLM (binary, multinomial, or continuous exposures)
final_prog_svyglm() – Prognostic-weighted survey GLM
viz_auc_svyglm()-Estimating and Plotting AUC
A synthetic dataset with 800 observations was generated to resemble a complex survey, including PSU, strata, and sampling weights. The data contain a binary exposure, continuous and categorical covariates (age, sex, BMI), and a binary outcome generated from a logistic model. The dataset is used for demonstration purposes only.
set.seed(123) n <- 800 synthetic_svy <- data.frame( psu = sample(1:80, n, replace = TRUE), strata = sample(1:40, n, replace = TRUE), weight = runif(n, 0.5, 3), exposure = rbinom(n, 1, 0.45), age = round(rnorm(n, 50, 12)), sex = factor(sample(c("Male", "Female"), n, replace = TRUE)), bmi = round(rnorm(n, 27, 4), 1) ) linpred <- -2 + 0.8 * synthetic_svy$exposure + 0.03 * synthetic_svy$age synthetic_svy$outcome <- rbinom(n, 1, plogis(linpred)) head(synthetic_svy)
final_svyglm(): Survey-Weighted Logistic Regressionfinal_svyglm() fits a design-based survey-weighted logistic regression model using svyglm() with a quasi-binomial family, incorporating sampling weights, stratification, and clustering (PSUs) to produce population-representative estimates under complex survey designs (Lumley, 2010). The function estimates adjusted associations between a binary outcome and covariates, returning odds ratios, confidence intervals, and p-values in a publication ready format. It also producing specific twy-way interaction terms which facilitates assessing interaction among covariates.
Unlike final_prop_svyglm() and final_prog_svyglm(), which further adjust for confounding through propensity score or prognostic score weighting, final_svyglm() relies solely on the original survey design weights for design-based inference. The function structure is given by
#final_svyglm(data, dep_var, covariates, id_var, strata_var, weight_var, family = "binomial", level = 0.95, interaction_terms = NULL)
| Argument | Description |
|--------------------|---------------------------------------------------------------|
| data | Data frame containing survey data |
| dep_var | Character. Binary outcome variable (0/1) |
| covariates | Character vector of covariate names |
| id_var | Character. Primary sampling unit (PSU) / cluster variable |
| strata_var | Character. Strata variable |
| weight_var | Character. Survey sampling weights |
| family | Model family (currently "binomial" only) |
| level | Confidence level for intervals (default = 0.95) |
| interaction_terms| Optional interaction terms (e.g., "age:sex") |
|------------------------------------------------------------------------------------|
To ensure valid design-based inference, the following conditions must be met when using final_svyglm():
survey::svyglm() and supports design-based inference. Estimates reflect population-average associations, not prediction or causal effects.final_svyglm() returns a list of class svyCausal containing the following elements:
--model: The fitted survey-weighted logistic regression model (svyglm object), incorporating sampling weights, stratification, and clustering.
--OR_table: A publication-ready table reporting odds ratios, confidence intervals, and p-values for all model covariates.
--outcome: The observed binary outcome vector used in the fitted model after applying complete-case filtering.
--predictions: Predicted outcome probabilities from the survey-weighted model.
--final_weights: The final analysis weights extracted from the survey design object used in model estimation.
--design: The survey design object (svydesign) used to fit the model, allowing further post-estimation analysis if needed.
final_svyglm()fit_main<-final_svyglm(data=synthetic_svy, dep_var="outcome", covariates = c("age", "sex", "bmi"), id_var = "psu", # Changed from psu_var strata_var = "strata", weight_var = "weight", family = "binomial", level = 0.95, interaction_terms = NULL) fit_main$OR_table
--Categorical Exposure
For binary and multinomial exposures, final_prop_svyglm() first estimates propensity scores using logistic or multinomial regression models, respectively. Inverse probability of treatment weights (IPTW) are constructed and stabilized before being combined with the survey sampling weights. The final analysis is conducted using a survey-weighted generalized linear model, ensuring both confounding adjustment and valid design-based inference.
--Continous Exposure
For continuous exposures, final_prop_svyglm() implements a stabilized generalized propensity score (GPS) approach, where the exposure is modeled using linear regression conditional on covariates. The conditional density of the exposure is evaluated under both the covariate-adjusted model and a null model, and the ratio of these densities is used to construct stabilized inverse probability weights. These weights are then combined multiplicatively with the survey sampling weights and used in a survey-weighted final model. This approach follows the methodology proposed by Ahmed and Dwivedi (2026), adapted here for use with complex survey designs via survey::svydesign().
final_prop_svyglm() fits a survey-weighted logistic regression model using complex survey design information (PSU, strata, weights) and applies inverse probability of treatment weighting (IPTW) for binary, multinomial, or continuous exposures.
final_prop_svyglm()#final_prop_svyglm(data, dep_var, covariates, exposure, id_var, strata_var, weight_var, exposure_type = "binary", outcome_covariates = NULL, level = 0.95)
| Argument | Description |
|---------------------|-----------------------------------------------------------------|
| data | Data frame containing survey data |
| dep_var | Character. Binary outcome variable (0/1) |
| covariates | Character vector of covariate names |
| exposure | Character. Treatment or exposure variable |
| id_var | Character. Primary sampling unit (PSU) / cluster variable |
| strata_var | Character. Strata variable |
| weight_var | Character. Survey sampling weights |
| exposure_type | Character. "binary", "multinomial", or "continuous" |
| outcome_covariates| Character vector of additional covariate names |
| level |Numeric. Confidence level for intervals (default = 0.95) |
| ... | Additional arguments passed to svyglm() |
|---------------------------------------------------------------------------------------|
Before using final_prop_svyglm(), ensure the following:
Binary outcome: The dependent variable must be binary (0/1).
Exposure: Can be binary, multinomial (factor), or continuous.
Covariates: Must exist in the dataset; used for estimating propensity scores.
Survey design: PSU, strata, and weights must be available and correctly specified.
No missing values: Missing values in the outcome, exposure, or covariates are removed automatically with a warning.
Sufficient data per category: Each level of exposure should have enough observations to avoid unstable estimates (default threshold <5 triggers a warning).
--model: Survey-weighted GLM fitted to the outcome using the final weights.
--OR_table: Odds ratios with confidence intervals and p-values for each covariate and exposure, reflecting the adjusted association.
--outcome: Observed outcome values used in the model.
--predictions: Predicted probabilities for each observation from the fitted model.
--final_weights: Combined survey and propensity/prognostic weights used in the model fitting.
Note: The S3 print() method displays only the OR_table for concise reporting in publications
A synthetic data set with n=1500 on all required variables including respondent specific identity, strata, survey weights, covariates (age and sex), and a binary exposure. The binary outcome is generated from binomial distribution with moderate probability which depends on age and exposure variables.
set.seed(123) n <- 1500 dat <- data.frame( psu = sample(1:10, n, replace = TRUE), strata = sample(1:5, n, replace = TRUE), weight = runif(n, 0.5, 2), age = rnorm(n, 50, 10), sex = factor(sample(c("Male", "Female"), n, replace = TRUE)), exposure_bin = rbinom(n, 1, 0.5) ) dat$outcome <- rbinom(n, 1, plogis(-2 + 0.03*dat$age + 0.5*dat$exposure_bin)) ## Multinomial exposure dat$exposure_3cat <- cut(dat$age, breaks = quantile(dat$age, probs = c(0, 1/3, 2/3, 1)), labels = c("Low", "Medium", "High"), include.lowest = TRUE) exp_eff <- ifelse(dat$exposure_3cat == "Low", 0, ifelse(dat$exposure_3cat == "Medium", 0.6, 1.2)) dat$outcome_cat <- rbinom(n, 1, plogis(-3 + 0.02 * dat$age + 0.4*(dat$sex=="Male") + exp_eff))
final_prop_svyglm()## Binary exposure fit_bin <- final_prop_svyglm( dat, dep_var = "outcome", covariates = c("age", "sex"), exposure = "exposure_bin", id_var = "psu", strata_var = "strata", weight_var = "weight" ) fit_bin$OR_table ## Continuous exposure fit_cont <- final_prop_svyglm( dat, dep_var = "outcome", covariates = c("sex"), exposure = "age", id_var = "psu", strata_var = "strata", weight_var = "weight", exposure_type = "continuous" ) fit_cont$OR_table ## Multinomial exposure fit_multi <- final_prop_svyglm( dat, dep_var = "outcome_cat", covariates = c("age", "sex"), exposure = "exposure_3cat", id_var = "psu", strata_var = "strata", weight_var = "weight", exposure_type = "multinomial" ) fit_multi$OR_table
As an alternative to propensity score adjustment, svyCausalGLM also supports prognostic score-adjusted survey-weighted analysis. Prognostic scores summarize the association between covariates and the outcome, allowing adjustment for high-dimensional confounding and improved efficiency in outcome modeling.
The prognostic score implementation follows the full-sample stabilized prognostic weighting framework described in Ahmed and Dwivedi (2026) and Hajian-Tilaki (2018), and is integrated with complex survey design features through survey-weighted estimation.
final_prog_svyglm() fits a survey-weighted logistic regression model using complex survey design information (PSU, strata, weights).
It is designed for design-based inference, not prediction. The function returns a publication-ready table including odds ratios, confidence intervals, p-values, and a survey-weighted AUC (Somers’ C).
final_prog_svyglm()##final_prog_svyglm(data, dep_var, covariates, exposure, id_var, strata_var, weight_var, exposure_type = "binary", level = 0.95, ...)
| Argument | Description |
|-----------------|--------------------------------------------------------------------|
| data | Data frame containing survey data |
| dep_var | Character. Binary outcome variable (0/1) |
| covariates | Character vector of covariate names |
| exposure | Character. Treatment or exposure variable |
| id_var | Character. Primary sampling unit (PSU) / cluster variable |
| strata_var | Character. Strata variable |
| weight_var | Character. Survey sampling weights |
| exposure_type | Character. "binary", "continuous", or "multinomial" |
| level | Confidence level for intervals (default = 0.95) |
| ... | Additional arguments passed to svyglm() |
|--------------------------------------------------------------------------------------|
Complex survey design variables: The input data must contain valid identifiers for primary sampling units (id_var), stratification (strata_var), and sampling weights (weight_var) compatible with survey::svydesign().
Binary outcome variable: The outcome specified by dep_var must be binary (coded 0/1) and contain both events and non-events; models cannot be fitted if only one outcome level is present.
Exposure variable: The exposure variable may be binary, categorical, or continuous and must be correctly coded and present for all included observations.
Baseline covariates for prognostic modeling Covariates supplied in covariates should represent pre-exposure variables used to estimate the prognostic score and must not include post-exposure variables.
Optional outcome covariates Additional covariates may be included in the final outcome model via outcome_covariates, provided they are not used to construct the prognostic score.
Complete-case data for modeling variables Observations with missing values in the outcome, exposure, or covariates used in the prognostic or outcome models are removed prior to analysis.
Correct specification of survey weights: The survey weights must be positive and finite; stabilized prognostic weights are multiplied by the base survey weights to preserve design-based inference.
Sufficient sample size and outcome frequency: Very small numbers of events or non-events may lead to unstable estimates; warnings are issued when sparse outcomes are detected.
Independence within sampling units: Observations are assumed independent within PSUs after accounting for the specified survey design.
Design-based inference objective: The function is intended for population-level association estimation under complex survey designs rather than individual-level prediction.
prognostic_model A survey-weighted logistic regression model (svyglm) used to estimate the prognostic score, fitted using baseline covariates and the complex survey design.
--final_model:
A survey-weighted logistic regression model (svyglm) for the association between the exposure and outcome, fitted using prognostic-adjusted survey weights.
--OR_table:
A publication-ready table summarizing exponentiated regression coefficients (odds ratios), confidence intervals, and p-values from the final model.
outcome:
The binary outcome vector used in the final model after applying complete-case filtering.
predictions:
Predicted outcome probabilities from the final prognostic-weighted survey model.
final_weights:
The combined prognostic and survey weights applied in the final analysis.
final_prog_svyglm()fit_prog <- final_prog_svyglm( data = synthetic_svy, dep_var = "outcome", exposure = "exposure", # Added (required by your function) covariates = c("age", "sex", "bmi"), # Changed from indep_vars id_var = "psu", # Changed from psu_var strata_var = "strata", weight_var = "weight" ) fit_prog$OR_table
AUC estimation is crucial to assess the predictive performance of competing models even if the goal of the study is to assess the true association between the variables. Existing AUC estimation procedures although provide the AUC with confidence interval the do not allow survey-weight directly and can results bias performance measures [Ahmed and Dwivedi, 2026) and reference therin].
The weighted AUC (using any of three weighting schemes discussed above) is computed using trapezoidal integration of the weighted ROC curve and corresponds to the probability that a randomly selected weighted case has a higher predicted risk than a control. Variance estimation follows the Hanley–McNeil (1982) approximation adapted for weighted data, enabling valid confidence intervals (see Ahmed and Dwivedi (2026) for same formulation in stata). The viz_auc_svyglm() function extracts predictions, outcomes, and weights from any of the fit object discussed in this package, constructs the weighted ROC curve, computes weighted AUC, and generates a publication-ready plot.
The viz_auc_svyglm() function produces a weighted ROC curve and reports a weighted AUC using survey-based models. It is intended to evaluate the discriminative ability of survey-weighted logistic regression models.
All computations respect the supplied analysis weights and remain valid under unequal probability sampling and post-stratification, assuming the weights correctly reflect the survey design.
Let ( \hat{p}_i ) denote the predicted probability for subject ( i ), ( Y_i \in {0,1} ) the observed outcome, and ( w_i ) the final analysis weight.
After sorting observations in decreasing order of ( \hat{p}_i ), the weighted true positive rate (TPR) and false positive rate (FPR) at threshold ( t ) are defined as:
[ \text{TPR}(t) = \frac{\sum_{i: \hat{p}i \ge t} w_i Y_i} {\sum_i w_i Y_i}, \quad \text{FPR}(t) = \frac{\sum{i: \hat{p}_i \ge t} w_i (1 - Y_i)} {\sum_i w_i (1 - Y_i)}. ]
The weighted area under the curve (AUC) is obtained by trapezoidal integration of TPR with respect to FPR. This estimator corresponds to the probability that a randomly selected weighted case has a higher predicted risk than a randomly selected weighted control.
Variance estimation follows the approximation proposed by Hanley and McNeil (1982), adapted by replacing sample counts with weighted counts of cases and controls.
viz_auc_svyglm()##viz_auc_svyglm(fit_object, title = "Weighted ROC Curve", line_color = "#0072B2")
| Argument | Description |
|---------------|----------------------------------------------------------------------------|
| fit_object | An object of class "svyCausal" returned by final_svyglm(), |
| | final_prop_svyglm() or final_prog_svyglm(), |
| | and contain outcome, predicted probabilities, and final analysis weights. |
| title | Character. Title of the ROC plot. Default is "Weighted ROC Curve". |
| line_color | Character. Color of the ROC curve line. Default is "#0072B2". |
fit_object must contain the following named elements:
outcome: binary outcome vector coded as 0/1
predictions: predicted probabilities from a survey-weighted logistic model
final_weights: final analysis weights (e.g., IPTW × survey weights)
Lengths of outcome, predictions, and final_weights must be equal after removal of missing values.
The ROC and AUC are computed using:
-- Probability-weighted empirical distributions
-- Trapezoidal integration, equivalent to Stata’s senspec
-- A variance approximation based on Hanley & McNeil (1982)
The function is appropriate for:
--Complex survey designs
-- Inverse probability weighted (IPW) causal models
-- plot: A ggplot2 ROC curve with Weighted False Positive Rate (FPR) and Weighted True Positive Rate (TPR)
-- `stats': A table including AUC value, its standard error and 95% confidence intervals.
viz_auc_svyglm()viz_auc_svyglm(fit_main)
viz_auc_svyglm(fit_bin)
viz_auc_svyglm(fit_prog)
final_svyglm() when estimating covariate-adjusted associations.final_prop_svyglm() for causal effect estimation via propensity-score weighting.final_prog_svyglm() when evaluating prognostic model performance.viz_auc_svyglm() to visualize weighted ROC curves and report weighted AUC with uncertainty.All examples in this vignette are fully reproducible and rely only on simulated datasets. Random seeds are set where applicable.
The methods implemented in svyCausalGLM rely on correct specification of
the propensity or prognostic score models and assume no unmeasured
confounding. The weighted AUC variance estimator is based on large-sample
approximations and may be unstable in settings with extremely sparse
outcomes or highly variable weights. Users are encouraged to assess weight
distributions and conduct sensitivity analyses where appropriate.
--Austin PC. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.” Multivariate Behavioral Research, 2011;46:399–424.
--Lunceford JK, Davidian M. “Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study.” Statistics in Medicine, 2004;23:2937–2960.
--Lumley T. Complex Surveys: A Guide to Analysis Using R. Wiley, 2010.
-- Ahmed and Dwivedi. (2026):Computing adjusted effect size, area under the curve, and c-statistic for evaulating the association between uric acid and mortality in US adults using unweighted and survey-weighted regression, propensity, and prognostic score analyses.
--Ridgeway, Greg, Daniel F. McCaffrey, Andrew R. Morral, Matthew Cefalu, Lane F. Burgette, Joseph D. Pane, and Beth Ann Griffin, Toolkit for Weighting and Analysis of Nonequivalent Groups: A Tutorial for the R TWANG Package. Santa Monica, CA: RAND Corporation, 2022. https://www.rand.org/pubs/tools/TLA570-5.html.
--Zhou, T., Tong, G., Li, F., & Thomas, L. E. (2020). PSweight: an R package for propensity score weighting analysis. arXiv preprint arXiv:2010.08893.
--Hajian-Tilaki, K. (2018). Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian Journal of Internal Medicine.
--Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.