knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, eval = FALSE )
This article provides a comprehensive reference for the statistical methods implemented in summata. It serves as a technical companion to the package vignettes, detailing the mathematical foundations, assumptions, and interpretation of each supported model class.
For practical guidance on using these methods, see the Regression Modeling vignette.
Throughout this article, we adopt the following notation:
| Symbol | Description | |:-------|:------------| | Y | Response (outcome) variable | | X₁, ..., Xₚ | Predictor (covariate) variables | | β₀ | Intercept | | β₁, ..., βₚ | Regression coefficients | | n | Sample size | | i | Observation index (i = 1, ..., n) | | j | Cluster index (for hierarchical data) | | t | Time (for survival models) | | g(·) | Link function | | exp(·) | Exponential function | | log(·) | Natural logarithm |
Tests H₀: β = 0 using:
$$z = \frac{\hat{\beta}}{\text{SE}(\hat{\beta})}$$
Under H₀, z ~ N(0, 1) for GLM and Cox models. For linear models, the analogous statistic follows a t-distribution with n − p − 1 degrees of freedom.
Compares nested models:
$$\Lambda = -2[\log L_{\text{reduced}} - \log L_{\text{full}}]$$
Under H₀, Λ ~ χ²ₖ, where k is the difference in parameters.
Tests H₀: all βⱼ = 0 (except intercept):
For coefficient β with standard error SE(β):
$$\beta \pm z_{1-\alpha/2} \cdot \text{SE}(\beta)$$
where z₁₋α/₂ is the standard normal quantile (1.96 for 95% CI). Wald intervals are computationally inexpensive but may be inaccurate when the likelihood surface is asymmetric, which can occur with small samples, sparse data, or estimates near boundary values.
For linear models, confidence intervals use the t-distribution to account for estimation of the residual variance:
$$\beta \pm t_{n-p-1, \; 1-\alpha/2} \cdot \text{SE}(\beta)$$
where tₙ₋ₚ₋₁ is the t-quantile with n − p − 1 degrees of freedom. This produces wider intervals than the normal approximation, particularly when n is small relative to p.
Profile likelihood intervals are based on inverting the likelihood ratio test. The confidence set for β is defined as:
$${\beta : 2[\log L(\hat{\beta}) - \log L(\beta)] \leq \chi^2_{1,1-\alpha}}$$
Profile intervals are more accurate than Wald intervals for small samples or boundary estimates, because they respect the curvature of the likelihood surface rather than approximating it as quadratic. However, they require iterative computation (refitting the model for each parameter), which increases execution time.
For exponentiated coefficients (OR, HR, RR), the CI is obtained by exponentiating the interval endpoints from the coefficient scale:
$$\left(\exp(\text{CI}{\text{lower}}), \; \exp(\text{CI}{\text{upper}})\right)$$
where CI_lower and CI_upper are the confidence limits for β on the log scale, computed using whichever method (Wald, exact t, or profile likelihood) is appropriate for the model class. Note that exponentiation preserves coverage probability because exp(·) is a monotone transformation.
The CI method is selected automatically based on model class, and can be overridden via the conf_method parameter:
| Model Class | Default (conf_method = "profile") | With conf_method = "wald" |
|:------------|:----------------------------------|:--------------------------|
| GLM (binomial, poisson) | Profile likelihood via MASS::confint.glm() | Wald (normal approximation) |
| Negative binomial | Profile likelihood via MASS::confint.glm() | Wald (normal approximation) |
| Quasi-likelihood (quasibinomial, quasipoisson) | Wald (no true likelihood) | Wald (no true likelihood) |
| Linear model (lm) | Exact t-distribution via confint.lm() | Wald (normal approximation) |
| Cox PH (coxph) | Wald | Wald |
| Mixed-effects (lmer, glmer, coxme) | Wald | Wald |
Cox and mixed-effects models use Wald intervals regardless of the conf_method setting. For Cox models, Wald CIs are the standard convention in the survival analysis literature. For mixed-effects models, profile likelihood CIs are available in principle (via lme4::confint.merMod(method = "profile")), but are prohibitively slow for routine use and are not implemented in summata.
The conf_method can be set globally for a session:
options(summata.conf_method = "profile") # Profile/exact-t (default) options(summata.conf_method = "wald") # Wald CIs throughout
Linear regression models the expected value of a continuous response as a linear combination of predictors:
$$E[Y_i | X_i] = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}$$
Equivalently, in matrix notation:
$$\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\varepsilon}$$
where $\mathbf{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I})$.
Coefficients are estimated by ordinary least squares (OLS):
$$\hat{\mathbf{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}$$
Each coefficient βⱼ represents the expected change in Y for a one-unit increase in Xⱼ, holding all other predictors constant.
| Predictor Type | Interpretation of β | |:---------------|:----------------------| | Continuous | Change in Y per unit increase in X | | Binary (0/1) | Difference in Y between groups | | Categorical | Difference from reference category |
Coefficient of determination (R²): $$R^2 = 1 - \frac{\sum_i (Y_i - \hat{Y}_i)^2}{\sum_i (Y_i - \bar{Y})^2}$$
Adjusted R²: $$R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$
Root mean squared error (RMSE): $$\text{RMSE} = \sqrt{\frac{1}{n}\sum_i (Y_i - \hat{Y}_i)^2}$$
Note: this is the descriptive (biased) RMSE. The unbiased estimate of the residual standard deviation σ uses n − p − 1 in the denominator and is reported by summary.lm() as the "residual standard error."
In summata, linear regression is specified with model_type = "lm":
fit(data, outcome = "continuous_var", predictors, model_type = "lm")
Generalized linear models (GLMs) extend linear regression to accommodate non-normal response distributions through three components:
For binary outcomes Y ∈ {0, 1}, logistic regression models the log-odds of the event:
$$\log\left(\frac{P(Y_i = 1 | X_i)}{1 - P(Y_i = 1 | X_i)}\right) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}$$
Equivalently: $$P(Y_i = 1 | X_i) = \frac{\exp(\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip})}{1 + \exp(\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip})}$$
The exponentiated coefficient exp(βⱼ) yields the odds ratio (OR):
$$\text{OR} = \exp(\beta_j) = \frac{\text{Odds}(Y = 1 | X_j + 1)}{\text{Odds}(Y = 1 | X_j)}$$
| OR Value | Interpretation | |:---------|:---------------| | OR = 1 | No association | | OR > 1 | Increased odds of outcome | | OR < 1 | Decreased odds of outcome |
For a continuous predictor, the OR represents the multiplicative change in odds per one-unit increase. For a categorical predictor, the OR compares each level to the reference category.
fit(data, outcome = "binary_var", predictors, model_type = "glm", family = "binomial")
For count outcomes Y ∈ {0, 1, 2, ...}, Poisson regression models the log of the expected count:
$$\log(E[Y_i | X_i]) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}$$
The exponentiated coefficient exp(βⱼ) yields the rate ratio (RR):
$$\text{RR} = \exp(\beta_j) = \frac{E[Y | X_j + 1]}{E[Y | X_j]}$$
When Var(Y) > E[Y] (overdispersion), standard Poisson SEs are underestimated. Solutions include:
# Standard Poisson fit(data, outcome = "count_var", predictors, model_type = "glm", family = "poisson") # Overdispersed counts fit(data, outcome = "count_var", predictors, model_type = "glm", family = "quasipoisson")
Negative binomial regression extends Poisson regression with an additional dispersion parameter θ:
$$Y_i \sim \text{NegBin}(\mu_i, \theta)$$ $$\log(\mu_i) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}$$
The variance is: $$\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{\theta}$$
As θ → ∞, the negative binomial converges to Poisson.
fit(data, outcome = "count_var", predictors, model_type = "negbin")
Requires the MASS package.
For positive, continuous, right-skewed outcomes, Gamma regression models:
$$Y_i \sim \text{Gamma}(\mu_i, \phi)$$
The canonical link for the Gamma family is the inverse (1/μ). In practice, the log link is often preferred because it yields more interpretable coefficients (multiplicative effects). When family = "Gamma" is passed as a string, summata defaults to the log link. The link can also be specified explicitly:
$$\log(E[Y_i | X_i]) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}$$
With log link, exp(βⱼ) represents the multiplicative effect (ratio):
$$\frac{E[Y | X_j + 1]}{E[Y | X_j]} = \exp(\beta_j)$$
# String shorthand (resolves to log link) fit(data, outcome = "positive_continuous", predictors, model_type = "glm", family = "Gamma") # Explicit link specification fit(data, outcome = "positive_continuous", predictors, model_type = "glm", family = Gamma(link = "log")) # Gamma with inverse link (canonical) fit(data, outcome = "positive_continuous", predictors, model_type = "glm", family = Gamma(link = "inverse"))
For overdispersed binary data, quasibinomial regression estimates a dispersion parameter:
$$\text{Var}(Y_i) = \phi \cdot \mu_i(1 - \mu_i)$$
where φ > 1 indicates overdispersion. Coefficients are identical to binomial GLM, but standard errors are adjusted.
For overdispersed count data:
$$\text{Var}(Y_i) = \phi \cdot \mu_i$$
# Overdispersed binary fit(data, outcome, predictors, model_type = "glm", family = "quasibinomial") # Overdispersed counts fit(data, outcome, predictors, model_type = "glm", family = "quasipoisson")
| Family | Outcome Type | Link | Effect Measure |
|:-------|:-------------|:-----|:---------------|
| binomial | Binary | logit | Odds ratio |
| quasibinomial | Binary (overdispersed) | logit | Odds ratio |
| poisson | Count | log | Rate ratio |
| quasipoisson | Count (overdispersed) | log | Rate ratio |
| gaussian | Continuous | identity | Coefficient |
| Gamma | Positive continuous | inverse (canonical); log (in summata) | Ratio (with log link) |
| inverse.gaussian | Positive continuous | 1/μ² | — |
The Cox model relates the hazard function to predictors:
$$h(t | X_i) = h_0(t) \exp(\beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip})$$
where:
The hazard ratio (HR) = exp(βⱼ) represents the relative hazard:
$$\text{HR} = \frac{h(t | X_j + 1)}{h(t | X_j)} = \exp(\beta_j)$$
| HR Value | Interpretation | |:---------|:---------------| | HR = 1 | No association with survival | | HR > 1 | Increased hazard (shorter survival) | | HR < 1 | Decreased hazard (longer survival) |
Concordance (Harrell's C-index): $$C = P(\hat{h}_i > \hat{h}_j | T_i < T_j)$$
Proportion of concordant pairs; C = 0.5 indicates no discrimination, C = 1 indicates perfect discrimination.
The proportional hazards assumption can be tested using:
fit(data, outcome = "Surv(time, status)", predictors, model_type = "coxph")
Requires the survival package.
For matched case-control studies, conditional logistic regression conditions on the stratum:
$$\log\left(\frac{P(Y_i = 1 | X_i, \text{stratum})}{P(Y_i = 0 | X_i, \text{stratum})}\right) = \beta_1 X_{i1} + \cdots + \beta_p X_{ip}$$
Note: No intercept is estimated; it is absorbed by the stratum.
fit(data, outcome = "case_status", predictors, model_type = "clogit", strata = "match_id")
Requires the survival package.
Mixed-effects models (hierarchical or multilevel models) account for correlation within clusters by incorporating random effects.
For observation i within cluster j:
$$g(E[Y_{ij} | X_{ij}, u_j]) = \mathbf{X}{ij}\mathbf{\beta} + \mathbf{Z}{ij}\mathbf{u}_j$$
where:
For continuous outcomes:
$$Y_{ij} = \beta_0 + \beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_{0j} + u_{1j}X_{1ij} + \varepsilon_{ij}$$
where:
| Syntax | Meaning |
|:-------|:--------|
| (1\|group) | Random intercepts by group |
| (x\|group) | Random intercepts and slopes for x |
| (1\|group1) + (1\|group2) | Crossed random effects |
| (1\|group1/group2) | Nested random effects |
fit(data, outcome, c(predictors, "(1|cluster)"), model_type = "lmer")
Requires the lme4 package.
GLMMs extend mixed-effects models to non-normal outcomes:
$$g(E[Y_{ij} | X_{ij}, u_j]) = \beta_0 + \beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_{0j}$$
Common families:
Fixed effects are interpreted as conditional (cluster-specific) effects. The population-averaged effect may differ due to the non-linear link function.
# Binary outcome with clustering fit(data, outcome, c(predictors, "(1|cluster)"), model_type = "glmer", family = "binomial") # Count outcome with clustering fit(data, outcome, c(predictors, "(1|cluster)"), model_type = "glmer", family = "poisson")
Requires the lme4 package.
For clustered survival data:
$$h_{ij}(t) = h_0(t) \exp(\beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_j)$$
where uⱼ ~ N(0, σ²ᵤ) is the cluster-specific frailty term.
fit(data, outcome = "Surv(time, status)", c(predictors, "(1|cluster)"), model_type = "coxme")
Requires the coxme package.
Mixed-effects models require adequate sample sizes at both levels:
| Level | Recommendation | |:------|:---------------| | Clusters | ≥ 20–30 for stable variance estimates | | Observations per cluster | ≥ 5 for random intercepts | | Total observations | Standard regression guidelines apply |
An interaction occurs when the effect of one predictor depends on the level of another. For predictors X and Z:
$$g(E[Y]) = \beta_0 + \beta_1 X + \beta_2 Z + \beta_3 (X \times Z)$$
where:
For continuous X and Z, the effect of X on Y is:
$$\frac{\partial E[Y]}{\partial X} = \beta_1 + \beta_3 Z$$
The effect of X varies linearly with Z.
For continuous X and binary Z (0/1):
For two categorical variables, each combination of levels has a distinct effect. The interaction coefficient represents the additional effect beyond the sum of main effects.
compfit() to compare AIC/BICfit(data, outcome, predictors, interactions = c("X:Z"), model_type = "glm")
$$\text{AIC} = -2\log L + 2k$$
where L is the maximized likelihood and k is the number of parameters. AIC balances fit and complexity; lower values indicate better models.
$$\text{BIC} = -2\log L + k\log n$$
BIC penalizes complexity more heavily than AIC, especially for large samples.
| Comparison | Guideline | |:-----------|:----------| | ΔAIC < 2 | Models essentially equivalent | | ΔAIC 2–7 | Some support for lower-AIC model | | ΔAIC > 10 | Strong support for lower-AIC model |
For binary outcomes or survival models:
$$C = P(\hat{p}_i > \hat{p}_j | Y_i = 1, Y_j = 0)$$
| C Value | Discrimination | |:--------|:---------------| | 0.5 | No discrimination (random) | | 0.6–0.7 | Poor | | 0.7–0.8 | Acceptable | | 0.8–0.9 | Excellent | | > 0.9 | Outstanding |
For binary outcomes:
$$\text{Brier} = \frac{1}{n}\sum_i (Y_i - \hat{p}_i)^2$$
Lower values indicate better calibration; Brier = 0 is perfect.
$$R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}$$
Proportion of variance explained.
McFadden pseudo-R²: $$R^2_{\text{McF}} = 1 - \frac{\log L_{\text{full}}}{\log L_{\text{null}}}$$
Nagelkerke pseudo-R²: $$R^2_N = \frac{1 - (L_{\text{null}}/L_{\text{full}})^{2/n}}{1 - L_{\text{null}}^{2/n}}$$
Values are not directly comparable to linear R²; 0.2–0.4 typically indicates good fit for logistic models.
When comparing multiple candidate models, researchers typically examine several quality metrics: information criteria (AIC, BIC), discrimination (concordance), calibration (Brier score), and explained variation (R² or pseudo-R²). Each metric captures a different aspect of model performance, and no single measure is universally optimal for model selection.
The Composite Model Score (CMS) addresses this challenge by combining multiple metrics into a single summary value, facilitating rapid comparison across candidate models. This approach:
The CMS is intended as a screening tool to identify promising models, not as a replacement for substantive evaluation. Final model selection should always consider domain knowledge, parsimony, and the specific research objectives.
Each component metric is transformed to a 0–100 scale where higher values indicate better performance:
Information criteria (AIC, BIC) — Lower is better: $$S_{\text{AIC}} = 100 \times \left(1 - \frac{\text{AIC}i - \text{AIC}{\min}}{\text{AIC}{\max} - \text{AIC}{\min}}\right)$$
Concordance (C-statistic) — Higher is better, with 0.5 as floor: $$S_C = \begin{cases} 0 & C \leq 0.5 \ 200 \times (C - 0.5) & C > 0.5 \end{cases}$$
Pseudo-R² (McFadden) — Scaled recognizing typical range: $$S_{R^2} = \min(100, \; 250 \times R^2_{\text{McF}})$$
This scaling reflects that McFadden's R² rarely exceeds 0.4 even for well-fitting models.
Brier score — Lower is better, with 0.25 as no-skill baseline: $$S_{\text{Brier}} = 100 \times \left(1 - \frac{\text{Brier}}{0.25}\right)$$
Convergence — Categorical assessment: $$S_{\text{conv}} = \begin{cases} 100 & \text{converged normally} \ 70 & \text{suspect (warnings)} \ 30 & \text{failed to converge} \end{cases}$$
Component scores are combined using model-type-specific weights that reflect the relative importance of each metric for that model class:
| Model Type | Convergence | AIC | Concordance | Pseudo-R² | Brier | Global p | |:-----------|:-----------:|:---:|:-----------:|:-----------:|:-----:|:----------:| | GLM (logistic) | 0.15 | 0.25 | 0.40 | 0.15 | 0.05 | — | | Cox PH | 0.15 | 0.30 | 0.40 | — | — | 0.15 | | Linear | 0.15 | 0.25 | — | 0.45 | — | 0.15 |
For mixed-effects models, additional components include marginal R², conditional R², and ICC.
The CMS is computed as a weighted sum:
$$\text{CMS} = \sum_j w_j \times S_j$$
where wⱼ are the weights and Sⱼ are the normalized component scores.
| CMS Range | Interpretation | |:----------|:---------------| | 85–100 | Excellent model quality | | 75–84 | Very good | | 65–74 | Good | | 55–64 | Fair | | < 55 | Poor |
These thresholds provide general guidance. A model with CMS = 80 is not necessarily "better" than one with CMS = 78—small differences should prompt examination of individual metrics and substantive considerations.
# Compare candidate models comparison <- compfit( data = mydata, outcome = "y", predictors = list( "Model 1" = c("x1", "x2"), "Model 2" = c("x1", "x2", "x3"), "Model 3" = c("x1", "x2", "x3", "x4") ), model_type = "logistic" ) # Custom weights emphasizing discrimination comparison_custom <- compfit( data = mydata, outcome = "y", predictors = model_list, model_type = "logistic", scoring_weights = list( convergence = 0.10, aic = 0.20, concordance = 0.50, pseudo_r2 = 0.15, brier = 0.05 ) )
fit(), uniscreen(), and fullfit()compfit()Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.