ssp.glm.rF: Balanced Subsampling Methods for Generalized Linear Models...
In subsampling: Optimal Subsampling Methods for Statistical Models

ssp.glm.rF

R Documentation

Balanced Subsampling Methods for Generalized Linear Models with Rare Features

Description

Rare features are binary covariates with low prevalence of being one. Because uniform or classical optimal subsampling can miss expressed rare-feature observations or produce unstable pilot estimates, this function uses rarity-aware sampling probabilities to preserve information for estimating rare-feature coefficients.

The function extends ssp.glm by supporting rarity-aware designs, optional response balancing for binary outcomes, weighted or unweighted pilot objectives, and a combined estimator based on the union of the pilot and second-step subsamples.

Usage

ssp.glm.rF(
  formula,
  data,
  subset = NULL,
  n.plt,
  n.ssp,
  family = "binomial",
  criterion = "BL-Uni",
  sampling.method = "poisson",
  objective.weight.plt = "weighted",
  objective.weight = "weighted",
  control = list(...),
  contrasts = NULL,
  balance.X.plt = FALSE,
  balance.Y.plt = FALSE,
  balance.Y.ssp = FALSE,
  balance.Y.all = FALSE,
  record.stage.time = FALSE,
  rareFeature.index = NULL,
  rareThreshold = 0.09,
  na.action = getOption("na.action"),
  ...
)

Arguments

`formula`	A model formula object.
`data`	A data frame containing the variables in the model.
`subset`	An optional vector specifying a subset of observations to be used as the full dataset.
`n.plt`	The expected pilot sample size for two-step methods. For one-step methods (`criterion = "Uni"` or `"BL-Uni"`), the expected sample size is `n.plt + n.ssp`.
`n.ssp`	The expected second-step subsample size. For Poisson subsampling, the actual sample size may vary.
`family`	A character string naming a family, a family function, or the result of a call to a family function. Supported families include `"binomial"`, `"quasibinomial"`, `"poisson"`, `"quasipoisson"`, `"gaussian"`, and `"Gamma"`.
`criterion`	The subsampling criterion. Choices include: `"BL-Uni"` (default): probabilities proportional to the balance score. `"Uni"`: uniform Poisson subsampling. `"Lopt"`: classical L-optimality. `"Aopt"`: classical A-optimality. `"R-Lopt"`: rareness-aware L-optimality. `"BL-Lopt"`: balance score combined with L-optimality.
`sampling.method`	The sampling method. Currently only `"poisson"` is supported.
`objective.weight.plt`	Objective weighting for the pilot fit. Use `"weighted"` for inverse-probability weighting or `"unweighted"` for an unweighted pilot objective. Unweighted pilot fitting is not allowed when the pilot sampling probability depends on the response.
`objective.weight`	Objective weighting for the one-step or second-step fit. Two-step methods currently require `"weighted"`.
`control`	A list passed to `glm.control()`. Supported entries include: `alpha`: mixture weight between optimal and uniform probabilities. `b`: pilot-based truncation tuning parameter. `poi.method`: `"exact"` or `"estimated"` for Poisson probability normalization.
`contrasts`	Optional list specifying how categorical variables are encoded in the design matrix.
`balance.X.plt`	Logical. Whether to use balance-score sampling for the pilot sample in two-step methods.
`balance.Y.plt`	Logical. Whether to balance the binary response in the pilot sample. Ignored for non-binary response families.
`balance.Y.ssp`	Logical. For one-step `"Uni"` and `"BL-Uni"` methods, whether to allocate the expected sample size across `Y = 0` and `Y = 1` groups in a case-control style. Ignored for two-step optimality criteria and non-binary response families.
`balance.Y.all`	Logical. Whether to include all `Y = 1` observations and subsample from `Y = 0`. Ignored for non-binary response families.
`record.stage.time`	Logical. Whether to store timing for major internal stages in the returned object.
`rareFeature.index`	Rare-feature columns. Numeric values follow the same convention as the original data/model variables: if the model contains an intercept, the function internally shifts the indices to account for the intercept column in the design matrix. Character values are matched to design-matrix column names. If `NULL`, rare binary features are detected automatically using `rareThreshold`.
`rareThreshold`	Prevalence threshold used to automatically identify rare binary features, and to warn when user-supplied rare features have prevalence at or above the threshold.
`na.action`	Currently accepted for interface compatibility.
`...`	Additional arguments passed to `glm.fit()` or `lm.wfit()`.

Details

Two-step criteria ("Lopt", "Aopt", "R-Lopt", and "BL-Lopt") draw a pilot sample, compute second-step Poisson probabilities, fit the second-step weighted GLM, and then refit on the union of the pilot and second-step samples. One-step criteria ("Uni" and "BL-Uni") draw a single Poisson subsample with expected size n.plt + n.ssp.

Value

An object of class "ssp.glm.rF" containing fitted coefficients, covariance estimates, selected row indices, rare-feature counts, response-composition summaries, and optional stage timings.

Examples

set.seed(2)
N <- 1000
Z1 <- rbinom(N, 1, 0.04)
Z2 <- rbinom(N, 1, 0.07)
X1 <- rnorm(N)
X2 <- rnorm(N)
eta <- 0.5 + 0.5 * Z1 + 0.5 * Z2 + 0.5 * X1 + 0.5 * X2
Y <- rbinom(N, 1, plogis(eta))
data <- data.frame(Y, Z1, Z2, X1, X2)

fit_bl <- ssp.glm.rF(
  Y ~ .,
  data = data,
  n.plt = 100,
  n.ssp = 150,
  family = "quasibinomial",
  criterion = "BL-Uni",
  rareFeature.index = 1:2
)
summary(fit_bl)

fit_rl <- ssp.glm.rF(
  Y ~ .,
  data = data,
  n.plt = 100,
  n.ssp = 150,
  family = "quasibinomial",
  criterion = "R-Lopt",
  balance.X.plt = TRUE,
  rareFeature.index = c("Z1", "Z2")
)
summary(fit_rl)

subsampling documentation built on June 21, 2026, 5:10 p.m.