ssp.glm.rF: Balanced Subsampling Methods for Generalized Linear Models...

View source: R/glm_rF_main_function.R

ssp.glm.rFR Documentation

Balanced Subsampling Methods for Generalized Linear Models with Rare Features

Description

This function inherits the weighted objective function (likelihood = "weighted") and the poisson sampling method (sampling.method = "poisson") from ssp.glm, while additionally handling rare features in the model. Rare features refer to binary covariates with low prevalence of being one (expressed) and mostly zero. Such features create challenges for subsampling, because it is likely to miss these rare but informative observations. The balanced subsampling method upweights observations that contain expressed rare features, thereby preserving estimation efficiency for the coefficients of rare features.

A quick start guide is provided in the vignette: https://dqksnow.github.io/subsampling/articles/ssp-logit-rF.html.

Usage

ssp.glm.rF(
  formula,
  data,
  subset = NULL,
  n.plt,
  n.ssp,
  family = "binomial",
  criterion = "BL-Uni",
  sampling.method = "poisson",
  likelihood = "weighted",
  balance.plt = TRUE,
  balance.Y = FALSE,
  rareFeature.index = NULL,
  control = list(...),
  contrasts = NULL,
  ...
)

Arguments

formula

A model formula object.

data

A data frame containing the variables in the model.

subset

An optional vector specifying a subset of observations to be used as the full dataset.

n.plt

The pilot sample size for computing the pilot estimator and estimating optimal subsampling probabilities.

n.ssp

The expected size of the optimal subsample. For sampling.method = "poisson", the actual sample size may vary, but the expected size equals n.ssp.

family

A character string naming a family. It can be a character string naming a family function, a family function or the result of a call to a family function.

criterion

The subsampling criterion. Choices include:

  • "BL-Uni" (default): probabilities proportional to the balance score.

  • "R-Lopt": rareness-aware L-optimality.

  • "Aopt": classical A-optimality, minimizing the trace of the asymptotic covariance matrix.

  • "Lopt": classical L-optimality, minimizing a transformed trace of the asymptotic covariance.

  • "BL-Lopt": balance score combined with L-optimality.

  • "Uni": uniform sampling.

sampling.method

The sampling method. Currently "poisson" is supported, which avoids drawing repeated observations.

likelihood

The objective function used for estimation. Currently "weighted" is supported. Each sampled observation is weighted by the inverse of its subsampling probability.

balance.plt

Logical. Whether to use "BL-Uni" to draw the pilot sample for two-step subsampling methods. Default is TRUE. A good pilot estimator significantly improves the performance of the second-step subsampling; a poor pilot estimator may cause failure.

balance.Y

Logical. Whether to balance the binary response variable in logistic regression. If TRUE, the pilot sampling probability combines the balance score with the case-control method, and the negative subsampling will be performed for the second-step subsampling method. That is, automatically include all Y=1 observations to the subsample, while performing subsampling for Y=0 observations.

rareFeature.index

Column indices of binary rare features in the design matrix, coded as 1 for the rare case. For example, c(4, 9) indicates that columns 4 and 9 of the design matrix contain rare features.

control

A list passed to glm.control(). It includes:

  • alpha: mixture weight between optimal and uniform probabilities, giving \pi = (1 - \alpha)\pi^{opt} + \alpha \pi^{uni}.

contrasts

Optional list specifying how categorical variables are encoded in the design matrix.

...

Additional arguments passed to svyglm().

Details

See the package vignette for more details and examples.

Value

An object of class "ssp.glm.rF" containing the following fields.

model.call

The original function call.

coef.plt

Pilot estimator obtained from the pilot subsample.

coef.ssp

Estimator obtained from the optimal subsample.

coef.cmb

Combined estimator using the union of pilot and optimal subsamples. If criterion = "BL-Uni" or "Uni", this equals the pilot and subsample estimators because only one sampling step is used.

cov.plt

Estimated covariance matrix of coef.plt.

cov.ssp

Estimated covariance matrix of coef.ssp.

cov.cmb

Estimated covariance matrix of coef.cmb.

N

Number of observations in the full dataset.

subsample.size.expect

Expected subsample size (n.ssp).

subsample.size.actual

Actual number of subsampled observations.

full.rare.count

For each rare feature, return the counts of ones in the full dataset.

rare.count.plt

Vector of length K giving, for each rare feature, the number of ones in the pilot subsample.

rare.count.ssp

Same as above, computed for the optimal subsample.

rare.count.cmb

Same as above, computed for the combined subsample.

index.plt

Row indices of the pilot subsample within the full dataset.

index.ssp

Row indices of the optimal subsample within the full dataset.

index.cmb

Union of pilot and optimal subsample row indices.

rareFeature.index

Column indices of rare features in the design matrix (same as the user input).

comp.time

Total computation time for computing sampling probabilities, drawing subsamples, and fitting the subsample estimator.

terms

The terms object for the fitted model.

Examples

# logistic regression
set.seed(2)
N <- 1e4
d_rare <- 3
d_cont <- 2
p_rare <- c(0.01, 0.02, 0.05)
beta0 <- c(0.5, rep(0.5, d_rare), rep(0.5, d_cont)) 
corr <- 0.5
sigmax  <- matrix(corr, d_cont, d_cont) + diag(1-corr, d_cont)
X <- MASS::mvrnorm(N, rep(0, d_cont), sigmax)
Z <- do.call(cbind, lapply(seq_along(p_rare), function(i) {
rbinom(N, 1, p_rare[i])
}))
X <- cbind(Z, X)
P <- 1 / (1 + exp(-(beta0[1] + X %*% beta0[-1])))
Y <- as.integer(rbinom(N, 1, P))
colnames(X) <- paste0("X", 1:(d_rare + d_cont))
rareFeature.index <- c(1:d_rare)
data <- data.frame(Y = Y, X)
formula <- Y ~ .
n.plt <- 300
n.ssp <- 2000
BL.Uni.results <- ssp.glm.rF(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'BL-Uni',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(BL.Uni.results)
R.Lopt.results <- ssp.glm.rF(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'R-Lopt',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(R.Lopt.results)

subsampling documentation built on March 11, 2026, 1:06 a.m.