ssp.glm: Optimal Subsampling Methods for Generalized Linear Models

View source: R/general_glm_main_function.R

ssp.glmR Documentation

Optimal Subsampling Methods for Generalized Linear Models

Description

Draw subsample from full dataset and fit a generalized linear model (GLM) on the subsample. For a quick start, refer to the vignette.

Usage

ssp.glm(
  formula,
  data,
  subset = NULL,
  n.plt,
  n.ssp,
  family = "binomial",
  criterion = "optL",
  sampling.method = "poisson",
  likelihood = "weighted",
  control = list(...),
  contrasts = NULL,
  ...
)

Arguments

formula

A model formula object of class "formula" that describes the model to be fitted.

data

A data frame containing the variables in the model. Denote N as the number of observations in data.

subset

An optional vector specifying a subset of observations from data to use for the analysis. This subset will be viewed as the full data.

n.plt

The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities.

n.ssp

The expected size of the optimal subsample (second-step subsample). For sampling.method = 'withReplacement', The exact subsample size is n.ssp. For sampling.method = 'poisson', n.ssp is the expected subsample size.

family

family can be a character string naming a family function, a family function or the result of a call to a family function.

criterion

The choices include optA, optL(default), LCC and uniform.

  • optA Minimizes the trace of the asymptotic covariance matrix of the subsample estimator.

  • optL Minimizes the trace of a transformation of the asymptotic covariance matrix. The computational complexity of optA is O(N d^2) while that of optL is O(N d).

  • LCC Local Case-Control sampling probability, used as a baseline subsampling strategy.

  • uniform Assigns equal subsampling probability \frac{1}{N} to each observation, serving as a baseline subsampling strategy.

sampling.method

The sampling method to use. Options include withReplacement and poisson (default). withReplacement draws exactly n.ssp subsamples from size N full dataset with replacement, using the specified subsampling probabilities. poisson draws observations independently by comparing each subsampling probability with a realization of uniform random variable U(0,1).

Differences between methods:

  • Sample size: withReplacement draws exactly n.ssp subsamples while poisson draws subsamples with expected size n.ssp, meaning the actual size may vary.

  • Memory usage: withReplacement requires the entire dataset to be loaded at once, while poisson allows for processing observations sequentially (will be implemented in future version).

  • Estimator performance: Theoretical results show that the poisson tends to get a subsample estimator with lower asymptotic variance compared to the withReplacement

likelihood

The likelihood function to use. Options include weighted (default) and logOddsCorrection. A bias-correction likelihood function is required for subsample since unequal subsampling probabilities introduce bias.

  • weighted Applies a weighted likelihood function where each observation is weighted by the inverse of its subsampling probability.

  • logOddsCorrection This lieklihood is available only for logistic regression model (i.e., when family is binomial or quasibinomial). It uses a conditional likelihood, where each element of the likelihood represents the probability of Y=1, given that this subsample was drawn.

control

The argument control contains two tuning parameters alpha and b.

  • alpha \in [0,1] is the mixture weight of the user-assigned subsampling probability and uniform subsampling probability. The actual subsample probability is \pi = (1-\alpha)\pi^{opt} + \alpha \pi^{uni}. This protects the estimator from extreme small subsampling probability. The default value is 0.

  • b is a positive number which is used to constaint the poisson subsampling probability. b close to 0 results in subsampling probabilities closer to uniform probability \frac{1}{N}. b=2 is the default value. See relevant references for further details.

contrasts

An optional list. It specifies how categorical variables are represented in the design matrix. For example, contrasts = list(v1 = 'contr.treatment', v2 = 'contr.sum').

...

A list of parameters which will be passed to svyglm().

Details

A pilot estimator for the unknown parameter \beta is required because both optA and optL subsampling probabilities depend on \beta. There is no "free lunch" when determining optimal subsampling probabilities. Fortunately the pilot estimator only needs to satisfy mild conditions. For logistic regression, this is achieved by drawing a size n.plt subsample with replacement from full dataset. The case-control subsample probability is applied, that is, \pi_i = \frac{1}{2N_1} for Y_i=1 and \pi_i = \frac{1}{2N_0} for Y_i=0, i=1,...,N, whereN_0 and N_1 are the counts of observations with Y = 0 and Y = 1, respectively. For other families, uniform subsampling probabilities are applied. Typically, n.plt is relatively small compared to n.ssp.

When criterion = 'uniform', there is no need to compute the pilot estimator. In this case, a size n.plt + n.ssp subsample will be drawn with uniform sampling probability and coef is the corresponding estimator.

As suggested by survey::svyglm(), for binomial and poisson families, use family=quasibinomial() and family=quasipoisson() to avoid a warning "In eval(family$initialize) : non-integer #successes in a binomial glm!". The quasi versions of the family objects give the same point estimates and suppress the warning. Since subsampling methods only rely on point estimates from svyglm() for further computation, using the quasi families does not introduce any issues.

For Gamma family, ssp.glm returns only the estimation of coefficients, as the dispersion parameter is not estimated.

Value

ssp.glm returns an object of class "ssp.glm" containing the following components (some are optional):

model.call

The original function call.

coef.plt

The pilot estimator. See Details for more information.

coef.ssp

The estimator obtained from the optimal subsample.

coef

The weighted linear combination of coef.plt and coef.ssp. The combination weights depend on the relative size of n.plt and n.ssp and the estimated covariance matrices of coef.plt and coef.ssp. We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary are coef and the square root of diag(cov).

cov.ssp

The covariance matrix of coef.ssp.

cov

The covariance matrix of coef.

index.plt

Row indices of pilot subsample in the full dataset.

index.ssp

Row indices of of optimal subsample in the full dataset.

N

The number of observations in the full dataset.

subsample.size.expect

The expected subsample size, equals to n.ssp for ssp.glm. Note that for other functions, such as ssp.relogit, this value may differ.

terms

The terms object for the fitted model.

References

Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of machine learning research, 20(132), 1-59.

Ai, M., Yu, J., Zhang, H., & Wang, H. (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2), 749-772.

Wang, H., & Kim, J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of machine learning research, 23(332), 1-50.

Examples

# logistic regression
set.seed(2)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax  <- matrix(corr, d, d) + diag(1-corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 500
n.ssp <- 1000
subsampling.results <- ssp.glm(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = "optL",
sampling.method = 'poisson',
likelihood = "logOddsCorrection")
summary(subsampling.results)
subsampling.results <- ssp.glm(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial', 
criterion = "optL",
sampling.method = 'withReplacement', 
likelihood = "weighted")
summary(subsampling.results)
Uni.subsampling.results <- ssp.glm(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial', 
criterion = 'uniform')
summary(Uni.subsampling.results)
####################
# poisson regression
set.seed(1)
N <-  1e4
beta0 <- rep(0.5, 7)
d <- length(beta0) - 1
X <- matrix(runif(N * d), N, d)
epsilon <- runif(N)
lambda <- exp(beta0[1] + X %*% beta0[-1])
Y <- rpois(N, lambda)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 600
subsampling.results <- ssp.glm(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson',
criterion = "optL", 
sampling.method = 'poisson',
likelihood = "weighted")
summary(subsampling.results)
subsampling.results <- ssp.glm(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson', 
criterion = "optL", 
sampling.method = 'withReplacement',
likelihood = "weighted")
summary(subsampling.results)
Uni.subsampling.results <- ssp.glm(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson', 
criterion = 'uniform')
summary(Uni.subsampling.results)
##################
# gamma regression
set.seed(1)
N <- 1e4
p <- 3
beta0 <- rep(0.5, p + 1)
d <- length(beta0) - 1
shape <- 2
X <- matrix(runif(N * d), N, d)
link_function <- function(X, beta0) 1 / (beta0[1] + X %*% beta0[-1])
scale <- link_function(X, beta0) / shape
Y <- rgamma(N, shape = shape, scale = scale)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 1000
subsampling.results <- ssp.glm(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'Gamma',
criterion = "optL", 
sampling.method = 'poisson',
likelihood = "weighted")
summary(subsampling.results)

subsampling documentation built on April 12, 2025, 1:50 a.m.