Introduction to `ssp.relogit`: Subsampling for Logistic Regression Model with Rare Events
In subsampling: Optimal Subsampling Methods for Statistical Models

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(subsampling)

This vignette introduces the usage of ssp.relogit. The statistical theory and algorithms in this implementation can be found in relevant reference papers.

The logistic regression log-likelihood function is

$$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left[y_i \beta^{\top} x_i - \log\left(1 + e^{\beta^\top x_i}\right) \right]. $$

Terminology

Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability ($\pi$): The probability assigned to each observation for inclusion in the subsample.
Rare events: Observations where $Y=1$ (positive instances).
Non-rare events: Observations where $Y=0$ (negative instances).

The idea of subsampling methods is as follows: instead of fitting the model on the size $N$ full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.

In the face of logistic regression with rare events, @wang2021nonuniform shows that the available information ties to the number of positive instances instead of the full data size. Based on this insight, one can keep all the rare instances and perform subsampling on the non-rare instances to reduce the computational cost.

Example

We introduce the basic usage by using ssp.relogit with simulated data. $X$ contains $d=6$ covariates drawn from multinormal distribution and $Y$ is the binary response variable. The full data size is $N = 2 \times 10^4$. Denote $N_{1}=sum(Y)$ as the counts of rare observations and $N_{0} = N - N_{1}$ as the counts of non-rare observations.

set.seed(2)
N <- 2 * 1e4
beta0 <- c(-6, -rep(0.5, 6))
d <- length(beta0) - 1
X <- matrix(0, N, d)
corr <- 0.5
sigmax <- corr ^ abs(outer(1:d, 1:d, "-"))
X <- MASS::mvrnorm(n = N, mu = rep(0, d), Sigma = sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
print(paste('N: ', N))
print(paste('sum(Y): ', sum(Y)))
n.plt <- 200
n.ssp <- 1000
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .

Key Arguments

The function usage is

ssp.relogit(
  formula,
  data,
  subset = NULL,
  n.plt,
  n.ssp,
  criterion = "optL",
  likelihood = "logOddsCorrection",
  control = list(...),
  contrasts = NULL,
  ...
)

The core functionality of ssp.glm revolves around three key questions:

How are subsampling probabilities computed? (Controlled by the criterion argument)
How is the subsample drawn?
How is the likelihood adjusted to correct for bias? (Controlled by the likelihood argument)

Different from ssp.glm which can choose withReplacement and poisson as the option of sampling.method, ssp.relogit uses poisson as default sampling method. poisson stands for drawing subsamples one by one by comparing the subsampling probability with a realization of uniform random variable $U(0,1)$. The actual size of drawn subsample is random but the expected size is $n.ssp$.

`criterion`

The choices of criterion include optA, optL(default), LCC and uniform. The optimal subsampling criterion optA and optL are derived by minimizing the asymptotic covariance of subsample estimator, proposed by @wang2018optimal. LCC and uniform are baseline methods.

Note that for rare observations $Y=1$ in the full data, the sampling probabilities are $1$. For non-rare observations, the sampling probabilities depend on the choice of criterion.

`likelihood`

The available choices for likelihood include weighted and logOddsCorrection(default). Both of these likelihood functions can derive an unbiased estimator. Theoretical results indicate that logOddsCorrection is more efficient than weighted in the context of rare events logistic regression. See @@wang2021nonuniform.

Results

After drawing subsample, ssp.relogit utilizes survey::svyglm to fit the model on the subsample, which eventually uses glm. Arguments accepted by svyglm can be passed through ... in ssp.glm.

Below is an example demonstrating the use of ssp.relogit.

n.plt <- 200
n.ssp <- 600
ssp.results <- ssp.relogit(formula = formula,
                           data = data,
                           n.plt = n.plt,
                           n.ssp = n.ssp,
                           criterion = 'optA',
                           likelihood = 'logOddsCorrection'
                           )

Outputs

The returned object contains estimation results and indices of drawn subsample in the full dataset.

names(ssp.results)

Some key returned variables:

index.plt and index are the row indices of drawn pilot subsamples and optimal subsamples in the full data.
coef.ssp is the subsample estimator for $\beta$ and coef is the linear combination of coef.plt (pilot estimator) and coef.ssp.
cov.ssp and cov are estimated covariance matrices of coef.ssp and coef.

summary(ssp.results)

In the printed results, Expected Subsample Size is the sum of rare event counts ($N_{1}$) and the expected size of negative subsample drawn from $N_{0}$ non-rare observations. Actual Subsample Size is the sum of $N_{1}$ and the actual size of negative subsample from $N_{0}$ non-rare observations.

The coefficients and standard errors printed by summary() are coef and the square root of diag(cov).

References

Any scripts or data that you put into this service are public.

subsampling documentation built on April 12, 2025, 1:50 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

subsampling
Optimal Subsampling Methods for Statistical Models

Introduction to `ssp.relogit`: Subsampling for Logistic Regression Model with Rare Events
In subsampling: Optimal Subsampling Methods for Statistical Models

Terminology

Example

Key Arguments

`criterion`

`likelihood`

Results

Outputs

References

Try the subsampling package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

subsampling Optimal Subsampling Methods for Statistical Models

Introduction to `ssp.relogit`: Subsampling for Logistic Regression Model with Rare Events In subsampling: Optimal Subsampling Methods for Statistical Models

Terminology

Example

Key Arguments

criterion

likelihood

Results

Outputs

References

Try the subsampling package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

subsampling
Optimal Subsampling Methods for Statistical Models

Introduction to `ssp.relogit`: Subsampling for Logistic Regression Model with Rare Events
In subsampling: Optimal Subsampling Methods for Statistical Models

`criterion`

`likelihood`