knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette introduces the usage of the ssp.glm
using logistic regression as
an example of generalized linear models (GLM). The statistical theory and
algorithms in this implementation can be found in the relevant reference papers.
The log-likelihood function for a GLM is
$$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left{y_i u(\beta^{\top} x_i) - \psi \left[ u(\beta^{\top} x_i) \right] \right}, $$ where $u$ and $\psi$ are known functions depend on the distribution from the exponential family. For the binomial distribution, the log-likelihood function becomes
$$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left[y_i \beta^{\top} x_i - \log\left(1 + e^{\beta^\top x_i}\right) \right]. $$
The idea of subsampling methods is as follows: instead of fitting the model on the size $N$ full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
You can install the development version of subsampling from GitHub with:
# install.packages("devtools") devtools::install_github("dqksnow/Subsampling")
library(subsampling)
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability ($\pi$): The probability assigned to each observation for inclusion in the subsample.
We introduce the usage of ssp.glm
with simulated data. $X$ contains $d=6$
covariates drawn from multinormal distribution, and $Y$ is the binary response
variable. The full dataset size is $N = 1 \times 10^4$.
set.seed(1) N <- 1e4 beta0 <- rep(-0.5, 7) d <- length(beta0) - 1 corr <- 0.5 sigmax <- matrix(corr, d, d) + diag(1-corr, d) X <- MASS::mvrnorm(N, rep(0, d), sigmax) colnames(X) <- paste("V", 1:ncol(X), sep = "") P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])) Y <- rbinom(N, 1, P) data <- as.data.frame(cbind(Y, X)) formula <- Y ~ . head(data)
The function usage is
ssp.glm( formula, data, subset = NULL, n.plt, n.ssp, family = "quasibinomial", criterion = "optL", sampling.method = "poisson", likelihood = "weighted", control = list(...), contrasts = NULL, ... )
The core functionality of ssp.glm
revolves around three key questions:
How are subsampling probabilities computed? (Controlled by the criterion
argument)
How is the subsample drawn? (Controlled by the sampling.method
argument)
How is the likelihood adjusted to correct for bias? (Controlled by the
likelihood
argument)
criterion
The choices of criterion
include optA
, optL
(default), LCC
and
uniform
. The optimal subsampling criterion optA
and optL
are derived by
minimizing the asymptotic covariance of subsample estimator, proposed by
@wang2018optimal. LCC
and uniform
are baseline methods.
sampling.method
The options for the sampling.method
argument include withReplacement
and
poisson
(default). withReplacement
stands for drawing n.ssp
subsamples
from full dataset with replacement, using the specified subsampling
probabilities. poisson
stands for drawing subsamples one by one by comparing the
subsampling probability with a realization of uniform random variable
$U(0,1)$. The expected number of drawn samples are n.ssp
. More details see
@wang2019more.
likelihood
The available choices for likelihood
include weighted
(default) and
logOddsCorrection
. Both of these likelihood functions can derive an unbiased
estimator. Theoretical results indicate that logOddsCorrection
is more
efficient than weighted
in the context of logistic regression. See
@wang2022maximum.
After drawing subsample, ssp.glm
utilizes survey::svyglm
to fit the model on
the subsample, which eventually uses glm
. Arguments accepted by svyglm
can
be passed through ...
in ssp.glm
.
Below are two examples demonstrating the use of ssp.glm
with different
configurations.
n.plt <- 200 n.ssp <- 600 ssp.results <- ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "optL", sampling.method = "withReplacement", likelihood = "weighted" ) summary(ssp.results)
ssp.results <- ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "optA", sampling.method = "poisson", likelihood = "logOddsCorrection" ) summary(ssp.results)
As recommended by survey::svyglm
, when working with binomial models, it is
advisable to use use family=quasibinomial()
to avoid a warning issued by
glm
. Refer to svyglm() help documentation Details
. The
'quasi' version of the family objects provide the same point estimates.
The object returned by ssp.glm
contains estimation results and indices of the
drawn subsample in the full dataset.
names(ssp.results)
Some key returned variables:
index.plt
and index
are the row indices of drawn pilot subsamples and
optimal subsamples in the full data.
coef.ssp
is the subsample estimator for $\beta$ and coef
is the linear
combination of coef.plt
(pilot estimator) and coef.ssp
.
cov.ssp
and cov
are estimated covariance matrices of coef.ssp
and
coef
.
The coefficients and standard errors printed by summary()
are coef
and the
square root of diag(cov)
. See the help documentation of ssp.glm
for details.
We also provide examples for poisson regression and gamma regression in the help
documentation of ssp.glm
. Note that likelihood = logOddsCorrection
is
currently implemented only for logistic regression (family = binomial
or
quasibonomial
).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.