View source: R/general_glm_main_function.R
ssp.glm | R Documentation |
Draw subsample from full dataset and fit a generalized linear model (GLM) on the subsample. For a quick start, refer to the vignette.
ssp.glm(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
family = "binomial",
criterion = "optL",
sampling.method = "poisson",
likelihood = "weighted",
control = list(...),
contrasts = NULL,
...
)
formula |
A model formula object of class "formula" that describes the model to be fitted. |
data |
A data frame containing the variables in the model. Denote |
subset |
An optional vector specifying a subset of observations from |
n.plt |
The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities. |
n.ssp |
The expected size of the optimal subsample (second-step subsample). For |
family |
|
criterion |
The choices include
|
sampling.method |
The sampling method to use. Options include Differences between methods:
|
likelihood |
The likelihood function to use. Options include
|
control |
The argument
|
contrasts |
An optional list. It specifies how categorical variables are represented in the design matrix. For example, |
... |
A list of parameters which will be passed to |
A pilot estimator for the unknown parameter \beta
is required because both optA and
optL subsampling probabilities depend on \beta
. There is no "free lunch" when determining optimal subsampling probabilities. Fortunately the
pilot estimator only needs to satisfy mild conditions. For logistic regression, this
is achieved by drawing a size n.plt
subsample with replacement from full
dataset. The case-control subsample probability is applied, that is, \pi_i =
\frac{1}{2N_1}
for Y_i=1
and \pi_i = \frac{1}{2N_0}
for Y_i=0
,
i=1,...,N
, whereN_0
and N_1
are the counts of observations with Y = 0
and Y = 1
, respectively. For other
families, uniform subsampling probabilities are applied. Typically, n.plt
is
relatively small compared to n.ssp
.
When criterion = 'uniform'
, there is no need to compute the pilot estimator. In this case, a size n.plt + n.ssp
subsample will be drawn with uniform sampling probability and coef
is the corresponding estimator.
As suggested by survey::svyglm()
, for binomial and poisson families, use family=quasibinomial()
and family=quasipoisson()
to avoid a warning "In eval(family$initialize) : non-integer #successes in a binomial glm!". The quasi versions of the family objects give the same point estimates and suppress the warning. Since subsampling methods only rely on point estimates from svyglm() for further computation, using the quasi families does not introduce any issues.
For Gamma family, ssp.glm
returns only the estimation of coefficients, as the dispersion parameter is not estimated.
ssp.glm
returns an object of class "ssp.glm" containing the following components (some are optional):
The original function call.
The pilot estimator. See Details for more information.
The estimator obtained from the optimal subsample.
The weighted linear combination of coef.plt
and coef.ssp
. The combination weights depend on the relative size of n.plt
and n.ssp
and the estimated covariance matrices of coef.plt
and coef.ssp.
We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary are coef
and the square root of diag(cov)
.
The covariance matrix of coef.ssp
.
The covariance matrix of coef
.
Row indices of pilot subsample in the full dataset.
Row indices of of optimal subsample in the full dataset.
The number of observations in the full dataset.
The expected subsample size, equals to n.ssp
for ssp.glm.
Note that for other functions, such as ssp.relogit, this value may differ.
The terms object for the fitted model.
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of machine learning research, 20(132), 1-59.
Ai, M., Yu, J., Zhang, H., & Wang, H. (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2), 749-772.
Wang, H., & Kim, J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of machine learning research, 23(332), 1-50.
# logistic regression
set.seed(2)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax <- matrix(corr, d, d) + diag(1-corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 500
n.ssp <- 1000
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = "optL",
sampling.method = 'poisson',
likelihood = "logOddsCorrection")
summary(subsampling.results)
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = "optL",
sampling.method = 'withReplacement',
likelihood = "weighted")
summary(subsampling.results)
Uni.subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'uniform')
summary(Uni.subsampling.results)
####################
# poisson regression
set.seed(1)
N <- 1e4
beta0 <- rep(0.5, 7)
d <- length(beta0) - 1
X <- matrix(runif(N * d), N, d)
epsilon <- runif(N)
lambda <- exp(beta0[1] + X %*% beta0[-1])
Y <- rpois(N, lambda)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 600
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson',
criterion = "optL",
sampling.method = 'poisson',
likelihood = "weighted")
summary(subsampling.results)
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson',
criterion = "optL",
sampling.method = 'withReplacement',
likelihood = "weighted")
summary(subsampling.results)
Uni.subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson',
criterion = 'uniform')
summary(Uni.subsampling.results)
##################
# gamma regression
set.seed(1)
N <- 1e4
p <- 3
beta0 <- rep(0.5, p + 1)
d <- length(beta0) - 1
shape <- 2
X <- matrix(runif(N * d), N, d)
link_function <- function(X, beta0) 1 / (beta0[1] + X %*% beta0[-1])
scale <- link_function(X, beta0) / shape
Y <- rgamma(N, shape = shape, scale = scale)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 1000
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'Gamma',
criterion = "optL",
sampling.method = 'poisson',
likelihood = "weighted")
summary(subsampling.results)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.