View source: R/rare_events_logistic_main_function.R
ssp.relogit | R Documentation |
Draw subsample from full dataset and fit logistic regression model on subsample. For a quick start, refer to the vignette.
ssp.relogit(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
criterion = "optL",
likelihood = "logOddsCorrection",
control = list(...),
contrasts = NULL,
...
)
formula |
A model formula object of class "formula" that describes the model to be fitted. |
data |
A data frame containing the variables in the model. Denote |
subset |
An optional vector specifying a subset of observations from |
n.plt |
The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities. |
n.ssp |
The expected subsample size (the second-step subsample
size) drawn from those samples with |
criterion |
The choices include
|
likelihood |
The likelihood function to use. Options include
|
control |
The argument
|
contrasts |
An optional list. It specifies how categorical variables are represented in the design matrix. For example, |
... |
A list of parameters which will be passed to |
'Rare event' stands for the number of observations where Y=1
is rare compare to the number of Y=0
in the full data. In the face of logistic regression with rare events, @wang2021nonuniform shows that the available information ties to the number of positive instances instead of the full data size. Based on this insight, one can keep all the rare instances and perform subsampling on the non-rare instances to reduce the computational cost. When criterion = optA, optL or LCC
, all observations with Y=1
are preserved and it draw n.ssp
subsmples from observations with Y=0. When criterion = uniform
, it draws (n.plt+n.ssp) subsmples from the full sample with equal sampling probability.
A pilot estimator for the unknown parameter \beta
is required because both optA and
optL subsampling probabilities depend on \beta
. This
is achieved by drawing half size subsample from rare observations and half from non-rare observations.
Most of the arguments and returned variables have similar meaning with ssp.glm. Refer to vignette
ssp.relogit
returns an object of class "ssp.relogit" containing the following components (some are optional):
The original function call.
The pilot estimator. See Details for more information.
The estimator obtained from the optimal subsample.
The weighted linear combination of coef.plt
and coef.ssp.
The combination weights depend on the relative size of n.plt
and n.ssp
and the estimated covariance matrices of coef.plt
and coef.ssp.
We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary are coef
and the square root of diag(cov)
.
The covariance matrix of coef.ssp
.
The covariance matrix of beta.cmb
.
Row indices of pilot subsample in the full dataset.
Row indices of of optimal subsample in the full dataset.
The number of observations in the full dataset.
The expected subsample size.
The terms object for the fitted model.
Wang, H., Zhang, A., & Wang, C. (2021). Nonuniform negative sampling and log odds correction with rare events data. Advances in Neural Information Processing Systems, 34, 19847-19859.
set.seed(1)
N <- 2 * 1e4
beta0 <- c(-5, -rep(0.7, 6))
d <- length(beta0) - 1
X <- matrix(0, N, d)
corr <- 0.5
sigmax <- corr ^ abs(outer(1:d, 1:d, "-"))
sigmax <- sigmax / 4
X <- MASS::mvrnorm(n = N, mu = rep(0, d), Sigma = sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
print(paste('N: ', N))
print(paste('sum(Y): ', sum(Y)))
n.plt <- 200
n.ssp <- 1000
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
subsampling.results <- ssp.relogit(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'optA',
likelihood = 'logOddsCorrection')
summary(subsampling.results)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.