View source: R/glm_rF_main_function.R
| ssp.glm.rF | R Documentation |
This function inherits the weighted objective function (likelihood = "weighted") and the poisson sampling method (sampling.method = "poisson") from ssp.glm, while additionally handling rare features in the model. Rare features refer to binary covariates with low prevalence of being one
(expressed) and mostly zero. Such features create challenges for subsampling,
because it is likely to miss these rare but informative
observations. The balanced subsampling method upweights observations that
contain expressed rare features, thereby preserving estimation efficiency for
the coefficients of rare features.
A quick start guide is provided in the vignette: https://dqksnow.github.io/subsampling/articles/ssp-logit-rF.html.
ssp.glm.rF(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
family = "binomial",
criterion = "BL-Uni",
sampling.method = "poisson",
likelihood = "weighted",
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = NULL,
control = list(...),
contrasts = NULL,
...
)
formula |
A model formula object. |
data |
A data frame containing the variables in the model. |
subset |
An optional vector specifying a subset of observations to be used as the full dataset. |
n.plt |
The pilot sample size for computing the pilot estimator and estimating optimal subsampling probabilities. |
n.ssp |
The expected size of the optimal subsample. For
|
family |
A character string naming a family. It can be a character string naming a family function, a family function or the result of a call to a family function. |
criterion |
The subsampling criterion. Choices include:
|
sampling.method |
The sampling method. Currently |
likelihood |
The objective function used for estimation. Currently
|
balance.plt |
Logical. Whether to use |
balance.Y |
Logical. Whether to balance the binary response variable in
logistic regression. If |
rareFeature.index |
Column indices of binary rare features in the design
matrix, coded as |
control |
A list passed to
|
contrasts |
Optional list specifying how categorical variables are encoded in the design matrix. |
... |
Additional arguments passed to |
See the package vignette for more details and examples.
An object of class "ssp.glm.rF" containing the following fields.
The original function call.
Pilot estimator obtained from the pilot subsample.
Estimator obtained from the optimal subsample.
Combined estimator using the union of pilot and optimal subsamples.
If criterion = "BL-Uni" or "Uni", this equals the pilot and subsample
estimators because only one sampling step is used.
Estimated covariance matrix of coef.plt.
Estimated covariance matrix of coef.ssp.
Estimated covariance matrix of coef.cmb.
Number of observations in the full dataset.
Expected subsample size (n.ssp).
Actual number of subsampled observations.
For each rare feature, return the counts of ones in the full dataset.
Vector of length K giving, for each rare feature, the number of ones in the pilot subsample.
Same as above, computed for the optimal subsample.
Same as above, computed for the combined subsample.
Row indices of the pilot subsample within the full dataset.
Row indices of the optimal subsample within the full dataset.
Union of pilot and optimal subsample row indices.
Column indices of rare features in the design matrix (same as the user input).
Total computation time for computing sampling probabilities, drawing subsamples, and fitting the subsample estimator.
The terms object for the fitted model.
# logistic regression
set.seed(2)
N <- 1e4
d_rare <- 3
d_cont <- 2
p_rare <- c(0.01, 0.02, 0.05)
beta0 <- c(0.5, rep(0.5, d_rare), rep(0.5, d_cont))
corr <- 0.5
sigmax <- matrix(corr, d_cont, d_cont) + diag(1-corr, d_cont)
X <- MASS::mvrnorm(N, rep(0, d_cont), sigmax)
Z <- do.call(cbind, lapply(seq_along(p_rare), function(i) {
rbinom(N, 1, p_rare[i])
}))
X <- cbind(Z, X)
P <- 1 / (1 + exp(-(beta0[1] + X %*% beta0[-1])))
Y <- as.integer(rbinom(N, 1, P))
colnames(X) <- paste0("X", 1:(d_rare + d_cont))
rareFeature.index <- c(1:d_rare)
data <- data.frame(Y = Y, X)
formula <- Y ~ .
n.plt <- 300
n.ssp <- 2000
BL.Uni.results <- ssp.glm.rF(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'BL-Uni',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(BL.Uni.results)
R.Lopt.results <- ssp.glm.rF(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'R-Lopt',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(R.Lopt.results)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.