View source: R/softmax_main_function.R
ssp.softmax | R Documentation |
Draw subsample from full dataset and fit softmax(multinomial logistic) regression model on the subsample. Refer to vignette for a quick start.
ssp.softmax(
formula,
data,
subset,
n.plt,
n.ssp,
criterion = "MSPE",
sampling.method = "poisson",
likelihood = "MSCLE",
constraint = "summation",
control = list(...),
contrasts = NULL,
...
)
formula |
A model formula object of class "formula" that describes the model to be fitted. |
data |
A data frame containing the variables in the model. Denote |
subset |
An optional vector specifying a subset of observations from |
n.plt |
The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities. |
n.ssp |
The expected size of the optimal subsample (second-step subsample). For |
criterion |
The criterion of optimal subsampling probabilities.
Choices include
|
sampling.method |
The sampling method to use.
Choices include
|
likelihood |
A bias-correction likelihood function is required for subsample since unequal subsampling probabilities introduce bias. Choices include
|
constraint |
The constraint for identifiability of softmax model. Choices include
|
control |
A list of parameters for controlling the sampling process. There are two tuning parameters
|
contrasts |
An optional list. It specifies how categorical variables are represented in the design matrix. For example, |
... |
A list of parameters which will be passed to |
A pilot estimator for the unknown parameter \beta
is required because MSPE, optA and
optL subsampling probabilities depend on \beta
. There is no "free lunch" when determining optimal subsampling probabilities. For softmax regression, this
is achieved by drawing a size n.plt
subsample with replacement from full
dataset with uniform sampling probability.
ssp.softmax returns an object of class "ssp.softmax" containing the following components (some are optional):
The original function call.
The pilot estimator. See Details for more information.
The estimator obtained from the optimal subsample.
The weighted linear combination of coef.plt
and coef.ssp
, under baseline constraint. The combination weights depend on the relative size of n.plt
and n.ssp
and the estimated covariance matrices of coef.plt
and coef.ssp.
We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary are coef
and the square root of diag(cov)
.
The pilot estimator under summation constrraint. coef.plt.sum = G %*% as.vector(coef.plt)
.
The estimator obtained from the optimal subsample under summation constrraint. coef.ssp.sum = G %*% as.vector(coef.ssp)
.
The weighted linear combination of coef.plt
and coef.ssp
, under summation constrraint. coef.sum = G %*% as.vector(coef)
.
The covariance matrix of coef.plt
.
The covariance matrix of coef.ssp
.
The covariance matrix of coef.cmb
.
The covariance matrix of coef.plt.sum
.
The covariance matrix of coef.ssp.sum
.
The covariance matrix of coef.sum
.
Row indices of pilot subsample in the full dataset.
Row indices of of optimal subsample in the full dataset.
The number of observations in the full dataset.
The expected subsample size.
The terms object for the fitted model.
Yao, Y., & Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers, 60, 585-599.
Han, L., Tan, K. M., Yang, T., & Zhang, T. (2020). Local uncertainty sampling for large-scale multiclass logistic regression. Annals of Statistics, 48(3), 1770-1788.
Wang, H., & Kim, J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of machine learning research, 23(332), 1-50.
Yao, Y., Zou, J., & Wang, H. (2023). Optimal poisson subsampling for softmax regression. Journal of Systems Science and Complexity, 36(4), 1609-1625.
Yao, Y., Zou, J., & Wang, H. (2023). Model constraints independent optimal subsampling probabilities for softmax regression. Journal of Statistical Planning and Inference, 225, 188-201.
# softmax regression
d <- 3 # dim of covariates
K <- 2 # K + 1 classes
G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d)
N <- 1e4
beta.true.baseline <- cbind(rep(0, d), matrix(-1.5, d, K))
beta.true.summation <- cbind(rep(1, d), 0.5 * matrix(-1, d, K))
set.seed(1)
mu <- rep(0, d)
sigma <- matrix(0.5, nrow = d, ncol = d)
diag(sigma) <- rep(1, d)
X <- MASS::mvrnorm(N, mu, sigma)
prob <- exp(X %*% beta.true.summation)
prob <- prob / rowSums(prob)
Y <- apply(prob, 1, function(row) sample(0:K, size = 1, prob = row))
n.plt <- 500
n.ssp <- 1000
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
head(data)
formula <- Y ~ . -1
WithRep.MSPE <- ssp.softmax(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'MSPE',
sampling.method = 'withReplacement',
likelihood = 'weighted',
constraint = 'baseline')
summary(WithRep.MSPE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.