Nothing
#' Optimal Subsampling Method for Softmax (multinomial logistic) Regression Model
#' @description
#' Draw subsample from full dataset and fit softmax(multinomial logistic) regression model on the subsample. Refer to [vignette](https://dqksnow.github.io/subsampling/articles/ssp-softmax.html) for a quick start.
#'
#' @param formula A model formula object of class "formula" that describes the model to be fitted.
#' @param data A data frame containing the variables in the model. Denote \eqn{N} as the number of observations in `data`.
#' @param subset An optional vector specifying a subset of observations from `data` to use for the analysis. This subset will be viewed as the full data.
#' @param n.plt The pilot subsample size (first-step subsample size).
#' This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities.
#' @param n.ssp The expected size of the optimal subsample (second-step subsample). For `sampling.method = 'withReplacement'`, The exact subsample size is `n.ssp`. For `sampling.method = 'poisson'`, `n.ssp` is the expected subsample size.
#' @param criterion The criterion of optimal subsampling probabilities.
#' Choices include \code{optA}, \code{optL}, \code{MSPE}(default), \code{LUC} and \code{uniform}.
#'
#' - `MSPE` Minimizes the mean squared prediction error between subsample estimator and full data estimator.
#'
#' - `optA` Minimizes the trace of the asymptotic covariance matrix of the subsample estimator.
#'
#' - `optL` Minimizes the trace of a transformation of the asymptotic covariance matrix, which reduces computational costs than `optA`.
#'
#' - `LUC` Local uncertainty sampling method, serving as a baseline subsampling strategy. See Wang and Kim (2022).
#'
#' - `uniform` Assigns equal subsampling probability
#' \eqn{\frac{1}{N}} to each observation, serving as a baseline subsampling strategy.
#'
#' @param sampling.method The sampling method to use.
#' Choices include \code{withReplacement} and \code{poisson}(default). `withReplacement` draws exactly `n.ssp`
#' subsamples from size \eqn{N} full dataset with replacement, using the specified
#' subsampling probabilities. `poisson` draws observations independently by
#' comparing each subsampling probability with a realization of uniform random
#' variable \eqn{U(0,1)}.
#' Differences between methods:
#'
#' - Sample size: `withReplacement` draws exactly `n.ssp` subsamples while `poisson` draws
#' subsamples with expected size `n.ssp`, meaning the actual size may vary.
#'
#' - Memory usage: `withReplacement` requires the entire dataset to be loaded at once, while `poisson`
#' allows for processing observations sequentially (will be implemented in future version).
#'
#' - Estimator performance: Theoretical results show that the `poisson` tends to get a
#' subsample estimator with lower asymptotic variance compared to the
#' `withReplacement`
#'
#' @param likelihood A bias-correction likelihood function is required for subsample since unequal subsampling probabilities introduce bias. Choices include
#' \code{weighted} and \code{MSCLE}(default).
#'
#' - `weighted` Applies a weighted likelihood function where each observation is weighted by the inverse of its subsampling probability.
#'
#' - `MSCLE` It uses a conditional likelihood, where each element of the likelihood represents the density of \eqn{Y_i} given that this observation was drawn.
#'
#' @param constraint The constraint for identifiability of softmax model. Choices include
#' \code{baseline} and \code{summation}(default). The baseline constraint assumes the coefficient for the baseline category are \eqn{0}. Without loss of generality, we set the category \eqn{Y=0} as the baseline category so that \eqn{\boldsymbol{\beta}_0=0}. The summation constraint \eqn{\sum_{k=0}^{K} \boldsymbol{\beta}_k} is also used in the subsampling method for the purpose of calculating subsampling probability. These two constraints lead to different interpretation of coefficients but are equal for computing \eqn{P(Y_{i,k} = 1 \mid \mathbf{x}_i)}. The estimation of coefficients returned by `ssp.softmax()` is under baseline constraint.
#'
#' @param contrasts An optional list. It specifies how categorical variables are represented in the design matrix. For example, \code{contrasts = list(v1 = 'contr.treatment', v2 = 'contr.sum')}.
#' @param control A list of parameters for controlling the sampling process. There are two tuning parameters `alpha` and `b`. Default is \code{list(alpha=0, b=2)}.
#'
#' - `alpha` \eqn{\in [0,1]} is the mixture weight of the user-assigned subsampling
#' probability and uniform subsampling probability. The actual subsample
#' probability is \eqn{\pi = (1-\alpha)\pi^{opt} + \alpha \pi^{uni}}. This protects the estimator from extreme small
#' subsampling probability. The default value is 0.
#'
#' - `b` is a positive number which is used to constaint the poisson subsampling probability. `b` close to 0 results in subsampling probabilities closer to uniform probability \eqn{\frac{1}{N}}. `b=2` is the default value. See relevant references for further details.
#'
#' @param ... A list of parameters which will be passed to \code{nnet::multinom()}.
#'
#' @return
#' ssp.softmax returns an object of class "ssp.softmax" containing the following components (some are optional):
#' \describe{
#' \item{model.call}{The original function call.}
#' \item{coef.plt}{The pilot estimator. See Details for more information.}
#' \item{coef.ssp}{The estimator obtained from the optimal subsample.}
#' \item{coef}{The weighted linear combination of `coef.plt` and `coef.ssp`, under baseline constraint. The combination weights depend on the relative size of `n.plt` and `n.ssp` and the estimated covariance matrices of `coef.plt` and `coef.ssp.` We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary are `coef` and the square root of `diag(cov)`.}
#' \item{coef.plt.sum}{The pilot estimator under summation constrraint. `coef.plt.sum = G %*% as.vector(coef.plt)`.}
#' \item{coef.ssp.sum}{The estimator obtained from the optimal subsample under summation constrraint. `coef.ssp.sum = G %*% as.vector(coef.ssp)`.}
#' \item{coef.sum}{The weighted linear combination of `coef.plt` and `coef.ssp`, under summation constrraint. `coef.sum = G %*% as.vector(coef)`.}
#' \item{cov.plt}{The covariance matrix of \code{coef.plt}.}
#' \item{cov.ssp}{The covariance matrix of \code{coef.ssp}.}
#' \item{cov}{The covariance matrix of \code{coef.cmb}.}
#' \item{cov.plt.sum}{The covariance matrix of \code{coef.plt.sum}.}
#' \item{cov.ssp.sum}{The covariance matrix of \code{coef.ssp.sum}.}
#' \item{cov.sum}{The covariance matrix of \code{coef.sum}.}
#' \item{index.plt}{Row indices of pilot subsample in the full dataset.}
#' \item{index.ssp}{Row indices of of optimal subsample in the full dataset.}
#' \item{N}{The number of observations in the full dataset.}
#' \item{subsample.size.expect}{The expected subsample size.}
#' \item{terms}{The terms object for the fitted model.}
#' }
#'
#' @details
#'
#' A pilot estimator for the unknown parameter \eqn{\beta} is required because MSPE, optA and
#' optL subsampling probabilities depend on \eqn{\beta}. There is no "free lunch" when determining optimal subsampling probabilities. For softmax regression, this
#' is achieved by drawing a size `n.plt` subsample with replacement from full
#' dataset with uniform sampling probability.
#'
#'
#' @references
#' Yao, Y., & Wang, H. (2019). Optimal subsampling for softmax regression. \emph{Statistical Papers}, \strong{60}, 585-599.
#'
#' Han, L., Tan, K. M., Yang, T., & Zhang, T. (2020). Local uncertainty sampling for large-scale multiclass logistic regression. \emph{Annals of Statistics}, \strong{48}(3), 1770-1788.
#'
#' Wang, H., & Kim, J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. \emph{Journal of machine learning research}, \strong{23}(332), 1-50.
#'
#' Yao, Y., Zou, J., & Wang, H. (2023). Optimal poisson subsampling for softmax regression. \emph{Journal of Systems Science and Complexity}, \strong{36}(4), 1609-1625.
#'
#' Yao, Y., Zou, J., & Wang, H. (2023). Model constraints independent optimal subsampling probabilities for softmax regression. \emph{Journal of Statistical Planning and Inference}, \strong{225}, 188-201.
#'
#' @examples
#' # softmax regression
#' d <- 3 # dim of covariates
#' K <- 2 # K + 1 classes
#' G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d)
#' N <- 1e4
#' beta.true.baseline <- cbind(rep(0, d), matrix(-1.5, d, K))
#' beta.true.summation <- cbind(rep(1, d), 0.5 * matrix(-1, d, K))
#' set.seed(1)
#' mu <- rep(0, d)
#' sigma <- matrix(0.5, nrow = d, ncol = d)
#' diag(sigma) <- rep(1, d)
#' X <- MASS::mvrnorm(N, mu, sigma)
#' prob <- exp(X %*% beta.true.summation)
#' prob <- prob / rowSums(prob)
#' Y <- apply(prob, 1, function(row) sample(0:K, size = 1, prob = row))
#' n.plt <- 500
#' n.ssp <- 1000
#' data <- as.data.frame(cbind(Y, X))
#' colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
#' head(data)
#' formula <- Y ~ . -1
#' WithRep.MSPE <- ssp.softmax(formula = formula,
#' data = data,
#' n.plt = n.plt,
#' n.ssp = n.ssp,
#' criterion = 'MSPE',
#' sampling.method = 'withReplacement',
#' likelihood = 'weighted',
#' constraint = 'baseline')
#' summary(WithRep.MSPE)
#' @export
ssp.softmax <- function(formula,
data,
subset,
n.plt,
n.ssp,
criterion = 'MSPE',
sampling.method = 'poisson',
likelihood = 'MSCLE',
constraint = 'summation',
control = list(...),
contrasts = NULL,
...
) {
model.call <- match.call()
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset"),
names(mf),
0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
mt <- attr(mf, "terms")
Y <- model.response(mf, "any")
if(length(dim(Y)) == 1L) {
nm <- rownames(Y)
dim(Y) <- NULL
if(!is.null(nm)) names(Y) <- nm
}
X <- model.matrix(mt, mf, contrasts)
if (attr(mt, "intercept") == 1) {
colnames(X)[1] <- "Intercept"
}
dimension <- dim(X)
N <- dimension[1]
d <- dimension[2]
K <- length(unique(Y)) - 1
G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d)
## G: transformation matrix
Y.matrix <- matrix(0, nrow = N, ncol = K)
Y.matrix[cbind(c(1:length(Y)), Y)] <- 1
criterion <- match.arg(criterion,c('optL', 'optA', 'MSPE', 'LUC', 'uniform'))
sampling.method <- match.arg(sampling.method, c('poisson', 'withReplacement'))
likelihood <- match.arg(likelihood, c('weighted', 'MSCLE'))
constraint <- match.arg(constraint, c('baseline', 'summation'))
control <- do.call("softmax.control", control)
## create a list to store variables
inputs <- list(X = X, Y = Y, Y.matrix = Y.matrix,
N = N, d = d, K = K, G = G,
n.plt = n.plt, n.ssp = n.ssp,
criterion = criterion, sampling.method = sampling.method,
likelihood = likelihood, constraint = constraint,
control = control
)
if (criterion %in% c('optL', 'optA', 'MSPE', 'LUC')) {
plt.estimate.results <- softmax.plt.estimate(inputs, ...)
p.plt <- plt.estimate.results$p.plt
beta.plt.b <- plt.estimate.results$beta.plt # under baseline constraint
P1.plt <- plt.estimate.results$P1.plt
ddL.plt <- plt.estimate.results$ddL.plt
dL.sq.plt <- plt.estimate.results$dL.sq.plt
index.plt <- plt.estimate.results$index.plt
cov.plt.b <- plt.estimate.results$cov.plt
Omega.plt <- plt.estimate.results$Omega.plt
Lambda.plt <- plt.estimate.results$Lambda.plt
## subsampling step
ssp.results <- softmax.subsampling(inputs,
p.plt = p.plt,
ddL.plt = ddL.plt,
P1.plt = P1.plt,
Omega.plt = Omega.plt,
index.plt = index.plt)
index.ssp <- ssp.results$index.ssp
p.ssp <- ssp.results$p.ssp
offsets <- ssp.results$offsets
## subsample estimating step
ssp.estimate.results <-
softmax.subsample.estimate(inputs,
index.ssp = index.ssp,
p.ssp = p.ssp[index.ssp],
offsets = offsets,
beta.plt = beta.plt,
...)
beta.ssp.b <- ssp.estimate.results$beta.ssp # under baseline constraint
ddL.ssp <- ssp.estimate.results$ddL.ssp
dL.sq.ssp <- ssp.estimate.results$dL.sq.ssp
Lambda.ssp <- ssp.estimate.results$Lambda.ssp
cov.ssp.b <- ssp.estimate.results$cov.ssp
## combining step
combining.results <- softmax.combining(inputs,
n.ssp = length(index.ssp),
ddL.plt = ddL.plt,
ddL.ssp = ddL.ssp,
dL.sq.plt = dL.sq.plt,
dL.sq.ssp = dL.sq.ssp,
Lambda.plt = Lambda.plt,
Lambda.ssp = Lambda.ssp,
beta.plt = beta.plt.b,
beta.ssp = beta.ssp.b)
beta.cmb.b <- combining.results$beta.cmb # under baseline constraint
cov.cmb.b <- combining.results$cov.cmb
P.cmb <- combining.results$P.cmb
beta.plt.sum <- G %*% as.vector(beta.plt.b)
beta.ssp.sum <- G %*% as.vector(beta.ssp.b)
beta.cmb.sum <- G %*% as.vector(beta.cmb.b)
cov.plt.sum <- G %*% cov.plt.b %*% t(G)
cov.ssp.sum <- G %*% cov.ssp.b %*% t(G)
cov.cmb.sum <- G %*% cov.cmb.b %*% t(G)
beta.plt <- matrix(beta.plt.b, nrow = d)
beta.ssp = matrix(beta.ssp.b, nrow = d)
beta <- matrix(beta.cmb.b, nrow = d)
rownames(beta.plt) <- rownames(beta.ssp) <- rownames(beta) <- colnames(X)
results <- list(model.call = model.call,
coef.plt = beta.plt,
coef.ssp = beta.ssp,
coef = beta,
coef.plt.sum = beta.plt.sum,
coef.ssp.sum = beta.ssp.sum,
coef.sum = beta.cmb.sum,
cov.plt = cov.plt.b,
cov.ssp = cov.ssp.b,
cov = cov.cmb.b,
cov.plt.sum = cov.plt.sum,
cov.sum = cov.cmb.sum,
cov.ssp.sum = cov.ssp.sum,
index.plt = index.plt,
index.ssp = index.ssp,
N = N,
subsample.size.expect = n.ssp,
terms = mt
)
class(results) <- c("ssp.softmax", "list")
return(results)
} else if (criterion == "uniform"){
n.uni <- n.plt + n.ssp
if (sampling.method == 'withReplacement') {
index.uni <- random.index(N, n.uni)
x.uni <- X[index.uni, ]
y.uni <- Y[index.uni]
results <- softmax.coef.estimate(x.uni, y.uni, ...)
beta.uni.b <- results$beta
P.uni <- results$P1
ddL.uni <- softmax_ddL_cpp(X = x.uni, P = P.uni[, -1], p = rep(1, n.uni),
K, d, scale = n.uni)
dL.sq.uni <- softmax_dL_sq_cpp(X = x.uni,
Y_matrix = Y.matrix[index.uni, ],
P = P.uni[, -1], p = rep(1, n.uni), K = K,
d = d, scale = n.uni)
c <- n.uni / N
cov.uni.b <- solve(ddL.uni) %*% (dL.sq.uni * (1+c)) %*%
solve(ddL.uni) / n.uni
} else if (sampling.method == 'poisson') {
index.uni <- poisson.index(N, n.uni/N)
p.uni <- rep(n.uni / N, length(index.uni))
x.uni <- X[index.uni, ]
y.uni <- Y[index.uni]
results <- softmax.coef.estimate(x.uni, y.uni, ...)
beta.uni.b <- results$beta
P.uni <- results$P1
ddL.uni <- softmax_ddL_cpp(X = x.uni, P = P.uni[, -1], p = p.uni, K, d,
scale = N)
dL.sq.uni <- softmax_dL_sq_cpp(X = x.uni,
Y_matrix = Y.matrix[index.uni, ],
P = P.uni[, -1], p = p.uni, K = K, d = d,
scale=(N^2))
cov.uni.b <- solve(ddL.uni) %*% dL.sq.uni %*% solve(ddL.uni)
}
beta.sum <- G %*% as.vector(beta.uni.b)
beta <- matrix(beta.uni.b, nrow = d)
cov.sum <- G %*% cov.uni.b %*% t(G)
rownames(beta) <- colnames(X)
results <- list(model.call = model.call,
index.ssp = index.uni,
coef = beta,
cov = cov.uni.b,
coef.sum = beta.sum,
cov.sum = cov.sum,
N = N,
subsample.size.expect = n.uni,
terms = mt
)
class(results) <- c("ssp.softmax", "list")
return(results)
}
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.