knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette introduces the usage of ssp.softmax
, which draws optimal
subsample from full data and fit softmax (multinomial) regression on the
subsample. The statistical theory and algorithms in this implementation can be
found in the relevant reference papers.
Denote $y$ as multi-category response variable and $K+1$ is the number of categories. $N$ is the number of observations in the full dataset. $X$ is the $N \times d$ covariates matrix. Softmax regression model assumes that $$ P(y_{i,k} = 1 \mid \mathbf{x}i) = \frac{\exp(\mathbf{x}_i^\top \boldsymbol{\beta}_k)}{\sum{l=0}^{K} \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_l)} $$ for $i = 1, \ldots, N$ and $k = 0, 1, \ldots, K$, where $\boldsymbol{\beta}_k$'s are $d$-dimensional unknown coefficients.
The log-likelihood function of softmax regression is
$$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{k=0}^{K} y_{i,k} \mathbf{x}i^\top \boldsymbol{\beta}_k - \ln \left{ \sum{l=0}^{K} \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_l) \right} \right]. $$
The idea of subsampling methods is as follows: instead of fitting the model on the size $N$ full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability ($\pi$): The probability assigned to each observation for inclusion in the subsample.
We introduce the usage of ssp.softmax
with simulated data. $X$ contains $d=3$
covariates drawn from multinormal distribution and $Y$ is the multicategory
response variable with $K+1=3$ categories. The full data size is $N = 1 \times
10^4$.
library(subsampling)
set.seed(1) d <- 3 K <- 2 G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d) N <- 1e4 beta.true.baseline <- cbind(rep(0, d), matrix(-1.5, d, K)) beta.true.summation <- cbind(rep(1, d), 0.5 * matrix(-1, d, K)) mu <- rep(0, d) sigma <- matrix(0.5, nrow = d, ncol = d) diag(sigma) <- rep(1, d) X <- MASS::mvrnorm(N, mu, sigma) prob <- exp(X %*% beta.true.summation) prob <- prob / rowSums(prob) Y <- apply(prob, 1, function(row) sample(0:K, size = 1, prob = row)) data <- as.data.frame(cbind(Y, X)) colnames(data) <- c("Y", paste("V", 1:ncol(X), sep="")) head(data)
The function usage is
ssp.softmax( formula, data, subset, n.plt, n.ssp, criterion = "MSPE", sampling.method = "poisson", likelihood = "MSCLE", constraint = "summation", control = list(...), contrasts = NULL, ... )
The core functionality of ssp.softmax
revolves around three key questions:
How are subsampling probabilities computed? (Controlled by the criterion
argument)
How is the subsample drawn? (Controlled by the sampling.method
argument)
How is the likelihood adjusted to correct for bias? (Controlled by the
likelihood
argument)
criterion
The choices of criterion
include optA
, optL
, ,MSPE
(default), LUC
and
uniform
. The default criterion MSPE
minimizes the mean squared prediction
error between subsample estimator and full data estimator. Criterion optA
and
optL
are derived by minimizing the asymptotic covariance of subsample
estimator. LUC
and uniform
are baseline methods. See @yao2023model and @wang2022maximum for
details.
sampling.method
The options for sampling.method
include withReplacement
and poisson
(default). withReplacement.
stands for drawing $n.ssp$ subsamples from full
dataset with replacement, using the specified subsampling probability. poisson
stands for drawing subsamples one by one by comparing the subsampling
probability with a realization of uniform random variable $U(0,1)$. The expected
number of drawed samples are $n.ssp$.
likelihood
The available choices for likelihood
include weighted
and MSCLE
(default).
MSCLE
stands for maximum sampled conditional likelihood. Both of these
likelihood functions can derive an unbiased optimal subsample estimator. See
@wang2022maximum for details about MSCLE
.
constraint
Softmax model needs constraint on unknown coefficients for identifiability. The
options for constraint
include summation
and baseline
(default). The
baseline constraint assumes the coefficient for the baseline category are
$0$. Without loss of generality, ssp.softmax
sets the category $Y=0$ as the baseline
category so that $\boldsymbol{\beta}0=0$. The summation constraint
$\sum{k=0}^{K} \boldsymbol{\beta}k$ can also used in the subsampling method for
the purpose of calculating optimal subsampling probability. These two constraints lead
to different interpretation of coefficients but are equal for computing
$P(y{i,k} = 1 \mid \mathbf{x}_i)$. The estimation of coefficients returned by
ssp.softmax()
is under baseline constraint.
After drawing subsample, ssp.softmax
utilizes nnet::multinom
to fit the
model on the subsample. Arguments accepted by nnet::multinom
can be passed
through ...
in ssp.softmax
.
Below are two examples demonstrating the use of ssp.softmax
with different
configurations.
n.plt <- 200 n.ssp <- 600 formula <- Y ~ . -1 ssp.results1 <- ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, criterion = 'MSPE', sampling.method = 'withReplacement', likelihood = 'weighted', constraint = 'baseline' ) summary(ssp.results1)
summary(ssp.results1)
shows that it draws 600 observations out of 10000, where
the number of unique indices is less than 600 since we use sampling.method = 'withReplacement'
. After fitting softmax model on subsample using the choosen
weighted
likelihood function, we get coefficients estimation and standard
errors as above.
ssp.results2 <- ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, criterion = 'MSPE', sampling.method = 'poisson', likelihood = 'MSCLE', constraint = 'baseline' ) summary(ssp.results2)
The returned object contains estimation results and index of drawn subsamples in the full dataset.
names(ssp.results1)
Some key returned variables:
index.plt
and index
are the row indices of drawn pilot subsamples and
optimal subsamples in the full data.
coef.ssp
is the subsample estimator for $\beta$ and coef
is the linear
combination of coef.plt
(pilot estimator) and coef.ssp
.
cov.ssp
and cov
are estimated covariance matrices of coef.ssp
and
coef
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.