maxSGA | R Documentation |
Stochastic Gradient Ascent–based optimizers
maxSGA(fn = NULL, grad = NULL, hess = NULL, start,
nObs,
constraints = NULL, finalHessian = FALSE,
fixed = NULL, control=NULL, ... )
maxAdam(fn = NULL, grad = NULL, hess = NULL, start,
nObs,
constraints = NULL, finalHessian = FALSE,
fixed = NULL, control=NULL, ... )
fn |
the function to be maximized. As the objective function
values are not directly used for optimization, this argument is
optional, given
|
grad |
gradient of the objective function.
It must have the parameter vector as the first argument, and it must
have an argument If |
hess |
Hessian matrix of the function. Mainly for compatibility
reasons, only used for computing the final Hessian if asked to do
so by setting |
start |
initial parameter values. If these have names, the names are also used for results. |
nObs |
number of observations. This is used to partition the data
into individual batches. The resulting batch
indices are forwarded to the |
constraints |
either |
finalHessian |
how (and if) to calculate the final Hessian. Either
Hessian matrix is not often used for optimization problems where one applies SGA, but even if one is not interested in standard errors, it may provide useful information about the model performance. If computed by finite-difference method, the Hessian computation may be very slow. |
fixed |
parameters to be treated as constants at their
|
control |
list of control parameters. The ones used by these optimizers are
Adam-specific parameters
General stochastic gradient parameters:
Stopping conditions:
See |
... |
further arguments to |
Gradient Ascent (GA) is a optimization method where the algorithm
repeatedly takes small steps in the gradient's direction, the
parameter vector \theta
is updated as \theta
\leftarrow theta + \mathrm{learning rate}\cdot \nabla
f(\theta)
.
In case of Stochastic GA (SGA), the gradient is not computed on the
full set of observations but on a small subset, batch,
potentially a single observation only. In certain circumstances
this converges much faster
than when using all observation (see
Bottou et al, 2018).
If SGA_momentum
is positive, the SGA algorithm updates the parameters
\theta
in two steps. First, the momentum is used to update
the “velocity” v
as
v \leftarrow \mathrm{momentum}\cdot v + \mathrm{learning
rate}\cdot \nabla f(\theta)
, and thereafter the parameter
\theta
is updates as
\theta \leftarrow \theta + v
. Initial
velocity is set to 0.
The Adam algorithm is more complex and uses first and second moments of stochastic gradients to automatically adjust the learning rate. See Goodfellow et al, 2016, page 301.
The function fn
is not directly used for optimization, only
for printing or as a stopping condition. In this sense
it is up to the user to decide what the function
returns, if anything. For instance, it may be useful for fn
to compute the
objective function on either full training data, or on validation data,
and just ignore the index
argument. The latter is useful if
using patience-based stopping.
However, one may also
choose to select the observations determined by the index to
compute the objective function on the current data batch.
object of class "maxim". Data can be extracted through the following methods:
maxValue |
|
coef |
estimated parameter value. |
gradient |
vector, last calculated gradient value. Should be close to 0 in case of normal convergence. |
estfun |
matrix of gradients at parameter value |
hessian |
Hessian at the maximum (the last calculated value if not converged). |
storedValues |
return values stored at each epoch |
storedParameters |
return parameters stored at each epoch |
returnCode |
a numeric code that describes the convergence or error. |
returnMessage |
a short message, describing the return code. |
activePar |
logical vector, which parameters are optimized over.
Contains only |
nIter |
number of iterations. |
maximType |
character string, type of maximization. |
maxControl |
the optimization control parameters in the form of a
|
Ott Toomet, Arne Henningsen
Bottou, L.; Curtis, F. & Nocedal, J.: Optimization Methods for Large-Scale Machine Learning SIAM Review, 2018, 60, 223–311.
Goodfellow, I.; Bengio, Y.; Courville, A. (2016): Deep Learning, MIT Press
Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458
A good starting point to learn about the usage of stochastic gradient ascent in maxLik package is the vignette “Stochastic Gradient Ascent in maxLik”.
The other related functions are
maxNR
for Newton-Raphson, a popular Hessian-based maximization;
maxBFGS
for maximization using the BFGS, Nelder-Mead (NM),
and Simulated Annealing (SANN) method (based on optim
),
also supporting inequality constraints;
maxLik
for a general framework for maximum likelihood
estimation (MLE);
optim
for different gradient-based optimization
methods.
## estimate the exponential distribution parameter by ML
set.seed(1)
t <- rexp(100, 2)
loglik <- function(theta, index) sum(log(theta) - theta*t[index])
## Note the log-likelihood and gradient are summed over observations
gradlik <- function(theta, index) sum(1/theta - t[index])
## Estimate with full-batch
a <- maxSGA(loglik, gradlik, start=1, control=list(iterlim=1000,
SG_batchSize=10), nObs=100)
# note that loglik is not really needed, and is not used
# here, unless more print verbosity is asked
summary(a)
##
## demonstrate the usage of index, and using
## fn for computing the objective function on validation data.
## Create a linear model where variables are very unequally scaled
##
## OLS loglik function: compute the function value on validation data only
loglik <- function(beta, index) {
e <- yValid - XValid %*% beta
-crossprod(e)/length(y)
}
## OLS gradient: compute it on training data only
## Use 'index' to select the subset corresponding to the minibatch
gradlik <- function(beta, index) {
e <- yTrain[index] - XTrain[index,,drop=FALSE] %*% beta
g <- t(-2*t(XTrain[index,,drop=FALSE]) %*% e)
-g/length(index)
}
N <- 1000
## two random variables: one with scale 1, the other with 100
X <- cbind(rnorm(N), rnorm(N, sd=100))
beta <- c(1, 1) # true parameter values
y <- X %*% beta + rnorm(N, sd=0.2)
## training-validation split
iTrain <- sample(N, 0.8*N)
XTrain <- X[iTrain,,drop=FALSE]
XValid <- X[-iTrain,,drop=FALSE]
yTrain <- y[iTrain]
yValid <- y[-iTrain]
##
## do this without momentum: learning rate must stay small for the gradient not to explode
cat(" No momentum:\n")
a <- maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0
), nObs=length(yTrain))
print(summary(a)) # the first component is off, the second one is close to the true value
## do with momentum 0.99
cat(" Momentum 0.99:\n")
a <- maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0.99
# no momentum
), nObs=length(yTrain))
print(summary(a)) # close to true value
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.