Description Usage Arguments Details Value Author(s) References See Also Examples
Stochastic Gradient Ascent–based optimizers
1 2 3 4 5 6 7 8 
fn 
the function to be maximized. As the objective function
values are not directly used for optimization, this argument is
optional, given

grad 
gradient of the objective function.
It must have the parameter vector as the first argument, and it must
have an argument If 
hess 
Hessian matrix of the function. Mainly for compatibility
reasons, only used for computing the final Hessian if asked to do
so by setting 
start 
initial parameter values. If these have names, the names are also used for results. 
nObs 
number of observations. This is used to partition the data
into individual batches. The resulting batch
indices are forwarded to the 
constraints 
either 
finalHessian 
how (and if) to calculate the final Hessian. Either
Hessian matrix is not often used for optimization problems where one applies SGA, but even if one is not interested in standard errors, it may provide useful information about the model performance. If computed by finitedifference method, the Hessian computation may be very slow. 
fixed 
parameters to be treated as constants at their

control 
list of control parameters. The ones used by these optimizers are
Adamspecific parameters
General stochastic gradient parameters:
Stopping conditions:
See 
... 
further arguments to 
Gradient Ascent (GA) is a optimization method where the algorithm repeatedly takes small steps in the gradient's direction, the parameter vector theta is updated as theta < learning rate * gradient f(theta). In case of Stochastic GA (SGA), the gradient is not computed on the full set of observations but on a small subset, batch, potentially a single observation only. In certain circumstances this converges much faster than when using all observation (see Bottou et al, 2018).
If SGA_momentum
is positive, the SGA algorithm updates the parameters
theta in two steps. First, the momentum is used to update
the “velocity” v as
v < momentum*v + learning
rate* gradient f(theta), and thereafter the parameter
theta is updates as
theta < theta + v. Initial
velocity is set to 0.
The Adam algorithm is more complex and uses first and second moments of stochastic gradients to automatically adjust the learning rate. See Goodfellow et al, 2016, page 301.
The function fn
is not directly used for optimization, only
for printing or as a stopping condition. In this sense
it is up to the user to decide what the function
returns, if anything. For instance, it may be useful for fn
to compute the
objective function on either full training data, or on validation data,
and just ignore the index
argument. The latter is useful if
using patiencebased stopping.
However, one may also
choose to select the observations determined by the index to
compute the objective function on the current data batch.
object of class "maxim". Data can be extracted through the following methods:



estimated parameter value. 

vector, last calculated gradient value. Should be close to 0 in case of normal convergence. 
estfun 
matrix of gradients at parameter value 

Hessian at the maximum (the last calculated value if not converged). 

return values stored at each epoch 

return parameters stored at each epoch 

a numeric code that describes the convergence or error. 

a short message, describing the return code. 

logical vector, which parameters are optimized over.
Contains only 

number of iterations. 

character string, type of maximization. 

the optimization control parameters in the form of a

Ott Toomet, Arne Henningsen
Bottou, L.; Curtis, F. & Nocedal, J.: Optimization Methods for LargeScale Machine Learning SIAM Review, 2018, 60, 223–311.
Goodfellow, I.; Bengio, Y.; Courville, A. (2016): Deep Learning, MIT Press
Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458
A good starting point to learn about the usage of stochastic gradient ascent in maxLik package is the vignette “Stochastic Gradient Ascent in maxLik”.
The other related functions are
maxNR
for NewtonRaphson, a popular Hessianbased maximization;
maxBFGS
for maximization using the BFGS, NelderMead (NM),
and Simulated Annealing (SANN) method (based on optim
),
also supporting inequality constraints;
maxLik
for a general framework for maximum likelihood
estimation (MLE);
optim
for different gradientbased optimization
methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56  ## estimate the exponential distribution parameter by ML
set.seed(1)
t < rexp(100, 2)
loglik < function(theta, index) sum(log(theta)  theta*t[index])
## Note the loglikelihood and gradient are summed over observations
gradlik < function(theta, index) sum(1/theta  t[index])
## Estimate with fullbatch
a < maxSGA(loglik, gradlik, start=1, control=list(iterlim=1000,
SG_batchSize=10), nObs=100)
# note that loglik is not really needed, and is not used
# here, unless more print verbosity is asked
summary(a)
##
## demonstrate the usage of index, and using
## fn for computing the objective function on validation data.
## Create a linear model where variables are very unequally scaled
##
## OLS loglik function: compute the function value on validation data only
loglik < function(beta, index) {
e < yValid  XValid %*% beta
crossprod(e)/length(y)
}
## OLS gradient: compute it on training data only
## Use 'index' to select the subset corresponding to the minibatch
gradlik < function(beta, index) {
e < yTrain[index]  XTrain[index,,drop=FALSE] %*% beta
g < t(2*t(XTrain[index,,drop=FALSE]) %*% e)
g/length(index)
}
N < 1000
## two random variables: one with scale 1, the other with 100
X < cbind(rnorm(N), rnorm(N, sd=100))
beta < c(1, 1) # true parameter values
y < X %*% beta + rnorm(N, sd=0.2)
## trainingvalidation split
iTrain < sample(N, 0.8*N)
XTrain < X[iTrain,,drop=FALSE]
XValid < X[iTrain,,drop=FALSE]
yTrain < y[iTrain]
yValid < y[iTrain]
##
## do this without momentum: learning rate must stay small for the gradient not to explode
cat(" No momentum:\n")
a < maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0
), nObs=length(yTrain))
print(summary(a)) # the first component is off, the second one is close to the true value
## do with momentum 0.99
cat(" Momentum 0.99:\n")
a < maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0.99
# no momentum
), nObs=length(yTrain))
print(summary(a)) # close to true value

Loading required package: miscTools
Please cite the 'maxLik' package as:
Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443458. DOI 10.1007/s0018001002171.
If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's RForge site:
https://rforge.rproject.org/projects/maxlik/

Stochastic Gradient Ascent
Number of iterations: 1000
Return code: 4
Iteration limit exceeded (iterlim)
Function value:
Estimates:
estimate gradient
[1,] 2.099567 0.1279617

No momentum:
Initial function value: 199008.4

Iteration limit exceeded (iterlim)
50 iterations
estimate: 8.05622774286993, 1.56881067557959
Function value: 816.6898

Stochastic Gradient Ascent
Number of iterations: 50
Return code: 4
Iteration limit exceeded (iterlim)
Function value: 816.6898
Estimates:
estimate gradient
[1,] 8.056228 22.18443
[2,] 1.568811 10806.06327

Momentum 0.99:
Initial function value: 199008.4

Iteration limit exceeded (iterlim)
50 iterations
estimate: 8.02519802176155e+23, 8.64902046641081e+25
Function value: 1.835145e+55

Stochastic Gradient Ascent
Number of iterations: 50
Return code: 4
Iteration limit exceeded (iterlim)
Function value: 1.835145e+55
Estimates:
estimate gradient
[1,] 8.025198e+23 2.210386e+27
[2,] 8.649020e+25 1.822211e+30

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.