maxSGA: Stochastic Gradient Ascent
In maxLik: Maximum Likelihood Estimation and Related Tools

maxSGA

R Documentation

Stochastic Gradient Ascent

Description

Stochastic Gradient Ascent–based optimizers

Usage

maxSGA(fn = NULL, grad = NULL, hess = NULL, start,
       nObs,
       constraints = NULL, finalHessian = FALSE, 
       fixed = NULL, control=NULL, ... )
maxAdam(fn = NULL, grad = NULL, hess = NULL, start,
        nObs,
        constraints = NULL, finalHessian = FALSE, 
        fixed = NULL, control=NULL, ... )

Arguments

`fn`	the function to be maximized. As the objective function values are not directly used for optimization, this argument is optional, given `grad` is provided. It must have the parameter vector as the first argument, and it must have an argument `index` to specify the integer index of the selected observations. It must return either a single number, or a numeric vector (this is is summed internally). If the parameters are out of range, `fn` should return `NA`. See details for constant parameters. `fn` may also return attributes "gradient" and/or "hessian". If these attributes are set, the algorithm uses the corresponding values as gradient and Hessian.
`grad`	gradient of the objective function. It must have the parameter vector as the first argument, and it must have an argument `index` to specify the integer index of selected observations. It must return either a gradient vector of the objective function, or a matrix, where columns correspond to individual parameters. The column sums are treated as gradient components. If `NULL`, finite-difference gradients are computed. If `fn` returns an object with attribute `gradient`, this argument is ignored. If `grad` is not supplied, it is computed by finite-difference method using `fn`. However, this is only adviseable for small-scale tests, not for any production run. Obviously, `fn` must be correctly defined in that case.
`hess`	Hessian matrix of the function. Mainly for compatibility reasons, only used for computing the final Hessian if asked to do so by setting `finalHessian` to `TRUE`. It must have the parameter vector as the first argument and it must return the Hessian matrix of the objective function. If missing, either finite-difference Hessian, based on `gradient` or BHHH approach is computed if asked to do so.
`start`	initial parameter values. If these have names, the names are also used for results.
`nObs`	number of observations. This is used to partition the data into individual batches. The resulting batch indices are forwarded to the `grad` function through the argument `index`.
`constraints`	either `NULL` for unconstrained optimization or a list with two components. The components may be either `eqA` and `eqB` for equality-constrained optimization `A \theta + B = 0`; or `ineqA` and `ineqB` for inequality constraints `A \theta + B > 0`. More than one row in `ineqA` and `ineqB` corresponds to more than one linear constraint, in that case all these must be zero (equality) or positive (inequality constraints). The equality-constrained problem is forwarded to `sumt`, the inequality-constrained case to `constrOptim2`.
`finalHessian`	how (and if) to calculate the final Hessian. Either `FALSE` (do not calculate), `TRUE` (use analytic/finite-difference Hessian) or `"bhhh"`/`"BHHH"` for the information equality approach. The latter approach is only suitable when working with a log-likelihood function, and it requires the gradient/log-likelihood to be supplied by individual observations. Hessian matrix is not often used for optimization problems where one applies SGA, but even if one is not interested in standard errors, it may provide useful information about the model performance. If computed by finite-difference method, the Hessian computation may be very slow.
`fixed`	parameters to be treated as constants at their `start` values. If present, it is treated as an index vector of `start` parameters.
`control`	list of control parameters. The ones used by these optimizers are SGA_momentum 0, numeric momentum parameter for SGA. Must lie in interval `[0,1]`. See details. Adam-specific parameters Adam_momentum1 0.9, numeric in interval `(0,1)`, the first moment momentum Adam_momentum2 0.999, numeric in interval `(0,1)`, the second moment momentum General stochastic gradient parameters: SG_learningRate step size the SGA algorithm takes in the gradient direction. If 1, the step equals to the gradient value. A good value is often 0.01–0.3 SG_batchSize SGA batch size, an integer between 1 and `nObs`. If `NULL` (default), the full batch gradient is computed. SG_clip `NULL`, gradient clipping threshold. The algorithm ensures that `\|\|g(\theta)\|\|_2^2 \le \kappa` where `\kappa` is the `SG_clip` value. If the actual norm of the gradient exceeds (square root of) `\kappa`, the gradient will be scaled back accordingly while preserving its direction. `NULL` means no clipping. Stopping conditions: gradtol stopping condition. Stop if norm of the gradient is less than `gradtol`. Default 0, i.e. do not use this condition. This condition is useful if the objective is to drive full batch gradient to zero on training data. It is not a good objective in case of the stochastic gradient, and if the objective is to optimize the objective on validation data. SG_patience `NULL`, or integer. Stopping condition: the algorithm counts how many times the objective function has been worse than its best value so far, and if this exceeds `SG_patience`, the algorithm stops. SG_patienceStep 1L, integer. After how many epochs to check the patience value. `1` means to check at each epoch, and hence to compute the objective function. This may be undesirable if the objective function is costly to compute. iterlim stopping condition. Stop if more than `iterlim` epochs, return `code=4`. Epoch is a set of iterations that cycles through all observations. In case of full batch, iterations and epochs are equivalent. If `iterlim = 0`, does not do any learning and returns the initial values unchanged. printLevel this argument determines the level of printing which is done during the optimization process. The default value 0 means that no printing occurs, 1 prints the initial and final details, 2 prints all the main tracing information for every epoch. Higher values will result in even more output. storeParameters logical, whether to store and return the parameter values at each epoch. If `TRUE`, the stored values can be retrieved with `storedParameters`-method. The parameters are stored as a matrix with rows corresponding to the epochs and columns to the parameter components. There are `iterlim` + 1 rows, where the first one corresponds to the initial parameters. Default `FALSE`. storeValues logical, whether to store and return the objective function values at each epoch. If `TRUE`, the stored values can be retrieved with `storedValues`-method. There are `iterlim` + 1 values, where the first one corresponds to the value at the initial parameters. Default `FALSE`. See `maxControl` for more information.
`...`	further arguments to `fn`, `grad` and `hess`. To maintain compatibility with the earlier versions, ... also passes certain control options to the optimizers.

Details

Gradient Ascent (GA) is a optimization method where the algorithm repeatedly takes small steps in the gradient's direction, the parameter vector \theta is updated as \theta \leftarrow theta + \mathrm{learning rate}\cdot \nabla f(\theta). In case of Stochastic GA (SGA), the gradient is not computed on the full set of observations but on a small subset, batch, potentially a single observation only. In certain circumstances this converges much faster than when using all observation (see Bottou et al, 2018).

If SGA_momentum is positive, the SGA algorithm updates the parameters \theta in two steps. First, the momentum is used to update the “velocity” v as v \leftarrow \mathrm{momentum}\cdot v + \mathrm{learning rate}\cdot \nabla f(\theta), and thereafter the parameter \theta is updates as \theta \leftarrow \theta + v. Initial velocity is set to 0.

The Adam algorithm is more complex and uses first and second moments of stochastic gradients to automatically adjust the learning rate. See Goodfellow et al, 2016, page 301.

The function fn is not directly used for optimization, only for printing or as a stopping condition. In this sense it is up to the user to decide what the function returns, if anything. For instance, it may be useful for fn to compute the objective function on either full training data, or on validation data, and just ignore the index argument. The latter is useful if using patience-based stopping. However, one may also choose to select the observations determined by the index to compute the objective function on the current data batch.

Value

object of class "maxim". Data can be extracted through the following methods:

`maxValue`	`fn` value at maximum (the last calculated value if not converged.)
`coef`	estimated parameter value.
`gradient`	vector, last calculated gradient value. Should be close to 0 in case of normal convergence.
`estfun`	matrix of gradients at parameter value `estimate` evaluated at each observation (only if `grad` returns a matrix or `grad` is not specified and `fn` returns a vector).
`hessian`	Hessian at the maximum (the last calculated value if not converged).
`storedValues`	return values stored at each epoch
`storedParameters`	return parameters stored at each epoch
`returnCode`	a numeric code that describes the convergence or error.
`returnMessage`	a short message, describing the return code.
`activePar`	logical vector, which parameters are optimized over. Contains only `TRUE`-s if no parameters are fixed.
`nIter`	number of iterations.
`maximType`	character string, type of maximization.
`maxControl`	the optimization control parameters in the form of a `MaxControl` object.

Author(s)

Ott Toomet, Arne Henningsen

References

Bottou, L.; Curtis, F. & Nocedal, J.: Optimization Methods for Large-Scale Machine Learning SIAM Review, 2018, 60, 223–311.

Goodfellow, I.; Bengio, Y.; Courville, A. (2016): Deep Learning, MIT Press

Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458

Examples

## estimate the exponential distribution parameter by ML
set.seed(1)
t <- rexp(100, 2)
loglik <- function(theta, index) sum(log(theta) - theta*t[index])
## Note the log-likelihood and gradient are summed over observations
gradlik <- function(theta, index) sum(1/theta - t[index])
## Estimate with full-batch
a <- maxSGA(loglik, gradlik, start=1, control=list(iterlim=1000,
            SG_batchSize=10), nObs=100)
            # note that loglik is not really needed, and is not used
            # here, unless more print verbosity is asked
summary(a)
##
## demonstrate the usage of index, and using
## fn for computing the objective function on validation data.
## Create a linear model where variables are very unequally scaled
##
## OLS loglik function: compute the function value on validation data only
loglik <- function(beta, index) {
   e <- yValid - XValid %*% beta
   -crossprod(e)/length(y)
}
## OLS gradient: compute it on training data only
## Use 'index' to select the subset corresponding to the minibatch
gradlik <- function(beta, index)  {
   e <- yTrain[index] - XTrain[index,,drop=FALSE] %*% beta
   g <- t(-2*t(XTrain[index,,drop=FALSE]) %*% e)
   -g/length(index)
}
N <- 1000
## two random variables: one with scale 1, the other with 100
X <- cbind(rnorm(N), rnorm(N, sd=100))
beta <- c(1, 1)  # true parameter values
y <- X %*% beta + rnorm(N, sd=0.2)
## training-validation split
iTrain <- sample(N, 0.8*N)
XTrain <- X[iTrain,,drop=FALSE]
XValid <- X[-iTrain,,drop=FALSE]
yTrain <- y[iTrain]
yValid <- y[-iTrain]
##
## do this without momentum: learning rate must stay small for the gradient not to explode
cat("  No momentum:\n")
a <- maxSGA(loglik, gradlik, start=c(10,10),
           control=list(printLevel=1, iterlim=50,
                        SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0
                        ), nObs=length(yTrain))
print(summary(a))  # the first component is off, the second one is close to the true value
## do with momentum 0.99
cat("  Momentum 0.99:\n")
a <- maxSGA(loglik, gradlik, start=c(10,10),
           control=list(printLevel=1, iterlim=50,
                        SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0.99
                        # no momentum
                        ), nObs=length(yTrain))
print(summary(a))  # close to true value

maxLik documentation built on May 29, 2024, 2:32 a.m.