nrgkw: Enhanced Newton-Raphson Optimization for GKw Family...

View source: R/RcppExports.R

nrgkwR Documentation

Enhanced Newton-Raphson Optimization for GKw Family Distributions

Description

An industrial-strength implementation of maximum likelihood estimation (MLE) for the parameters of any distribution in the Generalized Kumaraswamy (GKw) family. This function incorporates multiple advanced numerical techniques including trust region methods, eigenvalue-based regularization, adaptive scaling, and sophisticated line search to ensure robust convergence even for challenging datasets or extreme parameter values.

Usage

nrgkw(
  start = NULL,
  data = as.numeric(c()),
  family = "gkw",
  tol = 1e-06,
  max_iter = 100L,
  verbose = FALSE,
  optimization_method = "trust-region",
  enforce_bounds = TRUE,
  min_param_val = 1e-05,
  max_param_val = 1e+05,
  adaptive_scaling = TRUE,
  use_stochastic_perturbation = TRUE,
  get_num_hess = FALSE,
  multi_start_attempts = 3L,
  eigenvalue_hessian_reg = TRUE,
  max_backtrack = 20L,
  initial_trust_radius = 1
)

Arguments

start

A numeric vector containing initial values for the parameters. If NULL, automatic initialization is used based on the dataset characteristics. The length and order must correspond to the selected family (e.g., c(alpha, beta, gamma, delta, lambda) for "gkw"; c(alpha, beta) for "kw"; c(gamma, delta) for "beta").

data

A numeric vector containing the observed data. All values must be strictly between 0 and 1.

family

A character string specifying the distribution family. One of "gkw", "bkw", "kkw", "ekw", "mc", "kw", or "beta". Default: "gkw".

tol

Convergence tolerance. The algorithm stops when the Euclidean norm of the gradient is below this value, or if relative changes in parameters or the negative log-likelihood are below this threshold across consecutive iterations. Default: 1e-6.

max_iter

Maximum number of iterations allowed. Default: 100.

verbose

Logical; if TRUE, prints detailed progress information during optimization, including iteration number, negative log-likelihood, gradient norm, and step adjustment details. Default: FALSE.

optimization_method

Character string specifying the optimization method: "trust-region" (default), "newton-raphson", or "hybrid" (starts with trust-region, switches to newton-raphson near convergence).

enforce_bounds

Logical; if TRUE, parameter values are constrained to stay within min_param_val, max_param_val (and \delta \ge 0) during optimization. Default: TRUE.

min_param_val

Minimum allowed value for parameters constrained to be strictly positive (\alpha, \beta, \gamma, \lambda). Default: 1e-5.

max_param_val

Maximum allowed value for all parameters. Default: 1e5.

adaptive_scaling

Logical; if TRUE, parameters are automatically scaled to improve numerical stability. Default: TRUE.

use_stochastic_perturbation

Logical; if TRUE, applies random perturbations when optimization stalls. Default: TRUE.

get_num_hess

Logical; if TRUE, computes and returns a numerical approximation of the Hessian at the solution. Default: FALSE.

multi_start_attempts

Integer specifying the number of different starting points to try if initial optimization fails to converge. Default: 3.

eigenvalue_hessian_reg

Logical; if TRUE, uses eigenvalue-based regularization for the Hessian matrix. Default: TRUE.

max_backtrack

Integer specifying the maximum number of backtracking steps in line search. Default: 20.

initial_trust_radius

Initial radius for trust region method. Default: 1.0.

Details

This enhanced algorithm provides robust parameter estimation for the Generalized Kumaraswamy distribution and its subfamilies. The function implements several advanced numerical optimization techniques to maximize the likelihood function reliably even in difficult cases.

The GKw Family of Distributions

The Generalized Kumaraswamy (GKw) distribution, introduced by Carrasco, Ferrari, and Cordeiro (2010), is a flexible five-parameter continuous distribution defined on the standard unit interval (0,1). Its probability density function is given by:

f(x; \alpha, \beta, \gamma, \delta, \lambda) = \frac{\lambda\alpha\beta x^{\alpha-1}}{B(\gamma, \delta+1)}(1-x^{\alpha})^{\beta-1}[1-(1-x^{\alpha})^{\beta}]^{\gamma\lambda-1}\{1-[1-(1-x^{\alpha})^{\beta}]^{\lambda}\}^{\delta}

where \alpha, \beta, \gamma, \lambda > 0 and \delta \geq 0 are the model parameters, and B(\gamma, \delta+1) is the beta function.

The GKw distribution encompasses several important special cases:

  • GKw (5 parameters): \alpha, \beta, \gamma, \delta, \lambda

  • BKw (4 parameters): \alpha, \beta, \gamma, \delta (with \lambda = 1)

  • KKw (4 parameters): \alpha, \beta, \delta, \lambda (with \gamma = 1)

  • EKw (3 parameters): \alpha, \beta, \lambda (with \gamma = 1, \delta = 0)

  • Mc (3 parameters): \gamma, \delta, \lambda (with \alpha = 1, \beta = 1)

  • Kw (2 parameters): \alpha, \beta (with \gamma = 1, \delta = 0, \lambda = 1)

  • Beta(2 parameters): \gamma, \delta (with \alpha = 1, \beta = 1, \lambda = 1)

Trust Region Method with Levenberg-Marquardt Algorithm

The trust region approach restricts parameter updates to a region where the quadratic approximation of the objective function is trusted to be accurate. This algorithm implements the Levenberg-Marquardt variant, which solves the subproblem:

\min_p m_k(p) = -\nabla \ell(\theta_k)^T p + \frac{1}{2}p^T H_k p

\text{subject to } \|p\| \leq \Delta_k

where \nabla \ell(\theta_k) is the gradient, H_k is the Hessian, and \Delta_k is the trust region radius at iteration k.

The Levenberg-Marquardt approach adds a regularization parameter \lambda to the Hessian, solving:

(H_k + \lambda I)p = -\nabla \ell(\theta_k)

The parameter \lambda controls the step size and direction:

  • When \lambda is large, the step approaches a scaled steepest descent direction.

  • When \lambda is small, the step approaches the Newton direction.

The algorithm dynamically adjusts \lambda based on the agreement between the quadratic model and the actual function:

\rho_k = \frac{f(\theta_k) - f(\theta_k + p_k)}{m_k(0) - m_k(p_k)}

The trust region radius is updated according to:

  • If \rho_k < 0.25, reduce the radius: \Delta_{k+1} = 0.5\Delta_k

  • If \rho_k > 0.75 and \|p_k\| \approx \Delta_k, increase the radius: \Delta_{k+1} = 2\Delta_k

  • Otherwise, leave the radius unchanged: \Delta_{k+1} = \Delta_k

The step is accepted if \rho_k > \eta (typically \eta = 0.1).

Eigenvalue-Based Hessian Regularization

For the Newton-Raphson method to converge, the Hessian matrix must be positive definite. This algorithm uses eigendecomposition to create a positive definite approximation that preserves the Hessian's structure:

H = Q\Lambda Q^T

where Q contains the eigenvectors and \Lambda is a diagonal matrix of eigenvalues.

The regularized Hessian is constructed by:

\tilde{H} = Q\tilde{\Lambda}Q^T

where \tilde{\Lambda} contains modified eigenvalues:

\tilde{\lambda}_i = \max(\lambda_i, \epsilon)

with \epsilon being a small positive threshold (typically 10^{-6}).

This approach is superior to diagonal loading (H + \lambda I) as it:

  • Preserves the eigenvector structure of the original Hessian

  • Minimally modifies the eigenvalues needed to ensure positive definiteness

  • Better maintains the directional information in the Hessian

Parameter Scaling for Numerical Stability

When parameters have widely different magnitudes, optimization can become numerically unstable. The adaptive scaling system transforms the parameter space to improve conditioning:

\theta_i^{scaled} = s_i \theta_i

where scaling factors s_i are determined by:

  • For large parameters (|\theta_i| > 100): s_i = 100/|\theta_i|

  • For small parameters (0 < |\theta_i| < 0.01): s_i = 0.01/|\theta_i|

  • Otherwise: s_i = 1

The optimization is performed in the scaled space, with appropriate transformations for the gradient and Hessian:

\nabla \ell(\theta^{scaled})_i = \frac{1}{s_i}\nabla \ell(\theta)_i

H(\theta^{scaled})_{ij} = \frac{1}{s_i s_j}H(\theta)_{ij}

The final results are back-transformed to the original parameter space before being returned.

Line Search with Wolfe Conditions

The line search procedure ensures sufficient decrease in the objective function when taking a step in the search direction. The implementation uses Wolfe conditions which include both:

  1. Sufficient decrease (Armijo) condition:

    f(\theta_k + \alpha p_k) \leq f(\theta_k) + c_1 \alpha \nabla f(\theta_k)^T p_k

  2. Curvature condition:

    |\nabla f(\theta_k + \alpha p_k)^T p_k| \leq c_2 |\nabla f(\theta_k)^T p_k|

where 0 < c_1 < c_2 < 1, typically c_1 = 10^{-4} and c_2 = 0.9.

The step length \alpha is determined using polynomial interpolation:

  • First iteration: quadratic interpolation

  • Subsequent iterations: cubic interpolation using function and derivative values

The cubic polynomial model has the form:

a\alpha^3 + b\alpha^2 + c\alpha + d

The algorithm computes coefficients from values at two points, then finds the minimizer of this polynomial subject to bounds to ensure convergence.

Adaptive Numerical Differentiation

When analytical derivatives are unreliable, the algorithm uses numerical differentiation with adaptive step sizes based on parameter magnitudes:

h_i = \max(h_{min}, \min(h_{base}, h_{base} \cdot |\theta_i|))

where h_{min} is a minimum step size (typically 10^{-8}), h_{base} is a base step size (typically 10^{-5}), and \theta_i is the parameter value.

For computing diagonal Hessian elements, the central difference formula is used:

\frac{\partial^2 f}{\partial \theta_i^2} \approx \frac{f(\theta + h_i e_i) - 2f(\theta) + f(\theta - h_i e_i)}{h_i^2}

For mixed partial derivatives:

\frac{\partial^2 f}{\partial \theta_i \partial \theta_j} \approx \frac{f(\theta + h_i e_i + h_j e_j) - f(\theta + h_i e_i - h_j e_j) - f(\theta - h_i e_i + h_j e_j) + f(\theta - h_i e_i - h_j e_j)}{4h_i h_j}

The algorithm validates numerical differentiation by comparing with existing gradients and adaptively adjusts step sizes when discrepancies are detected.

Stochastic Perturbation

To escape flat regions or local minima, the algorithm implements controlled stochastic perturbation when progress stalls (detected by monitoring gradient norm changes):

\theta_i^{new} = \theta_i + \Delta\theta_i

where the perturbation \Delta\theta_i combines:

  • A directed component opposite to the gradient: -\text{sign}(\nabla \ell_i) \cdot 0.05 \cdot |\theta_i|

  • A random noise component: U(-0.05|\theta_i|, 0.05|\theta_i|)

The perturbation is applied when:

  • The relative change in gradient norm is below a threshold for several consecutive iterations

  • The algorithm appears to be stuck in a non-optimal region

The perturbation is accepted only if it improves the objective function value.

Multi-Start Strategy

For particularly challenging optimization landscapes, the algorithm can employ multiple starting points:

  • Initial values are generated using moment-based estimation and domain knowledge about each distribution family

  • Each initial point is randomly perturbed to explore different regions of the parameter space

  • Independent optimization runs are performed from each starting point

  • The best result (based on likelihood value and convergence status) is returned

This approach increases the probability of finding the global optimum or a high-quality local optimum, particularly for complex models with many parameters.

Advanced Parameter Initialization

Intelligent starting values are critical for convergence in complex models. The algorithm uses data-driven initialization based on:

  • Method of moments estimators for beta parameters:

    \alpha + \beta = \frac{\bar{x}(1-\bar{x})}{s^2} - 1

    \alpha = (\alpha + \beta)\bar{x}

  • Transformations to Kumaraswamy parameters:

    a_{Kw} = \sqrt{\alpha_{Beta}}

    b_{Kw} = \sqrt{\beta_{Beta}}

  • Adjustments based on data skewness (detected via mean relative to 0.5)

  • Corrections based on data dispersion (range relative to (0,1) interval)

The transformations between beta and Kumaraswamy parameters leverage the similarities between these distributions while accounting for their parametric differences.

Hybrid Optimization Strategy

The algorithm can dynamically switch between trust region and Newton-Raphson methods based on optimization progress:

  • Early iterations: trust region method for stability and global convergence properties

  • Later iterations (when close to optimum): Newton-Raphson with line search for quadratic convergence rate

The switching criteria are based on iteration count and gradient norm, with additional logic to handle cases where one method consistently fails.

Value

A list object of class gkw_fit containing the following components:

parameters

A named numeric vector with the estimated parameters.

loglik

The maximized value of the log-likelihood function.

iterations

Number of iterations performed.

converged

Logical flag indicating whether the algorithm converged successfully.

param_history

A matrix where rows represent iterations and columns represent parameter values.

loglik_history

A vector of log-likelihood values at each iteration.

gradient

The gradient vector at the final parameter estimates.

hessian

The analytical Hessian matrix at the final parameter estimates.

std_errors

A named numeric vector of approximate standard errors for the parameters.

aic

Akaike Information Criterion.

bic

Bayesian Information Criterion.

aicc

AIC with small sample correction.

n

The sample size.

status

A character string indicating the termination status.

z_values

A named numeric vector of Z-statistics for parameter significance testing.

p_values

A named numeric vector of two-sided p-values corresponding to the Z-statistics.

param_names

A character vector of the parameter names.

family

The distribution family used.

optimization_method

The optimization method used.

numeric_hessian

The numerically approximated Hessian at the solution (if requested).

condition_number

The condition number of the final Hessian, a measure of parameter identifiability.

scaling_factors

The scaling factors used for parameters (if adaptive scaling was enabled).

Warning

Although this implementation is highly robust, fitting complex distributions can still be challenging. For best results:

  • Try multiple starting values if results seem suboptimal

  • Examine diagnostic information carefully, especially condition numbers and standard errors

  • Be cautious of parameter estimates at or very near boundaries

  • Consider model simplification if convergence is consistently problematic

  • For the full GKw model with 5 parameters, convergence may be sensitive to starting values

  • High condition numbers (>1e6) may indicate parameter redundancy or weak identifiability

Author(s)

Enhanced by Lopes, J. E.

References

Carrasco, J. M. F., Ferrari, S. L. P., & Cordeiro, G. M. (2010). A new generalized Kumaraswamy distribution. arXiv preprint arXiv:1004.0911.

Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.

Conn, A. R., Gould, N. I. M., & Toint, P. L. (2000). Trust Region Methods. MPS-SIAM Series on Optimization.

Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2), 79-88.

Examples


# Generate sample data from a Beta(2,5) distribution for testing
set.seed(123)
sample_data <- rbeta_(200, 2, 5)

# Automatic initialization (recommended for beginners)
result_auto <- nrgkw(NULL, sample_data, family = "beta", verbose = FALSE)
print(result_auto$parameters)
print(result_auto$loglik)

# Compare different optimization methods
methods <- c("trust-region", "newton-raphson", "hybrid")
results <- list()

for (method in methods) {
  results[[method]] <- nrgkw(NULL, sample_data, family = "beta",
                               optimization_method = method)
  cat(sprintf("Method: %s, AIC: %.4f\n", method, results[[method]]$aic))
}

# Fit the full GKw model with diagnostic information
gkw_result <- nrgkw(NULL, sample_data, family = "gkw",
                      verbose = FALSE, get_num_hess = TRUE)

# Examine parameter identifiability through condition number
cat(sprintf("Condition number: %.2e\n", gkw_result$condition_number))
print(gkw_result)

# Compare with simpler models using information criteria
cat("Information criteria comparison:\n")
cat(sprintf("GKw: AIC=%.4f, BIC=%.4f\n", gkw_result$aic, gkw_result$bic))
cat(sprintf("Beta: AIC=%.4f, BIC=%.4f\n",
           results[["trust-region"]]$aic, results[["trust-region"]]$bic))


gkwreg documentation built on April 16, 2025, 1:10 a.m.