title: 'gkwreg: An R Package for Generalized Kumaraswamy Regression Models for Bounded Data' tags: - R - gkwreg - generalized kumaraswamy - regression - maximum likelihood - bounded data - TMB authors: - name: José Evandeilton Lopes orcid: 0009-0007-5887-4084 affiliation: "1" - name: Wagner Hugo Bonat orcid: 0000-0002-0349-7054 affiliation: "1" affiliations: - name: Paraná Federal University, Brazil index: 1 # - name: Universidade Federal do Paraná (UFPR) # index: 2 citation_author: Lopes, JE & Bonat, WH date: '19 julho 2025' year: '2025' bibliography: paper.bib output: rticles::joss_article csl: apa.csl journal: JOSS header-includes: - \usepackage{booktabs} - \usepackage{array} # - \usepackage{amsmath} - \usepackage{amssymb} - \usepackage{amsfonts} - \usepackage{multirow} - \usepackage{longtable} - \usepackage{caption} - \usepackage{float} # - \usepackage{colortbl} # Uncomment if necessary # - \usepackage{threeparttable} # Uncomment if necessary
gkwreg is an R package for fitting regression models to data restricted to the
unit interval $(0,1)$, such as proportions, rates, and indices.
The package implements the flexible five-parameter Generalized Kumaraswamy (GKw)
distribution and its seven main subfamilies, including the widely used Beta and
Kumaraswamy distributions. A key feature of gkwreg is its use of the Template
Model Builder (TMB) framework, which leverages automatic differentiation and
C++ templates to provide fast, stable, and accurate maximum likelihood estimation.
This overcomes the significant computational challenges typically associated
with such complex multiparametric models, making them accessible for practical
application. The package provides a user-friendly interface with standard R
methods for model specification, inference, and diagnostics.
Statistical modeling of data bounded in the interval $(0,1)$ is frequent across fields such as economics, epidemiology, and social sciences. Traditional methods like variable transformations followed by linear regression often present interpretability issues and fail near boundary points.
Direct modeling using distributions defined on $(0,1)$ is preferable. While the Beta distribution is commonly used, it can be insufficient for complex patterns and lacks a closed-form cumulative distribution function (CDF). The Kumaraswamy (Kw) distribution [@kumaraswamy1980] offers an analytically simple CDF, yet its two-parameter form may be overly restrictive. To overcome these limitations, the Generalized Kumaraswamy (GKw) distribution, a flexible, five-parameter family incorporating the Beta and Kw distributions introduced by [@carrasco2010] was developed. However, practical applications of GKw in regression contexts have faced computational challenges. Its complex likelihood function makes Maximum Likelihood Estimation (MLE) computationally demanding and unstable, necessitating efficient and user-friendly computational tools.
The R package gkwreg [@gkwreg] addresses this need. Built on the Template
Model Builder (TMB) package [@Kristensen2016], it leverages automatic
differentiation (AD) in C++ to efficiently compute gradients and Hessians,
significantly enhancing speed, accuracy, and stability of MLE, especially when
distribution parameters vary with covariates. gkwreg offers an intuitive
interface aligned with standard R modeling conventions. Its integration
with the multi-part formula syntax of the Formula package [@Zeileis2010]
allows flexible specification of regression structures. Additionally, it provides
comprehensive S3 methods (summary(), predict(), plot(), residuals())
and randomized quantile residuals [@Dunn1996] for model diagnostics, facilitating
robust goodness-of-fit assessments.
The Probability Density Function (PDF) of the five-parameter Generalized Kumaraswamy (GKw) distribution is given by: $$f(y; \boldsymbol{\theta}) = \frac{\lambda\,\alpha\,\beta\,y^{\alpha-1}}{B(\gamma, \delta+1)}\,\bigl(1-y^\alpha\bigr)^{\beta-1}\,\bigl[1-\bigl(1-y^\alpha\bigr)^\beta\bigr]^{\gamma\lambda-1}\,\left{1-\bigl[1-\bigl(1-y^\alpha\bigr)^\beta\bigr]^\lambda\right}^\delta$$ where $\boldsymbol{\theta} = (\alpha, \beta, \gamma, \delta, \lambda)^\top$ is the vector of positive shape parameters and $B(\cdot, \cdot)$ is the beta function.
gkwreg implements a comprehensive distributional regression framework where all relevant distribution parameters can be modeled as functions of covariates through flexible link functions. For a response variable $y_i \in (0,1)$ following a GKw family distribution, each parameter $\theta_{ip} \in {\alpha_i, \beta_i, \gamma_i, \delta_i, \lambda_i}$ is related to a linear predictor via:
$$g_p(\theta_{ip}) = \eta_{ip} = \mathbf{x}{ip}^\top \boldsymbol{\beta}_p$$
where $g_p(\cdot)$ is a suitable link function, $\mathbf{x}{ip}$ is the covariate vector for the $i$-th observation and $p$-th parameter, and $\boldsymbol{\beta}p$ is the corresponding coefficient vector. The package employs an extended formula syntax allowing users to specify parameter-specific linear predictors through the notation y ~ alpha_predictors | beta_predictors | gamma_predictors | delta_predictors | lambda_predictors. Multiple link functions are supported: for positive parameters ($\alpha, \beta, \gamma, \lambda$), options include logarithmic (default), square root, inverse, and identity links; for probability parameters ($\delta \in (0,1)$), logit (default), probit, complementary log-log, and Cauchy links are available. Additionally, link scaling functionality allows control over transformation intensity, enhancing numerical stability in challenging optimization scenarios. Maximum likelihood estimation maximizes the log-likelihood function:
$$\ell(\boldsymbol{\Theta}; \mathbf{y}, \mathbf{X}) = \sum{i=1}^{n} \log f(y_i; \boldsymbol{\theta}_i(\boldsymbol{\Theta}))$$
where TMB computes exact gradients $\nabla \ell$ and Hessian matrix $\mathbf{H}$ via automatic differentiation, enabling fast and stable optimization through efficient algorithms such as nlminb (default) or alternative methods like BFGS, Nelder-Mead, and L-BFGS-B, all configurable through the gkw_control() function.
Model diagnostics in gkwreg are primarily based on randomized quantile residuals, defined as:
$$r_i^Q = \Phi^{-1}\bigl(F(y_i; \hat{\boldsymbol{\theta}}_i)\bigr)$$
where $F(y_i; \hat{\boldsymbol{\theta}}_i)$ is the fitted CDF evaluated at observation $y_i$ with estimated parameters $\hat{\boldsymbol{\theta}}_i$, and $\Phi^{-1}(\cdot)$ is the quantile function of the standard normal distribution. If the model is correctly specified, these residuals should follow a standard normal distribution. The package provides six diagnostic plot types to assess model adequacy: residuals versus indices for detecting autocorrelation, Cook's distance for identifying influential observations, leverage versus fitted values for flagging high-leverage points, residuals versus linear predictors for checking linearity and heteroscedasticity, half-normal plots with simulated envelopes for distributional assessment, and predicted versus observed plots for overall goodness-of-fit evaluation. These diagnostics are accessible through a unified plot() method supporting both base R graphics and ggplot2 visualization systems.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.