gkwreg: Fit Generalized Kumaraswamy Regression Models

View source: R/gkwreg-core.R

gkwregR Documentation

Fit Generalized Kumaraswamy Regression Models

Description

Fits regression models using the Generalized Kumaraswamy (GKw) family of distributions for modeling response variables strictly bounded in the interval (0, 1). The function provides a unified interface for fitting seven nested submodels of the GKw family, allowing flexible modeling of proportions, rates, and other bounded continuous outcomes through regression on distributional parameters.

Maximum Likelihood Estimation is performed via automatic differentiation using the TMB (Template Model Builder) package, ensuring computational efficiency and numerical accuracy. The interface follows standard R regression modeling conventions similar to lm, glm, and betareg, making it immediately familiar to R users.

Usage

gkwreg(
  formula,
  data,
  family = c("gkw", "bkw", "kkw", "ekw", "mc", "kw", "beta"),
  link = NULL,
  link_scale = NULL,
  subset = NULL,
  weights = NULL,
  offset = NULL,
  na.action = getOption("na.action"),
  contrasts = NULL,
  control = gkw_control(),
  model = TRUE,
  x = FALSE,
  y = TRUE,
  ...
)

Arguments

formula

An object of class Formula (or one that can be coerced to that class). The formula uses extended syntax to specify potentially different linear predictors for each distribution parameter:

y ~ model_alpha | model_beta | model_gamma | model_delta | model_lambda

where:

  • y is the response variable (must be in the open interval (0, 1))

  • model_alpha specifies predictors for the \alpha parameter

  • model_beta specifies predictors for the \beta parameter

  • model_gamma specifies predictors for the \gamma parameter

  • model_delta specifies predictors for the \delta parameter

  • model_lambda specifies predictors for the \lambda parameter

If a part is omitted or specified as ~ 1, an intercept-only model is used for that parameter. Parts corresponding to fixed parameters (determined by family) are automatically ignored. See Details and Examples for proper usage.

data

A data frame containing the variables specified in formula. Standard R subsetting and missing value handling apply.

family

A character string specifying the distribution family from the Generalized Kumaraswamy hierarchy. Must be one of:

"gkw"

Generalized Kumaraswamy (default). Five parameters: \alpha, \beta, \gamma, \delta, \lambda. Most flexible, suitable when data show complex behavior not captured by simpler families.

"bkw"

Beta-Kumaraswamy. Four parameters: \alpha, \beta, \gamma, \delta (fixes \lambda = 1). Combines Beta and Kumaraswamy flexibility.

"kkw"

Kumaraswamy-Kumaraswamy. Four parameters: \alpha, \beta, \delta, \lambda (fixes \gamma = 1). Alternative four-parameter generalization.

"ekw"

Exponentiated Kumaraswamy. Three parameters: \alpha, \beta, \lambda (fixes \gamma = 1, \delta = 0). Adds flexibility to standard Kumaraswamy.

"mc"

McDonald (Beta Power). Three parameters: \gamma, \delta, \lambda (fixes \alpha = 1, \beta = 1). Generalization of Beta distribution.

"kw"

Kumaraswamy. Two parameters: \alpha, \beta (fixes \gamma = 1, \delta = 0, \lambda = 1). Computationally efficient alternative to Beta with closed-form CDF.

"beta"

Beta distribution. Two parameters: \gamma, \delta (fixes \alpha = 1, \beta = 1, \lambda = 1). Standard choice for proportions and rates, corresponds to shape1 = \gamma, shape2 = \delta.

See Details for guidance on family selection.

link

Link function(s) for the distributional parameters. Can be specified as:

  • Single character string: Same link for all relevant parameters. Example: link = "log" applies log link to all parameters.

  • Named list: Parameter-specific links for fine control. Example: link = list(alpha = "log", beta = "log", delta = "logit")

Default links (used if link = NULL):

  • "log" for \alpha, \beta, \gamma, \lambda (positive parameters)

  • "logit" for \delta (parameter in (0, 1))

Available link functions:

"log"

Logarithmic link. Maps (0, \infty) \to (-\infty, \infty). Ensures positivity. Most common for shape parameters.

"logit"

Logistic link. Maps (0, 1) \to (-\infty, \infty). Standard for probability-type parameters like \delta.

"probit"

Probit link using normal CDF. Maps (0, 1) \to (-\infty, \infty). Alternative to logit, symmetric tails.

"cloglog"

Complementary log-log. Maps (0, 1) \to (-\infty, \infty). Asymmetric, useful for skewed probabilities.

"cauchy"

Cauchy link using Cauchy CDF. Maps (0, 1) \to (-\infty, \infty). Heavy-tailed alternative to probit.

"identity"

Identity link (no transformation). Use with caution; does not guarantee parameter constraints.

"sqrt"

Square root link. Maps x \to \sqrt{x}. Variance-stabilizing for some contexts.

"inverse"

Inverse link. Maps x \to 1/x. Useful for rate-type parameters.

"inverse-square"

Inverse squared link. Maps x \to 1/x^2.

link_scale

Numeric scale factor(s) controlling the transformation intensity of link functions. Can be:

  • Single numeric: Same scale for all parameters.

  • Named list: Parameter-specific scales for fine-tuning. Example: link_scale = list(alpha = 10, beta = 10, delta = 1)

Default scales (used if link_scale = NULL):

  • 10 for \alpha, \beta, \gamma, \lambda

  • 1 for \delta

Larger values produce more gradual transformations; smaller values produce more extreme transformations. For probability-type links (logit, probit), smaller scales (e.g., 0.5-2) create steeper response curves, while larger scales (e.g., 5-20) create gentler curves. Adjust if convergence issues arise or if you need different response sensitivities.

subset

Optional vector specifying a subset of observations to be used in fitting. Can be a logical vector, integer indices, or expression evaluating to one of these. Standard R subsetting rules apply.

weights

Optional numeric vector of prior weights (e.g., frequency weights) for observations. Should be non-negative. Currently experimental; use with caution and validate results.

offset

Optional numeric vector or matrix specifying an a priori known component to be included in the linear predictor(s). If a vector, it is applied to the first parameter's predictor. If a matrix, columns correspond to parameters in order (\alpha, \beta, \gamma, \delta, \lambda). Offsets are added to the linear predictor before applying the link function.

na.action

A function specifying how to handle missing values (NAs). Options include:

na.fail

Stop with error if NAs present (default via getOption("na.action"))

na.omit

Remove observations with NAs

na.exclude

Like na.omit but preserves original length in residuals/fitted values

See na.action for details.

contrasts

Optional list specifying contrasts for factor variables in the model. Format: named list where names are factor variable names and values are contrast specifications. See contrasts and the contrasts.arg argument of model.matrix.

control

A list of control parameters from gkw_control specifying technical details of the fitting process. This includes:

  • Optimization algorithm (method)

  • Starting values (start)

  • Fixed parameters (fixed)

  • Convergence tolerances (maxit, reltol, abstol)

  • Hessian computation (hessian)

  • Verbosity (silent, trace)

Default is gkw_control() which uses sensible defaults for most problems. See gkw_control for complete documentation of all options. Most users never need to modify control parameters.

model

Logical. If TRUE (default), the model frame (data frame containing all variables used in fitting) is returned as component model of the result. Useful for prediction and diagnostics. Set to FALSE to reduce object size.

x

Logical. If TRUE, the list of model matrices (one for each modeled parameter) is returned as component x. Default FALSE. Set to TRUE if you need direct access to design matrices for custom calculations.

y

Logical. If TRUE (default), the response vector (after processing by na.action and subset) is returned as component y. Useful for residual calculations and diagnostics.

...

Additional arguments. Currently used only for backward compatibility with deprecated arguments from earlier versions. Using deprecated arguments triggers informative warnings with migration guidance. Examples of deprecated arguments: plot, conf.level, method, start, fixed, hessian, silent, optimizer.control. These should now be passed via the control argument.

Details

Distribution Family Selection

The Generalized Kumaraswamy family provides a flexible hierarchy for modeling bounded responses. Selection should be guided by:

1. Start Simple: Begin with two-parameter families ("kw" or "beta") unless you have strong reasons to use more complex models.

2. Model Comparison: Use information criteria (AIC, BIC) and likelihood ratio tests to compare nested models:

  # Fit sequence of nested models
  fit_kw   <- gkwreg(y ~ x, data, family = "kw")
  fit_ekw  <- gkwreg(y ~ x, data, family = "ekw")
  fit_gkw  <- gkwreg(y ~ x, data, family = "gkw")

  # Compare via AIC
  AIC(fit_kw, fit_ekw, fit_gkw)

  # Formal test (nested models only)
  anova(fit_kw, fit_ekw, fit_gkw)

3. Family Characteristics:

  • Beta: Traditional choice, well-understood, good for symmetric or moderately skewed data

  • Kumaraswamy (kw): Computationally efficient alternative to Beta, closed-form CDF, similar flexibility

  • Exponentiated Kumaraswamy (ekw): Adds flexibility for extreme values and heavy tails

  • Beta-Kumaraswamy (bkw), Kumaraswamy-Kumaraswamy (kkw): Four-parameter alternatives when three parameters insufficient

  • McDonald (mc): Beta generalization via power parameter, useful for J-shaped distributions

  • Kumaraswamy-Kumaraswamy (kkw): Most flexible, use only when simpler families inadequate. It extends kw

  • Generalized Kumaraswamy (gkw): Most flexible, use only when simpler families inadequate

4. Avoid Overfitting: More different parameters better model. Use cross-validation or hold-out validation to assess predictive performance.

Formula Specification

The extended formula syntax allows different predictors for each parameter:

Basic Examples:

  # Same predictors for both parameters (two-parameter family)
  y ~ x1 + x2
  # Equivalent to: y ~ x1 + x2 | x1 + x2

  # Different predictors per parameter
  y ~ x1 + x2 | x3 + x4
  # alpha depends on x1, x2
  # beta depends on x3, x4

  # Intercept-only for some parameters
  y ~ x1 | 1
  # alpha depends on x1
  # beta has only intercept

  # Complex specification (five-parameter family)
  y ~ x1 | x2 | x3 | x4 | x5
  # alpha ~ x1, beta ~ x2, gamma ~ x3, delta ~ x4, lambda ~ x5

Important Notes:

  • Formula parts correspond to parameters in order: \alpha, \beta, \gamma, \delta, \lambda

  • Unused parts (due to family constraints) are automatically ignored

  • Use . to include all predictors: y ~ . | .

  • Standard R formula features work: interactions (x1:x2), polynomials (poly(x, 2)), transformations (log(x)), etc.

Link Functions and Scales

Link functions map the range of distributional parameters to the real line, ensuring parameter constraints are satisfied during optimization.

Choosing Links:

  • Defaults are usually best: The automatic choices (log for shape parameters, logit for delta) work well in most cases

  • Alternative links: Consider if you have theoretical reasons (e.g., probit for latent variable interpretation) or convergence issues

  • Identity link: Avoid unless you have constraints elsewhere; can lead to invalid parameter values during optimization

Link Scales: The link_scale parameter controls transformation intensity. Think of it as a "sensitivity" parameter:

  • Larger values (e.g., 20): Gentler response to predictor changes

  • Smaller values (e.g., 2): Steeper response to predictor changes

  • Default (10): Balanced, works well for most cases

Adjust only if:

  • Convergence difficulties arise

  • You need very steep or very gentle response curves

  • Predictors have unusual scales (very large or very small)

Optimization and Convergence

The default optimizer (method = "nlminb") works well for most problems. If convergence issues occur:

1. Check Data:

  • Ensure response is strictly in (0, 1)

  • Check for extreme outliers or influential points

  • Verify predictors aren't perfectly collinear

  • Consider rescaling predictors to similar ranges

2. Try Alternative Optimizers:

  # BFGS often more robust for difficult problems
  fit <- gkwreg(y ~ x, data,
                control = gkw_control(method = "BFGS"))

  # Nelder-Mead for non-smooth objectives
  fit <- gkwreg(y ~ x, data,
                control = gkw_control(method = "Nelder-Mead"))

3. Adjust Tolerances:

  # Increase iterations and loosen tolerance
  fit <- gkwreg(y ~ x, data,
                control = gkw_control(maxit = 1000, reltol = 1e-6))

4. Provide Starting Values:

  # Fit simpler model first, use as starting values
  fit_simple <- gkwreg(y ~ 1, data, family = "kw")
  start_vals <- list(
    alpha = c(coef(fit_simple)[1], rep(0, ncol(X_alpha) - 1)),
    beta  = c(coef(fit_simple)[2], rep(0, ncol(X_beta) - 1))
  )
  fit_complex <- gkwreg(y ~ x1 + x2 | x3 + x4, data, family = "kw",
                         control = gkw_control(start = start_vals))

5. Simplify Model:

  • Use simpler family (e.g., "kw" instead of "gkw")

  • Reduce number of predictors

  • Use intercept-only for some parameters

Standard Errors and Inference

By default, standard errors are computed via the Hessian matrix at the MLE. This provides valid asymptotic standard errors under standard regularity conditions.

When Standard Errors May Be Unreliable:

  • Small sample sizes (n < 30-50 per parameter)

  • Parameters near boundaries

  • Highly collinear predictors

  • Mis-specified models

Alternatives:

  • Bootstrap confidence intervals (more robust, computationally expensive)

  • Profile likelihood intervals via confint(..., type = "profile") (not yet implemented)

  • Cross-validation for predictive performance assessment

To skip Hessian computation (faster, no SEs):

  fit <- gkwreg(y ~ x, data,
                control = gkw_control(hessian = FALSE))

Model Diagnostics

Always check model adequacy using diagnostic plots:

  fit <- gkwreg(y ~ x, data, family = "kw")
  plot(fit)  # Six diagnostic plots

Key diagnostics:

  • Residual plots: Check for patterns, heteroscedasticity

  • Half-normal plot: Assess distributional adequacy

  • Cook's distance: Identify influential observations

  • Predicted vs observed: Overall fit quality

See plot.gkwreg for detailed interpretation guidance.

Computational Considerations

Performance Tips:

  • GKw family: Most computationally expensive (~2-5x slower than kw/beta)

  • Beta/Kw families: Fastest, use when adequate

  • Large datasets (n > 10,000): Consider sampling for exploratory analysis

  • TMB uses automatic differentiation: Fast gradient/Hessian computation

  • Disable Hessian (hessian = FALSE) for faster fitting without SEs

Memory Usage:

  • Set model = FALSE, x = FALSE, y = FALSE to reduce object size (but limits some post-fitting capabilities)

  • Hessian matrix scales as O(p²) where p = number of parameters

Value

An object of class "gkwreg", which is a list containing the following components. Standard S3 methods are available for this class (see Methods section).

Model Specification:

call

The matched function call

formula

The Formula object used

family

Character string: distribution family used

link

Named list: link functions for each parameter

link_scale

Named list: link scale values for each parameter

param_names

Character vector: names of parameters for this family

fixed_params

Named list: parameters fixed by family definition

control

The gkw_control object used for fitting

Parameter Estimates:

coefficients

Named numeric vector: estimated regression coefficients (on link scale). Names follow the pattern "parameter:predictor", e.g., "alpha:(Intercept)", "alpha:x1", "beta:(Intercept)", "beta:x2".

fitted_parameters

Named list: mean values for each distribution parameter (\alpha, \beta, \gamma, \delta, \lambda) averaged across all observations

parameter_vectors

Named list: observation-specific parameter values. Contains vectors alphaVec, betaVec, gammaVec, deltaVec, lambdaVec, each of length nobs

Fitted Values and Residuals:

fitted.values

Numeric vector: fitted mean values E[Y|X] for each observation

residuals

Numeric vector: response residuals (observed - fitted) for each observation

Inference:

vcov

Variance-covariance matrix of coefficient estimates. Only present if control$hessian = TRUE. NULL otherwise.

se

Numeric vector: standard errors of coefficients. Only present if control$hessian = TRUE. NULL otherwise.

Model Fit Statistics:

loglik

Numeric: maximized log-likelihood value

aic

Numeric: Akaike Information Criterion (AIC = -2loglik + 2npar)

bic

Numeric: Bayesian Information Criterion (BIC = -2*loglik + log(nobs)*npar)

deviance

Numeric: deviance (-2 * loglik)

df.residual

Integer: residual degrees of freedom (nobs - npar)

nobs

Integer: number of observations used in fit

npar

Integer: total number of estimated parameters

Diagnostic Statistics:

rmse

Numeric: Root Mean Squared Error of response residuals

efron_r2

Numeric: Efron's pseudo R-squared (1 - SSE/SST, where SSE = sum of squared errors, SST = total sum of squares)

mean_absolute_error

Numeric: Mean Absolute Error of response residuals

Optimization Details:

convergence

Logical: TRUE if optimizer converged successfully, FALSE otherwise

message

Character: convergence message from optimizer

iterations

Integer: number of iterations used by optimizer

method

Character: optimization method used (e.g., "nlminb", "BFGS")

Optional Components (returned if requested via model, x, y):

model

Data frame: the model frame (if model = TRUE)

x

Named list: model matrices for each parameter (if x = TRUE)

y

Numeric vector: the response variable (if y = TRUE)

Internal:

tmb_object

The raw object returned by MakeADFun. Contains the TMB automatic differentiation function and environment. Primarily for internal use and advanced debugging.

Methods

The following S3 methods are available for objects of class "gkwreg":

Basic Methods:

  • print.gkwreg: Print basic model information

  • summary.gkwreg: Detailed model summary with coefficient tables, tests, and fit statistics

  • coef.gkwreg: Extract coefficients

  • vcov.gkwreg: Extract variance-covariance matrix

  • logLik: Extract log-likelihood

  • AIC, BIC: Information criteria

Prediction and Fitted Values:

  • fitted.gkwreg: Extract fitted values

  • residuals.gkwreg: Extract residuals (multiple types available)

  • predict.gkwreg: Predict on new data

Inference:

  • confint.gkwreg: Confidence intervals for parameters

  • anova.gkwreg: Compare nested models via likelihood ratio tests

Diagnostics:

  • plot.gkwreg: Comprehensive diagnostic plots (6 types)

Author(s)

Lopes, J. E.

Maintainer: Lopes, J. E.

References

Generalized Kumaraswamy Distribution:

Cordeiro, G. M., & de Castro, M. (2011). A new family of generalized distributions. Journal of Statistical Computation and Simulation, 81(7), 883-898. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/00949650903530745")}

Kumaraswamy Distribution:

Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2), 79-88. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/0022-1694(80)90036-0")}

Jones, M. C. (2009). Kumaraswamy's distribution: A beta-type distribution with some tractability advantages. Statistical Methodology, 6(1), 70-81. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.stamet.2008.04.001")}

Beta Regression:

Ferrari, S. L. P., & Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics, 31(7), 799-815. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/0266476042000214501")}

Smithson, M., & Verkuilen, J. (2006). A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods, 11(1), 54-71. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1037/1082-989X.11.1.54")}

Template Model Builder (TMB):

Kristensen, K., Nielsen, A., Berg, C. W., Skaug, H., & Bell, B. M. (2016). TMB: Automatic Differentiation and Laplace Approximation. Journal of Statistical Software, 70(5), 1-21. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v070.i05")}

Related Software:

Zeileis, A., & Croissant, Y. (2010). Extended Model Formulas in R: Multiple Parts and Multiple Responses. Journal of Statistical Software, 34(1), 1-13. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v034.i01")}

See Also

Control and Inference: gkw_control for fitting control parameters, confint.gkwreg for confidence intervals, anova.gkwreg for model comparison

Methods: summary.gkwreg, plot.gkwreg, coef.gkwreg, vcov.gkwreg, fitted.gkwreg, residuals.gkwreg, predict.gkwreg

Distributions: dgkw, pgkw, qgkw, rgkw for the GKw distribution family functions

Related Packages: betareg for traditional beta regression, Formula for extended formula interface, MakeADFun for TMB functionality

Examples


# SECTION 1: Basic Usage - Getting Started
# Load packages and data
library(gkwreg)
library(gkwdist)
data(GasolineYield)

# Example 1.1: Simplest possible model (intercept-only, all defaults)
fit_basic <- gkwreg(yield ~ 1, data = GasolineYield, family = "kw")
summary(fit_basic)

# Example 1.2: Model with predictors (uses all defaults)
# Default: family = "gkw", method = "nlminb", hessian = TRUE
fit_default <- gkwreg(yield ~ batch + temp, data = GasolineYield)
summary(fit_default)

# Example 1.3: Kumaraswamy model (two-parameter family)
# Default link functions: log for both alpha and beta
fit_kw <- gkwreg(yield ~ batch + temp, data = GasolineYield, family = "kw")
summary(fit_kw)

par(mfrow = c(3, 2))
plot(fit_kw, ask = FALSE)

# Example 1.4: Beta model for comparison
# Default links: log for gamma and delta
fit_beta <- gkwreg(yield ~ batch + temp, data = GasolineYield, family = "beta")

# Compare models using AIC/BIC
AIC(fit_kw, fit_beta)
BIC(fit_kw, fit_beta)

# SECTION 2: Using gkw_control() for Customization

# Example 2.1: Change optimization method to BFGS
fit_bfgs <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  control = gkw_control(method = "BFGS")
)
summary(fit_bfgs)

# Example 2.2: Increase iterations and enable verbose output
fit_verbose <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  control = gkw_control(
    method = "nlminb",
    maxit = 1000,
    silent = FALSE, # Show optimization progress
    trace = 1 # Print iteration details
  )
)

# Example 2.3: Fast fitting without standard errors
# Useful for model exploration or large datasets
fit_fast <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  control = gkw_control(hessian = FALSE)
)
# Note: Cannot compute confint() without hessian
coef(fit_fast) # Point estimates still available

# Example 2.4: Custom convergence tolerances
fit_tight <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  control = gkw_control(
    reltol = 1e-10, # Tighter convergence
    maxit = 2000 # More iterations allowed
  )
)

# SECTION 3: Advanced Formula Specifications

# Example 3.1: Different predictors for different parameters
# alpha depends on batch, beta depends on temp
fit_diff <- gkwreg(
  yield ~ batch | temp,
  data = GasolineYield,
  family = "kw"
)
summary(fit_diff)

# Example 3.2: Intercept-only for one parameter
# alpha varies with predictors, beta is constant
fit_partial <- gkwreg(
  yield ~ batch + temp | 1,
  data = GasolineYield,
  family = "kw"
)

# Example 3.3: Complex model with interactions
fit_interact <- gkwreg(
  yield ~ batch * temp | temp + I(temp^2),
  data = GasolineYield,
  family = "kw"
)

# SECTION 4: Working with Different Families

# Example 4.1: Fit multiple families and compare
families <- c("beta", "kw", "ekw", "bkw", "gkw")
fits <- lapply(families, function(fam) {
  gkwreg(yield ~ batch + temp, data = GasolineYield, family = fam)
})
names(fits) <- families

# Compare via information criteria
comparison <- data.frame(
  Family = families,
  LogLik = sapply(fits, logLik),
  AIC = sapply(fits, AIC),
  BIC = sapply(fits, BIC),
  npar = sapply(fits, function(x) x$npar)
)
print(comparison)

# Example 4.2: Formal nested model testing
fit_kw <- gkwreg(yield ~ batch + temp, GasolineYield, family = "kw")
fit_ekw <- gkwreg(yield ~ batch + temp, GasolineYield, family = "ekw")
fit_gkw <- gkwreg(yield ~ batch + temp, GasolineYield, family = "gkw")
anova(fit_kw, fit_ekw, fit_gkw)

# SECTION 5: Link Functions and Scales

# Example 5.1: Custom link functions
fit_links <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  link = list(alpha = "sqrt", beta = "log")
)

# Example 5.2: Custom link scales
# Smaller scale = steeper response curve
fit_scale <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  link_scale = list(alpha = 5, beta = 15)
)

# Example 5.3: Uniform link for all parameters
fit_uniform <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  link = "log" # Single string applied to all
)

# SECTION 6: Prediction and Inference

# Fit model for prediction examples
fit <- gkwreg(yield ~ batch + temp, GasolineYield, family = "kw")

# Example 6.1: Confidence intervals at different levels
confint(fit, level = 0.95) # 95% CI
confint(fit, level = 0.90) # 90% CI
confint(fit, level = 0.99) # 99% CI

# SECTION 7: Diagnostic Plots and Model Checking

fit <- gkwreg(yield ~ batch + temp, GasolineYield, family = "kw")

# Example 7.1: All diagnostic plots (default)
par(mfrow = c(3, 2))
plot(fit, ask = FALSE)

# Example 7.2: Select specific plots
par(mfrow = c(3, 1))
plot(fit, which = c(2, 4, 5)) # Cook's distance, Residuals, Half-normal

# Example 7.3: Using ggplot2 for modern graphics
plot(fit, use_ggplot = TRUE, arrange_plots = TRUE)

# Example 7.4: Customized half-normal plot
par(mfrow = c(1, 1))
plot(fit,
  which = 5,
  type = "quantile",
  nsim = 200, # More simulations for smoother envelope
  level = 0.95
) # 95% confidence envelope

# Example 7.5: Extract diagnostic data programmatically
diagnostics <- plot(fit, save_diagnostics = TRUE)
head(diagnostics$data) # Residuals, Cook's distance, etc.

# SECTION 8: Real Data Example - Food Expenditure

# Load and prepare data
data(FoodExpenditure, package = "betareg")
food_data <- FoodExpenditure
food_data$prop <- food_data$food / food_data$income

# Example 8.1: Basic model
fit_food <- gkwreg(
  prop ~ persons | income,
  data = food_data,
  family = "kw"
)
summary(fit_food)

# Example 8.2: Compare with Beta regression
fit_food_beta <- gkwreg(
  prop ~ persons | income,
  data = food_data,
  family = "beta"
)

# Which fits better?
AIC(fit_food, fit_food_beta)

# Example 8.3: Model diagnostics
par(mfrow = c(3, 1))
plot(fit_food, which = c(2, 5, 6))

# Example 8.4: Interpretation via effects
# How does proportion spent on food change with income?
income_seq <- seq(min(food_data$income), max(food_data$income), length = 50)
pred_data <- data.frame(
  persons = median(food_data$persons),
  income = income_seq
)
pred_food <- predict(fit_food, newdata = pred_data, type = "response")

par(mfrow = c(1, 1))
plot(food_data$income, food_data$prop,
  xlab = "Income", ylab = "Proportion Spent on Food",
  main = "Food Expenditure Pattern"
)
lines(income_seq, pred_food, col = "red", lwd = 2)

# SECTION 9: Simulation Studies

# Example 9.1: Simple Kumaraswamy simulation
set.seed(123)
n <- 500
x1 <- runif(n, -2, 2)
x2 <- rnorm(n)

# True model: log(alpha) = 0.8 + 0.3*x1, log(beta) = 1.2 - 0.2*x2
eta_alpha <- 0.8 + 0.3 * x1
eta_beta <- 1.2 - 0.2 * x2
alpha_true <- exp(eta_alpha)
beta_true <- exp(eta_beta)

# Generate response
y <- rkw(n, alpha = alpha_true, beta = beta_true)
sim_data <- data.frame(y = y, x1 = x1, x2 = x2)

# Fit and check parameter recovery
fit_sim <- gkwreg(y ~ x1 | x2, data = sim_data, family = "kw")

# Compare estimated vs true coefficients
cbind(
  True = c(0.8, 0.3, 1.2, -0.2),
  Estimated = coef(fit_sim),
  SE = fit_sim$se
)

# Example 9.2: Complex simulation with all five parameters
set.seed(2203)
n <- 2000
x <- runif(n, -1, 1)

# True parameters
alpha <- exp(0.5 + 0.3 * x)
beta <- exp(1.0 - 0.2 * x)
gamma <- exp(0.7 + 0.4 * x)
delta <- plogis(0.0 + 0.5 * x) # logit scale
lambda <- exp(-0.3 + 0.2 * x)

# Generate from GKw
y <- rgkw(n,
  alpha = alpha, beta = beta, gamma = gamma,
  delta = delta, lambda = lambda
)
sim_data2 <- data.frame(y = y, x = x)

# Fit GKw model
fit_gkw <- gkwreg(
  y ~ x | x | x | x | x,
  data = sim_data2,
  family = "gkw",
  control = gkw_control(method = "L-BFGS-B", maxit = 2000)
)
summary(fit_gkw)

# SECTION 10: Handling Convergence Issues

# Example 10.1: Try different optimizers
methods <- c("nlminb", "BFGS", "Nelder-Mead", "CG")
fits_methods <- lapply(methods, function(m) {
  tryCatch(
    gkwreg(yield ~ batch + temp, GasolineYield,
      family = "kw",
      control = gkw_control(method = m, silent = TRUE)
    ),
    error = function(e) NULL
  )
})
names(fits_methods) <- methods

# Check which converged
converged <- sapply(fits_methods, function(f) {
  if (is.null(f)) {
    return(FALSE)
  }
  f$convergence
})
print(converged)

# Example 10.2: Verbose mode for debugging
fit_debug <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  control = gkw_control(
    method = "BFGS",
    silent = TRUE,
    trace = 0, # 2, Maximum verbosity
    maxit = 1000
  )
)

# SECTION 11: Memory and Performance Optimization

# Example 11.1: Minimal object for large datasets
fit_minimal <- gkwreg(
  yield ~ batch + temp,
  data = GasolineYield,
  family = "kw",
  model = FALSE, # Don't store model frame
  x = FALSE, # Don't store design matrices
  y = FALSE, # Don't store response
  control = gkw_control(hessian = FALSE) # Skip Hessian
)

# Much smaller object
object.size(fit_minimal)

# Trade-off: Limited post-fitting capabilities
# Can still use: coef(), logLik(), AIC(), BIC()
# Cannot use: predict(), some diagnostics

# Example 11.2: Fast exploratory analysis
# Fit many models quickly without standard errors
formulas <- list(
  yield ~ batch,
  yield ~ temp,
  yield ~ batch + temp,
  yield ~ batch * temp
)

fast_fits <- lapply(formulas, function(f) {
  gkwreg(f, GasolineYield,
    family = "kw",
    control = gkw_control(hessian = FALSE),
    model = FALSE, x = FALSE, y = FALSE
  )
})

# Compare models via AIC
sapply(fast_fits, AIC)

# Refit best model with full inference
best_formula <- formulas[[which.min(sapply(fast_fits, AIC))]]
fit_final <- gkwreg(best_formula, GasolineYield, family = "kw")
summary(fit_final)

# SECTION 12: Model Selection and Comparison

# Example 12.1: Nested model testing
fit1 <- gkwreg(yield ~ 1, GasolineYield, family = "kw")
fit2 <- gkwreg(yield ~ batch, GasolineYield, family = "kw")
fit3 <- gkwreg(yield ~ batch + temp, GasolineYield, family = "kw")

# Likelihood ratio tests
anova(fit1, fit2, fit3)

# Example 12.2: Information criteria table
models <- list(
  "Intercept only" = fit1,
  "Batch effect" = fit2,
  "Batch + Temp" = fit3
)

ic_table <- data.frame(
  Model = names(models),
  df = sapply(models, function(m) m$npar),
  LogLik = sapply(models, logLik),
  AIC = sapply(models, AIC),
  BIC = sapply(models, BIC),
  Delta_AIC = sapply(models, AIC) - min(sapply(models, AIC))
)
print(ic_table)

# Example 12.3: Cross-validation for predictive performance
# 5-fold cross-validation
set.seed(2203)
n <- nrow(GasolineYield)
folds <- sample(rep(1:5, length.out = n))

cv_rmse <- sapply(1:5, function(fold) {
  train <- GasolineYield[folds != fold, ]
  test <- GasolineYield[folds == fold, ]

  fit_train <- gkwreg(yield ~ batch + temp, train,
    family = "kw"
  )
  pred_test <- predict(fit_train, newdata = test, type = "response")

  sqrt(mean((test$yield - pred_test)^2))
})

cat("Cross-validated RMSE:", mean(cv_rmse), "\n")


gkwreg documentation built on Nov. 27, 2025, 5:06 p.m.