A brief guide to GAReg

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

set.seed(12345)

Overview

GAreg implements genetic algorithm (GA) optimization for three core statistical workflows where discrete model structure must be learned from data:

  1. Best subset variable selection – choose a parsimonious set of predictors under information or validation criteria.

  2. Changepoint detection – estimate the number and locations of structural breaks in regression or time series.

  3. Optimal spline knot selection – place interior knots for splines or piecewise polynomials to balance fit and smoothness.

The package provides a consistent GA interface, unified S4 results (gareg), and penalty-aware objectives. It supports both varying‑knot and fixed‑knot modes, minimum spacing constraints, and unbalanced designs.

library(GAReg)

Best subset variable selection

We use the simulation function below for subset selection illustration. Here, n is the number of observations and p is the number of predictors. For the covairates, s0 represent number of truly active predictors, valued range from 0 to p. magnitudes_range specifies the range of significantly expressed coefficients that corresponding to the truly active predictors. If rho is specified with some values, the autoregressive structure is introduced into the error terms. If rho=NULL, we will have independent and identically distributed (IID) errors. We can also specify sigma for the error standard deviation.

sim_subset_data <- function(n = 60, p = 50, s0 = 25, sigma = 1.5,
                            magnitudes_range = c(0.5, 2),
                            rho = NULL,
                            seed = NULL) {
  stopifnot(n > 0, p > 0, s0 >= 0, s0 <= p, sigma >= 0)
  if (!is.null(seed)) set.seed(seed)

  X <- matrix(rnorm(n * p), n, p)

  # Active set and coefficients
  true_idx <- if (s0 > 0) sort(sample.int(p, s0)) else integer(0)
  signs <- if (s0 > 0) sample(c(-1, 1), s0, replace = TRUE) else numeric(0)
  magnitudes <- if (s0 > 0) runif(s0, magnitudes_range[1], magnitudes_range[2]) else numeric(0)

  beta_true <- numeric(p)
  if (s0 > 0) beta_true[true_idx] <- magnitudes * signs

  if (is.null(rho)) {
    e <- rnorm(n, sd = sigma)
  } else {
    sd_innov <- sigma * sqrt(1 - rho^2)
    burn_in <- 100
    z <- rnorm(n + burn_in, sd = sd_innov)
    e_full <- numeric(n + burn_in)
    for (t in 2:(n + burn_in)) e_full[t] <- rho * e_full[t - 1] + z[t]
    e <- e_full[(burn_in + 1):(burn_in + n)]
  }

  y <- as.numeric(X %*% beta_true + e)

  DF <- data.frame(y = y, as.data.frame(X))
  colnames(DF)[-1] <- paste0("X", seq_len(p))

  list(
    X = X,
    y = y,
    beta_true = beta_true,
    true_idx = true_idx,
    DF = DF,
    rho = if (is.null(rho)) NULL else rho,
    args = list(
      n = n, p = p, s0 = s0, sigma = sigma,
      magnitudes_range = magnitudes_range,
      rho = rho, seed = seed
    )
  )
}
sim <- sim_subset_data(n = 100, p = 50, s0 = 25, sigma = 1.5, rho = NULL, seed = 123)
y <- sim$y
X <- sim$X
ga <- gareg_subset(
  y = y, X = X, gaMethod = "GA", monitor = FALSE,
  gacontrol = list(
    popSize = 120,
    maxiter = 20000,
    run = 4000,
    pmutation = 0.02
  )
)
summary(ga)

res <- FDRCalc(truelabel = sim$true_idx, predlabel = ga@bestsol, N = 50)
# FALSE Discover Rate
res$fdr
# TRUE Positive Rate
res$tpr

Changepoint detection

The multiple changepoint detection can be conducted through changepointGA package (Li and Lu, 2024). The BIC penalized function is provided below for IID data. The related math details can be found in (Li et al., 2026).

BIC.cpt <- function(chromosome, Xt) {
  m <- chromosome[1]
  tau <- chromosome[2:(2 + m - 1)]
  N <- length(Xt)

  if (m == 0) {
    mu.hat <- mean(Xt)
    sigma.hatsq <- sum((Xt - mu.hat)^2) / N
    BIC.obj <- 0.5 * N * log(sigma.hatsq) + 2 * log(N)
  } else {
    tau.ext <- c(1, tau, N + 1)
    seg.len <- diff(tau.ext)
    ff <- rep(0:m, times = seg.len)
    Xseg <- split(Xt, ff)
    mu.seg <- unlist(lapply(Xseg, mean), use.names = F)
    mu.hat <- rep(mu.seg, seg.len)
    sigma.hatsq <- sum((Xt - mu.hat)^2) / N
    BIC.obj <- 0.5 * N * log(sigma.hatsq) + (2 * m + 2) * log(N)
  }
  return(BIC.obj)
}

# IID data
set.seed(1234)
n <- 200
et <- rnorm(n)
Xt <- et + rep(c(0, 2, 0, 2), each = n / 4)

library(changepointGA)
GA.res <- cptga(
  ObjFunc = BIC.cpt, N = n, popSize = 500,
  pcrossover = 0.95, pmutation = 0.3, pchangepoint = 10 / n,
  Xt = Xt
)
summary(GA.res)

Optimal spline knot selection

Example data set

This section will illustrate optimal spline knot placement on the classic motocycle impact dataset from Package \code{MASS} (Venables & Ripley, 2002).\code{MASS::mcycle} contains 133 observations from a simulated motorcycle crash test, recording head acceleration (in g) of a helmeted test subject over time (milliseconds). The series is non-linear with sharp transients and heteroskedastic noise, which makes it a canonical benchmark for smoothing and spline-based regression. We use function \code{gareg_knots} choose interior spline knots subject to a minimum separation in indices to avoid clustering. Here the truncated-power piecewise polynomials with polynomial degree of 3 is the used default.

library(MASS)
library(splines)

data(mcycle)
head(mcycle)

Different GA model set-up

g1 <- gareg_knots(
  y = mcycle$accel, x = mcycle$times,
  minDist = 5,
  gaMethod = "cptga",
  cptgactrl = cptgaControl(popSize = 200, pcrossover = 0.9, pmutation = 0.3),
  ic_method = "BIC"
)
summary(g1)

# knots location
g1@bestsol
g2 <- gareg_knots(
  y = mcycle$accel, x = mcycle$times,
  minDist = 5,
  gaMethod = "cptgaisl",
  cptgactrl = cptgaControl(
    numIslands = 5, popSize = 200, maxMig = 250,
    pcrossover = 0.9, pmutation = 0.3
  ),
  ic_method = "BIC"
)
summary(g2)
g3 <- gareg_knots(
  y = mcycle$accel, x = mcycle$times,
  fixedknots = 3,
  minDist = 5,
  gaMethod = "cptga",
  cptgactrl = cptgaControl(popSize = 200, pcrossover = 0.9, pmutation = 0.3),
  ic_method = "BIC"
)
summary(g3)
g4 <- gareg_knots(
  y = mcycle$accel, x = mcycle$times,
  fixedknots = 4,
  minDist = 5,
  gaMethod = "cptgaisl",
  cptgactrl = cptgaControl(
    numIslands = 5, popSize = 200, maxMig = 250,
    pcrossover = 0.9, pmutation = 0.3
  ),
  ic_method = "BIC"
)
summary(g4)
y <- mcycle$accel
x <- mcycle$times
x_unique <- unique(x)

tBIC.vary.ga <- g1@bestsol
tBIC.vary.gaisl <- g2@bestsol
tBIC.fix.3.ga <- g3@bestsol
tBIC.fix.4.gaisl <- g4@bestsol

bsfit.vary.ga <- lm(y ~ bs(x, knots = x_unique[g1@bestsol], Boundary.knots = range(x)))
bsfit.vary.gaisl <- lm(y ~ bs(x, knots = x_unique[g2@bestsol], Boundary.knots = range(x)))
bsfit.fix.3.ga <- lm(y ~ bs(x, knots = x_unique[g3@bestsol], Boundary.knots = range(x)))
bsfit.fix.4.gaisl <- lm(y ~ bs(x, knots = x_unique[g4@bestsol], Boundary.knots = range(x)))

plot(x, y, xlab = "Time (ms)", ylab = "Acceleration (g)")
ht <- seq(min(x), max(x), length.out = 200)
lines(ht, predict(bsfit.vary.ga, data.frame(x = ht)), col = "blue", lty = 5, lwd = 2)
lines(ht, predict(bsfit.vary.gaisl, data.frame(x = ht)), col = "orange", lty = 4, lwd = 2)
lines(ht, predict(bsfit.fix.3.ga, data.frame(x = ht)), col = "purple", lty = 3, lwd = 2)
lines(ht, predict(bsfit.fix.4.gaisl, data.frame(x = ht)), col = "#D55E00", lty = 2, lwd = 2)
legend("bottomright",
  legend = c(
    "Varying knots GA",
    "Varying knots island model GA",
    "Fixed 3 knots GA",
    "Fixed 4 knots island model GA"
  ),
  lty = 5:2, lwd = 2,
  col = c("blue", "orange", "purple", "#D55E00"), bty = "n"
)

Spline options: piecewise polynomials, natural cubic, and B-splines

This section illustrates how to build spline design matrices via \code{splineX()} for three common options:

We’ll use the motorcycle acceleration data \code{MASS::mcycle}, create interior knots at quantiles of \code{times}, and compare how different spline types/degrees behave. Here, we only illustrate through Varying number and locations of knots set-up (Let GA choose both how many knots and where they go).

g_pp3 <- gareg_knots(
  y = y, x = x,
  minDist = 5,
  gaMethod = "cptga",
  ObjFunc = NULL, # use default varyknotsIC
  type = "ppolys",
  degree = 3, # degree-3 piecewise polynomial
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_pp3)
g_ns <- gareg_knots(
  y = y, x = x,
  minDist = 5,
  gaMethod = "cptga",
  type = "ns", # natural cubic (degree ignored)
  degree = 3, # ignored for "ns"
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_ns)
g_bs1 <- gareg_knots(
  y = y, x = x,
  minDist = 5,
  gaMethod = "cptga",
  type = "bs",
  degree = 1, # linear B-splines
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_bs1)

Reproducibility

sessionInfo()

References

Li, M., & Lu, Q. (2025). changepointGA: An R package for Fast Changepoint Detection via Genetic Algorithm. arXiv preprint arXiv:2410.15571.

Mo Li, QiQi Lu, Robert Lund, & Xueheng Shi. (2026). Genetic Algorithms in Regression. In preparation.

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.



Try the GAReg package in your browser

Any scripts or data that you put into this service are public.

GAReg documentation built on March 29, 2026, 5:08 p.m.