R/gamlps.R
In blapsr: Bayesian Inference with Laplace Approximations and P-Splines

Documented in gamlps

#' Bayesian generalized additive modeling with Laplace-P-splines.
#'
#' @description {Fits a generalized additive model (GAM) to data using an
#' approximate Bayesian inference technique based on penalized regression
#' splines and Laplace approximations. Smooth additive terms are specified as a
#' linear combination of a large number of cubic B-splines. To counterbalance
#' the roughness of the fit, a discrete penalty on neighboring spline
#' coefficients is imposed in the spirit of Eilers and Marx (1996). The
#' effective degrees of freedom of the smooth terms are also estimated.
#'
#' The optimal amount of smoothing is determined by a grid-based exploration of
#' the posterior penalty space when the number of smooth terms is small to
#' moderate. When the dimension of the penalty space is large, the optimal
#' smoothing parameter is chosen to be the value that maximizes the
#' (log-)posterior of the penalty vector. Approximate Bayesian credible
#' intervals for latent model variables and functions of latent model variables
#' are available.
#' }
#'
#' @usage gamlps(formula, data, K = 30, family = c("gaussian", "poisson", "bernoulli", "binomial"),
#'        gauss.scale, nbinom, penorder = 2, cred.int = 0.95)
#'
#' @param formula A formula object where the ~ operator separates the response
#'   from the covariates of the linear part \code{z1,z2,..} and the smooth
#'   terms. A smooth term is specified by using the notation \code{sm(.)}.
#'   For instance, the formula \code{y ~ z1+sm(x1)+sm(x2)} specifies a
#'   generalized additive model of the form
#'   \emph{g(mu) = b0+b1z1+f1(x1)+f2(x2)}, where \emph{b0, b1} are the
#'   regression coefficients of the linear part and \emph{f1(.)} and
#'   \emph{f2(.)} are smooth functions of the continuous covariates \emph{x1}
#'   and \emph{x2} respectively. The function \emph{g(.)} is the canonical
#'   link function.
#' @param data Optional. A data frame to match the variables names provided in
#'   formula.
#' @param K A positive integer specifying the number of cubic B-spline
#'   functions in the basis used to model the smooth terms. Default is
#'   \code{K = 30} and allowed values are \code{15 <= K <= 60}. The same basis
#'   dimension is used for each smooth term in the model. Also, the
#'   computational cost to fit the model increases with \code{K}.
#' @param family The error distribution of the model. It is a character string
#'   that must partially match either \code{"gaussian"} for Normal data with
#'   an identity link, \code{"poisson"} for Poisson data with a log link,
#'   \code{"bernoulli"} or \code{"binomial"} for Bernoulli or Binomial data
#'   with a logit link.
#' @param gauss.scale The scale parameter to be specified when
#'   \code{family = "gaussian"}. It corresponds to the variance of the response.
#' @param nbinom The number of experiments in the Binomial family.
#' @param penorder The penalty order used on finite differences of the
#'  coefficients of contiguous B-splines. Can be either 2 for a second-order
#'  penalty (the default) or 3 for a third-order penalty.
#' @param cred.int The level of the pointwise credible interval to be computed
#'  for the coefficients in the linear part of the model.
#'
#' @details {The B-spline basis used to approximate a smooth additive component
#'  is computed with  the function \code{\link{cubicbs}}. The lower (upper)
#'  bound of the B-spline basis is taken to be the minimum (maximum) value of
#'  the covariate associated to the smooth. For identifiability
#'  purposes, the B-spline matrices (computed over the observed covariates)
#'  are centered. The centering consists is subtracting from each column of the
#'  B-spline matrix, the corresponding column average of another B-spline matrix
#'  computed on a fine grid of equidistant values in the domain of the smooth
#'  term.
#'
#'  A hierarchical Gamma prior is imposed on the roughness penalty vector. A
#'  Newton-Raphson algorithm is used to compute the posterior
#'  mode of the (log-)posterior penalty vector. The latter algorithm uses
#'  analytically derived versions of the gradient and Hessian. When the number
#'  of smooth terms in the model is smaller or equal to 4, a grid-based strategy
#'  is used for posterior exploration of the penalty space. Above that
#'  threshold, the optimal amount of smoothness is determined by the posterior
#'  maximum value of the penalty vector. This strategy allows to keep the
#'  computational burden to fit the model relatively low and conserve good
#'  statistical performance.}
#'
#' @return An object of class \code{gamlps} containing several components from
#'   the fit. Details can be found in \code{\link{gamlps.object}}. Details on
#'   the output printed by \code{gamlps} can be found in
#'   \code{\link{print.gamlps}}. Fitted smooth terms can be visualized with the
#'   \code{\link{plot.gamlps}} routine.
#'
#' @author Oswaldo Gressani \email{oswaldo_gressani@hotmail.fr}.
#'
#' @seealso \code{\link{cubicbs}}, \code{\link{gamlps.object}},
#'  \code{\link{print.gamlps}}, \code{\link{plot.gamlps}}
#'
#' @examples
#'
#' set.seed(14)
#' sim <- simgamdata(n = 300, setting = 2, dist = "binomial", scale = 0.25)
#' dat <- sim$data
#' fit <- gamlps(y ~ z1 + z2 + sm(x1) + sm(x2), K = 15, data = dat,
#'               penorder = 2, family = "binomial", nbinom = 15)
#' fit
#'
#' # Check fit and compare with target (in red)
#' opar <- par(no.readonly = TRUE)
#' par(mfrow = c(1, 2))
#' domx <- seq(-1, 1 ,length = 200)
#' plot(fit, smoo.index = 1, cred.int = 0.95, ylim = c(-2, 2))
#' lines(domx, sim$f[[1]](domx), type= "l", lty = 2, lwd = 2, col = "red")
#' plot(fit, smoo.index = 2, cred.int = 0.95, ylim = c(-3, 3))
#' lines(domx, sim$f[[2]](domx), type= "l", lty = 2, lwd = 2, col = "red")
#' par(opar)
#'
#' @references Hastie, T. J. and Tibshirani., RJ (1990). Generalized additive
#'   models. \emph{Monographs on statistics and applied probability}, 43, 335.
#' @references Eilers, P.H.C. and Marx, B.D. (1996). Flexible smoothing with
#'  B-splines and penalties. \emph{Statistical Science}, \strong{11}(2): 89-121.
#' @references Gressani, O. and Lambert, P. (2018). Fast Bayesian inference
#'  using Laplace approximations in a flexible promotion time cure model based
#'  on P-splines. \emph{Computational Statistical & Data Analysis} \strong{124}:
#'  151-167.

#'
#' @export

gamlps <- function(formula, data, K = 30, family = c("gaussian", "poisson",
                    "bernoulli", "binomial"), gauss.scale, nbinom,
                    penorder = 2, cred.int = 0.95){

  if (!inherits(formula, "formula"))
    stop("Incorrect model formula")
  if(missing(data)) {
    mf <- stats::model.frame(formula) # Extract model frame from formula
    X  <- stats::model.matrix(mf)     # Full design matrix
    colXnames <- colnames(X)
    smterms <- grepl("sm(", colnames(X), fixed = TRUE)
    X <- cbind(X[, as.logical(1 - smterms)], X[, smterms])
    colnames(X) <- colXnames
  } else{
    mf <- stats::model.frame(formula, data = data)
    X <- stats::model.matrix(mf, data = data)
    colXnames <- colnames(X)
    smterms <- grepl("sm(", colnames(X), fixed = TRUE)
    X <- cbind(X[, as.logical(1 - smterms)], X[, smterms])
    colnames(X) <- colXnames
  }
  if(any(is.infinite(X)))
    stop("Covariates contain Inf values")
  q  <- sum(smterms) # Number of smooth terms in model
  if(q == 0)
    stop("Model does not contain any smooth terms")
  p  <- ncol(X) - q # Number of regression coefficients in linear part
  n  <- nrow(X)     # Sample size
  Z  <- scale(X[, 1:p], center = TRUE, scale = FALSE)  # Centered Z matrix
  Z[, 1] <- rep(1, n) # Column for intercept
  if(ncol(Z)==1) colnames(Z) <- "(Intercept)"
  y  <- as.numeric(stats::model.extract(mf, "response")) # Response vector
  if(any(is.infinite(y)) || any(is.na(y)))
    stop("Response contains Inf, NA or NaN values")
  if(!is.vector(K, mode = "numeric") || length(K) > 1 || is.na(K))
    stop("K must be numeric of length 1")
  if(K < 15 || K > 60)
    stop("K must be between 15 and 60")
  if(penorder < 2 || penorder > 3)
    stop("Penalty order must be either 2 or 3")
  if(!missing(gauss.scale)){
    if(gauss.scale <= 0)
      stop("Gaussian scale parameter must be positive")
  }
  if(!missing(nbinom)){
    if(nbinom <= 1){
      stop("nbinom must be larger than one")
    }
  }

  # Chosen family
  EF.type <- match.arg(family)
  if (EF.type == "gaussian") {
    family <- "gaussian"
    link <- "identity"
  }
  if (EF.type == "poisson") {
    family <- "poisson"
    link <- "log"
  }
  if (EF.type == "bernoulli") {
    family <- "bernoulli"
    link <- "logit"
  }
  if (EF.type == "binomial") {
    family <- "binomial"
    link <- "logit"
  }

  B.list <- list() # List of B-spline bases for smooth terms
  # Centered B-spline matrices
  for(j in 1:q){
    xj <- as.numeric(X[, p + j]) # values of the jth covariate
    min.xj <- min(xj) # minimum
    max.xj <- max(xj) # maximum
    xj.fine <- seq(min.xj, max.xj, length = 1000) # fine grid
    Bj.fine <- cubicbs(xj.fine, lower = min.xj, upper = max.xj, K = K)$Bmatrix
    Bj.fine.mean <- colMeans(Bj.fine)
    Bj <- cubicbs(xj, lower = min.xj, upper = max.xj, K = K)$Bmatrix
    Bj.centered <- Bj - matrix(rep(Bj.fine.mean, n), nrow = n, byrow = TRUE)
    B.list[[j]] <- Bj.centered
  }
  B.list.trim <- lapply(B.list, function(x) x[, -K])
  B <- cbind(Z, do.call(cbind, B.list.trim)) # Design matrix ncol: q * (K-1) + p
  H <- p + (q * (K - 1))  # Latent field dimension
  if(n < H)
    warning("Number of coefficients to be estimated is larger than sample size")

  # Penalty matrix
  D <- diag(K) # Diagonal matrix
  for (k in 1:penorder) D <- diff(D) # Difference matrix of dimension K-r by K
  D <- D[, -K]                       # Delete last column for identifiability
  P <- t(D) %*% D                    # Penalty matrix of dimension K-1 by K-1
  P <- P + diag(1e-12, (K - 1))      # Perturbation to make P full rank

  # Robust prior on penalty parameters
  a.delta <- 0.5
  b.delta <- 0.5
  nu <- 1
  prec.betas <- diag(1e-05, p) # Precision for linear regression coefficients

  ## Matrices related to the chosen family

  if(family == "gaussian") {
    gam.scale <- gauss.scale
    invgam.scale <- (1 / gam.scale)
    s.gam <- function(latent) .5 * (as.numeric(B %*% latent) ^ 2)
    s.gam.deriv <- function(pred) pred
    s.gam.deriv2 <- function(pred) 1
    W.gam <- function(latent) (1 / gam.scale) * diag(1, n)
    mu.gam <- function(latent) as.numeric(B %*% latent)
  }

  if(family == "poisson") {
    gam.scale <- 1
    invgam.scale <- 1
    s.gam <- function(latent) as.numeric(exp(B %*% latent))
    s.gam.deriv <- function(pred) exp(pred)
    s.gam.deriv2 <- function(pred) exp(pred)
    W.gam <- function(latent) diag(exp(as.numeric(B %*% latent)))
    mu.gam <- function(latent) exp(as.numeric(B %*% latent))
  }

  if(family == "bernoulli") {
    gam.scale <- 1
    invgam.scale <- 1
    s.gam <- function(latent) log(1 + exp(as.numeric(B %*% latent)))
    s.gam.deriv <- function(pred) exp(pred) / (1 + exp(pred))
    s.gam.deriv2 <- function(pred) exp(pred) / ((1 + exp(pred)) ^ 2)
    W.gam <- function(latent){
      predictor <- as.numeric(B %*% latent)
      val <- diag(exp(predictor) / ((1 + exp(predictor)) ^ 2))
      return(val)
    }
    mu.gam <- function(latent) {
      predictor <- as.numeric(B %*% latent)
      val <- exp(predictor) / (1 + exp(predictor))
      return(val)
    }
  }

  if(family == "binomial") {
    gam.scale <- 1
    invgam.scale <- 1
    s.gam <- function(latent) nbinom * log(1 + exp(as.numeric(B %*% latent)))
    s.gam.deriv <- function(pred) nbinom * (exp(pred) / (1 + exp(pred)))
    s.gam.deriv2 <- function(pred) nbinom * (exp(pred) / ((1 + exp(pred)) ^ 2))
    W.gam <- function(latent){
      predictor <- as.numeric(B %*% latent)
      val <- diag(nbinom * exp(predictor) / ((1 + exp(predictor)) ^ 2))
      return(val)
    }
    mu.gam <- function(latent) {
      predictor <- as.numeric(B %*% latent)
      val <- nbinom * exp(predictor) / (1 + exp(predictor))
      return(val)
    }
  }

  # Prior precision matrix of latent field for a given v
  Qv <- function(v) as.matrix(Matrix::bdiag(prec.betas, diag(exp(v), q) %x% P))

  BWD <- t(B) %*% diag(invgam.scale, n) # To compute log-likelihood gradient

  # Log conditional posterior of latent field given v
  logplat <- function(latent, v) {
    post <- invgam.scale * sum(y * as.numeric(B %*% latent) -
                          s.gam(latent)) - .5 * t(latent) %*% Qv(v) %*% latent
    return(post)
  }

  # Gradient of the log conditional posterior latent field
  grad.logplat <- function(latent, v) {
    grad.loglik <- BWD %*% (y - mu.gam(latent))
    grad.vec <- as.numeric(grad.loglik - Qv(v) %*% latent)
    return(grad.vec)
  }

  # Hessian of the log conditional posterior latent field
  Hess.logplat <- function(latent, v) {
    Hess.loglik <- (-1) * t(B) %*% W.gam(latent) %*% B
    Hess.mat <- Hess.loglik - Qv(v)
    return(Hess.mat)
  }

  # Laplace approximation to conditional latent field
  Laplace <- function(latent0, v){

    epsilon <- 1e-05 # Stop criterion
    maxiter <- 100   # Maximum iterations
    iter <- 0        # Iteration counter

    for (k in 1:maxiter) {
      dlat <- as.numeric((-1) * solve(Hess.logplat(latent0, v),
                                      grad.logplat(latent0, v)))
      lat.new <- latent0 + dlat
      step <- 1
      iter.halving <- 1
      logplat.current <- logplat(latent0, v)
      while (logplat(lat.new, v) <= logplat.current) {
        step <- step * .5
        lat.new <- latent0 + (step * dlat)
        iter.halving <- iter.halving + 1
        if (iter.halving > 30) {
          break
        }
      }
      dist <- sqrt(sum((lat.new - latent0) ^ 2))
      iter <- iter + 1
      latent0 <- lat.new
      if(dist < epsilon) break
    }

    latentstar <- latent0   # Posterior mode of Laplace approximation given v
    Wtilde <- W.gam(latentstar)
    Sigmastar <- solve(t(B) %*% Wtilde %*% B + Qv(v))
    return(list(latentstar = latentstar,
                Sigmastar = Sigmastar,
                Wtilde = Wtilde))
  }

  # Log-posterior of log-penalty vector
  logp.vD0 <- function(v, laplace.approx) {
    Q <- Qv(v)
    latentstar <- laplace.approx$latentstar
    Sigmastar <- laplace.approx$Sigmastar
    Wtilde <- laplace.approx$Wtilde
    L <- t(chol((t(B) %*% (diag(Wtilde) * B)) + Q))
    Mv <- chol2inv(t(L))
    Blat <- as.numeric(B %*% latentstar)

    lpost <- (-1) * sum(log(diag(L))) +
      .5 * (nu + K - 1) * sum(v) +
      invgam.scale * sum(y * Blat) -
      invgam.scale * sum(s.gam(latentstar)) -
      .5 * as.numeric(t(latentstar) %*% Q %*% latentstar) -
      (.5 * nu + a.delta) * sum(log(b.delta + .5 * nu * exp(v)))

    listout <- list(value = lpost,
                    latentstar = latentstar,
                    Sigmastar = Sigmastar)

    return(listout)
  }

  # Gradient of logp.vD
  gradient <- function(v, laplace.approx){

    Q <- Qv(v)
    latentstar <- laplace.approx$latentstar
    #Sigmastar <- laplace.approx$Sigmastar
    Wtilde <- laplace.approx$Wtilde
    Blat <- as.numeric(B %*% latentstar)
    wbar <- as.numeric(t(B) %*% (invgam.scale * (y - mu.gam(latentstar)) +
                                   Wtilde %*% Blat))
    L <- t(chol((t(B) %*% (diag(Wtilde) * B)) + Q))
    Mv <- chol2inv(t(L))

    Pvj <- function(j){
      as.matrix(Matrix::bdiag(matrix(0, nrow = p, ncol = p),
                              diag(diag(q)[, j] * exp(v), q) %x% P))
    }
    partial.v <- function(j) {
      Gamma.vj <- Mv %*% Pvj(j) %*% Mv
      Gwbar <- Gamma.vj %*% wbar
      Bgw <- as.numeric(B %*% Gwbar)
      Mwbar <- Mv %*% wbar
      val <- (-.5) * sum(Mv[((p + 1) + (j - 1) * (K - 1)):(p + j * (K - 1)),
                            ((p + 1) + (j - 1) * (K - 1)):(p + j * (K - 1))] *
                           (exp(v[j]) * P)) +
        (.5 * (nu + (K - 1))) -
        invgam.scale * sum(y * Bgw) +
        invgam.scale * sum(s.gam.deriv(as.numeric(B %*% Mwbar)) * Bgw) +
        as.numeric(t(wbar) %*% Gamma.vj %*% Q %*% Mwbar) -
        as.numeric(.5 * (t(wbar) %*% Gwbar)) -
        (.5 * nu + a.delta) / (1 + (2 * b.delta) / (nu * exp(v[j])))
      return(val)
    }

    partial.vec <- sapply(seq_len(q), partial.v)
    return(partial.vec)
  }

  # Hessian of logp.vD
  Hessian <- function(v, laplace.approx) {
    Q <- Qv(v)
    latentstar <- laplace.approx$latentstar
    #Sigmastar <- laplace.approx$Sigmastar
    Wtilde <- laplace.approx$Wtilde
    Blat <- as.numeric(B %*% latentstar)
    wbar <- as.numeric(t(B) %*% (invgam.scale * (y - mu.gam(latentstar)) +
                                   Wtilde %*% Blat))
    L <- t(chol((t(B) %*% (diag(Wtilde) * B)) + Q))
    Mv <- chol2inv(t(L))
    Pvj <- function(j){
      as.matrix(Matrix::bdiag(matrix(0, nrow = p, ncol = p),
                              diag(diag(q)[, j] * exp(v), q) %x% P))
    }
    Bmw <- as.numeric(B %*% Mv %*% wbar)

    partial.diag <- function(j) {
      Gamma.vj <- Mv %*% Pvj(j) %*% Mv
      MPMP <- Mv %*% Pvj(j) %*% Mv %*% Pvj(j)
      MPMP.min <- 2 * (MPMP %*% Mv) - Gamma.vj
      Gwbar <- Gamma.vj %*% wbar
      Mwbar <- Mv %*% wbar

      val <-  .5 * sum(diag((Mv %*% Pvj(j) %*% Mv %*% Pvj(j) - Mv %*% Pvj(j)))) +
        invgam.scale * sum(y * (B %*% MPMP.min %*% wbar)) -
        invgam.scale * sum((s.gam.deriv(Bmw) *
                              as.numeric(B %*% MPMP.min %*% wbar)) + (s.gam.deriv2(Bmw) *
                                                                        as.numeric(B %*% Gwbar) ^ 2)) +
        as.numeric(-2 * (t(wbar) %*% MPMP %*% Mv %*% Q %*% Mwbar) +
                     (t(wbar) %*% Gamma.vj %*% (Q + Pvj(j)) %*% Mwbar) -
                     (t(wbar) %*% Gamma.vj %*% Q %*% Gwbar)) +
        as.numeric(t(wbar) %*% MPMP %*% Mwbar) -
        as.numeric(.5 * (t(wbar) %*% Gwbar)) -
        (b.delta * (1 + (2 * a.delta) / nu) * exp(-v[j])) /
        ((1 + (2 * b.delta) / (nu * exp(v[j]))) ^ 2)
      return(val)
    }
    deriv.diag <- sapply(seq_len(q), partial.diag)

    if(q > 1){
      partial.offdiag <- function(s, j) {
        Gamma.vj <- Mv %*% Pvj(j) %*% Mv
        Gamma.vs <- Mv %*% Pvj(s) %*% Mv
        MPsMPj <- Mv %*% Pvj(s) %*% Mv %*% Pvj(j)
        GPM <- (Gamma.vs %*% Pvj(j) %*% Mv) + (Mv %*% Pvj(j) %*% Gamma.vs)
        Mwbar <- Mv %*% wbar

        val <- .5 * sum(diag(MPsMPj)) +
          invgam.scale * sum(y * (B %*% GPM %*% wbar)) -
          invgam.scale * sum((s.gam.deriv(Bmw) * (B %*% GPM %*% wbar)) +
                               (s.gam.deriv2(Bmw) * (B %*% Gamma.vs %*% wbar) *
                                  (B %*% Gamma.vj %*% wbar))) -
          as.numeric(t(wbar) %*% Gamma.vs %*% Pvj(j) %*% Mv %*% Q %*% Mwbar) -
          as.numeric(t(wbar) %*% Mv %*% Pvj(j) %*% Gamma.vs %*% Q %*% Mwbar) +
          as.numeric(t(wbar) %*% Gamma.vj %*% Pvj(s) %*% Mwbar) -
          as.numeric(t(wbar) %*% Gamma.vj %*% Q %*% Gamma.vs %*% wbar) +
          as.numeric(t(wbar) %*% Gamma.vs %*% Pvj(j) %*% Mwbar)
        return(val)
      }

      deriv.offdiag <- matrix(0, nrow = q, ncol = q)

      col.idx <- 2
      for (i in 1:(q - 1)) {
        for (j in col.idx:q) {
          deriv.offdiag[j, i] <- partial.offdiag(i, j)
          deriv.offdiag[i, j] <- partial.offdiag(i, j)
        }
        col.idx <- col.idx + 1
      }

      hessmatrix <- adjustPD(-(deriv.offdiag + diag(deriv.diag)))$PD
      hess.mat <- (-1) * hessmatrix
    } else{hess.mat <- deriv.diag}

    return(hess.mat)

  }

  #############################################################################
  ###       Newton-Raphson with step-halving to find max of logp.vD         ###
  #############################################################################

  NRaphson <- function(v0, epsilon = 1e-05, maxiter = 100){

    NRoutput <- matrix(0, nrow = maxiter + 1, ncol = q + 3)
    colnames(NRoutput) <- c("Iteration", paste0(rep("v", q), seq_len(q)), "f",
                            "Distance")
    NRoutput[1, 2 : (q + 1)] <- v0
    L.approx <- Laplace(rep(0, H), v0)
    NRoutput[1, (q + 2)] <- logp.vD0(v0, L.approx)$value
    NRoutput[1, (q + 3)] <- NA

    iter <- 0
    v <- v0

    for (k in 1:maxiter) {
      gradf <- gradient(v, L.approx) # Gradient of function at v
      Hessf <- Hessian(v, L.approx)  # Hessian of function at v
      dv <- (-1) * solve(Hessf, gradf) # Ascent direction
      dv <- (dv / sqrt(sum(dv ^ 2))) * 5 # Step damping
      vnew <- v + dv # New point
      step <- 1 # Initial step length
      iter.halving <- 1 # Counter for step halving
      fv <- logp.vD0(v, L.approx)$value       # Function value at v
      fvnew <- logp.vD0(vnew, L.approx)$value # Function value at vnew
      # Step halving
      while (fvnew <= fv) {
        step <- step * .5 # Decrease step size
        vnew <- v + (step * dv)
        fvnew <- logp.vD0(vnew, L.approx)$value
        iter.halving <- iter.halving + 1
        if (iter.halving > 30) break
      }
      dist <- sqrt(sum((vnew - v) ^ 2))
      iter <- iter + 1
      v <- vnew
      L.approx <- Laplace(L.approx$latentstar, v = v)
      NRoutput[k + 1, 1] <- iter
      NRoutput[k + 1, 2 : (q + 1)] <- vnew
      NRoutput[k + 1, (q + 2)] <- fvnew
      NRoutput[k + 1, (q + 3)] <- dist

      if(dist < epsilon){ # stop algorithm
        listout <- list(voptim = vnew,
                        info = NRoutput[1 : (iter + 1), ],
                        converge = TRUE,
                        iter = iter,
                        foptim = fvnew,
                        distance = dist,
                        L.approx = L.approx)
        attr(listout, "class") <- "NewtonRaphson"
        return(listout)
        break
      }
    }

    if(dist >= epsilon){
      listout <- list(info = NRoutput,
                      converge = FALSE,
                      iter = iter)
      attr(listout, "class") <- "NewtonRaphson"
      return(listout)
    }

  }

  print.NewtonRaphson <- function(obj){
    if(obj$converge == TRUE){
      cat("Newton-Raphson algorithm converged after", obj$iter,
          "iterations.", "\n")
      cat("\n")
      cat("Maximum point:", format(round(obj$voptim, 4), nsmall = 4), "\n")
      cat("Value of objective function at maximum:",
          format(round(obj$foptim,4), nsmall = 4), "\n")
      cat("\n")
      print(as.data.frame(round(obj$info, 6)))
    }
    else(cat("Newton-Raphson algorithm failed convergence"))
  }

  NR.start <- rep(6, q)  # Newton-Raphson starting point
  NR.counter <- 1        # Number of Newton-Raphson runs
  NR.fail <- 0           # Indicator if Newton-Raphson fails

  newton <- NRaphson(NR.start)

  while (sum(abs(gradient(newton$voptim,
                          Laplace(rep(0, H),newton$voptim))) < 0.5) != q) {
    NR.start <- NR.start - 2
    newton <- NRaphson(NR.start)
    NR.counter <- NR.counter + 1
    if (NR.counter >= 7) {
      NR.fail <- 1
      break
    }
  }
  if(NR.fail == 1)
    stop("Newton-Raphson did not converge")

  # Extracting info from newton
  L.approx <- newton$L.approx
  v.max <- newton$voptim
  hess.max <- Hessian(v.max, L.approx)
  logpvmax <- newton$foptim
  Cov.vmax <- solve(-hess.max)
  var.max <- diag(Cov.vmax)

  # Compute the covariance of theta_j at a v.max for all functions j=1,..,q
  post.max <-  logp.vD0(v.max, L.approx)
  latmaximum <- post.max$latentstar
  Covmaximum <- post.max$Sigmastar
  Wtilde.max <- L.approx$Wtilde
  BWB <- t(B) %*% Wtilde.max %*% B

  # Diagonal of hat matrix (as a function of v)
  diagHmatrix <- function(v) diag(solve(BWB + Qv(v)) %*% BWB)

  # Estimated effective degrees of freedom for each latent field component
  edf.full <- diagHmatrix(v.max)
  edf.smooths <- matrix(edf.full[-(1:p)], ncol = (K - 1), nrow = q, byrow = TRUE)
  edf <- c(edf.full[1:p], as.vector(t(edf.smooths)))
  names(edf) <- c(colnames(Z),
                  paste0(rep(colnames(X)[(p + 1):(p + q)], rep(K - 1, q)), ".",
                         seq_len(K - 1)))
  edf[1:p] <- round(edf[1:p], 3)

  # Estimated effective degrees of freedom of the smooth terms
  ED.functions <- c()
  for (j in 1:q) {
    fjdf <- edf[((p + 1) + (j - 1) * (K - 1)):(p + j * (K - 1))]
    edf.j <- sum(fjdf)
    if(edf.j < (penorder - 1)) edf.j <- sum(fjdf[fjdf > 0])
    if(edf.j < (penorder - 1)) edf.j <- (penorder - 1)
    ED.functions[j] <- edf.j
  }

  ED.global <- sum(edf[1:p]) + sum(ED.functions)

  # Log-posterior of log-penalty vector at Wtilde.max, wbar.max
  logp.vD <- function(v) {
    Q <- Qv(v)
    latentstar <- latmaximum
    Sigmastar <- Covmaximum
    Wtilde <- Wtilde.max
    L <- t(chol((t(B) %*% (diag(Wtilde) * B)) + Q))
    Mv <- chol2inv(t(L))
    Blat <- as.numeric(B %*% latentstar)

    lpost <- (-1) * sum(log(diag(L))) +
      .5 * (nu + K - 1) * sum(v) +
      (1 / gam.scale) * sum(y * Blat) -
      (1 / gam.scale) * sum(s.gam(latentstar)) -
      .5 * as.numeric(t(latentstar) %*% Q %*% latentstar) -
      (.5 * nu + a.delta) * sum(log(b.delta + .5 * nu * exp(v)))

    listout <- list(value = lpost,
                    latentstar = latentstar,
                    Sigmastar = Sigmastar)

    return(listout)
  }

  # Matrix to host sampled values of v for effective dimension HPD interval
  nv.sample <- 500 # Number of samples to draw for vector v
  v.sample <- matrix(0, nrow = nv.sample, ncol = q)

  #############################################################################
  ##########           Grid-based inference when q < 5                      ###
  #############################################################################
  pendist.params <- matrix(nrow = q, ncol = 3)

  if(q < 5){
    crit.val <- exp(-.5 * stats::qchisq(0.9, df = q, lower.tail = TRUE))
    mode.check <- 1

    mix.components <- function(){

      # Univariate grid construction
      if(q == 1){grid.size <- 10}
      if(q == 2){grid.size <- 8}
      if(q == 3){grid.size <- 5}
      if(q == 4){grid.size <- 5}
      univariate.grids <- list() # list in which to store univariate grids
      dvj <- c() # univariate grid width

      for(j in 1:q){
        # step size
        left.step.j  <- (-1) * diag(1, q)[j, ]
        right.step.j <- diag(1, q)[j, ]

        # Explore left direction in dimension j
        v.left.j <- v.max
        ratio <- 1
        while(ratio > crit.val){
          v.left.j <- v.left.j + left.step.j
          logpvleft <- logp.vD(v.left.j)$value
          if(logpvleft > logpvmax){
            mode.check <- 0
            break
          }
          ratio <- exp(logpvleft-logpvmax)
        }
        if(mode.check == 0){break}

        # Explore right direction in dimension j
        v.right.j <- v.max
        ratio <- 1
        while(ratio > crit.val){
          v.right.j <- v.right.j + right.step.j
          logpvright <- logp.vD(v.right.j)$value
          if(logpvright > logpvmax){
            mode.check <- 0
            break
          }
          ratio <- exp(logpvright-logpvmax)
        }
        if(mode.check == 0){break}

        lb.j <- v.left.j[j]  # lower bound in dimension j
        ub.j <- v.right.j[j] # upper bound in dimension j
        x.j <- seq(lb.j, ub.j, length = 20) # equidistant grid
        dj <- x.j[2]-x.j[1] # grid width
        vgrid.j <- matrix(rep(v.max, 20), ncol = q, byrow = TRUE)
        vgrid.j[, j] <- x.j
        y.j <- unlist(lapply(apply(vgrid.j, 1, logp.vD),'[[',1))
        y.j <- exp(y.j - max(y.j))
        cnorm <- 1 / sum(y.j * dj)
        y.j <- cnorm * y.j
        snfit.j <- snmatch(x.j, y.j)
        v.sample[, j] <- sn::rsn(nv.sample, xi = snfit.j$location,
                             omega = snfit.j$scale,
                             alpha = snfit.j$shape)
        pendist.params[j, ] <- c(snfit.j$location, snfit.j$scale, snfit.j$shape)
        colnames(pendist.params) <- c("location", "scale", "shape")
        rownames(pendist.params) <- paste0(rep("SN.", q), "v", seq_len(q))
        q0.025 <- snfit.j$quant[1] # Lower bound of grid
        q0.975 <- snfit.j$quant[3] # Upper bound of grid
        grid.vals <- seq(q0.025, q0.975, length = grid.size) # Starting grid

        # Shift grid so that it goes through v.max[j]
        abs.dist <- abs(grid.vals - v.max[j])
        min.dist <- min(abs.dist)
        pos.close <- which.min(abs.dist)
        if(grid.vals[pos.close] - v.max[j] >= 0){
          grid.vals <- grid.vals - min.dist
          dir <- "left"}else{
            grid.vals <- grid.vals + min.dist
            dir <- "right"}
        if(dir == "left"){
          while(utils::tail(grid.vals, 1) < q0.975){
            grid.vals <- c(grid.vals, utils::tail(grid.vals, 1) + diff(grid.vals)[1])
          }
        }
        if(dir == "right"){
          while(utils::head(grid.vals, 1) > q0.025){
            grid.vals <- c(utils::head(grid.vals, 1) - diff(grid.vals)[1], grid.vals)
          }
        }
        univariate.grids[[j]] <- grid.vals
        dvj[j] <- grid.vals[2]-grid.vals[1]
      }

      if(mode.check == 1){ #continue and construct grid
        # Construction of the mesh (cartesian product of univariate grids)
        mesh.cartesian <- expand.grid(univariate.grids)
        eval.mesh.cartesian <- apply(mesh.cartesian, 1, logp.vD)
        eval.target <- exp(unlist(lapply(eval.mesh.cartesian, '[[', 1)) -
                             logpvmax)
        low.contrib <- which(eval.target < crit.val) # Low contribution points

        ## Computation of the mixture components
        if(length(low.contrib) > 0){ # Filter grid
          M <- nrow(mesh.cartesian)-length(low.contrib) # No. of points after filter
          mesh.log.image <- log(eval.target[-low.contrib])
          xistar.mat <- matrix(unlist(lapply(
            eval.mesh.cartesian,'[[', 2)[-low.contrib]),
            ncol = H, nrow = M, byrow = TRUE)
          var.regcoeffs <- matrix(unlist(lapply(lapply(
            eval.mesh.cartesian, '[[', 3)[-low.contrib], diag)),
            ncol = H, nrow = M, byrow = TRUE)[, 1:p]
        }else{ # No need to filter the grid
          M <- nrow(mesh.cartesian)
          mesh.log.image <- log(eval.target)
          xistar.mat <- matrix(unlist(lapply(
            eval.mesh.cartesian, '[[', 2)),
            ncol = H, nrow = M, byrow = TRUE)
          var.regcoeffs <- matrix(unlist(lapply(lapply(
            eval.mesh.cartesian,'[[',3), diag)),
            ncol = H, nrow = M, byrow = TRUE)[, 1:p]
        }
        omega.m <- exp(mesh.log.image)/sum(exp(mesh.log.image))

        list.out <- list(nquad = M, # Number of quadrature points
                        xistar.mat = xistar.mat, # Matrix of latent field values
                        var.regcoeffs = var.regcoeffs, # Variance of reg. coeffs)
                        omega.m = omega.m,     # Mixture weights
                        v.sample = v.sample,   # Matrix of samples for v
                        pendist.params = pendist.params, # Parameters of sn fit
                        mode.check = mode.check) # Mode check
        return(list.out)
      } else{ # avoid grid
        return(list(mode.check = mode.check))
      }
    }

    mixture.comp <- mix.components()

    if(mixture.comp$mode.check == 1){
      pen.family <- "skew-normal"
      pendist.params <- mixture.comp$pendist.params
      v.sample <- mixture.comp$v.sample
      M <- mixture.comp$nquad
      xistar.mat <- mixture.comp$xistar.mat
      omega.m <- mixture.comp$omega.m
      var.regcoeffs <- as.matrix(mixture.comp$var.regcoeffs)
      latent.hat <- colSums(omega.m * xistar.mat)
      betas.hat <- latent.hat[1:p]
      sdbeta.post <- sqrt(colSums(omega.m * (var.regcoeffs + (xistar.mat[, 1:p] -
                                  matrix(rep(betas.hat, M), nrow = M, ncol = p,
                                  byrow = TRUE)) ^ 2)))
      zscore <- betas.hat / sdbeta.post # Bayesian z-score

      # Credible intervals for regression coefficients
      betas.matCI <- matrix(0, nrow = p, ncol = 2)

      for(j in 1:p){
        support.j <- seq(betas.hat[j] - 4.5 * sdbeta.post[j],
                         betas.hat[j] + 4.5 * sdbeta.post[j],
                         length = 400)
        support.mat <- matrix(0, ncol = length(support.j), nrow = M)
        for(i in 1:length(support.j)){
          for(m in 1:M){
            support.mat[m, i] <- stats::dnorm(support.j[i], mean = xistar.mat[m,j],
                                       sd = sqrt(var.regcoeffs[m, j]))}
        }
        mix.image <- colSums(omega.m * support.mat)
        cumsum.mix <- cumsum(mix.image * diff(support.j)[1])
        lb.j <- support.j[sum(cumsum.mix < ((1 - cred.int) *.5))]
        ub.j <- support.j[sum(cumsum.mix < (1 - (1 - cred.int) * .5))]
        betas.matCI[j, ] <- c(lb.j, ub.j)
      }
    }
  }

  #############################################################################
  ###   When q >=5 inference is based on maximum a posteriori of v          ###
  #############################################################################

  if(q >= 5 || (q < 5 && mixture.comp$mode.check == 0)){
    pendist.params <- v.max
    pen.family <- "gaussian"
    latent.hat <- latmaximum
    betas.hat <- latent.hat[1:p]
    sdbeta.post <- sqrt(diag(Covmaximum)[1:p])
    zscore <- betas.hat / sdbeta.post # Bayesian z-score
    v.sample <- MASS::mvrnorm(n = nv.sample, mu = v.max, Sigma = Cov.vmax)

    # Credible intervals for regression coefficients
    betas.matCI <- matrix(0, nrow = p, ncol = 2)

    for(j in 1:p){

      betas.matCI[j, 1] <- betas.hat[j] - stats::qnorm(1-(1-cred.int) * .5) *
        sdbeta.post[j]
      betas.matCI[j, 2] <- betas.hat[j] + stats::qnorm(1-(1-cred.int) * .5) *
        sdbeta.post[j]
    }
  }

  #############################################################################
  ###                        Output results                                 ###
  #############################################################################

  # 95% HPD interval for effective dimension of smooth terms
  ED.functions.sample <- matrix(0, nrow = nv.sample, ncol = q)
  matdiagH <- matrix(0, nrow = nv.sample, ncol = H)

  for (s in 1:nv.sample) {
    if(is.numeric(class(try(diagHmatrix(v.sample[s, ]), silent = TRUE)))) {
      matdiagH[s, ] <- diagHmatrix(v.sample[s, ])
    } else {
      matdiagH[s, ] <- diagHmatrix(v.max)
    }
    edf.v <- matdiagH[s, ]
    for (j in 1:q) {
      edf.vj <- edf.v[((p + 1) + (j - 1) * (K - 1)):(p + j * (K - 1))]
      edf.vjsum <- sum(edf.vj)
      if(edf.vjsum < (penorder - 1)) edf.vjsum <- sum(edf.vj[edf.vj > 0])
      if(edf.vjsum < (penorder - 1)) edf.vjsum <- penorder - 1
      ED.functions.sample[s, j] <- edf.vjsum
    }
  }
  ED.functions.sample <- coda::as.mcmc(ED.functions.sample)
  ED.functions.HPD95 <- coda::HPDinterval(ED.functions.sample)[1:q, ]
  if (q == 1) ED.functions.HPD95 <- matrix(ED.functions.HPD95, ncol = 2)
  colnames(ED.functions.HPD95) <- c("lower .95", "upper .95")

  # Fitted values of gam
  fitted.values <- s.gam.deriv(as.numeric(B %*% latent.hat))

  # Response residuals
  residuals <- y - fitted.values

  # Adjusted R-squared
  r2.adj <- 1 - ((sum(residuals ^ 2) / (n - p)) / stats::var(y))

  Fmat <- solve(BWB + Qv(v.max)) %*% BWB
  edf2 <- diag(2 * Fmat - (Fmat %*% Fmat))

  ED.functions2 <- c()
  for (j in 1:q) {
    ED.functions2[j] <- sum(edf2[((p + 1) + (j - 1) * (K - 1)):(p + j * (K - 1))])
  }

  # Approximate significance of smooth terms: test H0: fj=0
  approx.signif <- function(j){

    Vthetaj <- Covmaximum[((p + 1) + (j - 1) * (K - 1)):(p + j * (K - 1)),
                          ((p + 1) + (j - 1) * (K - 1)):(p + j * (K - 1))]

    Vfj <- B.list.trim[[j]] %*% Vthetaj %*% t(B.list.trim[[j]])
    fhatj <- as.numeric(B.list.trim[[j]] %*% thetahat.list[[j]])

    k <- floor(ED.functions2[j]) + 1
    nnu <- ED.functions2[j] - k + 1
    rrho <- sqrt(nnu *  (1 - nnu) * .5)

    eigdecomp <- RSpectra::eigs(Vfj, k = k, which = "LM")
    eigk <- eigdecomp$values
    U <- eigdecomp$vectors

    Lamb.tilde <- diag(utils::tail(eigk, 2) ^ (-.5))
    Brhonu <- matrix(c(1, rrho, rrho, nnu), ncol = 2, byrow = TRUE)
    Bpseudo <- Lamb.tilde %*% Brhonu %*% t(Lamb.tilde)

    if(k > 2) {
      Lambfull.tilde <- diag(0, k)
      Lambfull.tilde[1:(k - 2), 1:(k - 2)] <- diag(eigk[1:(k - 2)] ^ (-1), k - 2)
      Lambfull.tilde[(k - 1):k, (k - 1):k] <- Bpseudo
      Vfjinv <- U %*% Lambfull.tilde %*% t(U)
      Tr <- as.numeric(t(fhatj) %*% Vfjinv %*% fhatj)
    } else {
      Lambfull.tilde <- Bpseudo
      Vfjinv <- U %*% Lambfull.tilde %*% t(U)
      Tr <- as.numeric(t(fhatj) %*% Vfjinv %*% fhatj)
    }
    pvalue <- stats::pgamma(Tr, shape = ED.functions2[j] * .5, scale= 2,
                            lower.tail = FALSE)
    return(list(Tr = Tr, pval = pvalue, dim = j))
  }


  # List of estimated B-spline coefficients
  thetahat.list <- as.list(as.data.frame(t(matrix(latent.hat[(p + 1):H],
                                                  ncol = (K - 1),
                                                  nrow = q, byrow = TRUE))))
  names(thetahat.list) <- paste0(rep("theta", q), seq_len(q))

  Approx.signif <- matrix(unlist(lapply(seq_len(q), approx.signif)),
                          ncol = 3, byrow = TRUE)[, 1:2]

  # The data frame
  if(missing(data)){
    data = NULL
  } else{
    data = data
  }

  # Posterior estimates for linear regression coefficients
  betas_estim <- matrix(0, nrow = p, ncol = 5)
  betas_estim[, 1] <- betas.hat
  betas_estim[, 2] <- sdbeta.post
  betas_estim[, 3] <- zscore
  betas_estim[, 4] <- betas.matCI[, 1]
  betas_estim[, 5] <- betas.matCI[, 2]
  if(p > 1){
    rownames(betas_estim) <- colnames(X[, 1:p])
  }else{rownames(betas_estim) <- "(Intercept)"}
  colnames(betas_estim) <- c("Estimate", "sd.post", "z-score",
                             paste0("lower.", round(cred.int  * 100, 2)),
                             paste0("upper.", round(cred.int  * 100, 2)))

  # List of objects to return
  outlist <- list(formula = formula,          # Model formula
                  family = family,            # Gam family
                  link = link,                # Link used
                  n = n,                      # Sample size
                  q = q,                      # Number of smooth terms
                  K = K,                      # Number of B-splines
                  penalty.order = penorder,   # Chosen penalty order
                  latfield.dim = H,           # Latent field dimension
                  linear.coeff = betas_estim, # Estimated linear coeff.
                  spline.estim = thetahat.list,  # List of B-spline coeff.
                  edf = edf, # Degrees of freedom for each latent field variable
                  Approx.signif = Approx.signif, # Approximate significance smooth
                  EDf = ED.functions, # Effective degrees of freedom of smooths
                  EDfHPD.95 = ED.functions.HPD95, # 95% HPD for EDf
                  ED = ED.global,     # Effective degrees of freedom of model
                  vmap = v.max, # Maximum a posteriori of log(penalty) vector
                  Cov.vmap = Cov.vmax, # Covariance of log(penalty) vector at vmap
                  pen.family = pen.family, # Posterior distribution for v
                  pendist.params = pendist.params, # Parameters of posterior dist of v
                  Covmaximum = Covmaximum, # Covariance of lat field at v.max
                  latmaximum = latmaximum, # latent field at v.max
                  fitted.values = fitted.values, # Gam fitted values
                  residuals = residuals,         # Response residuals
                  r2.adj = r2.adj,               # Adjusted R-squared
                  data = data)                   # The data frame

  attr(outlist, "class") <- "gamlps"
  outlist

}