R/bayesreg.R
In bayesreg: Bayesian Regression Models with Global-Local Shrinkage Priors

Documented in bayesreg predict.bayesreg summary.bayesreg

#' Fit a linear or logistic regression model using Bayesian continuous shrinkage prior distributions. Handles ridge, lasso, horseshoe and horseshoe+ regression with logistic,
#' Gaussian, Laplace, Student-t, Poisson or geometric distributed targets. See \code{\link{bayesreg-package}} for more details on the features available in this package.
#'
#' @title Fitting Bayesian Regression Models with Continuous Shrinkage Priors
#' @param formula An object of class "\code{\link{formula}}": a symbolic description of the model to be fitted using the standard R formula notation.
#' @param data A data frame containing the variables in the model.
#' @param model The distribution of the target (y) variable. Continuous or numeric variables can be distributed as per a Gaussian distribution (\code{model="gaussian"} 
#' or \code{model="normal"}), Laplace distribution (\code{model = "laplace"} or \code{model = "l1"}) or Student-t distribution (\code{"model" = "studentt"} or \code{"model" = "t"}). 
#' Integer or count data can be distributed as per a Poisson distribution (\code{model="poisson"}) or geometric distribution (\code{model="geometric"}).
#' For binary targets (factors with two levels) either \code{model="logistic"} or \code{"model"="binomial"} should be used.
#' @param prior Which continuous shrinkage prior distribution over the regression coefficients to use. Options include ridge regression 
#' (\code{prior="rr"} or \code{prior="ridge"}), lasso regression (\code{prior="lasso"}), horseshoe regression (\code{prior="hs"} or \code{prior="horseshoe"}) and 
#' horseshoe+ regression (\code{prior="hs+"} or \code{prior="horseshoe+"})
#' @param n.samples Number of posterior samples to generate.
#' @param burnin Number of burn-in samples.
#' @param thin Desired level of thinning.
#' @param t.dof Degrees of freedom for the Student-t distribution.
#' @param n.cores Maximum number of cores to use; if \code{n.cores=Inf} then Bayesreg automatically sets this to one less than the maximum number of available cores. If \code{n.cores=1} then no parallelization occurs.
#' @section Details:
#' Draws a series of samples from the posterior distribution of a linear (Gaussian, Laplace or Student-t) or generalized linear (logistic binary, Poisson, geometric) regression model with specified continuous 
#' shrinkage prior distribution (ridge regression, lasso, horseshoe and horseshoe+) using Gibbs sampling. The intercept parameter is always included, and is never penalised.
#' 
#' While only \code{n.samples} are returned, the total number of samples generated is equal to \code{burnin}+\code{n.samples}*\code{thin}. To generate the samples 
#' of the regression coefficients, the code will use either Rue's algorithm (when the number of samples is twice the number of covariates) or the algorithm of 
#' Bhattacharya et al. as appropriate. Factor variables are automatically grouped together and 
#' additional shrinkage is applied to the set of indicator variables to which they expand.
#' 
#' If \code{n.cores > 1} then Bayesreg will divide the total sampling process between a number of independent sampling chains, with each chain being run on
#' a seperate core. If \code{n.cores=Inf} then Bayesreg will use one less than the total number of available cores.
#' If \code{n.cores} is a positive integer, then this is the maximum number of cores Bayesreg will use, though Bayesreg will cap this at the total number of available cores minus one. 
#' The total number of samples are split evenly between the number of cores being utilised, and after sampling
#' the different chains are recombined into a single body of samples. Each chain will perform
#' \code{burnin} burn-in sampling iterations before collecting samples, so the improvement in speed from
#' using from multiple cores will be greater if \code{n.samples*thin} is substantially greater than \code{burnin}.
#' 
#' @return An object with S3 class \code{"bayesreg"} containing the results of the sampling process, plus some additional information.
#' \item{beta}{Posterior samples the regression model coefficients.}
#' \item{beta0}{Posterior samples of the intercept parameter.}
#' \item{sigma2}{Posterior samples of the square of the scale parameter; for Gaussian distributed targets this is equal to the variance. For binary targets this is empty.}
#' \item{mu.beta}{The mean of the posterior samples for the regression coefficients.}
#' \item{mu.beta0}{The mean of the posterior samples for the intercept parameter.}
#' \item{mu.sigma2}{The mean of the posterior samples for squared scale parameter.}
#' \item{tau2}{Posterior samples of the global shrinkage parameter.}
#' \item{t.stat}{Posterior t-statistics for each regression coefficient.}
#' \item{var.ranks}{Ranking of the covariates by their importance, with "1" denoting the most important covariate.}
#' \item{log.l}{The log-likelihood at the posterior means of the model parameters}
#' \item{waic}{The Widely Applicable Information Criterion (WAIC) score for the model}
#' \item{waic.dof}{The effective degrees-of-freedom of the model, as estimated by the WAIC.}
#' The returned object also stores the parameters/options used to run \code{bayesreg}:
#' \item{formula}{The object of type "\code{\link{formula}}" describing the fitted model.}
#' \item{model}{The distribution of the target (y) variable.}
#' \item{prior}{The shrinkage prior used to fit the model.}
#' \item{n.samples}{The number of samples generated from the posterior distribution.}
#' \item{burnin}{The number of burnin samples that were generated.}
#' \item{thin}{The level of thinning.}
#' \item{n}{The sample size of the data used to fit the model.}
#' \item{p}{The number of covariates in the fitted model.}
#' \item{n.cores}{The number of cores actually used during sampling.}
#' 
#' @references 
#' 
#' Makalic, E. & Schmidt, D. F.
#' High-Dimensional Bayesian Regularised Regression with the BayesReg Package
#' arXiv:1611.06649 [stat.CO], 2016 \url{http://arxiv.org/pdf/1611.06649}
#' 
#' Park, T. & Casella, G. 
#' The Bayesian Lasso 
#' Journal of the American Statistical Association, Vol. 103, pp. 681-686, 2008
#' 
#' Carvalho, C. M.; Polson, N. G. & Scott, J. G. 
#' The horseshoe estimator for sparse signals 
#' Biometrika, Vol. 97, 465-480, 2010
#' 
#' Makalic, E. & Schmidt, D. F. 
#' A Simple Sampler for the Horseshoe Estimator 
#' IEEE Signal Processing Letters, Vol. 23, pp. 179-182, 2016 \url{http://arxiv.org/pdf/1508.03884v4}
#' 
#' Bhadra, A.; Datta, J.; Polson, N. G. & Willard, B. 
#' The Horseshoe+ Estimator of Ultra-Sparse Signals 
#' Bayesian Analysis, 2016
#' 
#' Polson, N. G.; Scott, J. G. & Windle, J. 
#' Bayesian inference for logistic models using Polya-Gamma latent variables 
#' Journal of the American Statistical Association, Vol. 108, 1339-1349, 2013
#' 
#' Rue, H. 
#' Fast sampling of Gaussian Markov random fields 
#' Journal of the Royal Statistical Society (Series B), Vol. 63, 325-338, 2001
#' 
#' Bhattacharya, A.; Chakraborty, A. & Mallick, B. K. 
#' Fast sampling with Gaussian scale-mixture priors in high-dimensional regression 
#' arXiv:1506.04778, 2016
#' 
#' Schmidt, D.F. & Makalic, E.
#' Bayesian Generalized Horseshoe Estimation of Generalized Linear Models
#' ECML PKDD 2019: Machine Learning and Knowledge Discovery in Databases. pp 598-613, 2019
#' 
#' Stan Development Team, Stan Reference Manual (Version 2.26), Section 15.4, "Effective Sample Size",
#' \url{https://mc-stan.org/docs/2_18/reference-manual/effective-sample-size-section.html}
#'    
#' @note
#' To cite this toolbox please reference: 
#'   
#' Makalic, E. & Schmidt, D. F.
#' High-Dimensional Bayesian Regularised Regression with the BayesReg Package
#' arXiv:1611.06649 [stat.CO], 2016 \url{http://arxiv.org/pdf/1611.06649}
#' 
#' A MATLAB implementation of the bayesreg function is also available from:
#' 
#' \url{https://au.mathworks.com/matlabcentral/fileexchange/60823-flexible-bayesian-penalized-regression-modelling}
#' 
#' Copyright (C) Daniel F. Schmidt and Enes Makalic, 2016-2021
#' 
#' @seealso The prediction function \code{\link{predict.bayesreg}} and summary function \code{\link{summary.bayesreg}}
#' @examples 
#' # -----------------------------------------------------------------
#' # By default Bayesreg now utilizes multiple cores if you have them
#' # available. If you do not want to use multiple cores you can set 
#' # n.cores=1 when calling Bayesreg.
#' #
#' # In most realistic/practical settings, i.e., when a 
#' # even a moderate number of samples is being requested, parallelization 
#' # will usually result in substantial speedups, and can dramatically reduce
#' # run-time for large numbers of samples.
#' #
#' # However, if your design matrix and desired number of samples 
#' # are small (i.e, small nrow(X), ncol(X) and n.samples) 
#' # then sometimes no parallelization, or parallelization with a small
#' # number of cores, can be quicker due to the overhead of setting up 
#' # multiple threads. This is likely only relevant if you are calling Bayesreg
#' # hundreds/thousands of times with *small* numbers of samples being requested
#' # for each call, e.g., if you are doing some sort of one-at-a-time testing
#' # across many predictors such as a genome-wide association test, or similar.
#' # In this case you may be better off parallelizing the calls to 
#' # Bayesreg across the tests and disabling parallelization when calling 
#' # Bayesreg (i.e., n.cores=1).
#' #
#' # -----------------------------------------------------------------
#' # Example 1: Gaussian regression
#' X = matrix(rnorm(100*20),100,20)
#' b = matrix(0,20,1)
#' b[1:5] = c(5,4,3,2,1)
#' y = X %*% b + rnorm(100, 0, 1)
#' 
#' df <- data.frame(X,y)
#' rv.lm <- lm(y~.,df)                        # Regular least-squares
#' summary(rv.lm)
#' 
#' # Horseshoe regression -- here we show how to explicitly control the maximum
#' # number of cores being used for sampling to 4
#' rv.hs <- bayesreg(y~., df, prior="hs", n.cores=4)       
#' rv.hs$n.cores  # actual number of cores used (will be <= 4, depending on machine)
#' rv.hs.s <- summary(rv.hs)
#' 
#' # Expected squared prediction error for least-squares
#' coef_ls = coef(rv.lm)
#' as.numeric(sum( (as.matrix(coef_ls[-1]) - b)^2 ) + coef_ls[1]^2)
#' 
#' # Expected squared prediction error for horseshoe
#' as.numeric(sum( (rv.hs$mu.beta - b)^2 ) + rv.hs$mu.beta0^2)
#' 
#' 
#' # -----------------------------------------------------------------
#' # Example 2: Gaussian v Student-t robust regression
#' X = 1:10;
#' y = c(-0.6867, 1.7258, 1.9117, 6.1832, 5.3636, 7.1139, 9.5668, 10.0593, 11.4044, 6.1677);
#' df = data.frame(X,y)
#' 
#' # Gaussian ridge
#' rv.G <- bayesreg(y~., df, model = "gaussian", prior = "ridge", n.samples = 1e3)
#' 
#' # Student-t ridge
#' rv.t <- bayesreg(y~., df, model = "t", prior = "ridge", t.dof = 5, n.samples = 1e3)
#' 
#' # Plot the different estimates with credible intervals
#' plot(df$X, df$y, xlab="x", ylab="y")
#' 
#' yhat_G <- predict(rv.G, df, bayes.avg=TRUE)
#' lines(df$X, yhat_G[,1], col="blue", lwd=2.5)
#' lines(df$X, yhat_G[,3], col="blue", lwd=1, lty="dashed")
#' lines(df$X, yhat_G[,4], col="blue", lwd=1, lty="dashed")
#' 
#' yhat_t <- predict(rv.t, df, bayes.avg=TRUE)
#' lines(df$X, yhat_t[,1], col="darkred", lwd=2.5)
#' lines(df$X, yhat_t[,3], col="darkred", lwd=1, lty="dashed")
#' lines(df$X, yhat_t[,4], col="darkred", lwd=1, lty="dashed")
#' 
#' legend(1,11,c("Gaussian","Student-t (dof=5)"),lty=c(1,1),col=c("blue","darkred"),
#'        lwd=c(2.5,2.5), cex=0.7)
#' 
#' \dontrun{
#' # -----------------------------------------------------------------
#' # Example 3: Poisson/geometric regression example
#' 
#' X  = matrix(rnorm(100*5),100,5)
#' b  = c(0.5,-1,0,0,1)
#' nu = X%*%b + 1
#' y  = rpois(lambda=exp(nu),n=length(nu))
#'
#' df <- data.frame(X,y)
#'
#' # Fit a Poisson regression
#' rv.pois=bayesreg(y~.,data=df,model="poisson",prior="hs", burnin=1e4, n.samples=5e4)
#' summary(rv.pois)
#' 
#' # Fit a geometric regression
#' rv.geo=bayesreg(y~.,data=df,model="geometric",prior="hs", burnin=1e4, n.samples=5e4)
#' summary(rv.geo)
#' 
#' # Compare the two models in terms of their WAIC scores
#' cat(sprintf("Poisson regression WAIC=%g vs geometric regression WAIC=%g", 
#'             rv.pois$waic, rv.geo$waic))
#' # Poisson is clearly preferred to geometric, which is good as data is generated from a Poisson!
#'  
#'  
#' # -----------------------------------------------------------------
#' # Example 4: Logistic regression on spambase
#' data(spambase)
#'   
#' # bayesreg expects binary targets to be factors
#' spambase$is.spam <- factor(spambase$is.spam)
#' 
#' # First take a subset of the data (1/10th) for training, reserve the rest for testing
#' spambase.tr  = spambase[seq(1,nrow(spambase),10),]
#' spambase.tst = spambase[-seq(1,nrow(spambase),10),]
#'   
#' # Fit a model using logistic horseshoe for 2,000 samples
#' # In practice, > 10,000 samples would be a more realistic amount to draw
#' rv <- bayesreg(is.spam ~ ., spambase.tr, model = "logistic", prior = "horseshoe", n.samples = 2e3)
#'   
#' # Summarise, sorting variables by their ranking importance
#' rv.s <- summary(rv,sort.rank=TRUE)
#'   
#' # Make predictions about testing data -- get class predictions and class probabilities
#' y_pred <- predict(rv, spambase.tst, type='class')
#'   
#' # Check how well did our predictions did by generating confusion matrix
#' table(y_pred, spambase.tst$is.spam)
#'   
#' # Calculate logarithmic loss on test data
#' y_prob <- predict(rv, spambase.tst, type='prob')
#' cat('Neg Log-Like for no Bayes average, posterior mean estimates: ', sum(-log(y_prob[,1])), '\n')
#' y_prob <- predict(rv, spambase.tst, type='prob', sum.stat="median")
#' cat('Neg Log-Like for no Bayes average, posterior median estimates: ', sum(-log(y_prob[,1])), '\n')
#' y_prob <- predict(rv, spambase.tst, type='prob', bayes.avg=TRUE)
#' cat('Neg Log-Like for Bayes average: ', sum(-log(y_prob[,1])), '\n')
#' }
#' 
#' @export
#' 
#' 
bayesreg <- function(formula, data, model='normal', prior='ridge', n.samples = 1e3, burnin = 1e3, thin = 5, t.dof = 5, n.cores = Inf)
{
  VERSION = '1.3'
  groups  = NA
  display = F
  rankvars = T
  
  printf <- function(...) cat(sprintf(...))
  
  # Return object
  rv = list()
  
  # -------------------------------------------------------------------    
  # model/target distribution types
  SMN   = T
  MH    = F
  model = tolower(model)
  if (model == 'normal' || model == 'gaussian')
  {
    model = 'gaussian'
  }
  else if (model == 'laplace' || model == 'l1')
  {
    model = 'laplace'
  }
  else if (model == 'studentt' || model == 't')
  {
    model = 't'
  }
  else if (model == 'logistic' || model == "binomial")
  {
    model = 'logistic'
  }
  else if (model == 'poisson' || model == 'geometric')
  {
    SMN = F
    MH  = T
  }
  else if (model == 'binomialmh')
  {
    SMN = F
    MH = T
    model = 'logistic'
  }
  else
  {
    stop('Unknown target distribution \'',model,'\'')
  }
  
  # -------------------------------------------------------------------    
  # Process and set up the data from the model formula
  rv$terms <- stats::terms(x = formula, data = data)
  
  mf = stats::model.frame(formula = formula, data = data)
  rv$target.var = names(mf)[1]
  if (model == 'logistic')
  {
    # Check to ensure target is a factor
    if (!is.factor(mf[,1]) || (!is.factor(mf[,1]) && length(levels(mf[,1])) != 2))
    {
      stop('Target variable must be a factor with two levels for logistic regression')
    }
    
    rv$ylevels <- levels(mf[,1])
    mf[,1] = as.numeric(mf[,1])
    mf[,1] = (as.numeric(mf[,1]) - 1)
  }
  
  # If count regression, check targets are counts
  else if (model == 'poisson' || model == 'geometric')
  {
    if (any((floor(mf[,1]) != mf[,1])) || any(mf[,1]<0))
    {
      stop('Target variable must be non-negative integers for count regression')
    }
  }
  
  # Otherwise check to ensure target is numeric
  else if (!is.numeric(mf[,1]))
  {
    stop('Target variable must not be a factor for linear regression; use model = "logistic" instead')
  }
  
  # If not logistic, check if targets have only 2 unique values -- if so, give a warning
  if (model != 'logistic' && length(unique(mf[,1])) == 2)
  {
    warning('Target variable takes on only two distinct values -- should this be a binary regression?')
  }
  
  y = mf[,1]
  X = stats::model.matrix(formula, data=data)
  
  # Assign factors to groups if categorical variables have > 2 levels
  groups = matrix(NA,ncol(X),1)
  gnum = 1
  cn = colnames(X)
  assign = attr(X, "assign")
  for (j in 2:ncol(mf))
  {
    # If it is a factor, set up the groups as appropriate
    if (is.factor(mf[,j]) && length(levels(mf[,j]))>2)
    {
      groups[(assign==(j-1))] = gnum
      gnum = gnum+1
    }
  }
  groups = groups[2:length(groups)]  
  
  # Convert to a numeric matrix and drop the target variable
  X = as.matrix(X)
  X = X[,-1,drop=FALSE]
  
  # 
  n = nrow(X)
  p = ncol(X)
  
  # -------------------------------------------------------------------
  # Prior types
  if (prior == 'ridge' || prior == 'rr')
  {
    prior = 'rr'
  }
  else if (prior == 'horseshoe' || prior == 'hs')
  {
    prior = 'hs'
  }
  else if (prior == 'horseshoe+' || prior == 'hs+')
  {
    prior = 'hs+'
  }
  else if (prior != 'lasso')
  {
    stop('Unknown prior \'', prior, '\'')
  }
  
  # -------------------------------------------------------------------
  # Standardise data?
  std.X = bayesreg.standardise(X)
  X    = std.X$X
  
  # Initial values
  ydiff    = 0
  b0       = 0
  b        = matrix(0,p,1)
  omega2   = matrix(1,n,1)
  sigma2   = 1
  
  tau2     = 1
  xi       = 1
  
  lambda2  = matrix(1,p,1)
  nu       = matrix(1,p,1)
  
  eta2     = matrix(1,p,1)
  phi      = matrix(1,p,1)
  
  kappa    = y - 1/2
  z        = y
  
  # Quantities for computing WAIC 
  waicProb   = matrix(0,n,1) 
  waicLProb  = matrix(0,n,1)
  waicLProb2 = matrix(0,n,1)
  
  # Use Rue's MVN sampling algorithm
  mvnrue = T
  PrecomputedXtX = F
  if (p/n >= 2)
  {
    # Switch to Bhatta's MVN sampling algorithm
    mvnrue = F
  }

  # Precompute XtX?
  XtX  = NA
  Xty  = NA
  
  if (model == "gaussian" && mvnrue)
  {
    XtX = crossprod(X)
    Xty = crossprod(X,y)
    PrecomputedXtX = T
  }
  
  # Metropolis-Hastings gradient based sampling setup
  if (MH)
  {
    # If insufficient burn-in available
    if (burnin < 1e3)
    {
      stop('To use Metropolis-Hastings sampling you must use at least 1,000 burnin samples')
    }
    
    # Initialise betas and b0 with rough ridge solutions
    # Poisson
    if (model == 'poisson')
    {
      extra.model.params = NULL
      
      Xty = crossprod(X,y)
      rv.gd = bayesreg.fit.GLM.gd(X, y, model, 1, 10, 5e2)
      rv.mh = bayesreg.mgrad.L(y, X, as.vector(c(rv.gd$b,rv.gd$b0)), NULL, model, extra.model.params, Xty)
    }
    if (model == 'geometric')
    {
      extra.model.params = 1
      
      Xty = crossprod(X,y)
      rv.gd = bayesreg.fit.GLM.gd(X, y, model, 1, 10, 1e3)
      rv.mh = bayesreg.mgrad.L(y, X, as.vector(c(rv.gd$b,rv.gd$b0)), NULL, model, extra.model.params, Xty)
    }
    if (model == 'logistic')
    {
      extra.model.params = NULL
      
      Xty   = crossprod(X,y)
      rv.gd = bayesreg.fit.GLM.gd(X, y, model, 1, 10, 1e3)
      rv.mh = bayesreg.mgrad.L(y, X, as.vector(c(rv.gd$b,rv.gd$b0)), NULL, model, extra.model.params, Xty)
    }
    
    # Initialise variables
    b      = rv.gd$b
    b0     = rv.gd$b0
    L.b    = rv.mh$L
    grad.b = rv.mh$grad[1:p]
    eta    = rv.mh$eta
    H.b0   = rv.mh$H.b0
    
    extra.stats = NULL
    
    mh.tune = bayesreg.mh.initialise(75, 1e2, 1e-7, burnin, T)
    mh.tune$delta = 0.0002
  }
  else
  {
    rv.gd = NULL
  }
  
  # Set up group shrinkage parameters if required
  ng = 0
  if (length(groups) > 1)
  {
    ngroups = length(unique(groups))
    groups[is.na(groups)] = ngroups
    delta2  = matrix(1,ngroups,1)
    chi     = matrix(1,ngroups,1)
    rho2    = matrix(1,ngroups,1)
    zeta    = matrix(1,ngroups,1)
    ngroups = ngroups-1
  }
  else
  {
    ngroups = 1
    groups  = matrix(1,1,p)
    delta2  = matrix(1,1,1)
    chi     = matrix(1,1,1)
    rho2    = matrix(1,1,1)
    zeta    = matrix(1,1,1)
  }
  
  # Setup return object
  rv$formula   = formula
  rv$model     = model
  rv$prior     = prior
  rv$n.samples = n.samples
  rv$burnin    = burnin
  rv$thin      = thin
  rv$groups    = groups
  rv$t.dof     = t.dof
  
  rv$n         = n
  rv$p         = p
  rv$tss       = sum((y - mean(y))^2)
  
  rv$beta0     = matrix(0,1,n.samples)
  rv$beta      = matrix(0,p,n.samples)
  rv$mu.beta   = matrix(0,p,1)
  rv$mu.beta0  = matrix(0,1,1)
  rv$sigma2    = matrix(0,1,n.samples)
  rv$mu.sigma2 = matrix(0,1,1)
  rv$tau2      = matrix(0,1,n.samples)
  #rv$xi       = matrix(0,1,n.samples)
  #rv$lambda2  = matrix(0,p,n.samples)
  #rv$nu       = matrix(0,p,n.samples)
  #rv$delta2   = matrix(0,ngroups,n.samples)
  #rv$chi      = matrix(0,ngroups,n.samples)
  
  rv$t.stat    = matrix(0,p,1)

  # Determine number of available cores for parallelization
  # n.cores = 1 disables parallelization
  if (n.cores <= 0 || floor(n.cores) != n.cores)
  {
    stop("n.cores must be an integer > 0")
  }

  # Find out how many cores are available
  n.available.cores = parallel::detectCores()[1] - 1
  n.cores = min(n.available.cores, n.cores)

  # check if in CRAN test environment -- if so, limit cores to max 2
  chk = Sys.getenv("_R_CHECK_LIMIT_CORES_","")
  if (nzchar(chk) && (chk == "true" || chk == "TRUE"))
  {
    # No more than 2 cores for CRAN
    n.cores = min(n.cores, 2L)
  }  

  # Compute the number of samples for each core
  n.samples.per.core = ceiling(n.samples/n.cores)
  rv$n.cores = n.cores
  
  # Print the banner  
  if (display)
  {
    printf('==========================================================================================\n');
    printf('|                   Bayesian Penalised Regression Estimation ver. %s                    |\n', VERSION);
    printf('|                     (c) Enes Makalic, Daniel F Schmidt. 2016-2021                      |\n');
    printf('==========================================================================================\n');
  }
  
  # ======================================================================================
  # Main sampling loop
  if (n.cores > 1)
  {  
    # Register a cluster 
    cores   = parallel::detectCores()
    cluster = parallel::makeCluster(n.cores)
    doParallel::registerDoParallel(cluster)
    
    # Sample each chain    
    rv.final = foreach::foreach(i=1:n.cores) %dopar%
      {
        rv.samples = bayesreg.sample.chain(X, y, z, p, n, n.samples.per.core, burnin, thin, mvnrue, groups, XtX, Xty, model, prior, t.dof, PrecomputedXtX, rv.gd, SMN, MH)
        rv.samples
      }
    
    # Done -- close the cluster
    parallel::stopCluster(cluster)
  } 
  else
  {
    rv.final = list()
    rv.final[[1]] = bayesreg.sample.chain(X, y, z, p, n, n.samples.per.core, burnin, thin, mvnrue, groups, XtX, Xty, model, prior, t.dof, PrecomputedXtX, rv.gd, SMN, MH)
  }
  
  for (i in 1:n.cores)
  {
    if (i == 1)
    {
      rv$beta0     = rv.final[[i]]$beta0
      rv$beta      = rv.final[[i]]$beta
      rv$mu.beta   = rv.final[[i]]$mu.beta
      rv$mu.beta0  = rv.final[[i]]$mu.beta0
      rv$sigma2    = rv.final[[i]]$sigma2
      rv$mu.sigma2 = rv.final[[i]]$mu.sigma2
      rv$tau2      = rv.final[[i]]$tau2 
      
      waicProb     = rv.final[[i]]$waicProb
      waicLProb    = rv.final[[i]]$waicLProb
      waicLProb2   = rv.final[[i]]$waicLProb2
    }
    else
    {
      rv$beta0     = cbind(rv$beta0, rv.final[[i]]$beta0)
      rv$beta      = cbind(rv$beta, rv.final[[i]]$beta)
      rv$mu.beta   = rv$mu.beta + rv.final[[i]]$mu.beta
      rv$mu.beta0  = rv$mu.beta0 + rv.final[[i]]$mu.beta0
      rv$sigma2    = cbind(rv$sigma2, rv.final[[i]]$sigma2)
      rv$mu.sigma2 = rv$mu.sigma2 + rv.final[[i]]$mu.sigma2
      rv$tau2      = cbind(rv$tau2, rv.final[[i]]$tau2)
      
      waicProb     = waicProb + rv.final[[i]]$waicProb
      waicLProb    = waicLProb + rv.final[[i]]$waicLProb
      waicLProb2   = waicLProb2 + rv.final[[i]]$waicLProb2      
    }
  }
  
  # Trim off any excess samples (from running on multiple cores)
  rv$beta0 = rv$beta0[1:n.samples]
  if (nrow(rv$beta) > 1)
  {
    rv$beta = rv$beta[,1:n.samples]
  }
  else
  {
    rv$beta = t(as.matrix(rv$beta[,1:n.samples]))
  }
  rv$sigma2 = rv$sigma2[1:n.samples]
  rv$tau2 = rv$tau2[1:n.samples]
  
  rv$n.samples = n.samples
  
  #
  rv$mu.beta   = rv$mu.beta / n.samples
  rv$mu.beta0  = rv$mu.beta0 / n.samples
  rv$mu.sigma2 = rv$mu.sigma2 / n.samples
  
  # ===================================================================
  # Rank features, if requested
  if (rankvars)
  {
    # Run the BFR
    ranks = bayesreg.bfr(rv)
    
    # Determine the 75th percentile
    rv$var.ranks = rep(NA,p+1)
    q = apply(ranks,1,function(x) stats::quantile(x,0.75))
    O = sort(q,index.return = T)
    O = O$ix
    
    j = 1
    k = 1
    for (i in 1:p)
    {
      if (i >= 2)
      {
        if (q[O[i]] != q[O[i-1]])
        {
          j = j+k
          k = 1
        }
        else
        {
          k = k+1
        }
      }
      rv$var.ranks[O[i]] = j
    }
  }
  else
  {
    rv$var.ranks = rep(NA, p+1)
  }
  
  # ===================================================================
  # Compute the t-statistics
  for (i in 1:p)
  {
    rv$t.stat[i] = rv$mu.beta[i] / stats::sd(rv$beta[i,])
  }
  
  # ===================================================================
  # Compute other model statistics
  rv$yhat    = as.matrix(X) %*% matrix(rv$mu.beta, p, 1) + as.numeric(rv$mu.beta0)
  rv$r2      = 1 - sum((y - rv$yhat)^2)/rv$tss
  rv$rootmse = mean((y - rv$yhat)^2)
  
  # ===================================================================
  # Compute WAIC score and log-likelihoods
  rv$waic.dof = sum(waicLProb2/n.samples) - sum((waicLProb/n.samples)^2)
  rv$waic     = -sum(log(waicProb/n.samples)) + rv$waic.dof
  
  if (model == 'logistic')
  {
    rv$log.l  = -bayesreg.logregnlike(X, y, rv$mu.beta, rv$mu.beta0)
    rv$log.l0 = -bayesreg.logregnlike(X, y, matrix(0,p,1), log(sum(y)/(n-sum(y))))
  }
  # Count regression
  else if (model == 'poisson' || model == 'geometric')
  {
    rv$log.l  = -bayesreg.countregnlike(model, as.matrix(X), as.matrix(y), rv$mu.beta, rv$mu.beta0)
    rv$log.l0 = -bayesreg.countregnlike(model, X, y, matrix(0,p,1), log(mean(y)))
    
    # Compute overdispersion as well
    mu.eta = as.matrix(X) %*% as.matrix(rv$mu.beta) + as.numeric(rv$mu.beta0)
    if (model == 'poisson')
    {
      rv$over.dispersion = mean( (y-exp(mu.eta))^2 / exp(mu.eta) )
    }
    else
    {
      geo.p = 1/(exp(mu.eta)+1)
      rv$over.dispersion = mean( (y-exp(mu.eta))^2 / ((1-geo.p)/geo.p^2) )
    }
  }
  # Otherwise continuous targets
  else
  {
    rv$log.l  = -bayesreg.linregnlike(model, as.matrix(X), as.matrix(y), rv$mu.beta, rv$mu.beta0, rv$mu.sigma2, rv$t.dof)
  }
  
  # ===================================================================
  # Rescale the coefficients
  if (p == 1)
  {
    rv$beta  <- t(as.matrix(apply(t(rv$beta), 1, function(x)(x / std.X$std.X))))
  }
  else
  {
    rv$beta  <- as.matrix(apply(t(rv$beta), 1, function(x)(x / std.X$std.X)))
  }
  rv$beta0 <- rv$beta0 - std.X$mean.X %*% rv$beta
  
  rv$mu.beta  <- rv$mu.beta / t(std.X$std.X)
  rv$mu.beta0 <- rv$mu.beta0 - std.X$mean.X %*% rv$mu.beta
  
  rv$std.X = std.X
  
  # ===================================================================
  # Compute effective sample sizes
  rv$ess.frac = rep(NA, p)
  for (i in 1:p)
  {
    e = bayesreg.ess(rv$beta[i,])
    rv$ess.frac[i] = e$ess.frac
  }
  rv$ess.frac[ rv$ess.frac > 1 ] = 1
  rv$ess = ceiling(rv$ess.frac * n.samples)
  
  # assign name to beta matrix
  rownames(rv$beta) = labels(rv$mu.beta)[[1]]  
  
  class(rv) = "bayesreg"
  return(rv)
}

# ============================================================================================================================
# Main sampling routine -- sample a single chain
bayesreg.sample.chain <- function(X, y, z, p, n, n.samples, burnin, thin, mvnrue, groups, XtX, Xty, model, prior, t.dof, PrecomputedXtX, rv.gd, SMN, MH)
{
  # Initial values
  ydiff    = 0
  b0       = 0
  b        = matrix(0,p,1)
  omega2   = matrix(1,n,1)
  sigma2   = 1
  
  tau2     = 1
  xi       = 1
  
  lambda2  = matrix(1,p,1)
  nu       = matrix(1,p,1)
  
  eta2     = matrix(1,p,1)
  phi      = matrix(1,p,1)
  
  kappa    = y - 1/2
  z        = y
  
  # Quantities for computing WAIC 
  waicProb   = matrix(0,n,1) 
  waicLProb  = matrix(0,n,1)
  waicLProb2 = matrix(0,n,1)
  
  # Set up group shrinkage parameters if required
  ng = 0
  if (length(groups) > 1)
  {
    ngroups = length(unique(groups))
    groups[is.na(groups)] = ngroups
    delta2  = matrix(1,ngroups,1)
    chi     = matrix(1,ngroups,1)
    rho2    = matrix(1,ngroups,1)
    zeta    = matrix(1,ngroups,1)
    ngroups = ngroups-1
  }
  else
  {
    ngroups = 1
    groups  = matrix(1,1,p)
    delta2  = matrix(1,1,1)
    chi     = matrix(1,1,1)
    rho2    = matrix(1,1,1)
    zeta    = matrix(1,1,1)
  }
  
  # Initialise variables for MH if needed
  if (MH)
  {
    if (model == 'poisson')
    {
      extra.model.params = NULL
      rv.mh = bayesreg.mgrad.L(y, X, as.vector(c(rv.gd$b,rv.gd$b0)), NULL, model, extra.model.params, Xty)
    }
    if (model == 'geometric')
    {
      extra.model.params = 1
      rv.mh = bayesreg.mgrad.L(y, X, as.vector(c(rv.gd$b,rv.gd$b0)), NULL, model, extra.model.params, Xty)
    }    
    if (model == 'logistic')
    {
      extra.model.params = NULL
      rv.mh = bayesreg.mgrad.L(y, X, as.vector(c(rv.gd$b,rv.gd$b0)), NULL, model, extra.model.params, Xty)
    }
    
    b      = rv.gd$b
    b0     = rv.gd$b0
    L.b    = rv.mh$L
    grad.b = rv.mh$grad[1:p]
    eta    = rv.mh$eta
    H.b0   = rv.mh$H.b0
    
    extra.stats = NULL
    
    mh.tune = bayesreg.mh.initialise(75, 1e2, 1e-7, burnin, T)
    mh.tune$delta = 0.0002
  }  
  
  # Samples (return values)
  rv           = list()
  rv$beta0     = matrix(0,1,n.samples)
  rv$beta      = matrix(0,p,n.samples)
  rv$mu.beta   = matrix(0,p,1)
  rv$mu.beta0  = matrix(0,1,1)
  rv$sigma2    = matrix(0,1,n.samples)
  rv$mu.sigma2 = matrix(0,1,1)
  rv$tau2      = matrix(0,1,n.samples)  
  
  # Main sampling loop
  k    = 0
  iter = 0
  
  while (k < n.samples)
  {
    # If using scale-mixture of normals sampler
    if (SMN)
    {
      # ===================================================================
      # Sample regression coefficients (beta)
      if (model == 'logistic')
      {
        z = kappa * omega2
      }
      
      bs  = bayesreg.sample_beta(X, z, mvnrue, b0, sigma2, tau2, lambda2 * delta2[groups], omega2, XtX, Xty, model, PrecomputedXtX)
      muB = bs$m
      b   = bs$x
      
      # ===================================================================
      # Sample intercept (b0)
      W         = sum(1 / omega2)
      e         = y - X %*% b
      
      # Non-logistic models
      if (model != 'logistic')
      {
        muB0    = sum(e / omega2) / W
      }
      else
      {
        W       = sum(1 / omega2)
        muB0    = sum((z-(y-e)) / omega2) / W
      }
      v       = sigma2 / W
      
      # Sample b0 and update residuals
      b0      = stats::rnorm(1, muB0, sqrt(v))    
      e       = e-b0
      
      # ===================================================================
      # Sample the noise scale parameter sigma2
      mu.sigma2   = NA
      if (model != 'logistic')
      {
        #e         = y - X %*% b - b0
        shape    = (n + p)/2
        scale    = sum( (e^2 / omega2)/2 ) + sum( (b^2 / delta2[groups] / lambda2 / tau2)/2 )
        sigma2   = 1/stats::rgamma(1, shape=shape, scale=1/scale)
        
        mu.sigma2 = scale / (shape-1)
      }
    }
    # Else if using Metropolis-Hastings
    else if (MH)
    {
      mu.sigma2 = 1
      
      # Sample beta's
      mh.tune$D = mh.tune$D + 1
      
      rv.mh = bayesreg.mgrad.sample.beta(b, b0, L.b, grad.b, sigma2*tau2*lambda2*delta2[groups], mh.tune$delta, eta, H.b0, y, X, sigma2, Xty, model, extra.model.params, extra.stats)
      
      # If it was accepted?
      if (rv.mh$accepted)
      {
        mh.tune$M = mh.tune$M + 1
        
        b      = rv.mh$b
        b0     = rv.mh$b0
        L.b    = rv.mh$L.b
        grad.b = rv.mh$grad.b
        eta    = rv.mh$eta
        H.b0   = rv.mh$H.b0
      }
      
      # Update as required
      muB = rv.mh$b
      muB0 = rv.mh$b0
      
      # Sample b0 using simple MH sampler (only every 25 ticks)
      if (iter %% 25 == 0)
      {
        b0.new = rnorm(n=1, mean=b0, sd=sqrt(2.5/rv.mh$H.b0))
        if (model == "poisson" || model == "geometric" || model == "logistic")
        {
          rv.mh = bayesreg.mgrad.L(y, X, c(b,b0), eta-b0+b0.new, model, extra.model.params, Xty)
          #[L_bnew, grad_bnew, H_b0new, ~, extra_stats_new] = br_mGradL(y, X, [b;b0], eta-b0+b0_new, model, extra_model_params, Xty);
        }
        
        # Accept?
        if (runif(1) < exp(-rv.mh$L/sigma2 + L.b/sigma2))
        {
          eta = eta - b0 + b0.new
          b0 = b0.new;
          L.b = rv.mh$L
          H_b0 = rv.mh$H.b0;
          grad.b = rv.mh$grad[1:p]
          extra.stats = rv.mh$extra.stats
        }
      }
    }
    
    # ===================================================================
    # Sample 'omega's
    
    # -------------------------------------------------------------------
    # Laplace
    if (model == 'laplace')
    {
      #e = y - X %*% b - b0
      mu = sqrt(2 * sigma2 / e^2)
      omega2 = 1 / bayesreg.rinvg(mu,1/2)
    }
    
    # -------------------------------------------------------------------
    # Student-t
    if (model == 't')
    {
      #e = y - X %*% b - b0
      shape = (t.dof+1)/2
      scale = (e^2/sigma2+t.dof)/2
      omega2 = as.matrix(1 / stats::rgamma(n, shape=shape, scale=1/scale), n, 1)
    }
    
    # -------------------------------------------------------------------
    # Logistic
    if (model == 'logistic' && SMN)
    {
      #omega2 = 1 / rpg.devroye(num = n, n = 1, z = b0 + X %*% b)
      omega2 = as.matrix(1 / pgdraw::pgdraw(1, b0+X%*%b), n, 1)
    }
    
    # ===================================================================
    # Sample the global shrinkage parameter tau2 (and L.V. xi)
    # hs/hs+/ridge use tau2 ~ C+(0,1)
    if (prior == 'hs' || prior == 'hs+' || prior == 'rr')
    {
      shape = (p+1)/2
      scale = 1/xi + sum(b^2 / lambda2 / delta2[groups]) / 2 / sigma2
      tau2  = 1 / stats::rgamma(1, shape=shape, scale=1/scale)
      
      # Sample xi
      scale = 1 + 1/tau2
      xi    = scale / stats::rexp(1)
    }
    
    # Lasso uses tau2 ~ IG(1,1)
    else if (prior == 'lasso')
    {
      shape = p/2+1
      scale = 1 + sum(b^2 / lambda2 / delta2[groups]) / 2 / sigma2
      tau2  = 1 / stats::rgamma(1, shape=shape, scale=1/scale)
    }
    
    # ===================================================================
    # Sample the lambda2's/nu's (if horseshoe, horseshoe+, horseshoe-grouped)
    if (prior == 'hs' || prior == 'hs+' || prior == 'hsge')
    {
      # Sample nu -- horseshoe
      if (prior == 'hs' || prior == 'hsge')
      {
        # Sample lambda2
        scale   = 1/nu + b^2 / 2 / tau2 / sigma2 / delta2[groups]
        lambda2 = scale / stats::rexp(p)
        
        scale = 1 + 1/lambda2
        nu    = scale / stats::rexp(p)
      }
      
      # Parameter expanded HS+ sampler
      else if (prior == 'hs+')
      {
        # Sample lambda2
        scale   = 1/nu + b^2 / 2 / tau2 / sigma2 / (delta2[groups]*eta2)
        lambda2 = scale / stats::rexp(p)
        
        # Sample nu
        scale = 1 + 1/lambda2
        nu    = scale / stats::rexp(p)
        
        # Sample eta2
        scale =  1/phi + b^2 / 2 / tau2 / sigma2 / (delta2[groups]*lambda2)
        eta2  = scale / stats::rexp(p)
        
        # Sample phi
        scale = 1 + 1/eta2
        phi   = scale / stats::rexp(p)
        
        lambda2 = lambda2*eta2
      }
    }
    
    
    # ===================================================================
    # Sample the lambda2's (if lasso)
    if (prior == 'lasso')
    {
      mu      = sqrt(2 * sigma2 * tau2 / (b^2 / delta2[groups]))
      lambda2 = 1 / bayesreg.rinvg(mu, 1/2)
    }
    
    
    # ===================================================================
    # Sample the delta2's (if grouped horseshoe, horseshoe+ or lasso)
    if (ngroups > 1)
    {
      # Sample delta2's
      for (i in 1:ngroups)
      {
        # Only sample delta2 for this group, if the group size is > 1
        ng = sum(groups == i)
        if (ng > 1)
        {
          # Grouped horseshoe/horseshoe+
          if (prior == 'hs')
          {  
            # Sample delta2
            shape = (ng+1)/2
            scale = 1 / chi[i] + sum((b[groups==i]^2) / lambda2[groups==i]) / 2 / sigma2 / tau2
            delta2[i] = 1 / stats::rgamma(1, shape=shape, scale=1/scale)
            
            # Sample chi
            scale  = 1 + 1/delta2[i]
            chi[i] = 1 / stats::rgamma(1, shape=1, scale=1/scale)
          }
          
          # Grouped horseshoe+
          else if (prior == 'hs+')
          {
            # Sample delta2
            shape = (ng+1)/2
            scale = 1 / chi[i] + sum((b[groups==i]^2) / 2 / sigma2 / tau2 / lambda2[groups==i] / rho2[i])
            delta2[i] = 1 / stats::rgamma(1, shape=shape, scale=1/scale)
            
            # Sample chi
            scale  = 1 + 1/delta2[i]
            chi[i] = scale / stats::rexp(1)
            
            # Sample rho2
            shape = (ng+1)/2
            scale =  1 / zeta[i] + sum((b[groups==i]^2) / 2 / sigma2 / tau2 / lambda2[groups==i] / delta2[i])
            rho2[i] = 1 / stats::rgamma(1, shape=shape, scale=1/scale)
            
            # Sample zeta
            scale = 1 + 1/rho2[i]
            zeta[i] = scale / stats::rexp(1)
            
            delta2[i] = delta2[i]*rho2[i]
          }
          
          # Grouped lasso
          else if (prior == 'lasso')
          {
            mu = sqrt(2 * sigma2 * tau2 / sum((b[groups==i]^2)/lambda2[groups==i]))
            delta2[i] = 1 / bayesreg.rinvg(mu, 0.5)
          }
        }
      }
    }
    
    # ===================================================================
    # Store the samples
    iter = iter+1
    if (iter > burnin)
    {
      # thinning
      if (!(iter %% thin))
      {
        # Store posterior samples
        k = k+1
        rv$beta0[k]     = b0
        rv$beta[,k]     = b
        rv$mu.beta      = rv$mu.beta + muB
        rv$mu.beta0     = rv$mu.beta0 + muB0
        rv$sigma2[k]    = sigma2
        rv$mu.sigma2    = rv$mu.sigma2 + mu.sigma2
        rv$tau2[k]      = tau2
        #rv$xi[k]       = xi
        #rv$lambda2[,k] = lambda2
        #rv$nu[,k]      = nu
        #rv$delta2[,k]  = delta2
        #rv$chi[,k]     = chi
        
        # Neg-log-likelihood scores for WAIC
        # Linear regression (Gaussian, t, Laplace)
        if (model != 'logistic' && model != 'poisson' && model != 'geometric')
        {
          rv.ll = bayesreg.linregnlike_e(model, e, sigma2, t.dof = t.dof)
        }
        # Logistic regression
        else if (model == 'logistic')
        {
          if (SMN)
          {
            rv.ll = bayesreg.logregnlike_eta(y, y-e)
          }
          else
          {
            rv.ll = bayesreg.logregnlike_eta(y,eta)
          }
        }
        # Count regression
        else
        {
          rv.ll = bayesreg.countregnlike_eta(model, y, eta)
        }
        
        waicProb   = waicProb + rv.ll$prob
        waicLProb  = waicLProb + rv.ll$negll
        waicLProb2 = waicLProb2 + rv.ll$negll^2
      }
    }
    
    # Else we are in the burnin phase; if we are using MH sampler we need to perform step-size tuning
    else if (iter <= burnin && MH)
    {
      # Tuning step
      mh.tune = bayesreg.mgrad.tune(mh.tune)
    }
  }
  
  # Store extra info
  rv$waicProb = waicProb
  rv$waicLProb = waicLProb
  rv$waicLProb2 = waicLProb2
  
  rv
}


# ============================================================================================================================
# Sample the intercept
bayesreg.sample_beta0 <- function(X, z, b, sigma2, omega2)
{
  rv      = list()
  W       = sum(1 / omega2)
  rv$muB0 = sum((z - X %*% b) / omega2) / W
  v       = sigma2 / W
  
  rv$b0  = stats::rnorm(1, rv$muB0, sqrt(v))

  # Done
  rv
}


# ============================================================================================================================
# Compute the effective sample size of a sequence of RVs using the algorithm from the Stan user manual
bayesreg.ess <- function(x)
{
  n = length(x)
  s = min(c(n - 1, 2e4))
  # Old SLOW code
  #g = stats::acf(x, s, plot=F)
  
  # Find ACFs using FFT instead ...
  g = list()
  x = x-mean(x)
  f = fft(x)
  acf = fft(f*Conj(f),inverse=T)
  g$acf = Re(acf[1:s]/acf[1])
  
  i = seq(1,s-1,by=2)
  G = as.vector(g$acf[i]) + as.vector(g$acf[i+1])
  for (i in 2:length(G))
  {
    if (G[i] > G[i-1])
    {
      G[i] = G[i-1]
    }
  }
  Gz = G<0
  if (sum(Gz) == 0)
  {
    k = length(G)
  }
  else
  {
    for (i in 1:length(G))
    {
      if (Gz[i])
      {
        break
      }
    }
    k = i
  }
  
  rv = list()
  rv$ess = n/(-1 + 2*sum(G[1:k]))
  rv$ess.frac = rv$ess/n
  
  # G = as.vector(g$acf[2:(s-1)]) + as.vector(g$acf[3:s])
  # G = G < 0
  # for (i in 1:length(G))
  # {
  #   if (G[i])
  #   {
  #     break
  #   }
  # }
  # k = i
  # if (k >= 2)
  #   V = g$acf[1] + 2 * sum(g$acf[2:i])
  # else
  #   V = g$acf[1]
  # 
  # ACT = V / g$acf[1]
  # rv = list()
  # rv$ESS = n/ ACT
  # rv$ess.frac = rv$ESS / n
  
  return(rv)
}


# ============================================================================================================================
# Bayesian Feature Ranking
bayesreg.bfr <- function(rv)
{
  ranks = matrix(0, rv$p, rv$n.samples)
  
  for (i in 1:rv$n.samples)
  {
    r = sort(-abs(rv$beta[,i]), index.return = T)
    ranks[r$ix,i] = 1:rv$p
  }
  
  return(ranks)
}   


#' Predict values based on Bayesian penalised regression (\code{\link{bayesreg}}) models.
#'
#' @title Prediction method for Bayesian penalised regression (\code{bayesreg}) models
#' @param object an object of class \code{"bayesreg"} created as a result of a call to \code{\link{bayesreg}}.
#' @param newdata A data frame providing the variables from which to produce predictions.
#' @param type The type of predictions to produce; if \code{type="linpred"} it will return the linear predictor for binary, 
#' count and continuous data. If \code{type="prob"} it will return predictive probability estimates for provided 'y' data (see below for more details).
#' If \code{type="response"} it will return the predicted conditional mean of the target (see below for more details).
#' If \code{type="class"} and the data is binary, it will return the best guess at the class of the target variable.
#' @param bayes.avg logical; whether to produce predictions using Bayesian averaging.
#' @param sum.stat The type of summary statistic to use; either \code{sum.stat="mean"} or \code{sum.stat="median"}.
#' @param CI The size (level, as a percentage) of the credible interval to report (default: 95, i.e. a 95\% credible interval)
#' @param ... Further arguments passed to or from other methods.
#' @section Details:
#' \code{predict.bayesreg} produces predicted values using variables from the specified data frame. The type of predictions produced 
#' depend on the value of the parameter \code{type}:
#'
#' \itemize{
#' \item If \code{type="linpred"}, the predictions that are returned will be the value of the linear predictor formed from the model 
#' coefficients and the provided data. 
#' 
#' \item If \code{type="response"}, the predictions will be the conditional mean for each data point. For Gaussian, Laplace and Student-t targets
#' the conditional mean is simply equal to the linear predictor; for binary data, the predictions will 
#' be the probability of the target being equal to the second level of the factor variable; for count data, the conditional mean
#' will be exp(linear predictor).
#' 
#' \item If \code{type="prob"}, the predictions will be probabilities. The specified data frame must include a column with the same name as the 
#' target variable on which the model was created. The predictions will then be the probability (density) values for these target values. 
#' 
#' \item If \code{type="class"} and the target variable is binary, the predictions will be the most likely class.
#' }
#' 
#' If \code{bayes.avg} is \code{FALSE} the predictions will be produced by using a summary of the posterior samples of the coefficients 
#' and scale parameters as estimates for the model. If \code{bayes.avg} is \code{TRUE}, the predictions will be produced by posterior 
#' averaging over the posterior samples of the coefficients and scale parameters, allowing the uncertainty in the estimation process to 
#' be explicitly taken into account in the prediction process. 
#' 
#' If \code{sum.stat="mean"} and \code{bayes.avg} is \code{FALSE}, the mean of the posterior samples will be used as point estimates for
#' making predictions. Likewise, if \code{sum.stat="median"} and \code{bayes.avg} is \code{FALSE}, the co-ordinate wise posterior medians 
#' will be used as estimates for making predictions. If \code{bayes.avg} is \code{TRUE} and \code{type!="prob"}, the posterior mean 
#' (median) of the predictions from each of the posterior samples will be used as predictions. The value of \code{sum.stat} has no effect 
#' if \code{type="prob"}.
#' @return 
#' \code{predict.bayesreg} produces a vector or matrix of predictions of the specified type. If \code{bayes.avg} is 
#' \code{FALSE} a matrix with a single column \code{pred} is returned, containing the predictions.
#'
#' If \code{bayes.avg} is \code{TRUE}, three additional columns are returned: \code{se(pred)}, which contains 
#' standard errors for the predictions, and two columns containing the credible intervals (at the specified level) for the predictions.
#' @seealso The model fitting function \code{\link{bayesreg}} and summary function \code{\link{summary.bayesreg}}
#' @examples
#' 
#  # -----------------------------------------------------------------
#'
#' # The examples below that are run by CRAN use n.cores=2 to limit the number 
#' # of cores to two for CRAN check compliance.
#'
#' # In practice you can simply omit this option to let bayesreg use as many
#' # as are available (which is usually total number of cores - 1)
#'
#' # If you do not want to use multiple cores you can set parallel=F
#'
#' # -----------------------------------------------------------------
#' # Example 1: Fitting linear models to data and generating credible intervals
#' X = 1:10;
#' y = c(-0.6867, 1.7258, 1.9117, 6.1832, 5.3636, 7.1139, 9.5668, 10.0593, 11.4044, 6.1677);
#' df = data.frame(X,y)
#' 
#' # Gaussian ridge
#' rv.L <- bayesreg(y~., df, model = "laplace", prior = "ridge", n.samples = 1e3, n.cores = 2)
#' 
#' # Plot the different estimates with credible intervals
#' plot(df$X, df$y, xlab="x", ylab="y")
#' 
#' yhat <- predict(rv.L, df, bayes.avg=TRUE)
#' lines(df$X, yhat[,1], col="blue", lwd=2.5)
#' lines(df$X, yhat[,3], col="blue", lwd=1, lty="dashed")
#' lines(df$X, yhat[,4], col="blue", lwd=1, lty="dashed")
#' yhat <- predict(rv.L, df, bayes.avg=TRUE, sum.stat = "median")
#' lines(df$X, yhat[,1], col="red", lwd=2.5)
#' 
#' legend(1,11,c("Posterior Mean (Bayes Average)","Posterior Median (Bayes Average)"),
#'        lty=c(1,1),col=c("blue","red"),lwd=c(2.5,2.5), cex=0.7)
#' 
#' 
#' # -----------------------------------------------------------------
#' # Example 2: Predictive density for continuous data
#' X = 1:10;
#' y = c(-0.6867, 1.7258, 1.9117, 6.1832, 5.3636, 7.1139, 9.5668, 10.0593, 11.4044, 6.1677);
#' df = data.frame(X,y)
#' 
#' # Gaussian ridge
#' rv.G <- bayesreg(y~., df, model = "gaussian", prior = "ridge", n.samples = 1e3, n.cores = 2)
#' 
#' # Produce predictive density for X=2
#' df.tst = data.frame(y=seq(-7,12,0.01),X=2)
#' prob_noavg_mean <- predict(rv.G, df.tst, bayes.avg=FALSE, type="prob", sum.stat = "mean")
#' prob_noavg_med  <- predict(rv.G, df.tst, bayes.avg=FALSE, type="prob", sum.stat = "median")
#' prob_avg        <- predict(rv.G, df.tst, bayes.avg=TRUE, type="prob")
#' 
#' # Plot the density
#' plot(NULL, xlim=c(-7,12), ylim=c(0,0.14), xlab="y", ylab="p(y)")
#' lines(df.tst$y, prob_noavg_mean[,1],lwd=1.5)
#' lines(df.tst$y, prob_noavg_med[,1], col="red",lwd=1.5)
#' lines(df.tst$y, prob_avg[,1], col="green",lwd=1.5)
#' 
#' legend(-7,0.14,c("Mean (no averaging)","Median (no averaging)","Bayes Average"),
#'        lty=c(1,1,1),col=c("black","red","green"),lwd=c(1.5,1.5,1.5), cex=0.7)
#' 
#' title('Predictive densities for X=2')
#' 
#' 
#' \dontrun{
#' # -----------------------------------------------------------------
#' # Example 3: Poisson (count) regression
#' 
#' X  = matrix(rnorm(100*20),100,5)
#' b  = c(0.5,-1,0,0,1)
#' nu = X%*%b + 1
#' y  = rpois(lambda=exp(nu),n=length(nu))
#'
#' df <- data.frame(X,y)
#'
#' # Fit a Poisson regression
#' rv.pois = bayesreg(y~.,data=df, model="poisson", prior="hs", burnin=1e4, n.samples=1e4)
#'  
#' # Make a prediction for the first five rows
#' # By default this predicts the log-rate (i.e., the linear predictor)
#' predict(rv.pois,df[1:5,]) 
#' 
#' # This is the response (i.e., conditional mean of y)
#' exp(predict(rv.pois,df[1:5,])) 
#' 
#' # Same as above ... compare to the actual targets
#' cbind(exp(predict(rv.pois,df[1:5,])), y[1:5])
#' 
#' # Slightly different as E[exp(x)]!=exp(E[x])
#' predict(rv.pois,df[1:5,], type="response", bayes.avg=TRUE) 
#' 
#' # 99% credible interval for response
#' predict(rv.pois,df[1:5,], type="response", bayes.avg=TRUE, CI=99) 
#' 
#' 
#' # -----------------------------------------------------------------
#' # Example 4: Logistic regression on spambase
#' data(spambase)
#'  
#' # bayesreg expects binary targets to be factors
#' spambase$is.spam <- factor(spambase$is.spam)
#'   
#' # First take a subset of the data (1/10th) for training, reserve the rest for testing
#' spambase.tr  = spambase[seq(1,nrow(spambase),10),]
#' spambase.tst = spambase[-seq(1,nrow(spambase),10),]
#'   
#' # Fit a model using logistic horseshoe for 2,000 samples
#' rv <- bayesreg(is.spam ~ ., spambase.tr, model = "logistic", prior = "horseshoe", n.samples = 2e3)
#'   
#' # Summarise, sorting variables by their ranking importance
#' rv.s <- summary(rv,sort.rank=TRUE)
#' 
#' # Make predictions about testing data -- get class predictions and class probabilities
#' y_pred <- predict(rv, spambase.tst, type='class')
#' y_prob <- predict(rv, spambase.tst, type='prob')
#' 
#' # Check how well our predictions did by generating confusion matrix
#' table(y_pred, spambase.tst$is.spam)
#' 
#' # Calculate logarithmic loss on test data
#' y_prob <- predict(rv, spambase.tst, type='prob')
#' cat('Neg Log-Like for no Bayes average, posterior mean estimates: ', sum(-log(y_prob[,1])), '\n')
#' y_prob <- predict(rv, spambase.tst, type='prob', sum.stat="median")
#' cat('Neg Log-Like for no Bayes average, posterior median estimates: ', sum(-log(y_prob[,1])), '\n')
#' y_prob <- predict(rv, spambase.tst, type='prob', bayes.avg=TRUE)
#' cat('Neg Log-Like for Bayes average: ', sum(-log(y_prob[,1])), '\n')
#' }
#' @S3method predict bayesreg
#' @method predict bayesreg
#' @export
predict.bayesreg <- function(object, newdata, type = "linpred", bayes.avg = FALSE, sum.stat = "mean", CI = 95, ...)
{
  if (!inherits(object,"bayesreg")) stop("Not a valid Bayesreg object")

  # Error checking 
  if (sum.stat != "mean" && sum.stat != "median")
  {
    stop("The summary statistic must be either 'mean' or 'median'.")
  }
  if (type == "class" && object$model != "logistic")
  {
    stop("Class predictions are only available for logistic regressions.")
  }
  if (type != "linpred" && type != "prob" && type != "class" && type != "response")
  {
    stop("Type of prediction must be one of 'linpred', 'prob', 'class' or 'response'.")
  }
  if (CI <= 0 || CI >= 100)
  {
    stop("Credible interval level must be between 0 and 100 (exclusive).")
  }
  
  # Build the fully specified formula using the covariates that were fitted
  f <- stats::as.formula(paste("~",paste(attr(object$terms,"term.labels"),collapse="+")))
  
  # Extract the design matrix
  X = stats::model.matrix(f, data=newdata)
  X = as.matrix(X[,-1])
  if (nrow(newdata) == 1)
  {
    X = t(X)
  }
  n = nrow(X)
  p = ncol(X)

  # Get y-data if it has been passed and is not NA
  if (!any(names(newdata) == object$target.var) && type == "prob" && object$model != "logistic")
  {
    stop("You must provide a column of targets called '", object$target.var, "' to predict probabilities.")
  }
  
  if (any(names(newdata) == object$target.var) && type == "prob")
  {
    y <- as.matrix(newdata[object$target.var])
    if (any(is.na(y)) && type == "prob" && object$model != "logistic")
    {
      stop("Missing values in the target variable '", object$target.var, "' not allowed when predicting probabilities.")
    }
  }
  else {
    y <- NA
  }

  # Compute the linear predictor
  # If not averageing
  if (bayes.avg == F)
  {
    #browser()
    if (sum.stat == "median")
    {
      medBeta  = apply(object$beta,1,function(x) stats::quantile(x,c(0.5)))
      medBeta0 = stats::quantile(object$beta0,0.5)
      yhat = X %*% as.vector(medBeta) + as.numeric(medBeta0)
    }
    else
    {
      yhat = X %*% as.vector(object$mu.beta) + as.numeric(object$mu.beta0)
    }
  }
  # Otherwise we are averaging
  else
  {
    yhat = X %*% as.matrix(object$beta)
    yhat = t(as.matrix(apply(yhat,1,function(x) x + object$beta0)))
  }
  
  # Logistic reg -- class labels
  if (object$model == "logistic" && type == "class")
  {
    yhat = rowMeans(yhat)
    yhat[yhat>0] <- 2
    yhat[yhat<=0] <- 1
    yhat = factor(yhat, levels=c(1,2), labels=object$ylevels)

  # Logistic reg -- response prob
  } else if (object$model == "logistic" && (type == "prob" || type == "response") )
  {
    # Compute the response -- i.e., probability of Y=1
    eps = 1e-16
    yhat = (1 / (1+exp(-yhat)) + eps)/(1+2*eps)
  
    # If binary and type="prob", and y has been passed, compute probabilities for the specific y values
    if (!any(is.na(y)) && type == "prob")
    {
      yhat[y == object$ylevels[1],] <- 1 - yhat[y == object$ylevels[1],]
    }
  }
  
  # Continuous linear regression -- probability
  else if ( (object$model == "gaussian" || object$model == "laplace" || object$model == "t") && type == "prob" && !any(is.na(y)))
  {
    # Gaussian probabilities
    if (object$model == "gaussian")
    {
      if (bayes.avg == T)
      {
        yhat <- (as.matrix(apply(yhat, 2, function(x) (x - y)^2 )))
        yhat <- t(as.matrix(apply(yhat, 1, function(x) (1/2)*log(2*pi*object$sigma2) + x/2/object$sigma2)))
      }
      else
      {
        scale = as.numeric(object$mu.sigma2)
        yhat <- (1/2)*log(2*pi*scale) + (yhat - y)^2/2/scale
      }
    }

    # Laplace probabilities
    else if (object$model == "laplace")
    {
      if (bayes.avg == T)
      {
        yhat <- as.matrix(apply(yhat, 2, function(x) abs(x - y) ))
        scale <- as.matrix(sqrt(object$sigma2/2))
        yhat <- t(as.matrix(apply(yhat, 1, function(x) log(2*scale) + x/scale)))
      }
      else
      {
        scale <- sqrt(as.numeric(object$mu.sigma2)/2)
        yhat <- log(2*scale) + abs(yhat - y)/scale
      }
    }
    
    # Student-t probabilities
    else
    {
      nu = object$t.dof;
      if (bayes.avg == T)
      {
        yhat <- (as.matrix(apply(yhat, 2, function(x) x - y)))^2
        scale = as.matrix(object$sigma2)
        yhat <- t(as.matrix(apply(yhat, 1, function(x) (-lgamma((nu+1)/2) + lgamma(nu/2) + log(pi*nu*scale)/2) + (nu+1)/2*log(1 + 1/nu*x/scale) )))
      }
      else
      {
        yhat <- (-lgamma((nu+1)/2) + lgamma(nu/2) + log(pi*nu*as.numeric(object$mu.sigma2))/2) + (nu+1)/2*log(1 + 1/nu*(yhat-y)^2/as.numeric(object$mu.sigma2))
      }
    }
    
    yhat <- exp(-yhat)
  }
  
  # Count regression
  if ((object$model == "poisson" || object$model == "geometric") && type != "linpred")
  {
    # Response
    if (type == "response")
    {
      yhat <- exp(yhat)
    }
    
    # Otherwise, probabilities
    else
    {
      # Poisson
      if (object$model == "poisson")
      {
        if (bayes.avg == F)
        {
          yhat <- (exp(yhat)) - ( y*yhat ) + lgamma(y+1)
        }
        else
        {
          yhat = exp(yhat) - apply(yhat, 2, function(x) (x*y))
          yhat = apply(yhat, 2, function(x)(x+lgamma(y+1)))
        }
      }
      # Geometric
      if (object$model == "geometric")
      {
        eps = 1e-16
        if (bayes.avg == F)
        {
          geo.p = 1/(1+exp(yhat))
          geo.p = (geo.p+eps)/(1+2*eps)
          yhat  = -y*log(1-geo.p) - log(geo.p)
        }
        else
        {
          geo.p = 1/(1+exp(yhat))
          geo.p = (geo.p+eps)/(1+2*eps)
          
          yhat = apply(-log(1-geo.p), 2, function(x)(x*y))
          yhat = yhat - log(geo.p)
        }
      }
      
      yhat = exp(-yhat)
    }
  }

  # If not class labels and averaging, also compute SE's and CI's
  if (type != "class" && bayes.avg == T)
  {
    # Standard errors
    se = apply(yhat,1,stats::sd)
    
    # Credible intervals
    CI = round(CI,2)
    CI.lwr = (1 - CI/100)/2
    CI.upr = 1 - CI.lwr
    CI = apply(yhat,1,function(x) stats::quantile(x,c(CI.lwr,0.5,CI.upr)))
    
    # Store results
    r = matrix(0, n, 4)
    if (sum.stat == "median" && type != "prob")
    {
      r[,1] = as.matrix(CI[2,])
    }
    else
    {
      r[,1] = as.matrix(rowMeans(yhat))
    }
    r[,2] = as.matrix(se)
    r[,3] = as.matrix(CI[1,])
    r[,4] = as.matrix(CI[3,])
    colnames(r) <- c("pred","se(pred)",sprintf("CI %g%%", CI.lwr*100),sprintf("CI %g%%", CI.upr*100))
  }
  
  # Otherwise just return the class labels
  else
  {
    r = yhat
    if (object$model != "logistic")
    {
      colnames(r) <- "pred"
    }
  }
  
  return(r)
}


#' \code{summary} method for Bayesian regression models fitted using \code{\link{bayesreg}}.
#'
#' @title Summarization method for Bayesian penalised regression (\code{bayesreg}) models
#' @param object An object of class \code{"bayesreg"} created as a result of a call to \code{\link{bayesreg}}.
#' @param sort.rank logical; if \code{TRUE}, the variables in the summary will be sorted by their importance as determined by their rank estimated by 
#' the Bayesian feature ranking algorithm.
#' @param display.OR logical; if \code{TRUE}, the variables will be summarised in terms of their cross-sectional odds-ratios rather than their 
#' regression coefficients (logistic regression only).
#' @param CI numerical; the level of the credible interval reported in summary. Default is 95 (i.e., 95\% credible interval).
#' @param max.rows numerical; the maximum number of rows (variables) to display in the summary. Default is to display all variables.
#' @param ... Further arguments passed to or from other methods.
#' @section Details:
#' The \code{summary} method computes a number of summary statistics and displays these for each variable in a table, along 
#' with suitable header information.
#' 
#' For continuous target variables, the header information includes a posterior estimate of the standard deviation of the random disturbances (errors), the \eqn{R^2} statistic
#' and the Widely applicable information criterion (WAIC) statistic. For logistic regression models, the header information includes the negative 
#' log-likelihood at the posterior mean of the regression coefficients, the pseudo \eqn{R^2} score and the WAIC statistic. For count
#' data (Poisson and geometric), the header information includes an estimate of the degree of overdispersion (observed variance divided by expected variance around the conditional mean, with a value < 1 indicating underdispersion),
#' the pseudo \eqn{R^2} score and the WAIC statistic.
#' 
#' The main table summarises properties of the coefficients for each of the variables. The first column is the variable name. The 
#' second and third columns are either the mean and standard error of the coefficients, or the median and standard error of the 
#' cross-sectional odds-ratios if \code{display.OR=TRUE}. 
#' 
#' The fourth and fifth columns are the end-points of the credible intervals of the coefficients (odds-ratios). The sixth column displays the 
#' posterior \eqn{t}-statistic, calculated as the ratio of the posterior mean on the posterior standard deviation for the coefficient. 
#' The seventh column is the importance rank assigned to the variable by the Bayesian feature ranking algorithm. 
#' 
#' In between the seventh and eighth columns are up to two asterisks indicating significance; a variable scores a first asterisk if 
#' the 75\% credible interval does not include zero, and scores a second asterisk if the 95\% credible interval does not include zero. The 
#' final column gives an estimate of the effective sample size for the variable, ranging from 0 to n.samples, which indicates the 
#' effective number of i.i.d draws from the posterior (if we could do this instead of using MCMC) represented by the samples
#' we have drawn. This quantity is computed using the algorithm presented in the Stan Bayesian sampling package documentation.
#' @return Returns an object with the following fields:
#'   
#' \item{log.l}{The log-likelihood of the model at the posterior mean estimates of the regression coefficients.}
#' \item{waic}{The Widely Applicable Information Criterion (WAIC) score of the model.}
#' \item{waic.dof}{The effective degrees-of-freedom of the model, as estimated by the WAIC.}
#' \item{r2}{For non-binary data, the R^2 statistic.}
#' \item{sd.error}{For non-binary data, the estimated standard deviation of the errors.}
#' \item{p.r2}{For binary data, the pseudo-R^2 statistic.}
#' \item{mu.coef}{The posterior means of the regression coefficients.}
#' \item{se.coef}{The posterior standard deviations of the regression coefficients.}
#' \item{CI.coef}{The posterior credible interval for the regression coefficients, at the level specified (default: 95\%).}
#' \item{med.OR}{For binary data, the posterior median of the cross-sectional odds-ratios.}
#' \item{se.OR}{For binary data, the posterior standard deviation of the cross-sectional odds-ratios.}
#' \item{CI.OR}{For binary data, the posterior credible interval for the cross-sectional odds-ratios.}
#' \item{t.stat}{The posterior t-statistic for the coefficients.}
#' \item{n.stars}{The significance level for the variable (see above).}
#' \item{rank}{The variable importance rank as estimated by the Bayesian feature ranking algorithm (see above).}
#' \item{ESS}{The effective sample size for the variable.}
#' \item{log.l0}{For binary data, the log-likelihood of the null model (i.e., with only an intercept).}
#' @seealso The model fitting function \code{\link{bayesreg}} and prediction function \code{\link{predict.bayesreg}}.
#' @examples
#' 
#' X = matrix(rnorm(100*20),100,20)
#' b = matrix(0,20,1)
#' b[1:9] = c(0,0,0,0,5,4,3,2,1)
#' y = X %*% b + rnorm(100, 0, 1)
#' df <- data.frame(X,y)
#' 
#' # Horseshoe regression (using max 2 cores for CRAN check compliance)
#' rv.hs <- bayesreg(y~.,df,prior="hs",n.cores=2)       
#' 
#' # Summarise without sorting by variable rank
#' rv.hs.s <- summary(rv.hs)
#' 
#' # Summarise sorting by variable rank and provide 75% credible intervals
#' rv.hs.s <- summary(rv.hs, sort.rank = TRUE, CI=75)
#'
#' @S3method summary bayesreg
#' @method summary bayesreg
#' @export
summary.bayesreg <- function(object, sort.rank = FALSE, display.OR = FALSE, CI = 95, max.rows = NA, ...)
{
  VERSION = '1.3'
  
  # Credible interval must be between 1 and 99
  CI = round(CI)
  if (CI < 1 || CI > 99)
  {
    stop("Credible interval level must be between 1 and 99.")
  }
  CI.low  = (1 - CI/100)/2
  CI.high = 1- CI.low
  
  if (!inherits(object,"bayesreg")) stop("Not a valid Bayesreg object.")
  
  printf <- function(...) cat(sprintf(...))
  repchar <- function(c, n) paste(rep(c,n),collapse="")
  
  n   = object$n
  px  = object$p
  
  if (is.na(max.rows))
  {
    max.rows = (px+1)
  }
  
  PerSD = F
  
  rv = list()
  
  # Error checking
  if (display.OR == T && object$model != "logistic")
  {
    stop("Can only display cross-sectional odds-ratios for logistic models.")
  }
  
  # Banner
  printf('==========================================================================================\n');
  printf('|                   Bayesian Penalised Regression Estimation ver. %s                    |\n', VERSION);
  printf('|                     (c) Enes Makalic, Daniel F Schmidt. 2016-2024                      |\n');
  printf('==========================================================================================\n');
  
  # ===================================================================
  # Table symbols
  chline = '-'
  cvline = '|'
  cTT    = '+'
  cplus  = '+'
  bTT    = '+'
  
  # ===================================================================
  # Find length of longest variable name
  varnames = c(labels(object$mu.beta)[[1]], "(Intercept)")
  maxlen   = 12
  nvars    = length(varnames)
  for (i in 1:nvars)
  {
    if (nchar(varnames[i]) > maxlen)
    {
      maxlen = nchar(varnames[i])
    }
  }
  
  # ===================================================================
  # Display pre-table stuff
  #printf('%s%s%s\n', repchar('=', maxlen+1), '=', repchar('=', 76))
  #printf('\n')
  
  if (object$model != 'logistic')
  {
    model = 'linear'
  } else {
    model = 'logistic'
  }
  
  if (object$model == "gaussian")
  {
    distr = "Gaussian"
  } else if (object$model == "laplace")
  {
    distr = "Laplace"
  }
  else if (object$model == "t")
  {
    distr = paste0("Student-t (DOF = ",object$t.dof,")")
  }
  else if (object$model == "logistic")
  {
    distr = "logistic"
  }
  else if (object$model == "poisson")
  {
    distr = "Poisson"
  }
  else if (object$model == "geometric")
  {
    distr = "geometric"
  }
  
  if (object$prior == 'rr' || object$prior == 'ridge') {
    prior = 'ridge'
  } else if (object$prior == 'lasso') {      
    prior = 'lasso'
  } else if (object$prior == 'hs' || object$prior == 'horseshoe') {
    prior = 'horseshoe'
  } else if (object$prior == 'hs+' || object$prior == 'horseshoe+') {
    prior = 'horseshoe+'
  }
  
  str = sprintf('Bayesian %s %s regression', distr, prior)
  printf('%-64sNumber of obs   = %8d\n', str, n)
  printf('%-64sNumber of vars  = %8.0f\n', '', px);
  
  if (object$model == 'gaussian' || object$model == 'laplace' || object$model == 't')
  {
    s2 = mean(object$sigma2)
    if (object$model == 't' && object$t.dof > 2)
    {
      s2 = object$t.dof/(object$t.dof - 2) * s2
    }
    else if (object$model == 't' && object$t.dof <= 2)
    {
      s2 = NA
    }
    
    str = sprintf('MCMC Samples   = %6.0f', object$n.samples);
    if (!is.na(s2))
    {
      printf('%-64sstd(Error)      = %8.5g\n', str, sqrt(s2));
    }
    else {
      printf('%-64sstd(Error)      =        -\n', str);
    }
    
    str = sprintf('MCMC Burnin    = %6.0f', object$burnin);
    printf('%-64sR-squared       = %8.4f\n', str, object$r2);    
    str = sprintf('MCMC Thinning  = %6.0f', object$thin);
    printf('%-64sWAIC            = %8.5g\n', str, object$waic);
    
    rv$sd.error  = sqrt(s2)
    rv$r2        = object$r2
    rv$waic      = object$waic
    rv$waic.dof  = object$waic.dof
    rv$log.l     = object$log.l
  }
  else if (object$model == 'logistic')
  {
    log.l = object$log.l
    log.l0 = object$log.l0
    r2 = 1 - log.l / log.l0
    
    str = sprintf('MCMC Samples   = %6.0f', object$n.samples);
    printf('%-64sLog. Likelihood = %8.5g\n', str, object$log.l);
    str = sprintf('MCMC Burnin    = %6.0f', object$burnin);
    printf('%-64sPseudo R2       = %8.4f\n', str, r2);    
    str = sprintf('MCMC Thinning  = %6.0f', object$thin);
    printf('%-64sWAIC            = %8.5g\n', str, object$waic);
    
    rv$log.l     = log.l
    rv$p.r2      = r2
    rv$waic      = object$waic
    rv$waic.dof  = object$waic.dof
  }
  else if (object$model == 'poisson' || object$model == 'geometric')
  {
    log.l = object$log.l
    log.l0 = object$log.l0
    r2 = 1 - log.l / log.l0
    
    str = sprintf('MCMC Samples   = %6.0f', object$n.samples);
    printf('%-64sOverdispersion  = %8.5g\n', str, object$over.dispersion);
    str = sprintf('MCMC Burnin    = %6.0f', object$burnin);
    printf('%-64sPseudo R2       = %8.4f\n', str, r2);    
    str = sprintf('MCMC Thinning  = %6.0f', object$thin);
    printf('%-64sWAIC            = %8.5g\n', str, object$waic);
    
    rv$log.l     = log.l
    rv$p.r2      = r2
    rv$waic      = object$waic
    rv$waic.dof  = object$waic.dof      
  }
  printf('\n')
  
  # ===================================================================
  # Table Header
  fmtstr = sprintf('%%%ds', maxlen);
  printf('%s%s%s\n', repchar(chline, maxlen+1), cTT, repchar(chline, 77))
  tmpstr = sprintf(fmtstr, 'Parameter');
  if (model == 'linear')
  {
    printf('%s %s  %10s %10s    [%d%% Cred. Interval] %10s %7s %10s\n', tmpstr, cvline, 'mean(Coef)', 'std(Coef)', CI, 'tStat', 'Rank', 'ESS');
  } else if (model == 'logistic' && display.OR == F) {
    printf('%s %s  %10s %10s    [%d%% Cred. Interval] %10s %7s %10s\n', tmpstr, cvline, 'med(Coef)', 'std(Coef)', CI, 'tStat', 'Rank', 'ESS');
  } else if (model == 'logistic' && display.OR == T) {
    printf('%s %s  %10s %10s    [%d%% Cred. Interval] %10s %7s %10s\n', tmpstr, cvline, 'med(OR)', 'std(OR)', CI, 'tStat', 'Rank', 'ESS');
  }
  printf('%s%s%s\n', repchar(chline, maxlen+1), cTT, repchar(chline, 77));
  
  # ===================================================================
  if (PerSD) {
    beta0 = object$beta0
    beta  = object$beta
    
    object$beta0   = beta0 + object$std.X$mean.X %*% object$beta
    object$beta    = apply(t(object$beta), 1, function(x)(x * object$std.X$std.X/sqrt(n)))    
    object$mu.beta = object$mu.beta * t(as.matrix(object$std.X$std.X)) / sqrt(n)
  }
  
  # ===================================================================
  # Return values
  rv$mu.coef   = matrix(0,px+1,1, dimnames = list(varnames))
  rv$se.coef   = matrix(0,px+1,1, dimnames = list(varnames))
  rv$CI.coef   = matrix(0,px+1,2, dimnames = list(varnames))
  
  if (model == "logistic")
  {
    rv$med.OR  = matrix(0,px+1,1, dimnames = list(varnames))
    rv$se.OR   = matrix(0,px+1,1, dimnames = list(varnames))
    rv$CI.OR   = matrix(0,px+1,2, dimnames = list(varnames))
  }
  
  rv$t.stat    = matrix(0,px+1,1, dimnames = list(varnames))
  rv$n.stars   = matrix(0,px+1,1, dimnames = list(varnames))
  rv$ESS       = matrix(0,px+1,1, dimnames = list(varnames))
  rv$rank     = matrix(0,px+1,1, dimnames = list(varnames))
  
  # Variable information
  rv$rank[1:(px+1)] = as.matrix(object$var.ranks)
  
  # If sorting by BFR ranks
  if (sort.rank == T)
  {
    O = sort(object$var.ranks, index.return = T)
    indices = O$ix
    indices[px+1] = px+1
  }    
  # Else sorted by the order they were passed in
  else {
    indices = (1:(px+1))
  }
  
  #for (i in 1:(px+1))
  for (i in 1:max.rows)
  {
    k = indices[i]
    kappa = NA
    
    # Regression variable
    if (k <= px)
    {
      s = object$beta[k,]
      mu = object$mu.beta[k]
      
      # Calculate shrinkage proportion, if possible
      kappa = object$t.stat[k]
    }
    
    # Intercept
    else if (k == (px+1))
    {
      s = object$beta0
      mu = mean(s)
    }
    
    # Compute credible intervals/standard errors for beta's
    std_err = stats::sd(s)
    qlin = stats::quantile(s, c(CI.low, CI.high))
    qlog = stats::quantile(exp(s), c(CI.low, CI.high))
    q = qlin
    
    rv$mu.coef[k]  = mu
    rv$se.coef[k]  = std_err
    rv$CI.coef[k,] = c(qlin[1],qlin[2])
    
    # Compute credible intervals/standard errors for OR's
    if (model == 'logistic')
    {
      med_OR = stats::median(exp(s))
      std_err_OR = (qlog[2]-qlog[1])/2/1.96
      
      rv$med.OR[k] = med_OR
      rv$se.OR[k]  = std_err_OR
      rv$CI.OR[k,] = c(qlog[1],qlog[2])
      
      # If display ORs, use these instead
      if (display.OR)
      {
        mu = med_OR
        std_err = std_err_OR
        q = qlog
      }
      # Otherwise use posterior medians of coefficients for stability
      else
      {
        mu = stats::median(s)
      }
    }
    rv$t.stat[k] = kappa
    
    # Display results
    tmpstr = sprintf(fmtstr, varnames[k])
    if (is.na(kappa))
      t.stat = '         .'
    else t.stat = sprintf('%10.3f', kappa)
    
    if (is.na(object$var.ranks[k]))
      rank = '      .'
    else
      rank = sprintf('%7d', object$var.ranks[k])
    
    printf('%s %s %11.5f %10.5f   %10.5f %10.5f %s %s ', tmpstr, cvline, mu, std_err, q[1], q[2], t.stat, rank);
    
    # Model selection scores
    qlin = stats::quantile(s, c(0.025, 0.125, 0.875, 0.975))
    qlog = stats::quantile(exp(s), c(0.025, 0.125, 0.875, 0.975))
    
    # Test if 75% CI includes 0
    if ( k <= px && ( (qlin[2] > 0 && qlin[3] > 0) || (qlin[2] < 0 && qlin[3] < 0) ) )    
    {
      printf('*')
      rv$n.stars[k] = rv$n.stars[k] + 1
    }
    else 
      printf(' ')
    
    # Test if 95% CI includes 0
    if ( k <= px && ( (qlin[1] > 0 && qlin[4] > 0) || (qlin[1] < 0 && qlin[4] < 0) ) )    
    {
      printf('*')
      rv$n.stars[k] = rv$n.stars[k] + 1
    }
    else
      printf(' ')
    
    # Display ESS-frac
    if(k > px)
      printf('%8s', '.')
    else
      printf('%8d', object$ess[k])
    
    rv$ESS[k] = object$ess[k]
    
    printf('\n');
  }
  
  printf('%s%s%s\n\n', repchar(chline, maxlen+1), cTT, repchar(chline, 77));
  
  if (model == 'logistic')
  {
    rv$log.l0 = object$log.l0
  }
  
  invisible(rv)
}

# ============================================================================================================================
# function to standardise columns of X to have mean zero and unit length
bayesreg.standardise <- function(X)
{
  n = nrow(X)
  p = ncol(X)
  
  # 
  r       = list()
  r$X     = X
  if (p > 1)
  {
    r$mean.X = colMeans(X)
  } else
  {
    r$mean.X = mean(X)
  }
  r$std.X  = t(apply(X,2,stats::sd)) * sqrt(n-1)
  
  # Perform the standardisation
  if (p == 1)
  {
    r$X <- as.matrix(apply(X,1,function(x)(x - r$mean.X)))
    r$X <- as.matrix(apply(r$X,1,function(x)(x / r$std.X)))
  } else
  {
    r$X <- t(as.matrix(apply(X,1,function(x)(x - r$mean.X))))
    r$X <- t(as.matrix(apply(r$X,1,function(x)(x / r$std.X))))
  }
  
  return(r)
}


# ============================================================================================================================
# Sample the regression coefficients
bayesreg.sample_beta <- function(X, z, mvnrue, b0, sigma2, tau2, lambda2, omega2, XtX, Xty, model, PrecomputedXtX)
{
  alpha  = (z - b0)
  Lambda = sigma2 * tau2 * lambda2
  sigma  = sqrt(sigma2)
  
  # Use Rue's algorithm
  if (mvnrue)
  {
    # If XtX is not precomputed
    if (!PrecomputedXtX)
    {
      #omega = sqrt(omega2)
      #X0    = apply(X,2,function(x)(x/omega))
      #bs    = bayesreg.fastmvg2_rue(X0/sigma, alpha/sigma/omega, Lambda)

      omega = sqrt(omega2)*sigma
      X0 = X
      # Loop is faster than apply() -- why??
      for (i in 1:length(lambda2))
      {
        X0[,i] = X[,i]/omega
      }

      bs    = bayesreg.fastmvg2_rue.nongaussian(X0, alpha, Lambda, sigma2, omega)
    }
    
    # XtX is precomputed (Gaussian only)
    else {
      #bs    = bayesreg.fastmvg2_rue(X/sigma, alpha/sigma, Lambda, XtX/sigma2)
      bs    = bayesreg.fastmvg2_rue.gaussian(X, alpha, Lambda, XtX, Xty, sigma2, PrecomputedXtX=T)
      #bs    = bayesreg.fastmvg2_rue.gaussian.2(X, alpha, Lambda, XtX, sigma2, PrecomputedXtX=T)
    }
  }
  
  # Else use Bhat. algorithm
  else
  {
    omega = sqrt(omega2)*sigma
    
    # Non-Gaussian (heteroskedastic Gaussian ...)
    if (model != 'gaussian')
    {
      #omega = sqrt(omega2)
      #X0    = apply(X,2,function(x)(x/omega))
      #bs    = bayesreg.fastmvg_bhat(X0/sigma, alpha/sigma/omega, Lambda)
      
      # Loop is faster than apply ...
      X0    = X
      for (i in 1:length(lambda2))
      {
        X0[,i] = X[,i]/omega
      }
    }
    # Gaussian 
    else
    {
      X0 = X/sigma
    }
    
    # Generate a sample
    bs    = bayesreg.fastmvg_bhat(X0, alpha/omega, Lambda)
  }
  
  return(bs)
}


# ============================================================================================================================
# function to generate multivariate normal random variates using Rue's algorithm for non-Gaussian noise
bayesreg.fastmvg2_rue.nongaussian <- function(Phi, alpha, d, sigma2, omega)
{
  Phi   = as.matrix(Phi)
  alpha = as.matrix(alpha)
  r     = list()

  p     = ncol(Phi)
  Dinv  = diag(as.vector(1/d), nrow = length(d))

  PtP   = t(Phi)%*%Phi
  
  L     = chol(PtP + Dinv)
  v     = forwardsolve(t(L), (crossprod(Phi,alpha/omega)))
  r$m   = backsolve(L, v)
  w     = backsolve(L, stats::rnorm(p,0,1))
  
  r$x   = r$m + w
  return(r)
}


# ============================================================================================================================
# function to generate multivariate normal random variates using Rue's algorithm for Gaussian noise
bayesreg.fastmvg2_rue.gaussian <- function(Phi, alpha, d, PtP = NA, Ptalpha = NA, sigma2, PrecomputedXtX=F)
{
  Phi   = as.matrix(Phi)
  alpha = as.matrix(alpha)
  r     = list()
  
  # If PtP not precomputed
  if (!PrecomputedXtX)
  {
    PtP = t(Phi) %*% Phi
  }

  p     = ncol(Phi)
  Dinv  = diag(as.vector(1/d), nrow = length(d))

  L     = chol(PtP/sigma2 + Dinv)
  
  v     = forwardsolve(t(L), (Ptalpha/sigma2))
  r$m   = backsolve(L, v)
  w     = backsolve(L, stats::rnorm(p,0,1))
  
  r$x   = r$m + w
  return(r)
}

# ============================================================================================================================
# function to generate multivariate normal random variates using Bhat. algorithm
bayesreg.fastmvg_bhat <- function(Phi, alpha, d)
{
  #d     = as.matrix(d)
  p     = ncol(Phi)
  n     = nrow(Phi)
  r     = list()
  
  u     = as.matrix(stats::rnorm(p,0,1)) * sqrt(d)
  delta = as.matrix(stats::rnorm(n,0,1))
  
  v     = Phi %*% u + delta
  #Dpt   = (apply(Phi, 1, function(x)(x*d)))
  #
  #Dpt = Phi
  #for (i in 1:length(alpha))
  #{
  #  Dpt[i,] = Dpt[i,]*d
  #}
  #Dpt = t(Dpt)
  
  Dpt   = Phi
  for (i in 1:length(d))
  {
    Dpt[,i] = Dpt[,i]*d[i]
  }
  Dpt = t(Dpt)
  
  W     = Phi %*% Dpt + diag(1,n)
  
  #w     = solve(W,(alpha-v))
  
  L     = chol(W)
  vv    = forwardsolve(t(L), (alpha-v))
  w     = backsolve(L, vv)
  
  r$x   = u + Dpt %*% w
  r$m   = r$x
  
  return(r)
}
# ============================================================================================================================
# rinvg
bayesreg.rinvg <- function(mu, lambda)
{
  lambda = 1/lambda
  p = length(mu)
  
  V = stats::rnorm(p,mean=0,sd=1)^2
  out = mu + 1/2*mu/lambda * (mu*V - sqrt(4*mu*lambda*V + mu^2*V^2))
  out[out<1e-16] = 1e-16
  z = stats::runif(p)
  
  l = z >= mu/(mu+out)
  out[l] = mu[l]^2 / out[l]
  
  return(out)
}

# ============================================================================================================================
# compute the negative log-likelihood
bayesreg.linregnlike <- function(error, X, y, b, b0, s2, t.dof = NA)
{
  n = nrow(y)

  y = as.matrix(y)
  X = as.matrix(X)
  b = as.matrix(b)
  b0 = as.numeric(b0)
  
  e = (y - X %*% as.matrix(b) - as.numeric(b0))
  
  if (error == 'gaussian')
  {
    negll = n/2*log(2*pi*s2) + 1/2/s2*t(e) %*% e
  }
  else if (error == 'laplace')
  {
    scale <- sqrt(as.numeric(s2)/2)
    negll = n*log(2*scale) + sum(abs(e))/scale
  }
  else if (error == 't')
  {
    nu = t.dof;
    negll = n*(-lgamma((nu+1)/2) + lgamma(nu/2) + log(pi*nu*s2)/2) + (nu+1)/2*sum(log(1 + 1/nu*e^2/as.numeric(s2)))
  }
  
  return(negll)
}

# ============================================================================================================================
# compute the negative log-likelihood for logistic regression
bayesreg.logregnlike <- function(X, y, b, b0)
{
  y = as.matrix(y)
  X = as.matrix(X)
  b = as.matrix(b)
  b0 = as.numeric(b0)
  
  eps = exp(-36)
  lowerBnd = log(eps)
  upperBnd = -lowerBnd
  muLims = c(eps, 1-eps)
  
  eta = as.numeric(b0) + X %*% as.matrix(b)
  eta[eta < lowerBnd] = lowerBnd
  eta[eta > upperBnd] = upperBnd
  
  mu = 1 / (1 + exp(-eta))
  
  mu[mu < eps] = eps
  mu[mu > (1-eps)] = (1-eps)
  
  negll = -sum(y*log(mu)) -
    sum((1.0 - y)*log(1.0 - mu))
  
  return(negll)
}

# ============================================================================================================================
# compute the negative log-likelihood for count regression
bayesreg.countregnlike <- function(model, X, y, b, b0)
{
  n = nrow(y)
  
  y = as.matrix(y)
  X = as.matrix(X)
  b = as.matrix(b)
  b0 = as.numeric(b0)
  
  eta = X %*% as.matrix(b) + as.numeric(b0)

  if (model == 'poisson')
  {
    negll = sum(-y*eta + exp(eta) + lgamma(y+1))
  }
  else if (model == 'geometric')
  {
    mu = exp(eta)
    geo.p = 1/(1+exp(eta))
    
    eps = exp(-36)
    geo.p[geo.p < eps] = eps
    geo.p[geo.p > (1-eps)] = (1-eps)    
    
    negll = sum(-y*log(1-geo.p) - log(geo.p))
  }

  return(negll)
}

# ============================================================================================================================
# Compute individual likelihoods for linear regression using precomputed predictor
bayesreg.linregnlike_e <- function(error, e, s2, t.dof = NA)
{
  n = nrow(e)

  if (error == 'gaussian')
  {
    negll = 1/2*log(2*pi*s2) + 1/2/s2*e^2
  }
  else if (error == 'laplace')
  {
    scale <- sqrt(as.numeric(s2)/2)
    negll = 1*log(2*scale) + abs(e)/scale
  }
  else if (error == 't')
  {
    nu = t.dof;
    negll = 1*(-lgamma((nu+1)/2) + lgamma(nu/2) + log(pi*nu*s2)/2) + (nu+1)/2*(log(1 + 1/nu*e^2/as.numeric(s2)))
  }

  rv = list(negll = negll, prob = exp(-negll))
  
  return(rv)
}

# ============================================================================================================================
# Compute individual likelihoods for logistic regression using precomputed predictor
bayesreg.logregnlike_eta <- function(y, eta)
{
  y = as.matrix(y)
  b = as.matrix(eta)

  eps = exp(-36)
  lowerBnd = log(eps)
  upperBnd = -lowerBnd
  muLims = c(eps, 1-eps)
  
  # Constrain the linear predictor
  eta[eta < lowerBnd] = lowerBnd
  eta[eta > upperBnd] = upperBnd
  
  mu = 1 / (1 + exp(-eta))
  
  # Constrain probabilities 
  mu[mu < eps] = eps
  mu[mu > (1-eps)] = (1-eps)
  
  # Return quantities
  negll = -y*log(mu) - (1.0-y)*log(1.0-mu)
  
  rv = list(negll = negll, prob = exp(-negll))

  return(rv)
}

# ============================================================================================================================
# Compute individual likelihoods for count regression using precomputed predictor
bayesreg.countregnlike_eta <- function(model, y, eta)
{
  if (model == 'poisson')
  {
    negll = -y*eta + exp(eta) + lgamma(y+1)
  }
  else if (model == 'geometric')
  {
    mu = exp(eta)
    geo.p = 1/(1+exp(eta))

    eps = exp(-36)
    geo.p[geo.p < eps] = eps
    geo.p[geo.p > (1-eps)] = (1-eps)    

    negll = -y*log(1-geo.p) - log(geo.p)
  }
  
  rv = list(negll = negll, prob = exp(-negll))
}

# ============================================================================================================================
# Simple gradient descent fitting of generalized linear models
bayesreg.fit.GLM.gd <- function(X, y, model, xi, tau2, max.iter = 5e2)
{
  nx = nrow(X)
  px = ncol(X)
  theta = matrix(data=0, nrow = px+1, ncol = 1)
  
  # Precompute statistics
  Xty = crossprod(X,y)
  
  # Learning rate
  kappa = 1
  
  # Error checking
  # ...  
  
  # Any special initialisation
  if (model == 'gaussian')
  {
    theta[px+1] = mean(y)
  }
  
  # Optimise
  rv.L = bayesreg.grad.L(y, X, theta, model, Xty, xi, tau2)
  for (i in 1:max.iter)
  {
    # Update estimates
    kappa.vec = matrix(kappa,px+1,1)
    kappa.vec[px+1] = kappa.vec[px+1]/sqrt(nx)
    theta.new = theta - kappa.vec*rv.L$grad
    
    # Have we improved?
    rv.L.new = bayesreg.grad.L(y, X, theta.new, model, Xty, xi, tau2)
    if (rv.L.new$L < rv.L$L)
    {
      theta = theta.new
      rv.L  = rv.L.new
    } 
    else 
    {
      # If not, halve the learning rate
      kappa = kappa/2
    }
  }
  
  # Return
  rv = list()
  rv$b       = theta[1:px]
  rv$b0      = theta[px+1]
  rv$grad.b  = rv.L$grad[1:px]
  rv$grad.b0 = rv.L$grad[px+1]
  rv$L       = rv.L$L
  
  rv
}

# ============================================================================================================================
# Simple likelihood/gradient function for gradient descent
bayesreg.grad.L <- function(y, X, theta, model, Xty, xi, tau2)
{
  px = length(theta)-1
  rv = list()
  
  # Poisson regression
  if (model == 'poisson')
  {
    # Form the linear predictor
    eta = X %*% theta[1:px] + theta[px+1]
    eta = pmin(eta, 500)
    mu  = as.vector(exp(eta))
    
    # Poisson likelihood (up to constants)
    rv$L = sum(mu) - sum( y*eta ) + sum(theta[1:px]^2)/2/tau2
    
    # Poisson gradient
    rv$grad = matrix(0, px+1, 1)
    rv$grad[1:px] = crossprod(X,mu) - Xty + theta[1:px]/tau2;
    rv$grad[px+1] = sum(mu) - sum(y)
  }
  
  else if (model == 'geometric')
  {
    r = xi[1]
    
    #browser()
    
    # Form the linear predictor
    eta = X %*% theta[1:px] + theta[px+1]
    eta = pmin(eta, 500)
    mu  = as.vector(exp(eta))
    
    # Geometric likelihood (up to constants)
    rv$L = -sum(eta*y) + sum(log(mu+r)*(y+1)) + sum(theta[1:px]^2)/2/tau2
    
    # Geometric gradient
    rv$grad = matrix(0, px+1, 1)
    c = as.vector((mu*(y+1)/(mu+r)))
    rv$grad[1:px] = crossprod(X,c) - Xty + theta[1:px]/tau2
    rv$grad[px+1] = sum(c) - sum(y)
  }
  
  else if (model == 'logistic')
  {
    eta = X %*% theta[1:px] + theta[px+1]

    eps = exp(-36)
    lowerBnd = log(eps)
    upperBnd = -lowerBnd
    muLims = c(eps, 1-eps)
    
    # Constrain the linear predictor
    eta[eta < lowerBnd] = lowerBnd
    eta[eta > upperBnd] = upperBnd
    
    mu = 1 / (1 + exp(-eta))
    
    # Constrain probabilities 
    mu[mu < eps] = eps
    mu[mu > (1-eps)] = (1-eps)
    
    # Logistic likelihood
    #negll = -y*log(mu) - (1.0-y)*log(1.0-mu)
    rv$L = -sum(y*log(mu) + (1.0-y)*log(1.0-mu)) + sum(theta[1:px]^2)/2/tau2
    
    # Logistic gradient
    rv$grad = matrix(0, px+1, 1)
    rv$grad[1:px] = crossprod(X,mu) - Xty + theta[1:px]/tau2;
    rv$grad[px+1] = sum(mu) - sum(y)    
  }
  
  rv
}

# ============================================================================================================================
# Gradient and likelihoods for GLM models (for use with mgrad tools)
# ============================================================================================================================
# Gradient and likelihoods for GLM models (for use with mgrad tools)
bayesreg.mgrad.L <- function(y, X, theta, eta, model, xi, Xty)
{
  rv = list()
  px = ncol(X)
  nx = nrow(X)
  ix = NULL
  
  # Form the linear predictor if needed
  if (is.null(eta))
  {
    #if (px <= 100)
    #{
    #ix = which(abs(theta[1:px])>1e-2)
    #if (length(ix)<=100)
    #{
    #rv$eta = matrix(theta[px+1],nx,1)
    #for (i in ix)
    #{
    #  rv$eta = rv$eta + theta[i]*X[,i]
    #}
    #rv$eta = as.vector(rv$eta)
    #}
    #else
    #{
    #rv$eta = as.vector(X %*% theta[1:px] + theta[px+1])
    
    
    #ix = abs(theta[1:px]/sqrt(nx))>1e-3
    #if (px>500 && sum(ix)<=200)
    #{
    #  rv$eta = as.vector(tcrossprod((theta[c(ix,F)]),X[,ix]) + theta[px+1])
    #}
    #else
    #{
    rv$eta = as.vector(tcrossprod((theta[1:px]),X) + theta[px+1])
    #}
  }  
  else
  {
    rv$eta = eta
  }
  
  # Poisson regression
  if (model == 'poisson')
  {
    # Form the linear predictor
    #rv$eta = pmin(rv$eta, 500);
    rv$eta[rv$eta>500] = 500
    mu  = exp(rv$eta)
    
    # Poisson likelihood (up to constants)
    rv$L = sum(mu) - sum( y*rv$eta )
    
    # Poisson gradient
    rv$grad = matrix(0, px+1, 1)
    rv$grad[1:px] = crossprod(X,mu) - Xty
    rv$grad[px+1]   = sum(mu) - sum(y)
    
    rv$H.b0 = sum(mu)
  }
  
  # Geometric regression
  if (model == "geometric")
  {
    r = xi[1]
    
    rv$eta[rv$eta>500] = 500
    mu  = exp(rv$eta)
    
    # Geometric likelihood (up to constants)
    rv$L = -sum(rv$eta*y) + sum(log(mu+r)*(y+1))
    
    # Geometric gradient
    rv$grad = matrix(0, px+1, 1)
    c = as.vector((mu*(y+1)/(mu+r)))
    rv$grad[1:px] = crossprod(X,c) - Xty
    rv$grad[px+1] = sum(c) - sum(y)
    
    rv$H.b0 = sum(mu/(mu+1))
  }
  
  else if (model == 'logistic')
  {
    #eta = X %*% theta[1:px] + theta[px+1]
    
    eps = exp(-36)
    #lowerBnd = log(eps)
    #upperBnd = -lowerBnd
    muLims = c(eps, 1-eps)
    
    # Constrain the linear predictor
    eta = rv$eta
    
    #eta[eta < lowerBnd] = lowerBnd
    #eta[eta > upperBnd] = upperBnd
    
    mu = 1 / (1 + exp(-eta))
    
    # Constrain probabilities 
    mu[mu < eps] = eps
    mu[mu > (1-eps)] = (1-eps)
    
    # Logistic likelihood
    #negll = -y*log(mu) - (1.0-y)*log(1.0-mu)
    rv$L = -sum(y*log(mu) + (1.0-y)*log(1.0-mu))
    #rv$L = -crossprod(y,eta) + sum(log(1+exp(eta)))
    
    # Logistic gradient
    rv$grad = matrix(0, px+1, 1)
    rv$grad[1:px] = (crossprod(X,mu) - Xty)
    rv$grad[px+1] = (sum(mu) - sum(y))
    rv$H.b0 = sum( (1-mu)*mu )
  }  
  
  #
  rv$extra.stats = NULL
  
  rv
}

# ============================================================================================================================
# Create a tuning structure for adaptive step-size tuning
bayesreg.mh.initialise <- function(window, delta.max, delta.min, burnin, display)
{
  tune = list()
  
  # Step-size setup
  tune$M      = 0
  tune$window = window
  
  tune$delta.max = delta.max
  tune$delta.min = delta.min
  
  # Start in phase 1
  tune$iter            = 0
  tune$burnin          = burnin
  tune$W.phase         = 1
  tune$W.burnin        = 0
  tune$nburnin.windows = floor(burnin/window)
  tune$m.window        = rep(0, tune$nburnin.windows)
  tune$n.window        = rep(0, tune$nburnin.windows, 1)
  tune$delta.window    = rep(0, tune$nburnin.windows, 1)
  tune$display         = display
  
  tune$phase.cnt       = c(0,0,0)
  
  tune$M               = 0
  tune$D               = 0
  
  tune$b.tune          = NULL
  
  # Start at the maximum delta (phase 1)
  tune$delta           = delta.max
  
  tune
}

# ============================================================================================================================
# Sample the beta's using the mGrad algorithm
bayesreg.mgrad.sample.beta <- function(b, b0, L.b, grad.b, D, delta, eta, H.b0, y, X, sigma2, Xty, model, extra.model.params, extra.stats)
{
  px = ncol(X)

  # Get quantities need for proposal
  rv.prop = bayesreg.mgrad.update.proposal(px, D, delta)
  
  # Generate proposal from marginal proposal distribution    
  #bnew = normrnd( mGrad_prop_2.*((2/delta)*(b - (delta/2)*grad_b/sigma2)), sqrt((2/delta)*mGrad_prop_1+mGrad_prop_2) );
  b.new = rnorm( n=px, mean=rv.prop$prop.2*((2/delta)*(b - (delta/2)*grad.b/sigma2)), sd=sqrt((2/delta)*rv.prop$prop.1+rv.prop$prop.2) );
  
  # Accept/reject?
  extra.stats.new = NULL
  if (model == 'poisson')
  {
    rv.mh.new = bayesreg.mgrad.L(y, X, c(b.new,b0), NULL, 'poisson', extra.model.params, Xty)
  }
  if (model == 'geometric')
  {
    rv.mh.new = bayesreg.mgrad.L(y, X, c(b.new,b0), NULL, 'geometric', extra.model.params, Xty)
  }
  if (model == 'logistic')
  {
    rv.mh.new = bayesreg.mgrad.L(y, X, c(b.new,b0), NULL, 'logistic', extra.model.params, Xty)
  }
  grad.b.new = rv.mh.new$grad[1:px]
  
  h1 = bayesreg.mgrad.hfunc(b, b.new, -grad.b.new/sigma2, rv.prop$prop.2, delta)
  h2 = bayesreg.mgrad.hfunc(b.new, b, -grad.b/sigma2, rv.prop$prop.2, delta);
  
  mhprob = min(exp(-rv.mh.new$L/sigma2 - -L.b/sigma2 + h1 - h2), 1)

  #if (is.na(mhprob) || is.nan(mhprob))
  #{
  #  mhprob
  #}
  
  # Check for acceptance  
  rv = list()
  rv$accepted = F
  if (runif(1) < mhprob && !any(is.infinite(b.new)))
  {
    rv$accepted = T
    
    rv$b = b.new;
    rv$b0 = b0
    rv$L.b = rv.mh.new$L
    rv$grad.b = grad.b.new
    rv$eta = rv.mh.new$eta
    rv$H.b0 = rv.mh.new$H.b0
    rv$extra.stats = extra.stats.new
  }
  # If reject, stay where we are
  else
  {
    rv$b = b
    rv$b0 = b0
    rv$L.b = L.b
    rv$grad.b = grad.b
    rv$eta = eta
    rv$H.b0 = H.b0
    rv$extra.stats = extra.stats
  }
  
  rv
}

# ============================================================================================================================
# Update the quantities used for MH proposal
bayesreg.mgrad.update.proposal <- function(px, Lambda, delta)
{
  rv = list()
  
  # Update proposal information
  rv$prop.2 = (1/(1/Lambda + (2/delta)))
  rv$prop.1 = rv$prop.2^2
  
  rv
}

# ============================================================================================================================
# 'h'-function for mGrad MH acceptance probability
bayesreg.mgrad.hfunc <- function(x, y, grad.y, v, delta)
{
  h = t((x - (2/delta)*v*(y + (delta/4)*grad.y))*(1/((2/delta)*v+1))) %*% grad.y
}

# ============================================================================================================================
# Tune the MH sampler step-size
bayesreg.mgrad.tune <- function(tune)
{
  tune$iter = tune$iter + 1

  DELTA.MAX = exp(40)
  DELTA.MIN = exp(-40)
  NUM.PHASE.1 = 15
  
  # Perform a tuning step, if necessary
  if ( (tune$iter %% tune$window) == 0)
  {
    #W_burnin, W_phase, delta, delta_min, delta_max, mc, nc, delta_window, m_window, n_window)

    # Store the measured acceptance probability
    tune$W.burnin                    = tune$W.burnin + 1;
    tune$delta.window[tune$W.burnin] = log(tune$delta);
    tune$m.window[tune$W.burnin]     = tune$M+1
    tune$n.window[tune$W.burnin]     = tune$D+2
    
    # If in phase 1, we are exploring the space uniformly
    if (tune$W.phase == 1)
    {
      tune$phase.cnt[1] = tune$phase.cnt[1]+1;
      #d = linspace(log(DELTA_MIN),log(DELTA_MAX),NUM_PHASE_1);
      d = seq(log(DELTA.MIN), log(DELTA.MAX), length.out = NUM.PHASE.1)
      
      tune$delta = exp(d[tune$phase.cnt[1]]);
      
      if (tune$delta < tune$delta.min)
      {
        tune$delta.min = tune$delta
      }
      if (tune$delta > tune$delta.max)
      {
        tune$delta.max = tune$delta
      }
      
      # If we have exhausted 
      if (tune$phase.cnt[1] == NUM.PHASE.1)
      {
        tune$W.phase = 2
      }
    }    
    # Else in phase 2, we are probing randomly guided by model
    else {
      tune$phase.cnt[2] = tune$phase.cnt[2]+1
      
      # Fit a logistic regression to the response and generate new random probe point
      YY          = matrix(c(tune$m.window[1:tune$W.burnin], tune$n.window[1:tune$W.burnin]), tune$W.burnin, 2)
      tune$b.tune = suppressWarnings(glm(y ~ x, data=data.frame(y=YY[,1]/YY[,2],x=tune$delta.window[1:tune$W.burnin]), family=binomial, weights=YY[,2]))
      
      probe.p     = runif(1)*0.7 + 0.15
      tune$delta  = exp( -(log(1/probe.p-1) + tune$b.tune$coefficients[1])/tune$b.tune$coefficients[2] )
      tune$delta  = min(tune$delta, DELTA.MAX);
      tune$delta  = max(tune$delta, DELTA.MIN);
      
      if (tune$delta > tune$delta.max)
      {
        tune$delta.max = tune$delta
      }
      else if (tune$delta < tune$delta.min)
      {
        tune$delta.min = tune$delta
      }
      
      if (tune$delta == tune$delta.max || tune$delta == tune$delta.min)
      {
        tune$delta = exp(runif(1)*(log(tune$delta.max) - log(tune$delta.min)) + log(tune$delta.min))
      }
    }
    
    #
    tune$M = 0
    tune$D = 0
  }    

  # If we have reached last sample of burn-in, select a suitable delta
  if (tune$iter == tune$burnin)
  {
    # If the algorithm has not explored the space sufficiently, give an error
    if (tune$phase.cnt[2] < 100)
    {
      stop('Metropolis-Hastings sampler has not explored the step-size space sufficiently; please increase the number of burnin samples');
    }
    
    #tune$b.tune = glmfit(tune.delta_window(1:tune.W_burnin), [tune.m_window(1:tune.W_burnin), tune.n_window(1:tune.W_burnin)], 'binomial');
    YY          = matrix(c(tune$m.window[1:tune$W.burnin], tune$n.window[1:tune$W.burnin]), tune$W.burnin, 2)
    tune$b.tune = suppressWarnings(glm(y ~ x, data=data.frame(y=YY[,1]/YY[,2],x=tune$delta.window[1:tune$W.burnin]), family=binomial, weights=YY[,2]))
    
    # Select the final delta to use
    tune$delta  = exp( -(log(1/0.55-1) + tune$b.tune$coefficients[1])/tune$b.tune$coefficients[2] )
    #if (tune.delta == 0 || isinf(tune.delta))
    #    tune.delta = log(tune.delta_max)/2;
    #end
    
    if (tune$display)
    {
      #df = data.frame(log.delta=tune$delta.window[1:tune$W.burnin], p=YY[,1]/YY[,2])
      #df.2 = data.frame(y=predict(tune$b.tune,newdata=data.frame(x=seq(log(tune$delta.min)-5,log(tune$delta.max)+5,length.out=100)),type="response"), x=seq(log(tune$delta.min)-5,log(tune$delta.max)+5,length.out=100))
      #print(ggplot(df, aes(x=log.delta,y=p)) + geom_point() + geom_line(data=df.2,aes(x=x,y=y,color="red")) + labs(title=paste("mGrad Burnin Tuning: Final delta = ", sprintf("%.3g",tune$delta), sep=""), x="log(delta)", y="Estimated Probability of Acceptance") + theme(legend.position = "none"))
    }
  }
  
  # Done
  tune    
}