R/fpsden.r

Defines functions fpsden lpsden nlpsden cvpsden iwlspsden

Documented in cvpsden fpsden iwlspsden lpsden nlpsden

#' @export
#' 
#' @title MLE Fitting of P-splines Density Estimator
#'
#' @description Maximum likelihood estimation for P-splines density estimation. Histogram binning
#' produces frequency counts, which are modelled by constrained B-splines in a Poisson regression. A penalty
#' based on differences in the sequences B-spline coefficients is used to smooth/interpolate the counts.
#' Iterated weighted least squares (IWLS) for a mixed model representation of the P-splines regression,
#' conditional on a particular penalty coefficient, is used for estimating the B-spline coefficients.
#' Leave-one-out cross-validation deviances are available for estimation of the penalty coefficient.
#' 
#' @param lambdaseq   vector of \eqn{\lambda}'s (or scalar) to be considered in profile likelihood. Required.
#' @param ord         order of difference used in the penalty term
#' @param lambda      penalty coefficient
#' @param breaks      histogram breaks (as in \code{\link[graphics:hist]{hist}} function)
#' @param counts      counts from histogram binning
#' @param bsplines    matrix of B-splines
#' @inheritParams     psden
#' @inheritParams     fgpd
#' 
#' @details The P-splines density estimator is fitted using maximum likelihood estimation, following
#' the approach of Eilers and Marx (1996). Histogram binning produces frequency counts, which are
#' modelled by constrained B-splines in a Poisson regression. A penalty
#' based on differences in the sequences B-spline coefficients is used to smooth/interpolate the counts.
#'
#' The B-splines are defined as in Eiler and Marx (1996), so that those are meet the boundary are simply
#' shifted and truncated version of the internal B-splines. No renormalisation is carried out. They are not
#' "natural" B-spline which are also commonly in use. Note that atural B-splines can be obtained by suitable
#' linear combinations of these B-splines. Hence, in practice there is little difference in the fit obtained
#' from either B-spline definition, even with the penalty constraining the coefficients. If the user desires
#' they can force the use of natural B-splines, by prior specification of the \code{design.knots}
#' with appropriate replication of the boundaries, see \code{\link[evmix:psden]{dpsden}}.
#' 
#' Iterated weighted least squares (IWLS) for a mixed model representation of the P-splines regression,
#' conditional on a particular penalty coefficient, is used for estimating the B-spline coefficients which
#' is equivalent to maximum likelihood estimation. Leave-one-out cross-validation deviances are available
#' for estimation of the penalty coefficient.
#' 
#' The parameter vector is the B-spline coefficients \code{beta}, no matter whether the penalty coefficient is
#' fixed or estimated. The penalty coefficient \code{lambda} is treated separately.
#' 
#' The log-likelihood functions \code{\link[evmix:fpsden]{lpsden}} and \code{\link[evmix:fpsden]{nlpsden}}
#' evaluate the likelihood for the original dataset, using the fitted P-splines density estimator. The
#' log-likelihood is output as \code{nllh} from the fitting function \code{\link[evmix:fpsden]{fpsden}}.
#' They do not provide the likelihood for the Poisson regression of the histogram counts, which is usually
#' evaluated using the deviance. The deviance (via CVMSE for Poisson counts) is also output as \code{cvlambda}
#' from the fitting function \code{\link[evmix:fpsden]{fpsden}}.
#' 
#' The \code{\link[evmix:fpsden]{iwlspsden}} function performs the IWLS. The 
#' \code{\link[evmix:fpsden]{cvpsden}} function calculates the leave-one-out cross-validation 
#' sum of the squared errors. They are not designed to be used directly by users. No checks of the
#' inputs are carried out.
#' 
#' @return Log-likelihood for original data is given by \code{\link[evmix:fpsden]{lpsden}} and it's
#'   wrappers for negative log-likelihood from \code{\link[evmix:fpsden]{nlpsden}}. Cross-validation 
#'   sum of square of errors is provided by \code{\link[evmix:fpsden]{cvpsden}}. Poisson regression
#'   fitting by IWLS is carried out in \code{\link[evmix:fpsden]{iwlspsden}}. Fitting function
#'   \code{\link[evmix:fpsden]{fpsden}} returns a simple list with the
#'   following elements
#'
#' \tabular{ll}{
#'  \code{call}:                \tab \code{optim} call\cr
#'  \code{x}:                   \tab data vector \code{x}\cr
#'  \code{xrange}:              \tab range of support of B-splines\cr
#'  \code{degree}:              \tab degree of B-splines\cr
#'  \code{nseg}:                \tab number of internal segments\cr
#'  \code{design.knots}:        \tab knots used in \code{\link[splines:splineDesign]{splineDesign}}\cr
#'  \code{ord}:                 \tab order of penalty term\cr
#'  \code{binned}:              \tab histogram results\cr
#'  \code{breaks}:              \tab histogram breaks\cr
#'  \code{mids}:                \tab histogram mid-bins\cr
#'  \code{counts}:              \tab histogram counts\cr
#'  \code{nbinwidth}:           \tab scaling factor to convert counts to density\cr
#'  \code{bsplines}:            \tab B-splines matrix used for binned counts\cr
#'  \code{databsplines}:        \tab B-splines matrix used for data\cr
#'  \code{counts}:              \tab histogram counts\cr
#'  \code{lambdaseq}:           \tab \eqn{\lambda} vector for profile likelihood or scalar for fixed \eqn{\lambda}\cr
#'  \code{cvlambda}:            \tab CV MSE for each \eqn{\lambda}\cr
#'  \code{mle} and \code{beta}: \tab vector of MLE of coefficients\cr
#'  \code{nllh}:                \tab negative log-likelihood for original data\cr
#'  \code{n}:                   \tab total original sample size\cr
#'  \code{lambda}:              \tab Estimated or fixed \eqn{\lambda}\cr
#' }
#' 
#' @note The data are both vectors. Infinite and missing sample values are dropped.
#' 
#' No initial values for the coefficients are needed.
#' 
#' It is advised to specify the range of support \code{xrange}, using finite end-points. This is 
#' especially important when the support is bounded. By default \code{xrange} is simply the range of the
#' input data \code{range(x)}.
#' 
#' Further, it is advised to always set the histogram bin \code{breaks}, expecially if the support is bounded.
#' By default \code{10*ln(n)} equi-spaced bins are defined between \code{xrange}.
#' 
#' @references
#' \url{http://www.math.canterbury.ac.nz/~c.scarrott/evmix}
#' 
#' \url{http://en.wikipedia.org/wiki/Cross-validation_(statistics)}
#' 
#' \url{http://en.wikipedia.org/wiki/B-spline}
#' 
#' \url{http://statweb.lsu.edu/faculty/marx/}
#' 
#' Eilers, P.H.C. and Marx, B.D. (1996). Flexible smoothing with B-splines and penalties.
#' Statistical Science 11(2), 89-121.
#' 
#' @author Alfadino Akbar and Carl Scarrott \email{carl.scarrott@@canterbury.ac.nz}
#'
#' @section Acknowledgments: The Poisson regression and leave-one-out cross-validation functions
#' are based on the code of Eilers and Marx (1996) available from Brian Marx's website 
#' \url{http://statweb.lsu.edu/faculty/marx/}, which is gratefully acknowledged.
#' 
#' @seealso \code{\link[evmix:kden]{kden}}.
#'  
#' @aliases fpsden lpsden nlpsden iwlspsden cvpsden
#' @family  psden
#' @family  fpsden
#' 
#' @examples
#' \dontrun{
#' set.seed(1)
#' par(mfrow = c(1, 1))
#' 
#' x = rnorm(1000)
#' xx = seq(-4, 4, 0.01)
#' y = dnorm(xx)
#' 
#' # Plenty of histogram bins (100)
#' breaks = seq(-4, 4, length.out=101)
#' 
#' # P-spline fitting with cubic B-splines, 2nd order penalty and 10 internal segments
#' # CV search for penalty coefficient. 
#' fit = fpsden(x, lambdaseq = 10^seq(-5, 5, 0.25), breaks = breaks,
#'              xrange = c(-4, 4), nseg = 10, degree = 3, ord = 2)
#' psdensity = exp(fit$bsplines %*% fit$mle)
#' 
#' hist(x, freq = FALSE, breaks = seq(-4, 4, length.out=101), xlim = c(-6, 6))
#' lines(xx, y, col = "black") # true density
#' 
#' lines(fit$mids, psdensity/fit$nbinwidth, lwd = 2, col = "blue") # P-splines density
#' 
#' # check density against dpsden function
#' with(fit, lines(xx, dpsden(xx, beta, nbinwidth, design = design.knots),
#'                 lwd = 2, col = "red", lty = 2))
#'
#' # vertical lines for all knots
#' with(fit, abline(v = design.knots, col = "red"))
#'
#' # internal knots
#' with(fit, abline(v = design.knots[(degree + 2):(length(design.knots) - degree - 1)], col = "blue"))
#'   
#' # boundary knots (support of B-splines)
#' with(fit, abline(v = design.knots[c(degree + 1, length(design.knots) - degree)], col = "green"))
#'
#' legend("topright", c("True Density","P-spline density","Using dpsdens function"),
#'   col=c("black", "blue", "red"), lty = c(1, 1, 2))
#' legend("topleft", c("Internal Knots", "Boundaries", "Extra Knots"),
#'   col=c("blue", "green", "red"), lty = 1)
#' }
#'   

# maximum likelihood fitting for P-splines density estimate
fpsden <- function(x, lambdaseq = NULL, breaks = NULL, xrange = NULL,
                   nseg = 10, degree = 3, design.knots = NULL, ord = 2) {

  call <- match.call()
    
  # Check properties of inputs
  check.quant(x, allowna = TRUE, allowinf = TRUE)
  check.param(lambdaseq, allowvec = TRUE, allownull = TRUE)
  
  if (any(!is.finite(x))) {
    warning("non-finite cases have been removed")
    x = x[is.finite(x)] # ignore missing and infinite cases
  }

  check.quant(x)
  n = length(x)

  # Check B-spline specification
  check.param(xrange, allowvec = TRUE, allownull = TRUE)
  check.n(nseg) # optional, if knots provided
  check.n(degree, allowzero = TRUE)
  check.param(design.knots, allowvec = TRUE, allownull = TRUE) # optional
  check.n(ord)

  # Two options for specifying knots:
  #    1) design.knots vector
  #    2) xrange, nseg and degree
  # if both provided then design.knots is used
  if (is.null(design.knots) & is.null(xrange)) {
    xrange = range(x)
    
    # wee buffer on support for machine precision issues in splineDesign range checks
    # But also checks for most common lower and upper bounds
    if (isTRUE(all.equal(xrange[1], 0)) & (xrange[1] > 0)) {
      xrange[1] = 0
    } else if (isTRUE(all.equal(xrange[1], -1)) & (xrange[1] > -1)) {
      xrange[1] = -1
    } else {
      xrange[1] = xrange[1] - 8*.Machine$double.eps    
    }
    if (isTRUE(all.equal(xrange[2], 1)) & (xrange[2] < 1)) {
      xrange[2] = 1
    } else if (isTRUE(all.equal(xrange[2], 100)) & (xrange[2] < 100)) {
      xrange[2] = 100
    } else {
      xrange[2] = xrange[2] + 8*.Machine$double.eps
    }
  }

  if (is.null(design.knots)) {
    # check x-range
    if (length(xrange) != 2) stop("knot range in xrange must be vector of length 2")
    if (diff(xrange) <= 0) stop("knot range in xrange must have positive width")
    
    # consistent with Eilers and Marx the "P-spline masters":
    # defaults to regular knots and not natural B-splines,
    # so each B-spline is just shifted/spliced version of each other
    dx = diff(xrange)/nseg # regular knot spacing
    design.knots = seq(xrange[1] - degree * dx, xrange[2] + degree * dx, by = dx)
    
  } else {
    # if knots specified, they must be sorted
    if (is.unsorted(design.knots)) {
      design.knots = sort(design.knots)
    } else {
      if (design.knots[1] > design.knots[length(design.knots)])
        design.knots = rev(design.knots)
    }
    
    # cannot check degree as beta coefficients not provided (as checked in psden functions)
    
    # number of segments and degree determine design.knots, and vice-versa, so stop if different
    # different behaviour to psden function, which can check degree in advance using beta
    if (nseg != (length(design.knots) - 1 - degree*2)) {
      stop(paste("Number of segments = ", nseg, " and degree = ", degree, 
                    " are inconsistent with design knots where length(design.knots) = ",
                    length(design.knots), " should be nseg + 1 + degree*2 = ",
                    ", so ignoring nseg and degree", sep=""))
    }

    # xrange also determined by design knots and degree
    if (!is.null(xrange)) {
      if (!isTRUE(all.equal(xrange, design.knots[c(degree + 1, length(design.knots) - degree)]))) {
        warning(paste("Interpolation range (", xrange[1], ", ", xrange[2], ") is inconsistent with design knots (",
                      design.knots[degree + 1], ", " , design.knots[length(design.knots) - degree],
                      "), so xrange reset", sep=""))
        xrange = design.knots[c(degree + 1, length(design.knots) - degree)]
      }
    } else {
      xrange = design.knots[c(degree + 1, length(design.knots) - degree)]
    }
  }
  
  # Calculate histogram counts
  if (is.null(breaks)) {
    breaks = ceiling(10 * log(n)) # Eilers and Marx default setting for # breaks
  } else {
    # if breaks specified, they must be sorted
    if (is.unsorted(breaks)) {
      breaks = sort(breaks)
    } else {
      if (breaks[1] > breaks[length(breaks)])
        breaks = rev(breaks)
    }
  }
  
  # If breaks are not prescribed, then set sequence from min to max of data
  # Different to Eilers and Marx (1996) who go slightly beyond the data
  # Our approach makes more sense when data is bounded (density cannot go beyond min and max of data)
  # Can be overridden by specifying xrange
  if (length(breaks) == 1) {
    breaks = seq(xrange[1], xrange[2], length.out = breaks)
  }

  # Slightly modified from Eilers and Marx (1996) code:
  #   1) x-locations are mid-points of bins rather than lower boundary
  #   2) if # of breaks is provided (rather than actual breaks) then 
  #      actual breaks are specified by default behaviour of hist function
    
  # breaks must cover all data (hist function will check this as well)
  if (any((x < breaks[1]) | (x > breaks[length(breaks)])))
    stop(paste("breaks range (", breaks[1], ", ", breaks[length(breaks)], 
               ") must cover range of input data", sep=""))

  binned = hist(x, breaks, plot = FALSE)
  mids = binned$mids
  counts = binned$counts
  binwidth = binned$breaks[2] - binned$breaks[1]
  nbinwidth = n * binwidth
  
  # B-spline basis at the bin mid-points
  bsplines = as.matrix.csr(splineDesign(design.knots, mids, ord = degree + 1))
  
  # Check if profile likelihood or fixed lambda is being used
  # and determine initial values for parameters in each case
  if (is.null(lambdaseq)) { # not profile or fixed
      lambda = 10 # same as Eilers and Marx (1996)
      cvlambda = cvpsden(lambda, counts = counts, bsplines = bsplines, ord = ord)
  } else { # profile or fixed
    
    # CVMSE for lambda or scalar given
    if (length(lambdaseq) != 1) {
      
      cvlambda = sapply(lambdaseq, cvpsden, counts = counts, bsplines = bsplines, ord = ord)
      
      if (all(!is.finite(cvlambda))) stop("lambdas are all invalid")
      lambda = lambdaseq[which.min(cvlambda)]

    } else {
      lambda = lambdaseq
      cvlambda = cvpsden(lambda, counts = counts, bsplines = bsplines, ord = ord)
    }
  }

  # IWLS for estimating B-spline coefficients, given penalty coefficient
  beta = iwlspsden(counts = counts, bsplines = bsplines, ord = ord, lambda = lambda)

  # B-spline basis at the actual data needed to evaluate log-likelihood
  databsplines = as.matrix.csr(splineDesign(design.knots, x, ord = degree + 1))

  nllh = nlpsden(beta, x, databsplines, nbinwidth)

  list(call = call, x = as.vector(x), xrange = xrange, degree = degree, nseg = nseg, 
       design.knots = design.knots, ord = ord, binned = binned,
       breaks = breaks, mids = mids, counts = counts, nbinwidth = nbinwidth, bsplines = bsplines,
       databsplines = databsplines, lambdaseq = lambdaseq, cvlambda = cvlambda,
       mle = beta, nllh = nllh, n = n, beta = beta, lambda = lambda)
}

#' @export
#' @aliases fpsden lpsden nlpsden iwlspsden cvpsden
#' @rdname  fpsden

# log-likelihood function for P-splines density estimate
lpsden <- function(x, beta = NULL, bsplines = NULL, nbinwidth = 1, log = TRUE) {

  # Check properties of inputs
  check.quant(x, allowna = TRUE, allowinf = TRUE)
  check.param(beta, allowvec = TRUE)
  check.posparam(nbinwidth, allowvec = TRUE)
  check.logic(log)
  
  if (!(is.matrix.csr(bsplines) | (is.matrix(bsplines))))
    stop("bsplines input must be matrix, with each column a B-spline")

  if (any(!is.finite(x))) {
    warning("non-finite cases have been removed")
    x = x[is.finite(x)] # ignore missing and infinite cases
  }

  check.quant(x)
  n = length(x)
  np = length(beta)
  
  if (dim(bsplines)[1] != n)
    stop("bsplines input must be matrix, with row for each datapoint in x")

  if (dim(bsplines)[2] != np)
    stop("bsplines input must be matrix, with column for each B-spline")

  pred = bsplines %*% beta
  l = sum(pred) - n * log(nbinwidth)
  
  if (!log) l = exp(l)  
  
  l
}

#' @export
#' @aliases fpsden lpsden nlpsden iwlspsden cvpsden
#' @rdname  fpsden
  
# negative log-likelihood function for P-splines density estimate
# (wrapper for likelihood, inputs and checks designed for optimisation)
nlpsden <- function(pvector, x, bsplines = NULL, nbinwidth = 1, finitelik = FALSE) {

  # Check properties of inputs
  check.quant(x, allowna = TRUE, allowinf = TRUE)
  check.param(pvector, allowvec = TRUE)
  check.posparam(nbinwidth)
  check.logic(finitelik)

  if (!(is.matrix.csr(bsplines) | (is.matrix(bsplines))))
    stop("bsplines input must be matrix, with each column a B-spline")

  n = length(x)
  np = length(pvector)
    
  if (dim(bsplines)[1] != n)
    stop("bsplines input must be matrix, with row for each datapoint in x")

  if (dim(bsplines)[2] != np)
    stop("bsplines input must be matrix, with column for each B-spline")
  
  beta = pvector
  
  nllh = -lpsden(x, beta, bsplines, nbinwidth) 
  
  if (finitelik & is.infinite(nllh)) {
    nllh = sign(nllh) * 1e6
  }

  nllh
}

#' @export
#' @aliases fpsden lpsden nlpsden iwlspsden cvpsden
#' @rdname  fpsden

# leave one cross-validation RMSE for P-splines density estimate, with given B-splines
cvpsden = function(lambda = 1, counts, bsplines, ord = 2) {

  # Adapted from poisson.cv function of Paul Eilers and Brian Marx (2007)
  # Not designed to be called by users, so no input checking (to make it more efficient)
  
  nc = length(counts)
  np = dim(bsplines)[2]
  
  cv.pred <- function(i, counts, bsplines, ord, lambda) {
    exp(bsplines[i, ] %*% iwlspsden(counts[-i], bsplines[-i, ], ord, lambda))
  }

  if (lambda <= 0) {
    cv = Inf
  } else {
    pred = sapply(1:nc, cv.pred, counts = counts, bsplines = bsplines, ord = ord, lambda = lambda) 
    cv = mean((counts - pred)^2) # CVMSE
  }
  
  return(cv)
}

#' @export
#' @aliases fpsden lpsden nlpsden iwlspsden cvpsden
#' @rdname  fpsden

# iterative weighted least squares fitting for P-splines density estimate, with given B-splines
iwlspsden = function(counts, bsplines, ord = 2, lambda = 10) {

  # Adapted from ps.poisson function of Paul Eilers and Brian Marx (2007)
  # Not designed to be called by users, so no input checking (to make it more efficient)

  nc = length(counts)
  np = dim(bsplines)[2]

  # Penalty matrix from differences
  P = as.matrix.csr(sqrt(lambda) * diff(diag(np), diff = ord))
  X = rbind.matrix.csr(bsplines, P)

  npen = np - ord # same as dim(P)[1] (not checked)
  
  # Initialize
  newpred = log(counts + 0.0001)
  pred = rep(1, nc)

  # Iterations
  niter = 1
  diffpred = sum(abs(pred - newpred))
  while ((niter < 20) & (diffpred > 1e-5)) {
    mu = exp(pred)
    
    # determine weights and score
    w = ifelse(mu < 1e-16, 0, as.vector(mu)) # practical definition of zero predicted density

    newy = ifelse(w == 0, 0, as.vector((counts - mu) / w + pred))
    
    Y = c(newy, rep(0, npen))

    # Mixed model representation
    beta = slm.wfit(X, Y, weights = c(w, rep(1, npen)))$coef

    newpred = bsplines %*% beta # on log-link scale

    niter = niter + 1
    diffpred = sum(abs(pred - newpred))

    pred = newpred
  }
  
  if (diffpred > 1e-5) warning("IRLS fitting did not converge in 20 iterations, check all inputs")
  
  return(beta)
}

Try the evmix package in your browser

Any scripts or data that you put into this service are public.

evmix documentation built on Sept. 3, 2019, 5:07 p.m.