R/dffit.R

Defines functions resample correct.mle.bias add.Gaussian.errors make.veff corefit dffit

Documented in dffit

#' Fit a generative distribution function, such as a galaxy mass function
#'
#' This function fits galaxy mass function (MF) to a discrete set of \code{N} galaxies with noisy data. More generally, \code{dffit} finds the most likely \code{P}-dimensional distribution function (DF) generating \code{N} objects \code{i=1,...,N} with uncertain measurements \code{P} observables. For instance, if the objects are galaxies, it can fit a MF (\code{P=1}), a mass-size distribution (\code{P=2}) or the mass-spin-morphology distribution (\code{P=3}). A full description of the algorithm can be found in Obreschkow et al. (2017).
#'
#' @importFrom akima interpp
#'
#' @param x Normally \code{x} is a \code{N}-element vector, representing the log-masses (log10(M/Msun)) of \code{N} galaxies. More generally, \code{x} can be either a vector of \code{N} elements or a matrix of \code{N-by-P} elements, containing the values of one or \code{P} observables of \code{N} objects, respectively.
#' @param selection Specifies the effective volume \code{Veff(xval)} in which a galaxy of log-mass \code{xval} can be observed; or, more generally, the volume in which an object of observed values \code{xval[1:P]} can be observed. This volume can be specified in five ways: (1) If \code{selection} is a single positive number, it will be interpreted as a constant volume, \code{Veff(xval)=selection}, in which all galaxies are fully observable. \code{Veff(xval)=0} is assumed outside the "observed domain". This domain is defined as \code{min(x)<=xval<=max(x)} for one observable (\code{P=1}), or as \code{min(x[,j])<=xval[j]<=max(x[,j])} for all \code{j=1,...,P} if \code{P>1}. This mode can be used for volume-complete surveys or for simulated galaxies in a box. (2) If \code{selection} is a vector of \code{N} elements, they will be interpreted as the effective volumes for each of the \code{N} galaxies. \code{Veff(xval)} is interpolated (linearly in \code{1/Veff}) for other values \code{xval}. \code{Veff(xval)=0} is assumed outside the observed domain. (3) \code{selection} can be a function of \code{P} variables, which directly specivies the effective volume for any \code{xval}, i.e. \code{Veff(xval)=selection(xval)}. (4) \code{selection} can also be a list (\code{selection = list(Veff.values, Veff.fct)}) of an \code{N}-element vector \code{Veff.values} and a \code{P}-dimensional function \code{Veff.fct}. In this case, the effective volume is computed using a hybrid scheme of modes (2) and (3): \code{Veff(xval)} will be interpolated from the \code{N} values of \code{Veff.values} inside the observed domain, but set equal to \code{Veff.fct} outside this domain. (5) Finally, \code{selection} can be a list of two functions and two numbers: \code{selection = list(f, V, rmin, rmax)}, where \code{f = function(xval,r)} is the isotropic selection function and \code{V = function(r)} is the total survey volume as a function of comoving distance \code{r}, and \code{rmin} and \code{rmax} are the distance limits of the survey.
#' @param x.err Optional vector or array specifying the observational errors of \code{x}. If \code{x} is a vector then \code{x.err} must also be a vector of same length. Its elements are interpreted as the standard deviations of Gaussian uncertainties in \code{x}. If \code{x} is a \code{N-by-P} matrix representing \code{N} objects with \code{P} observables, then \code{x.err} must be either a \code{N-by-P} matrix or a \code{N-by-P-by-P} array. In the first case, the elements \code{x.err[i,]} are interpreted as the standard deviations of Gaussian uncertainties on \code{x[i,]}. In the second case, the \code{P-by-P} matrices \code{x.err[i,,]} are interpreted as the covariance matrices of the \code{P} observed values \code{x[i,]}.
#' @param distance Optional vector of \code{N} elements specifying the comoving distances of the \code{N} galaxies. This vector is only needed if \code{correct.lss.bias = TRUE}.
#' @param phi Either a string or a function specifying the DF to be fitted. A string is interpreted as the name of a predefined mass function. Available options are \code{'Schechter'} for Schechter function (3 parameters), \code{'PL'} for a power law (2 parameters), or \code{'MRP'} for an MRP function (4 parameters). Alternatively, \code{phi = function(xval,p)} can be any function of the \code{P} observable(s) \code{xval} and a list of parameters \code{p}. IMPORTANT: The function \code{phi(xval,p)} must be fully vectorized in \code{xval}, i.e. it must output a vector of \code{N} elements if \code{xval} is an \code{N-by-P} array (such as \code{x}). Note that if \code{phi} is given as a function, the argument \code{p.initial} is mandatory.
#' @param p.initial Initial model parameters for fitting the DF.
#' @param n.iterations Maximum number of iterations in the repeated fit-and-debias algorithm to evaluate the maximum likelihood.
#' @param n.resampling Integer (>0) specifying the number of iterations for the resampling of the most likely DF used to evaluate realistic parameter uncertainties with quantiles. If \code{n.resampling = NULL}, no resampling is performed.
#' @param correct.mle.bias The maximum likelihood estimator (MLE) of a finite dataset can be biased - a general property of the ML approach. If \code{TRUE}, \code{dffit} also outputs the parameters, where this estimator bias has been corrected, to first order in \code{1/N}, using jackknifing.
#' @param correct.lss.bias If \code{TRUE} the \code{distance} values are used to correct for the observational bias due to galaxy clustering (large-scale structure).
#' @param lss.sigma Cosmic variance of survey volume. Explicitly, \code{lss.sigma} is interpreted as the expected relative RMS on the total number of galaxies in the survey volume. This value must be determined from a cosmological model.
#' @param write.fit If \code{TRUE}, the best-fitting parameters are displayed in the console.
#' @param x.grid sets the grid on which the numerical integration is performed. This grid must be specified as \code{x.grid = list(x1, ..., xp)}, where \code{xi} is an equally spaced vector with the grid values for the i-th observable. In the case of a MF (a 1-dimensional DF), \code{x.grid = list(seq(xmin,xmax,by=dx))}, where \code{xmin} and \code{xmax} are the minimum/maximum values of the log-mass considered in the numerical integratino and \code{dx} is the grid spacing.
#'
#' @return Returns a structured list. The sublist \code{input} contains all relevant input arguments. The sublist \code{fit} contains all the output arguments of the MLE algorithm. The output can be visualized using \code{\link{mfplot}}, \code{\link{plot.df}} and \code{\link{dfwrite}}.
#'
#' @keywords schechter function
#' @keywords mass function
#' @keywords fit
#'
#' @examples
#' # basic example
#' data = mfdata()
#' df = dffit(data$x, data$selection)
#' mfplot(df, xlim=c(2e6,5e10))
#'
#' # show fitted parameter PDFs and covariances
#' dfplotcov(df)
#'
#' # show fitted effective volume function
#' dfplotveff(df)
#'
#' # include measurement errors in fit and determine Schechter function uncertainties from resampling
#' df = dffit(data$x, data$selection, data$x.err, n.resampling = 1e2)
#' mfplot(df, uncertainty.type=3, nbins=10, bin.xmin=6.5, bin.xmax=9.5, xlim=c(2e6,5e10), ylim=c(2e-3,1.5))
#'
#' # evaluate bias corrected maximum-likelihood estimator and add to plot
#' df = dffit(data$x, data$selection, data$x.err, write.fit=FALSE, correct.mle.bias=TRUE)
#' mfplot(df, show.uncertainties=FALSE, nbins=0, show.bias.correction=TRUE, col.bias.correction="blue", add=TRUE)
#'
#' # evaluate posteriors of the mass measurements and visualize the change in the mass mode between observation and posterior
#' # i.e. the Eddington bias correction
#' df = posterior.data(df)
#' plot(data$x,10^df$posterior$x.mode.correction,log='x',pch=20,xlab='Mass',ylab='Eddington bias correction = posterior/observed mass mode')
#' lines(10^seq(5,10,by=0.01),df$fit$functions$source.count(seq(5,10,by=0.01))*1e-2+1)
#' abline(h=1,v=5.4e7)
#'
#' @author Danail Obreschkow
#'
#' @export

dffit <- function(x, # normally log-mass, but can be multi-dimensional
                  selection = NULL,
                  x.err = NULL, # Gaussian standard deviations of x
                  distance = NULL,
                  phi = 'Schechter',
                  p.initial = NULL,
                  n.iterations = 100,
                  n.resampling = NULL,
                  correct.mle.bias = FALSE,
                  correct.lss.bias = FALSE,
                  lss.sigma = NULL,
                  write.fit = TRUE,
                  x.grid = list(seq(4,12,0.01))) {

  tStart.global = Sys.time() # timer
  wt.fitting <<- 0
  
  # Input handling ########################################################################

  # Handle x
  if (length(x)<1) stop('Give at least one data point.')
  if (is.null(dim(x))) x = cbind(as.vector(x)) # make col-vector
  if (length(dim(x))!=2) stop('x cannot have more than two dimensions.')
  n.data = dim(x)[1]
  n.dim = dim(x)[2]

  # Handle x.err
  if (!is.null(x.err)) {
    if (is.null(dim(x.err))) x.err = cbind(as.vector(x.err)) # make col-vector
    if (length(dim(x.err))==2) {
      if (any(dim(x)!=dim(x.err))) stop('Size of x.err not compatible with size of x.')
    } else if (length(dim(x.err))==3) {
      if (n.dim==1) stop('For one-dimensional distribution function x.err cannot have 3 dimensions.')
      if (!(dim(x.err)[1]==n.data & dim(x.err)[2]==n.dim & dim(x.err)[3]==n.dim)) {
        stop('Size of x.err not compatible with size of x.')
      }
    } else {
      stop('x.err cannot have more than three dimensions.')
    }
    if (min(x)<=0) stop('All values of x.err must be positive.')
  }

  # Handle distance
  if (!is.null(distance)) {
    distance = as.vector(distance)
    if (length(distance)!=n.data) stop('The number of distance values must be equal to the number of data points (=number of columns of x).')
    if (min(distance)<=0) stop('All distance values must be positive.')
  }

  # Handle selection
  if (is.null(selection)) stop('A selection function must be given.')
  veff.list = .make.veff(selection,x)
  veff.function = veff.list$veff.function

  # Handle phi
  if (is.function(phi)) {
    phi.function = phi
    phi.equation = NA
    if (is.null(p.initial)) stop('For user-defined distribution functions initial parameters must be given.')
  } else {
    phi.function <- function(x,p) {dfmodel(x, p, type = phi)}
    phi.equation = dfmodel(output = 'equation', type = phi)
    if (is.null(p.initial)) p.initial = dfmodel(output = 'initial', type = phi)
  }

  # Handle n.iterations
  if (is.null(n.iterations)) stop('n.iterations must be a positive integer.')
  if (n.iterations<1) stop('n.iterations must be a positive integer.')

  # Handle n.resampling
  if (!is.null(n.resampling)) {
    if (n.resampling<10) stop('n.resampling must be 10 or larger.')
  }

  # Handle correct.lss.bias
  if (correct.lss.bias) {
    if (is.null(distance)) stop('Distances must be given of correct.lss.bias = TRUE.')
    if (is.null(lss.sigma)) stop('lss.sigma must be given of correct.lss.bias = TRUE.')
    if (lss.sigma<0) stop('lss.sigma cannot be negative.')
    veff.function.lss = 0 #xxx
  }

  # Handle correct.mle.bias
  if (correct.mle.bias) {
    if (n.data<2) stop('At least two data points must be given if correct.mle.bias = TRUE.')
  }

  # Handle x.grid
  if (!is.list(x.grid)) stop('x.grid must be a list of p vectors, where p is the number of columns of x.')
  if (length(x.grid)!=n.dim) stop('x.grid must be a list of p vectors, where p is the number of columns of x.')
  dx = nx = array(NA,n.dim)
  for (i in seq(n.dim)) {
    dx[i] = x.grid[[i]][2]-x.grid[[i]][1]
    if (dx[i]<=0) stop('x.grid must be made of vectors with monotonically increasing elements.')
    nx[i] = length(x.grid[[i]])
    if (nx[i]<2) stop('x.grid must be made of vectors with at least 2 elements, each.')
    dtmp = x.grid[[i]][3:nx[i]]-x.grid[[i]][2:(nx[i]-1)]
    if (any(abs(dtmp-dx[i])>dx[i]*1e-10)) stop('x.grid must be made of vectors with equally spaced elements.')
  }
  x.mesh.dv = prod(dx)
  x.mesh = array(NA,c(prod(nx),n.dim))
  k1 = 1
  k2 = prod(nx)
  for (i in seq(n.dim)) {
    k2 = k2/nx[i]
    x.mesh[,i] = rep(rep(x.grid[[i]],each=k1),k2)
    k1 = k1*nx[i]
  }
  veff.mesh = veff.function(x.mesh)

  # Find most likely generative model ######################################################
  best.fit = .corefit(x, x.err,
                      veff.mesh,
                      phi.function, p.initial,
                      x.mesh, x.mesh.dv,
                      n.iterations = n.iterations)

  # Initialize output parameters ###########################################################
  phi.fit <- function(x) {phi.function(x,best.fit$p.optimal)}
  source.count <- function(x) {phi.fit(x)*veff.function(x)}
  input = list(data = list(x = x, x.err = x.err, distance = distance),
               selection = list(veff.function = veff.function,
                                veff.values = veff.list$veff.values,
                                veff.fct = veff.list$veff.fct,
                                veff.mesh = veff.mesh),
               distribution.function = list(phi = phi.function,
                                            phi.equation = phi.equation),
               options = list(p.initial = p.initial,
                              n.iterations = n.iterations, n.resampling = n.resampling,
                              correct.lss.bias = correct.lss.bias, lss.sigma = lss.sigma,
                              correct.mle.bias = correct.mle.bias, x.grid = x.grid,
                              x.mesh = x.mesh, x.mesh.dv = x.mesh.dv))
  fit = list(parameters = list(p.optimal = best.fit$p.optimal,
                               p.covariance = best.fit$covariance),
             functions = list(phi.fit = phi.fit,
                              source.count = source.count,
                              logL = best.fit$logL),
             evaluation = list(x = x.mesh,
                               y = apply(x.mesh,1,phi.fit)),
             status = list(n.iterations = best.fit$n.fit.and.debias,
                           converged = best.fit$converged,
                           chain = best.fit$chain))
  df = list(input = input, fit = fit)
  
  # Additional processing for converged solutions ###############################################
  if (df$fit$status$converged) {

    # Bias correction
    if (correct.mle.bias) {df = .correct.mle.bias(df)}

    # Determine uncertainties
    df = .add.Gaussian.errors(df)
    if (!is.null(n.resampling)) {df = .resample(df)}

  }

  # Finalize ###########################################################
  if (write.fit) dfwrite(df)
  df$fit$status$walltime.fitting = wt.fitting
  df$fit$status$walltime.total = as.double(Sys.time())-as.double(tStart.global)
  
  invisible(df)

}

#' @export
#'
.corefit <- function(x, x.err,
                     veff.mesh,
                     phi.function, p.initial,
                     x.mesh, x.mesh.dv,
                     n.iterations, debiasing.first.iteration = TRUE,
                     supress.warning = FALSE) {
  
  tStart.fitting = Sys.time()

  # Input handling
  n.data = dim(x)[1]
  n.dim = dim(x)[2]
  n.mesh = dim(x.mesh)[1]
  if (is.null(x.err)) n.iterations = 1

  if (!is.null(x.err)) {
    
    # Make inverse covariances
    invC = array(NA,c(n.data,n.dim,n.dim))
    if (length(dim(x.err))==2) {
      if (!(dim(x.err)[1]==n.data & dim(x.err)[2]==n.dim)) {
        stop('Unknown format for x.err in .corefit.')
      }
      if (n.dim==1) {
        for (i in seq(n.data)) {
          invC[i,,] = 1/x.err[i,]^2
        }
      } else {
        for (i in seq(n.data)) {
          invC[i,,] = diag(1/x.err[i,]^2)
        }
      }
    } else if (length(dim(x.err))==3) {
      if (n.dim==1) stop('Unknown format for x.err in .corefit.')
      if (!(dim(x.err)[1]==n.data & dim(x.err)[2]==n.dim & dim(x.err)[3]==n.dim)) {
        stop('Unknown format for x.err in .corefit.')
      }
      for (i in seq(n.data)) {
        invC[i,,] = solve(x.err[i,,])
      }
    } else {
      stop('Unknown format for x.err in .corefit.')
    }
    
    # make a priori observed PDFs
    rho.observed = list()
    for (i in seq(n.data)) {
      d = x[i,]-t(x.mesh)
      rho.observed[[i]] = exp(-colSums(d*(invC[i,,]%*%d))/2)
    }
  }

  # Iterative algorithm
  running = TRUE
  k = 0
  value.old = Inf
  chain = array(NA,c(n.iterations,length(p.initial)+1))
  
  while (running) {

    k = k+1

    # make unbiased source density function
    rho.unbiased = array(0,n.mesh)
    if (is.null(x.err)) {
      for (i in seq(n.data)) {
        difference = colSums(abs(t(x.mesh)-x[i,]))
        index = which.min(difference)[1]
        rho.unbiased[index] = rho.unbiased[index]+1/x.mesh.dv
      }
    } else {
      if (k==1 & !debiasing.first.iteration) {
        prior = array(1,n.mesh)
      } else {
        # predicted source counts (up to a factor x.mesh.dv)
        prior = phi.function(c(x.mesh),p.initial)*veff.mesh
        prior[!is.finite(prior)] = 0
        prior = pmax(0,prior)
      }
      for (i in seq(n.data)) {
        rho.corrected = rho.observed[[i]]*prior
        rho.corrected = rho.corrected/(sum(rho.corrected)*x.mesh.dv)
        rho.unbiased = rho.unbiased+rho.corrected
      }
    }

    # make -ln(L)
    neglogL = function(p) {
      phi = phi.function(x.mesh,p)
      # safety operations (adding about 50% computation time)
      phi[!is.finite(phi)] = 0
      phi = pmax(.Machine$double.xmin,phi) # also vectorizes the array
      # end safety operations
      return(sum(phi*veff.mesh-log(phi)*rho.unbiased)*x.mesh.dv)
    }

    # maximize ln(L)
    tStart.fitting = Sys.time()
    opt = optim(p.initial,neglogL,hessian=TRUE,control=list(parscale=rep(1,length(p.initial)),reltol=1e-10,abstol=1e-10,maxit=1e3))
    chain[k,] = c(opt$par,opt$value)
    wt.fitting <<- wt.fitting + as.double(Sys.time())-as.double(tStart.fitting)
    
    # assess convergence
    if (is.null(x.err)) { # exist without extra iterations
      converged = opt$convergence==0
      running = FALSE
    } else {
      # asses convergence
      if (k==1) {
        d = 1e98
        d.old = 1e99
      } else {
        d = abs(opt$value-value.old)/n.data
      }
      converged = d>=d.old
      value.old = opt$value
      d.old = d

      if (converged) {
        running = FALSE
      } else if (k == n.iterations) {
        converged = FALSE
        running = FALSE
        if (!supress.warning) {
          cat('WARNING: Maximum number of iteration reached. Consider increasing n.iterations and/or providing better initial parameters.\n')
        }
      }

      # prepare initial values for next iteration
      p.initial = opt$par
    }
  }

  # make output
  if (det(opt$hessian)<1e-12) {
    converged = FALSE
    cov = NULL
    if (!supress.warning) {
      cat('WARNING: Fit ill-conditionned. Consider providing better initial parameters or selection arguments.\n')
    }
  } else {
    cov = solve(opt$hessian)
  }
  
  return(list(p.optimal = opt$par, covariance = cov, n.fit.and.debias = k,
              converged = converged,
              chain = chain[1:k,],
              logL = function(p) -neglogL(p)))

}

#' @export
#'
.make.veff = function(selection,x) {

  # Generates the function veff.function(xval) from various selection function types.
  # The argument xval must be a M-by-P array. veff.function(xval) returns a vector of
  # M elements.

  if (length(dim(x))==0) stop('x must be a N-by-P array.')
  n.data = dim(x)[1]
  n.dim = dim(x)[2]
  xmin = apply(x,2,min)
  xmax = apply(x,2,max)
  mode = NULL
  veff.values = NULL
  veff.fct = NULL

  # Mode 1: Constant effective volume inside observed domain
  if (is.double(selection)) {
    if (length(selection)==1) {
      if (selection<=0) stop('selection = Vconstant mustt be positive.')
      mode = 1
      veff.function.elemental = function(xval) {
        if (any(xval<xmin) | any(xval>xmax)) {
          return(0)
        } else {
          return(selection)
        }
      }
    }
  }

  # Mode 2: Interpolated effective volume inside observed domain
  if (is.double(selection)) {
    if (length(selection)==n.data) {
      if (min(selection)<=0) stop('All values of selection (=Veff) must be positive.')
      mode = 2
      veff.values = selection
      if (n.dim==1) {
        vapprox = function(xval) {
          f = approxfun(x[,1],1/veff.values,rule=2)
          return(1/f(xval))
        }
      } else if (n.dim==2) {
        vapprox = function(xval) {
          z = 1/(akima::interpp(x[,1],x[,2],1/veff.values,xval[1],xval[2],duplicate='mean'))$z
          if (is.na(z)) {return(0)} else {return(z)}
        }
      } else {
        stop('Linear interpolation of Veff not implemented for DF with more than 2 dimensions. Use a different selection type.')
      }
      veff.function.elemental = function(xval) {
        if (any(xval<xmin) | any(xval>xmax)) {
          return(0)
        } else {
          return(vapprox(xval))
        }
      }
    }
  }

  # Mode 3: Effective volume given directly
  if (is.function(selection)) {
    mode = 3
    veff.function.elemental = selection
    veff.fct = selection
  }

  # Mode 4: Hybrid of 2 and 3
  if (is.list(selection)) {
    if (length(selection)==2) {
      if (is.double(selection[[1]]) & is.function(selection[[2]])) {
        veff.values = selection[[1]]
        veff.fct = selection[[2]]
        if (min(veff.values)<=0) stop('All values of selection (=Veff) must be positive.')
        mode = 4
        if (n.dim==1) {
          vapprox = function(xval) {
            f = approxfun(x[,1],1/veff.values,rule=2)
            return(1/f(xval))
          }
        } else if (n.dim==2) {
          vapprox = function(xval) {
            z = 1/(akima::interpp(x[,1],x[,2],1/veff.values,xval[1],xval[2],duplicate='mean'))$z
            if (is.na(z)) {return(veff.fct(xval))} else {return(z)}
          }
        } else {
          stop('Linear interpolation of Veff not implemented for DF with more than 2 dimensions. Use a different selection type.')
        }
        veff.function.elemental = function(xval) {
          if (any(xval<xmin) | any(xval>xmax)) {
            return(veff.fct(xval))
          } else {
            return(vapprox(xval))
          }
        }
      }
    }
  }

  # Mode 5: selection given as {f(x,r), dVdr(r)}
  if (is.list(selection)) {
    if (length(selection)>=2 & length(selection)<=3) {
      if (is.function(selection[[1]]) & is.function(selection[[2]])) {
        if (length(selection)==3) {
          if (is.double(selection[[3]])) {
            if (length(selection[[3]])==2) {
              mode = 5.1
              rmin = selection[[3]][1]
              rmax = selection[[3]][2]
            }
          }
        } else {
          mode = 5
          rmin = 0
          rmax = Inf
        }
        test = try(selection[[1]](NA,NA)*selection[[2]](NA),silent=TRUE)
        if (!is.double(test)) stop('In the argument selection = list(f, dVdr, ...), the functions f(xval,r) and dVdr(r) must not contain if-statements and work for r being a vector.')
        veff.function.elemental = function(xval) {
          f = function(r) {selection[[1]](xval,r)*selection[[2]](r)}
          return(integrate(f,rmin,rmax)$value)
        }
      }
    }
  }

  if (is.null(mode)) stop('Unknown selection.')

  # apply to all elements
  veff.function = function(xval) {
    if (length(dim(xval))==2) {
      return(apply(xval,1,veff.function.elemental))
    } else if (length(dim(xval))==0) {
      if (n.dim==1) {
        return(sapply(xval,veff.function.elemental))
      } else {
        if (length(xval)!=n.dim) stop('Incorrect argument.')
        return(veff.function.elemental(xval))
      }
    } else {
      stop('Incorrect argument.')
    }
  }

  return(list(veff.function = veff.function,
              veff.values = veff.values,
              veff.fct = veff.fct))
}

.add.Gaussian.errors <- function(df) {

  # make Gaussian uncertainties of parameters
  cov = df$fit$parameters$p.covariance
  df$fit$parameters$p.sigma = sqrt(diag(cov))

  # make Gaussian uncertainties of DF
  eig = eigen(cov)
  np = length(df$fit$parameters$p.optimal)
  index = 0
  nsteps = 6 # a larger number of steps leads to a more accurate sampling of the covariance ellipsoid
  y.new = array(NA,c(length(df$fit$evaluation$x),nsteps^np))
  k = seq(-1,1,length=nsteps)
  step = array(1,np)
  step[1] = 0

  for (i in seq(nsteps^np)) {

    # next step
    overflow = TRUE
    j = 1
    while (overflow) {
      step[j] = step[j]+1
      if (step[j]>nsteps) {
        step[j] = 1
        j = j+1
      } else {
        overflow = FALSE
      }
    }

    # evaluate DF
    if (any(k[step]!=0)) {
      e = k[step]/sqrt(sum(k[step]^2))
      v = array(0,np)
      for (j in seq(np)) {
        v = v+e[j]*sqrt(eig$values[j])*eig$vectors[,j]
      }
      p.new = df$fit$parameters$p.optimal+v
      y.new[,i] = df$input$distribution.function$phi(df$fit$evaluation$x,p.new)
      if (any(!is.finite(y.new[,i]))) {
        y.new[,i] = NA
      } else if (any(y.new[,i]<0)) {
        y.new[,i] = NA
      }

    } else {
      y.new[,i] = NA
    }
  }

  df$fit$evaluation$y.error.neg = array(NA,length(df$fit$evaluation$x))
  df$fit$evaluation$y.error.pos = array(NA,length(df$fit$evaluation$x))
  for (i in seq(length(df$fit$evaluation$x))) {
    df$fit$evaluation$y.error.neg[i] = df$fit$evaluation$y[i]-min(c(y.new[i,],Inf),na.rm=TRUE)
    df$fit$evaluation$y.error.pos[i] = max(c(y.new[i,],-Inf),na.rm=TRUE)-df$fit$evaluation$y[i]
  }

  return(df)
}

.correct.mle.bias <- function(df) {
  n = dim(df$input$data$x)[1]
  np = length(df$fit$parameters$p.optimal)
  if (n<2) {
    stop('Bias correction requires at least two objects.')
  } else if (n>=1e3) {
    cat('WARNING: bias correction normally not relevant for more than 1000 objects.\n')
  }
  p.new = array(NA,c(np,n))
  for (i in seq(n)) {
    list = setdiff(seq(n),i)
    cf = .corefit(df$input$data$x[list], df$input$data$x.err[list],
                 df$input$selection$veff.mesh*(n-1)/n,
                 df$input$distribution.function$phi,
                 df$fit$parameters$p.optimal,
                 df$input$options$x.mesh, df$input$options$x.mesh.dv,
                 n.iterations = 1, debiasing.first.iteration = TRUE,
                 supress.warning = TRUE)
    p.new[,i] = cf$p.optimal
  }
  p.reduced = apply(p.new, 1, mean, na.rm = T)
  df$fit$parameters$p.optimal.bias.corrected = n*df$fit$parameters$p.optimal-(n-1)*p.reduced
  return(df)
}

.resample <- function(df, seed = 1) {

  # randomly resample and refit the DF
  if (dim(df$input$data$x)[2]>1) stop('resampling only available for one-dimensional distribution functions.')
  set.seed(seed)
  np = length(df$fit$parameters$p.optimal)
  x = df$input$options$x.grid[[1]]
  density = pmax(0,df$fit$evaluation$y*df$input$selection$veff.function(x))
  cum = cumsum(density/sum(density))
  n.data = dim(df$input$data$x)[1]
  p.new = array(NA,c(df$input$options$n.resampling,np))
  for (iteration in seq(df$input$options$n.resampling)) {
    n.new = max(2,rpois(1,n.data))
    x.obs = array(NA,n.new)
    r = runif(n.new)
    for (i in seq(n.new)) {
      index = which.min(abs(cum-r[i]))
      x.obs[i] = x[index]
    }
    p.new[iteration,] = .corefit(cbind(x.obs), NULL, df$input$selection$veff.mesh,
                                 df$input$distribution.function$phi,
                                 df$fit$parameters$p.optimal,
                                 df$input$options$x.mesh, df$input$options$x.mesh.dv,
                                 n.iterations = 1)$p.optimal
  }

  # make parameter quantiles
  q = c(0.02,0.16,0.84,0.98)
  p.quant = array(NA,c(length(q),np))
  for (i in seq(np)) {
    p.quant[,i] = quantile(p.new[,i],q,names=FALSE)
  }
  p.quantile = list(p.quantile.02 = p.quant[1,], p.quantile.16 = p.quant[2,],
                    p.quantile.84 = p.quant[3,], p.quantile.98 = p.quant[4,])

  # make DF quantiles
  s = array(NA,c(df$input$options$n.resampling,length(x)))
  for (iteration in seq(df$input$options$n.resampling)) {
    s[iteration,] = df$input$distribution.function$phi(x,p.new[iteration,])
  }
  y.quant = array(NA,c(4,length(x)))
  for (i in seq(length(x))) {
    list = !is.na(s[,i]) & is.finite(s[,i]) & (s[,i]>0)
    y.quant[,i] = quantile(s[list,i],q,names=FALSE)
  }
  y.quantile = list(y.quantile.02 = y.quant[1,], y.quantile.16 = y.quant[2,],
                    y.quantile.84 = y.quant[3,], y.quantile.98 = y.quant[4,])

  # output parameters
  df$fit$parameters = append(df$fit$parameters, p.quantile)
  df$fit$evaluation = append(df$fit$evaluation, y.quantile)
  return(df)

}
obreschkow/gdftools documentation built on June 10, 2017, 12:40 a.m.