dftools: Fitting distribution functions such as galaxy mass functions

Documented in dfmockdata

#' Generate mock data
#'
#' This function produces a mock survey with observed log-masses \code{x.obs} with Gaussian uncertainties and distances \code{r}, using a custom mass function (MF) and selection function.
#' 
#' @importFrom pracma erf
#' @importFrom stats rnorm runif uniroot integrate rpois optim approxfun
#'
#' @param n Number of objects (galaxies) to be generated. If \code{n=NULL}, the number is determined from the mass function (\code{gdf}) and the selection criteria (specified by \code{f} and \code{dVdr}). Otherwise, the survey volume (specified by the derivative \code{dVdr}) is automatically multiplied by the scaling factor required to obtain the requested number of objects \code{n}.
#' @param seed An interger number used as seed for the random number generator. If you wish to generate different realizations, with the same survey specifications, it suffices to vary this number.
#' @param veff is the effective volume function \code{veff(x)}, definied as the cosmic volume in which sources of log-mass \code{x} can be detected by the survey. If this function is specified, \code{f}, \code{dVdr} and \code{g} cannot be specified.
#' @param f is the selection function \code{f(x,r)}, giving the ratio between the expected number of detected galaxies and true galaxies of log-mass \code{x} and comoving distance \code{r}. Normally this function is bound between 0 and 1. It takes the value 1 at distances, where objects of mass \code{x} are easily detected, and 0 at distances, where such objects are impossible to detect. A rapid, continuous drop from 1 to 0 normally occurs at the limting distance \code{rmax}, at which a galaxy of log-mass \code{x} can be picked up. \code{f(x,r)} can never by smaller than 0, but values larger than 1 are conceivable, if there is a large number of false positive detections in the survey. The default is \code{f = function(x,r) erf((1-1e3*r/sqrt(10^x))*20)*0.5+0.5}, which mimiks a sensitivity-limited survey with a fuzzy limit.
#' @param dVdr is the function \code{dVdr(r)}, spedifying the derivative of the survey volume \code{V(r)} as a function of comoving distance \code{r}. This survey volume is simply the total observed volume, irrespective of the detection probability, which is already specified by the function \code{f}. Normally, the survey volume is given by \code{V(r)=Omega*r^3/3}, where \code{Omega} is the solid angle of the survey. Hence, the derivative is \code{dVdr(r)=Omega*r^2}. The default is \code{Omega=2.13966} [sterradians], chosen such that the expected number of galaxies is exactly 1000 when combined with the default selection function \code{f(x,r)}.
#' @param gdf is the 'generative distribution function', i.e. the underlying mass function, from which the galaxies are drawn. This function is a function of log-mass \code{x}. It returns the expected number of galaxies per unit of cosmic volume \code{V} and log-mass \code{x}. The default is a Schechter function.
#' @param g function of distance \code{r} descibing the number-density variation of galaxies due to cosmic large-scale structure (LSS). Explicitly, \code{g(r)>0} is the number-density at \code{r}, relative to the number-density without LSS. Values between 0 and 1 are underdense regions, values larger than 1 are overdense regions. In the absence of LSS, \code{g(r)=1}. Note that g is automatically rescaled, such that its average value in the survey volume is 1.
#' @param sigma Gaussian observing errors in log-mass \code{x}, which are automatically added to the survey. \code{sigma} can either be (1) a scalar, (2) a vector of \code{n} elements, or a function of the true log-mass \code{x}.
#' @param rmin,rmax Minimum and maximum distance of the survey. Outside these limits the function \code{f(x,r)} will automatically be assumed to be 0.
#' @param xmin,xmax Minimum and maximum log-mass in the survey. For optimal performance, specify these boubdaries in such a way that they certainly contain all sources generated by the survey, but don't span a much larger range.
#' @param shot.noise Logical flag. If set to \code{TRUE}, the number of galaxies in the survey can differ from the expected number, following a Poisson distribution.
#' @param verbose Logical flag. If set to \code{TRUE}, some information will be displayed in the console while generating the mock survey.
#' 
#' @return \code{dfmockdata} returns a list of arrays and scalars:
#' \item{x}{Array of observed log-mass.}
#' \item{x.err}{Gaussian uncertainties on x.}
#' \item{x.true}{Array of true log-masses, i.e. the values of \code{x} before they were perturbed by random uncertainties \code{x.err}.}
#' \item{r}{Array of comoving distances, only available if a function \code{f} is given.}
#' \item{f}{Selection function provided as input argument.}
#' \item{g}{Cosmic LSS function provided as input argument.}
#' \item{dVdr}{Derivative of survey volume provided as input argument, but rescaled to the requested number of galaxies \code{n}.}
#' \item{veff}{Function returning the effective volume as a function of log-mass \code{x}.}
#' \item{veff.values}{Array of effective volumes for each galaxy.}
#' \item{scd}{Function returning the expected source count density as a function of log-mass \code{x}.}
#' \item{rmin,rmax}{Range of comoving distances \code{r}, spanned by the survey. Same as input arguments.}
#' \item{xmin,xmax}{Range of log-masses \code{x} provided as input argument. This range is generally larger than the range spanned by the values of \code{x} and is meant to span the maximally conceivable range of \code{x} given the survey specifications.}
#' \item{rescaling.factor}{Value of rescaling factor applied to the cosmic volume to match the requested number of galaxies \code{n}.}
#' 
#' @examples
#' # draw 1000 galaxies with mass errors of 0.3 dex from a Schechter function
#' # with parameters (-2,11,-1.3) and a preset selection function
#' mock = dfmockdata(sigma = 0.3)
#' 
#' # plot the distance-log(mass) relation of observed data, true data, and approximate survey limit
#' plot(mock$r,mock$x,col='blue')
#' points(mock$r,mock$x.true,pch=20)
#' x = seq(5,11,0.01)
#' lines(1e-2*sqrt(10^x),x,col='red')
#'
#' # These data can then be used to fit a MF in several ways. For instance,
#' # assuming that the effective volume function Veff(x) is known:
#' selection = mock$veff
#' survey = dffit(mock$x, selection, mock$x.err)
#' 
#' # or assuming that Veff is known only on a galaxy-by-galaxy basis
#' selection = mock$veff.values
#' dffit(mock$x, selection, mock$x.err)
#' 
#' # or assuming that Veff is known on a galaxy-by-balaxy basis, but approximate analytically
#' # outside the range of observed galaxy masses
#' selection = list(mock$veff.values, mock$veff)
#'dffit(mock$x, selection, mock$x.err)
#' 
#' # or assuming that the full selection function f(x,r) and the observing volume
#' # derivative dVdr(r) are known
#' selection = list(mock$f, mock$dVdr, mock$rmin,mock$rmax)
#' dffit(mock$x, selection, mock$x.err)
#'
#' @seealso \code{\link{dffit}}
#'
#' @author Danail Obreschkow
#'
#' @export

dfmockdata <- function(n = NULL,
                       seed = 1,
                       veff = NULL,
                       f = NULL,
                       dVdr = NULL,
                       gdf = function(x) dfmodel(x,c(-2,11,-1.3),type='Schechter'),
                       g = NULL,
                       sigma = 0.0,
                       rmin = 0, rmax = 20,
                       xmin = 2, xmax = 13,
                       shot.noise = FALSE,
                       verbose = FALSE
                       ) {
  
  # check input
  if (is.null(f)!=is.null(dVdr)) stop('The functions f(x,r) and dVdr(r) must both be given or both not be given.')
  if (!is.null(f) & !is.null(veff)) stop('Either veff(x) or the pair {f(x,r) and dVdr(r)} must be given, but not both.')
  if (is.null(f) & is.null(veff)) {
    f = function(x,r) pracma::erf((1-1e3*r/sqrt(10^x))*20)*0.5+0.5
    dVdr = function(r) 2.13966*r^2
  }
  lss = !is.null(g)
  if (lss & is.null(f)) stop('If g(r) is given, the selection must be specified by f(x,r) and dVdr(r), not by veff(x).')
  test = try(gdf(xmin),silent=TRUE)
  if (!is.finite(test)) {
    stop('gdf cannot be evaluated for parameter-vector p.\n')
  }
  set.seed(seed+1)
  
  # evaluate effective volume and source count density without LSS
  if (is.null(veff)) {
    veff.elemental = function(x) {
      fct = function(r) f(x,r)*dVdr(r)
      return(integrate(fct,rmin,rmax,stop.on.error=FALSE)$value)
    }
  } else {
    veff.elemental = veff
  }
  veff = Vectorize(veff.elemental)
  scd = function(x) veff(x)*gdf(x)
  
  if (lss) {
    
    # renormalize function g(r), such that the average value of g in the survey volume is 1
    integrand = function(r) dVdr(r)*g(r)
    gnorm = integrate(integrand,rmin,rmax,stop.on.error=FALSE)$value/integrate(dVdr,rmin,rmax,stop.on.error=FALSE)$value
    tmp.g = g; g = function(r) tmp.g(r)/gnorm
    
    # evaluate effective volume and source count density with LSS
    veff.lss.elemental = function(x) {
      fct = function(r) f(x,r)*g(r)*dVdr(r)
      return(integrate(fct,rmin,rmax,stop.on.error=FALSE)$value)
    }
    veff.lss = Vectorize(veff.lss.elemental)
    scd.lss = function(x) veff.lss(x)*gdf(x)
    
  } else {
    
    veff.lss = veff
    scd.lss = scd
    
  }
  
  # compute expected number of galaxies (accounting for lss if present)
  n.expected = integrate(scd.lss,xmin,xmax,stop.on.error=FALSE)$value
  n.expected.large = integrate(scd.lss,2*xmin-xmax,2*xmax-xmin,stop.on.error=FALSE)$value
  if (n.expected.large>1.001*n.expected) stop('A non-negligible number of galaxies lies outside the range xmin-xmax. Please change this range.')
  
  # rescale effective volume to match the (optional) requested number of galaxies
  if (is.null(n)) {
    if (n.expected<2) stop('Input arguments imply less than two sources in the survey.')
    rescaling.factor = 1
  } else {
    if (n<2) stop('Number of sources must be at least 2.')
    rescaling.factor = n/n.expected
    if (is.null(dVdr)) {
      tmp = veff; veff = function(x) tmp(x)*rescaling.factor
    } else {
      tmp = dVdr; dVdr = function(r) tmp(r)*rescaling.factor
    }
    n.expected = n
  }
  
  # make actual number of sources
  if (shot.noise) {
    n = max(1,rpois(1,n.expected))
  } else {
    n = round(n.expected)
  }
  if (verbose) {
    cat(sprintf('Number of sources in the mock survey (expected): %.3f\n',n.expected))
    cat(sprintf('Number of sources in the mock survey (selected): %d\n',n))
  }
  
  # sample masses (x)
  dx = min(0.005,(xmax-xmin)/1000)
  xgrid = seq(xmin,xmax,dx)
  cdf = cumsum(scd.lss(xgrid)) # cumulative distribution function of source count density
  cdf.min = cdf[1]
  cdf.max = cdf[length(cdf)]
  i.min = max(which(cdf==cdf.min))
  i.max = min(which(cdf==cdf.max))
  nd = !duplicated(cdf[i.min:i.max])
  qnf = approxfun((cdf[i.min:i.max])[nd],(xgrid[i.min:i.max])[nd]) # quantile function of source count density
  x = qnf(runif(n,cdf[1],cdf[length(cdf)]))
  
  # add mass observing errors (x.err)
  if (!is.null(sigma)) {
    if (is.function(sigma)) {
      x.err = sapply(x,sigma)
    } else if (is.numeric(sigma)) {
      if (length(sigma)>1) {
        if (length(sigma)<n) {
          stop('dfmockdata: if sigma is a vector, it must have >=n elements.')
        } else {
          x.err = sigma[1:n]
        }
      } else {
        x.err = rep(sigma,n)
      }
    } else {
      stop('dfmockdata: unknown type for argument sigma.')
    }
    x.obs = x+rnorm(n)*x.err
  } else {
    x.obs = x
    x.err = NULL
  }
  
  # make effective volumes for each observation, that an observer would assign, not knowning the observational error in x
  veff.values = veff(x.obs)
  
  if (!is.null(f)) {
    
    # find maximum of fg(x,r) = f(x,r)*g(r)
    if (lss) {
      fg = function(x,r) f(x,r)*g(r)
    } else {
      fg = f
    }
    xseq = seq(xmin,xmax,length=100)
    rseq = seq(rmin,rmax,length=100)
    xrgrid = cbind(rep(xseq,10),rep(rseq,each=10))
    fct = function(p) -fg(p[1],p[2])
    q = apply(xrgrid,1,fct)
    if (max(q)>0) stop('f*g can never by smaller than 0.')
    para = xrgrid[which.min(q),]
    opt = optim(para,fct,method="L-BFGS-B",lower=c(xmin,rmin),upper=c(xmax,rmax))
    fgmax = -opt$value
    if (fgmax>5 & verbose) {
      if (lss) {
        cat(sprintf('The maximum of f(r)*g(r) (=%f) is significantly larger than 1. Check if this is intended.\n',fgmax))
      } else {
        cat(sprintf('The maximum of f(r) (=%f) is significantly larger than 1. Check if this is intended.\n',fgmax))
      }
    }
    
    # sample distances (r) using cumsum algorithm
    r = array(NA,n)
    dr = min(0.005,(rmax-rmin)/1000)
    rgrid = seq(rmin,rmax,dr)
    cdf = cumsum(dVdr(rgrid)) # cumulative volume out to r
    qnf = approxfun(cdf,rgrid) # quantile function of source count density
    list = seq(n)
    m = n
    count = 0
    while (m>0 & count<100) {
      count = count+1
      r[list] = qnf(runif(m,cdf[1],cdf[length(rgrid)]))
      rejected = fg(x[list],r[list])<runif(m)*fgmax
      list = list[rejected]
      m = length(list)
    }
    
    # sample distances (r) using deterministic uniroot algorithm to avoid iterating forever
    if (m>0) {
      get_random_r = function(x) { 
        h = function(r) fg(x,r)*dVdr(r)
        H = Vectorize(function(r) {integrate(h,rmin,r)$value})
        H.inv = function(y){uniroot(function(x){H(x)-y},interval=c(rmin,rmax))$root}
        return(H.inv(runif(1)*H(rmax)))
      }
      for (i in list) {
        r[i] = get_random_r(x[i])
      }
    }
    
  } else {
    r = NULL
  }
  
  # return
  return(list(x = x.obs, x.err = x.err, x.true = x, r = r,
              f = f, dVdr = dVdr, veff = veff, veff.lss = veff.lss,
              veff.values = veff.values, scd = scd,
              gdf = gdf, g = g,
              rmin = rmin, rmax = rmax,
              xmin = xmin, xmax = xmax,
              rescaling.factor = rescaling.factor,
              n = n, n.expected = n.expected))
  
}