dffit: Fit a generative distribution function, such as a galaxy mass...

Description Usage Arguments Details Value Author(s) Examples

View source: R/dffit.R

Description

This function finds the most likely P-dimensional model parameters of a D-dimensional distribution function (DF) generating an observed set of N objects with D-dimensional observables x, accounting for measurement uncertainties and a user-defined selection function. For instance, if the objects are galaxies, dffit can fit a mass function (D=1), a mass-size distribution (D=2) or the mass-spin-morphology distribution (D=3). A full description of the algorithm can be found in Obreschkow et al. (2017).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
dffit(
  x,
  selection = NULL,
  x.err = NULL,
  r = NULL,
  gdf = "Schechter",
  p.initial = NULL,
  prior = NULL,
  obs.selection = NULL,
  obs.sel.cov = NULL,
  n.iterations = 100,
  method = "Nelder-Mead",
  reltol = 1e-10,
  correct.lss.bias = FALSE,
  lss.weight = NULL,
  lss.errors = TRUE,
  n.bootstrap = NULL,
  n.jackknife = NULL,
  xmin = NULL,
  xmax = NULL,
  dx = 0.01,
  keep.eddington.bias = FALSE,
  write.fit = FALSE,
  add.gaussian.errors = TRUE,
  make.posteriors = TRUE
)

Arguments

x

is an N-by-D matrix (or N-element vector if D=1) containing the observed D-dimensional properties of N objects. For instance, x can be N-element vector containing the logarithmig masses of N galaxies, to which a mass function should be fitted.

selection

Specifies the effective volume V(xval) in which an object of (D-dimensional) true property (e.g. true log-mass) xval can be observed. This volume can account for complex selection criteria, as long as they can be expressed as a function of the true properties, i.e. before measurement errors occur. This volume can be specified in five ways:

(1) selection can be a single positive number. This number will be interpreted as a constant volume, V(xval)=selection, in which all objects are fully observable. V(xval)=0 is assumed outside the "observed domain". This domain is defined as min(x)<=xval<=max(x) for a scalar observable (D=1), or as min(x[,j])<=xval[j]<=max(x[,j]) for all j=1,...,D if D>1. This mode can be used for volume-complete surveys or for simulated galaxies in a box.

(2) selection can be an N-element vector. The elements will be interpreted as the volumes of each galaxy. V(xval) is interpolated (linearly in 1/V) for other values xval. V(xval)=0 is assumed outside the observed domain, except if D=1 (as in the case of fitting mass functions), where V(xval)=0 is assumed only if xval<min(x), whereas for xval>max(x), V(xval) is set equal to the maximum effective volume, i.e. the maximum of selection.

(3) selection can be a function of D variables, which directly specifies the effective volume for any xval, i.e. Veff(xval)=selection(xval).

(4) selection can be a list (selection = list(veff.values, veff.userfct)) of an N-element vector veff.values and a D-dimensional function veff.userfct. In this case, the effective volume is computed using a hybrid scheme of modes (2) and (3): V(xval) will be interpolated from the N values of veff.values inside the observed domain, but set equal to veff.userfct outside this domain.

(5) selection can be a list of two functions and one 2-element vector: selection = list(f, dVdr, rmin, rmax), where f = function(xval,r) is the isotropic selection function and dVdr = function(r) is the derivative of the total survey volume as a function of comoving distance r. The scalars rmin and rmax (can be 0 and Inf) are the minimum and maximum comoving distance limits of the survey. Outside these limits V(xval)=0 will be assumed.

x.err

specifies the observational uncertainties of the N data points. These uncertainties can be specified in four ways:

(1) If x.err = NULL, the measurements x will be considered exact.

(2) If x.err is a N-by-D matrix, the scalars x.err[i,j] are interpreted as the standard deviations of Gaussian uncertainties on x[i,j].

(3) If x.err is a N-by-D-by-D array, the D-by-D matrices x.err[i,,] are interpreted as the covariance matrices of the D observed values x[i,].

(4) x.err can also be a function of a D-vector x and an integer i, such that x.err(x,i) is the prior probability distribution function of the data point i to have the true value x. This function must be vectorized in the first argument, such that calling x.err(x,i) with x being a N-by-D matrix returns an N-element vector.

r

Optional N-element vector specifying the comoving distances of the N objects (e.g. galaxies). This vector is only needed if correct.lss.bias = TRUE.

gdf

Either a string or a function specifying the DF to be fitted. A string is interpreted as the name of a predefined mass function (i.e. functions of one obervable, D=1). Available options are 'Schechter' for Schechter function (3 parameters), 'PL' for a power law (2 parameters), or 'MRP' for an MRP function (4 parameters). Alternatively, gdf = function(xval,p) can be any function of the P observable(s) xval and a list of parameters p. The function gdf(xval,p) must be fully vectorized in xval, i.e. it must output a vector of N elements if xval is an N-by-P array (such as the input argument x). Note that if gdf is given as a function, the argument p.initial is mandatory.

p.initial

is a P-vector specifying the initial model parameters for fitting the DF.

prior

is an optional function specifying the priors on the P model parameters. This function must take a P-dimensional vector p as input argument and return a scalar proportional to the natural logarithm of the prior probability of the P-dimensional model parameter p. In other words, prior(p) is an additative term for the log-likelihood. Note that prior(p) must be smooth and finite for the full parameter space. If prior=NULL (default), uniform priors are assumed.

obs.selection

is an optional selection function of a D-vector x, which specifies the fraction (between 0 and 1) of data with observed values x, whose true value passes the selection function selection can be observed. Normally this fraction is assumed to be 1 and all the selections are assumed to be specified relative to the true value in selection. However, if the data were selected, for example, with a sharp cut in the observed values, this can be specified in obs.selection. The function obs.selection(x) must be vectorized and return a vector of N elements, if x is an N-by-D matrix (such as the input argument x).

obs.sel.cov

is an optional D-by-D matrix, only used if obs.selection is set. It specifies the mean covariance matrix of the data to estimate how much data has been scattered outside the observing range. If obs.selection is set, but obs.sel.cov is not specified, the code estimates obs.sel.cov from the mean errors of the data (only works for 1D data).

n.iterations

Maximum number of iterations in the repeated fit-and-debias algorithm to evaluate the maximum likelihood.

method

optimization method argument of optim routine

reltol

relative tolerance parameter of optim routine

correct.lss.bias

If TRUE the distance values are used to correct for the observational bias due to galaxy clustering (large-scale structure). The overall normalization of the effective volume is chosen such that the expected mass contained in the survey volume is the same as for the uncorrected effective volume.

lss.weight

If correct.lss.bias==TRUE, this optional function of a P-vector is the weight-function used for the mass normalization of the effective volume. For instance, to preserve the number of galaxies, choose lss.weight = function(x) 1, or to perserve the total mass, choose lss.weight = function(x) 10^x (if the data x are log10-masses).

lss.errors

is a logical flag specifying whether uncertainties computed via resampling should include errors due to the uncerainty of large-scale structure (LSS). If TRUE the parameter uncerainties are estimated by refitting the LSS correction at each resampling iteration. This argument is only considered if correct.lss.bias=TRUE and n.bootstrap>0.

n.bootstrap

If n.bootstrap is an integer larger than one, the data is resampled n.bootstrap times using a non-parametric bootstrapping method to produce more accurate covariances. These covariances are given as matrix and as parameter quantiles in the output list. If n.bootstrap = NULL, no resampling is performed.

n.jackknife

If n.jackknife is an integer larger than one, the data is jackknife-resampled n.jackknife times, removing exactly one data point from the observed set at each iteration. This resampling adds model parameters, maximum likelihood estimator (MLE) bias corrected parameter estimates (corrected to order 1/N). If n.jackknife is larger than the number of data points N, it is automatically reduced to N. If n.jackknife = NULL, no such parameters are deterimed.

xmin, xmax, dx

are P-element vectors (i.e. scalars for 1-dimensional DF) specifying the points (seq(xmin[i],xmax[i],by=dx[i])) used for some numerical integrations.

keep.eddington.bias

If TRUE, the data is not corrected for Eddington bias. In this case no fit-and-debias iterations are performed and the argument n.iterations will be ignored.

write.fit

If TRUE, the best-fitting parameters are displayed in the console.

add.gaussian.errors

If TRUE, Gaussian estimates of the 16 and 84 percentiles of the fitted generative distribution function (gdf) are included in the sublist grid of the output.

make.posteriors

If TRUE, posterior probability distributions of the observed data are evaluated from the best fitting model.

Details

For a detailed description of the method, please refer to the peer-reviewed publication by Obreschkow et al. 2017 (in prep.).

Value

The routine dffit returns a structured list, which can be interpreted by other routines, such as dfwrite, dfplot, dfplotcov, dfplotveff. The list contains the following sublists:

data

contains the input arguments x, x.err, r and lss.weight, as well as the integers n.data and n.dim specifying the number of objects (N) to be fitted and the dimension (D) of the observables. In other words, n.data and n.dim are the number of rows and columns of x, respectively.

selection

is a list describing the selection function of the data used when fitting the generative DF. The most important entries in this list are:

veff is a function of a D-dimensional vector, specifying the effective volume associated with an object of properties xval. If LSS is corrected for, i.e. if the argument correct.lss.bias is set to TRUE, this function is the final effecive volume, including the effect of LSS.

veff.no.lss is a function of a D-dimensional vector, specifying the effecive volume associate with an object, if LSS were not accounted for. This function is indentical to veff, if LSS is not accounted for in the fit, i.e. if the argument correct.lss.bias is set to FALSE.

mode is an integer specifying the format of the input argument selection.

fit

is a list describing the fitted generative distribution function. Its most important entries are:

p.best is a P-vector giving the most likely model parameters according to the MML method.

p.sigma and p.covariance are the standard deviations and covariance matrices of the best-fitting parameters in the Gaussian approximation from the Hessian matrix of the modified likelihood function.

gdf is a function of a D-dimensional vector, which is the generative DF, evaluated at the parameters p.best.

scd is a function of a D-dimensional vector, which gives the predicted source counts of the most likely model, i.e. scd(x)=gdf(x)*veff(x).

posteriors

is a list specifying the posterior PDFs of the observed data, given the best-fitting model. It contains the following entries:

x.mean is an N-by-D dimensional array giving the D-dimensional means of the posterior PDFs of the N objects.

x.mode is an N-by-D dimensional array giving the D-dimensional modes of the posterior PDFs of the N objects.

x.stdev is an N-by-D dimensional array giving the D-dimensional standard deviations of the posterior PDFs of the N objects.

x.random is an N-by-D dimensional array giving one random D-dimensional value drawn from the posterior PDFs of each of the N objects.

model

is a list describing the generative DF used to model the data. The main entries of this list are:

gdf(xval,p) is the generative DF to be fitted, written as phi(x|theta) in the reference publication.

gdf.equation is a text-string representing the analytical equation of gdf.

parameter.names is a P-vector of expressions (see expression), specifying the names of the parameters.

n.para is an integer specifying the number P of model parameters.

grid

is a list of arrays with numerical evaluations of different functions on a grid in the D-dimensional observable space. This grid is used for numerical integrations and graphical representations. The most important list entries are:

x is a M-by-D array of M points, defining a regular Cartesian grid in the D-dimensional observable space.

xmin and xmax are D-vectors specifying the lower and upper boundary of the grid in the D-dimensional observable space.

dx is a D-vector specifying the steps between grid points.

dvolume is a number specifying the D-dimensional volume associated with each grid point.

n.points is the number M of grid points.

gdf is an M-vector of values of the best-fitting generative DF at each grid point. The additional entries gdf.error.neg and gdf.error.pos define the 68%-confidence range in the Hessian approximation of the parameter covariances. The optional entries gdf.quantile.# are different quantiles, generated if the input argument n.bootstrap is set.

veff is an M-vector of effective volumes at each grid point.

scd is an M-vector of predicted source counts according to the best-fitting model.

scd.posterior is an M-vector of observed source counts derived from the posterior PDFs of each object.

effective.counts is an M-vector of fractional source counts derived from the posterior PDFs of each object.

options

is a list of various optional input arguments of dffit.

Author(s)

Danail Obreschkow

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# For a quick overview of some key functionalities run
dfexample()
# with varying integer arguments 1, 2, 3, 4.

# The following examples introduce the basics of dftools step-by step.
# First, generate a mock sample of 1000 galaxies with 0.5dex mass errors, drawn from
# a Schechter function with the default parameters (-2,11,-1.3):
dat = dfmockdata(n=1000, sigma=0.5)

# show the observed and true log-masses (x and x.true) as a function of true distance r
plot(dat$r,dat$x,col='grey'); points(dat$r,dat$x.true,pch=20)

# fit a Schechter function to the mock sample without accounting for errors
survey1 = dffit(dat$x, dat$veff)

# plot fit and add a black dashed line showing the input MF
ptrue = dfmodel(output='initial')
mfplot(survey1, xlim=c(1e6,2e12), ylim=c(2e-4,2),
show.data.histogram = TRUE, p = ptrue, col.fit = 'purple')

# now, do the same again, while accountting for measurement errors in the fit
# this time, the posterior data, corrected for Eddington bias, is shown as black points
survey2 = dffit(dat$x, dat$veff, dat$x.err)
mfplot(survey2, show.data.histogram = NA, add = TRUE)

# show fitted parameter PDFs and covariances with true input parameters as black points
dfplotcov(list(survey2,survey1,ptrue),pch=c(20,20,3),col=c('blue','purple','black'),nstd=15)

# show effective volume function
dfplotveff(survey2)

# now create a smaller survey of only 30 galaxies with 0.5dex mass errors
dat = dfmockdata(n=30, sigma=0.5)

# fit a Schechter function and determine uncertainties by resampling the data
survey = dffit(dat$x, dat$veff, dat$x.err, n.bootstrap = 30)

# show best fit with 68% Gaussian uncertainties from Hessian and posterior data as black points
mfplot(survey, show.data.histogram = TRUE, uncertainty.type = 1)

# show best fit with 68% and 95% resampling uncertainties and posterior data as black points
mfplot(survey, show.data.histogram = TRUE, uncertainty.type = 3)

# add input model as dashed lines
lines(10^survey$grid$x, survey$model$gdf(survey$grid$x,ptrue), lty=2)

obreschkow/dftools documentation built on June 25, 2021, 10:45 p.m.