Description Usage Arguments Details Value Parameter Estimation Cost Functions Minimization Algorithms See Also Examples
LNRE model constructor, returns an object representing a LNRE model with the specified parameters, or allows parameters to be estimated automatically from an observed frequency spectrum.
1 2 3 4 5 6 7 8 |
type |
class of LNRE model to use (see "LNRE Models" below) |
spc |
observed frequency spectrum used to estimate model parameters |
debug |
if |
cost |
cost function for measuring the "distance" between observed and expected vocabulary size and frequency spectrum. Parameters are estimated by minimizing this cost function (see "Cost Functions" below for a list of built-in cost functions and details on user-defined cost functions). |
m.max |
number of spectrum elements considered by the cost function (see "Cost Functions" below for more information). If unspecified, the default is automatically adjusted to avoid small spectrum elements that may be mathematically unreliable. |
runs |
number of parameter optimization runs with random
initialization. Parameters from the run that achieves the smallest
value of the cost function will be selected. Currently not supported
for |
method |
algorithm used for parameter estimation, by minimizing the value of the cost function (see "Parameter Estimation" below for details, and "Minimization Algorithms" for descriptions of the available algorithms) |
exact |
if |
sampling |
type of random sampling model to use. |
bootstrap |
number of bootstrap samples used to estimate confidence
intervals for estimated model parameters. Recommended values are
|
parallel |
whether to use parallelisation for the bootstrapping procedure
(highly recommended). See |
verbose |
if |
... |
all further named arguments are interpreted as parameter values for the chosen LNRE model (see the respective manpages for names and descriptions of the model parameters) |
Currently, the following LNRE models are supported by the zipfR
package:
The Zipf-Mandelbrot (ZM) LNRE model (see lnre.zm
for details).
The finite Zipf-Mandelbrot (fZM) LNRE model (see
lnre.fzm
for details).
The Generalized Inverse Gauss-Poisson (GIGP) LNRE model (see
lnre.gigp
for details).
If explicit model parameters are specified in addition to an observed
frequency spectrum spc
, these parameters are fixed to the given
values and are excluded from the estimation procedure. This feature
can be useful if fully automatic parameter estimation leads to a poor
or counterintuitive fit.
An object of a suitable subclass of lnre
, depending on the
type
argument (e.g. lnre.fzm
for type="fzm"
).
This object represents a LNRE model of the selected type with the
specified parameter values, or with parameter values estimated from
the observed frequency spectrum spc
.
The internal structure of lnre
objects is described on the
lnre.details
manpage (intended for developers).
Automatic parameter estimation for LNRE models is performed by
matching the expected vocabulary size and frequency spectrum of the
model against the observed data passed in the spc
argument.
For this purpose, a cost function has to be defined as a measure of the "distance" between observed and expected frequency spectrum. Parameters are then estimated by applying a minimization algorithm in order to find those parameter values that lead to the smallest possible cost.
Parameter estimation is a crucial and often also quite critical step in the application of LNRE models. Depending on the shape of the observed frequency spectrum, the automatic estimation procedure may result in a poor and counter-intuitive fit, or may fail altogether.
Usually, multiple runs of the minimization are performed with different
random start values. An error will only be reported if all the estimation
runs fail. Such multiple runs have not been implemented for the Custom
minimization method yet; please specify runs=1
in this case.
Users can influence parameter estimation by choosing from a range of
predefined cost functions and from several minimization algorithms, as
described in the following sections. Some experimentation with the
cost
, m.max
and method
arguments will often help
to resolve estimation failures and may result in a considerably better
goodness-of-fit.
The following cost functions are available and can be selected with
the cost
argument. All functions are based on the differences
between observed and expected values for vocabulary size and the first
elements of the frequency spectrum (V_1, …, V_m, where
m is given by the m.max
argument):
gof
:the multivariate chi-squared statistic used for
goodness-of-fit testing (lnre.goodness.of.fit
).
This cost function corresponds (almost) to maximum-likelihood
parameter estimation and is used by default.
chisq
:cost function based on a simplified version of the multivariate chi-squared test for goodness-of-fit (assuming independence between the random variables V_m).
linear
:linear cost function, which sums over the absolute differences between observed and expected values. This cost function puts more weight on fitting the vocabulary size and the first few elements of the frequency spectrum (where absolute differences are much larger than for higher spectrum elements).
smooth.linear
:modified version of the linear cost function, which smoothes the kink of the absolute value function for a difference of 0 (since non-differentiable cost functions might be problematic for gradient-base minimization algorithms)
mse
:mean squared error cost function, averaging over the squares of differences between observed and expected values. This cost function penalizes large absolute differences more heavily than linear cost (and therefore puts even greater weight on fitting vocabulary size and the first spectrum elements).
exact
:this "virtual" cost function attempts to match
the observed vocabulary size and first spectrum elements exactly,
ignoring differences for all higher spectrum elements. This is
achieved by adjusting the value of m.max
automatically,
depending on the number of free parameters that are estimated (in
general, the number of constraints that can be satisfied by
estimating parameters is the same as the number of free
parameters). Having adjusted m.max
, the mse
cost
function is used to determined parameter values, so that the
estimation procedure will not fail even if the constraints cannot
be matched exactly.
Alternatively a user-defined cost function can be passed as a function object with signature 'cost(model, spc, m.max)', which compares the LNRE model 'model' against the observed frequency spectrum 'spc' and returns a cost value (i.e. lower cost indicates a better fit).
Several different minimization algorithms can be used for parmeter
estimation and are selected with the method
argument:
Nelder-Mead
:the Nelder-Mead algorithm, implemented by
the optim
function, performs minimization without using
derivatives. Parameter estimation is therefore very robust, while
almost as fast and accurate as the NLM
method.
Nelder-Mead
is the default algorithm and is also used
internally by most custom minimization procedures (see below).
NLM
:a standard Newton-type algorithm for nonlinear
minimization, implemented by the nlm
function, which
makes use of numerical derivatives of the cost function.
NLM
minimization converges quickly and obtains very precise
parameter estimates (for a local minimum of the cost function),
but it is not very stable and may cause parameter estimation to
fail altogether.
SANN
:minimization by simulated annealing, also provided by the
optim
function. Like Nelder-Mead
, this algorithm is
very robust because it avoids numerical derivatives, but
convergence is extremely slow. In some cases, SANN
might
produce a better fit than Nelder-Mead
(if the latter
converges to a suboptimal local minimum).
BFGS
:a quasi-Newton method developed by Broyden, Fletcher, Goldfarb and Shanno. This minimization algorithm is efficient, but should be applied with care as it will often overshoot the valid range of parameter values.
Custom
:a custom estimation procedure provided
for certain types of LNRE model, which may exploit special
mathematical properties of the model in order to calculate one or
more of the parameter values directly. For example, one parameter
of the ZM and fZM models can easily be determined from the
constraint E[V] = V (but note that this additional
constraint leads to a different fit than is obtained by plain
minimization of the cost function!). Custom estimation might also
apply special configuration settings to improve convergence of the
minimization process, based on knowledge about the valid ranges
and "behaviour" of model parameters. If no custom estimation
procedure has been implemented for the selected LNRE model,
lnre
falls back on the Nelder-Mead
or NLM
algorithm.
See the nlm
and optim
manpages for more
information about the minimization algorithms used and key references.
Detailed descriptions of the different LNRE models provided by
zipfR
and their parameters can be found on the manpages
lnre.zm
, lnre.fzm
and
lnre.gigp
.
Useful methods for trained models are lnre.spc
,
lnre.vgc
, EV
, EVm
,
VV
, VVm
. Suitable implementations of the
print
and summary
methods are also
provided (see print.lnre
for details), as well as for
plotting (see plot.lnre
). Note that the
methods N
, V
and Vm
can be
applied to LNRE models with estimated parameters and return
information about the observed frequency spectrum used for parameter
estimation.
If bootstrapping samples have been generated (bootstrap > 0
),
confidence intervals for the model parameters can be determined with
confint.lnre
. See lnre.bootstrap
for
more information on the bootstrapping procedure and implementation.
The lnre.details
manpage gives details about the
implementation of LNRE models and the internal structure of
lnre
objects, while estimate.model
has more
information on the parameter estimation procedure (both manpages are
intended for developers).
See lnre.goodness.of.fit
for a complete description of
the goodness-of-fit test that is automatically performed after
parameter estimation (and which is reported in the summary
of
the LNRE model). This function can also be used to evaluate the
predictions of the LNRE model on a different data set than the one
used for parameter estimation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | ## load Dickens dataset
data(Dickens.spc)
## estimate parameters of GIGP model and show summary
m <- lnre("gigp", Dickens.spc)
m
## N, V and V1 of spectrum used to compute model
## (should be the same as for Dickens.spc)
N(m)
V(m)
Vm(m,1)
## expected V and V_m and their variances for arbitrary N
EV(m,100e6)
VV(m,100e6)
EVm(m,1,100e6)
VVm(m,1,100e6)
## use only 10 instead of 15 spectrum elements to estimate model
## (note how fit improves for V and V1)
m.10 <- lnre("gigp", Dickens.spc, m.max=10)
m.10
## experiment with different cost functions
m.mse <- lnre("gigp", Dickens.spc, cost="mse")
m.mse
m.exact <- lnre("gigp", Dickens.spc, cost="exact")
m.exact
## NLM minimization algorithm is faster but less robust
m.nlm <- lnre("gigp", Dickens.spc, method="NLM")
m.nlm
## ZM and fZM LNRE models have special estimation algorithms
m.zm <- lnre("zm", Dickens.spc)
m.zm
m.fzm <- lnre("fzm", Dickens.spc)
m.fzm
## estimation is much faster if approximations are allowed
m.approx <- lnre("fzm", Dickens.spc, exact=FALSE)
m.approx
## specify parameters of LNRE models directly
m <- lnre("zm", alpha=.5, B=.01)
lnre.spc(m, N=1000, m.max=10)
m <- lnre("fzm", alpha=.5, A=1e-6, B=.01)
lnre.spc(m, N=1000, m.max=10)
m <- lnre("gigp", gamma=-.5, B=.01, C=.01)
lnre.spc(m, N=1000, m.max=10)
## bootstrapped confidence intervals for model parameters
## Not run:
model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40)
confint(model, "alpha") # Zipf slope
confint(model, "S") # population diversity
confint(model, "S", method="normal") # Gaussian approx works well in this case
## speed up with parallelisation (see ?lnre.bootstrap for more information)
model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40,
parallel=8) # on Linux / MacOS with 8 available cores
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.