lnre: LNRE Models (zipfR)

Description Usage Arguments Details Value Parameter Estimation Cost Functions Minimization Algorithms See Also Examples

Description

LNRE model constructor, returns an object representing a LNRE model with the specified parameters, or allows parameters to be estimated automatically from an observed frequency spectrum.

Usage

1
2
3
4
5
6
7
8
  lnre(type=c("zm", "fzm", "gigp"),
       spc=NULL, debug=FALSE,
       cost=c("gof", "chisq", "linear", "smooth.linear", "mse", "exact"),
       m.max=15, runs=5,
       method=c("Nelder-Mead", "NLM", "BFGS", "SANN", "Custom"),
       exact=TRUE, sampling=c("Poisson", "multinomial"),
       bootstrap=0, verbose=TRUE, parallel=1L,
       ...)

Arguments

type

class of LNRE model to use (see "LNRE Models" below)

spc

observed frequency spectrum used to estimate model parameters

debug

if TRUE, detailed debugging information will be printed during parameter estimation

cost

cost function for measuring the "distance" between observed and expected vocabulary size and frequency spectrum. Parameters are estimated by minimizing this cost function (see "Cost Functions" below for a list of built-in cost functions and details on user-defined cost functions).

m.max

number of spectrum elements considered by the cost function (see "Cost Functions" below for more information). If unspecified, the default is automatically adjusted to avoid small spectrum elements that may be mathematically unreliable.

runs

number of parameter optimization runs with random initialization. Parameters from the run that achieves the smallest value of the cost function will be selected. Currently not supported for method="Custom", please use runs=1 in this case.

method

algorithm used for parameter estimation, by minimizing the value of the cost function (see "Parameter Estimation" below for details, and "Minimization Algorithms" for descriptions of the available algorithms)

exact

if FALSE, certain LNRE models will be allowed to use approximations when calculating expected values and variances, in order to improve performance and numerical stability. However, the computed values might be inaccurate or inconsistent in "extreme" situations: in particular, E[V] might be larger than N when N is very small; ∑_m E[V_m] can be larger than E[V] at the same N; sum_m (m * E[V_m]) can be larger than N

sampling

type of random sampling model to use. Poisson sampling is mathematically simpler and allows fast and robust calculations, while multinomial sampling is more accurate especially for very small samples. Poisson sampling is the default and should be unproblematic for sample sizes N ≥ 10000. NB: The multinomial sampling option has not been implemented yet.

bootstrap

number of bootstrap samples used to estimate confidence intervals for estimated model parameters. Recommended values are bootstrap=100 or bootstrap=200. Bootstrapping can be very time-consuming and should not be used if the underlying sample size is very large (roughly, more than 1 million tokens). See lnre.bootstrap for further information and warnings.

parallel

whether to use parallelisation for the bootstrapping procedure (highly recommended). See lnre.bootstrap for details.

verbose

if TRUE, a progress bar will be shown in the R console during the bootstrapping procedure

...

all further named arguments are interpreted as parameter values for the chosen LNRE model (see the respective manpages for names and descriptions of the model parameters)

Details

Currently, the following LNRE models are supported by the zipfR package:

The Zipf-Mandelbrot (ZM) LNRE model (see lnre.zm for details).

The finite Zipf-Mandelbrot (fZM) LNRE model (see lnre.fzm for details).

The Generalized Inverse Gauss-Poisson (GIGP) LNRE model (see lnre.gigp for details).

If explicit model parameters are specified in addition to an observed frequency spectrum spc, these parameters are fixed to the given values and are excluded from the estimation procedure. This feature can be useful if fully automatic parameter estimation leads to a poor or counterintuitive fit.

Value

An object of a suitable subclass of lnre, depending on the type argument (e.g. lnre.fzm for type="fzm"). This object represents a LNRE model of the selected type with the specified parameter values, or with parameter values estimated from the observed frequency spectrum spc.

The internal structure of lnre objects is described on the lnre.details manpage (intended for developers).

Parameter Estimation

Automatic parameter estimation for LNRE models is performed by matching the expected vocabulary size and frequency spectrum of the model against the observed data passed in the spc argument.

For this purpose, a cost function has to be defined as a measure of the "distance" between observed and expected frequency spectrum. Parameters are then estimated by applying a minimization algorithm in order to find those parameter values that lead to the smallest possible cost.

Parameter estimation is a crucial and often also quite critical step in the application of LNRE models. Depending on the shape of the observed frequency spectrum, the automatic estimation procedure may result in a poor and counter-intuitive fit, or may fail altogether.

Usually, multiple runs of the minimization are performed with different random start values. An error will only be reported if all the estimation runs fail. Such multiple runs have not been implemented for the Custom minimization method yet; please specify runs=1 in this case.

Users can influence parameter estimation by choosing from a range of predefined cost functions and from several minimization algorithms, as described in the following sections. Some experimentation with the cost, m.max and method arguments will often help to resolve estimation failures and may result in a considerably better goodness-of-fit.

Cost Functions

The following cost functions are available and can be selected with the cost argument. All functions are based on the differences between observed and expected values for vocabulary size and the first elements of the frequency spectrum (V_1, …, V_m, where m is given by the m.max argument):

gof:

the multivariate chi-squared statistic used for goodness-of-fit testing (lnre.goodness.of.fit). This cost function corresponds (almost) to maximum-likelihood parameter estimation and is used by default.

chisq:

cost function based on a simplified version of the multivariate chi-squared test for goodness-of-fit (assuming independence between the random variables V_m).

linear:

linear cost function, which sums over the absolute differences between observed and expected values. This cost function puts more weight on fitting the vocabulary size and the first few elements of the frequency spectrum (where absolute differences are much larger than for higher spectrum elements).

smooth.linear:

modified version of the linear cost function, which smoothes the kink of the absolute value function for a difference of 0 (since non-differentiable cost functions might be problematic for gradient-base minimization algorithms)

mse:

mean squared error cost function, averaging over the squares of differences between observed and expected values. This cost function penalizes large absolute differences more heavily than linear cost (and therefore puts even greater weight on fitting vocabulary size and the first spectrum elements).

exact:

this "virtual" cost function attempts to match the observed vocabulary size and first spectrum elements exactly, ignoring differences for all higher spectrum elements. This is achieved by adjusting the value of m.max automatically, depending on the number of free parameters that are estimated (in general, the number of constraints that can be satisfied by estimating parameters is the same as the number of free parameters). Having adjusted m.max, the mse cost function is used to determined parameter values, so that the estimation procedure will not fail even if the constraints cannot be matched exactly.

Alternatively a user-defined cost function can be passed as a function object with signature 'cost(model, spc, m.max)', which compares the LNRE model 'model' against the observed frequency spectrum 'spc' and returns a cost value (i.e. lower cost indicates a better fit).

Minimization Algorithms

Several different minimization algorithms can be used for parmeter estimation and are selected with the method argument:

Nelder-Mead:

the Nelder-Mead algorithm, implemented by the optim function, performs minimization without using derivatives. Parameter estimation is therefore very robust, while almost as fast and accurate as the NLM method. Nelder-Mead is the default algorithm and is also used internally by most custom minimization procedures (see below).

NLM:

a standard Newton-type algorithm for nonlinear minimization, implemented by the nlm function, which makes use of numerical derivatives of the cost function. NLM minimization converges quickly and obtains very precise parameter estimates (for a local minimum of the cost function), but it is not very stable and may cause parameter estimation to fail altogether.

SANN:

minimization by simulated annealing, also provided by the optim function. Like Nelder-Mead, this algorithm is very robust because it avoids numerical derivatives, but convergence is extremely slow. In some cases, SANN might produce a better fit than Nelder-Mead (if the latter converges to a suboptimal local minimum).

BFGS:

a quasi-Newton method developed by Broyden, Fletcher, Goldfarb and Shanno. This minimization algorithm is efficient, but should be applied with care as it will often overshoot the valid range of parameter values.

Custom:

a custom estimation procedure provided for certain types of LNRE model, which may exploit special mathematical properties of the model in order to calculate one or more of the parameter values directly. For example, one parameter of the ZM and fZM models can easily be determined from the constraint E[V] = V (but note that this additional constraint leads to a different fit than is obtained by plain minimization of the cost function!). Custom estimation might also apply special configuration settings to improve convergence of the minimization process, based on knowledge about the valid ranges and "behaviour" of model parameters. If no custom estimation procedure has been implemented for the selected LNRE model, lnre falls back on the Nelder-Mead or NLM algorithm.

See the nlm and optim manpages for more information about the minimization algorithms used and key references.

See Also

Detailed descriptions of the different LNRE models provided by zipfR and their parameters can be found on the manpages lnre.zm, lnre.fzm and lnre.gigp.

Useful methods for trained models are lnre.spc, lnre.vgc, EV, EVm, VV, VVm. Suitable implementations of the print and summary methods are also provided (see print.lnre for details), as well as for plotting (see plot.lnre). Note that the methods N, V and Vm can be applied to LNRE models with estimated parameters and return information about the observed frequency spectrum used for parameter estimation.

If bootstrapping samples have been generated (bootstrap > 0), confidence intervals for the model parameters can be determined with confint.lnre. See lnre.bootstrap for more information on the bootstrapping procedure and implementation.

The lnre.details manpage gives details about the implementation of LNRE models and the internal structure of lnre objects, while estimate.model has more information on the parameter estimation procedure (both manpages are intended for developers).

See lnre.goodness.of.fit for a complete description of the goodness-of-fit test that is automatically performed after parameter estimation (and which is reported in the summary of the LNRE model). This function can also be used to evaluate the predictions of the LNRE model on a different data set than the one used for parameter estimation.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
## load Dickens dataset
data(Dickens.spc)

## estimate parameters of GIGP model and show summary
m <- lnre("gigp", Dickens.spc)
m


## N, V and V1 of spectrum used to compute model
## (should be the same as for Dickens.spc)
N(m)
V(m)
Vm(m,1)


## expected V and V_m and their variances for arbitrary N 
EV(m,100e6)
VV(m,100e6)
EVm(m,1,100e6)
VVm(m,1,100e6)

## use only 10 instead of 15 spectrum elements to estimate model
## (note how fit improves for V and V1)
m.10 <- lnre("gigp", Dickens.spc, m.max=10)
m.10

## experiment with different cost functions
m.mse <- lnre("gigp", Dickens.spc, cost="mse")
m.mse
m.exact <- lnre("gigp", Dickens.spc, cost="exact")
m.exact


## NLM minimization algorithm is faster but less robust
m.nlm <- lnre("gigp", Dickens.spc, method="NLM")
m.nlm

## ZM and fZM LNRE models have special estimation algorithms
m.zm <- lnre("zm", Dickens.spc)
m.zm
m.fzm <- lnre("fzm", Dickens.spc)
m.fzm


## estimation is much faster if approximations are allowed
m.approx <- lnre("fzm", Dickens.spc, exact=FALSE)
m.approx


## specify parameters of LNRE models directly
m <- lnre("zm", alpha=.5, B=.01)
lnre.spc(m, N=1000, m.max=10)

m <- lnre("fzm", alpha=.5, A=1e-6, B=.01)
lnre.spc(m, N=1000, m.max=10)

m <- lnre("gigp", gamma=-.5, B=.01, C=.01)
lnre.spc(m, N=1000, m.max=10)

## bootstrapped confidence intervals for model parameters
## Not run: 
model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40)
confint(model, "alpha") # Zipf slope
confint(model, "S")     # population diversity
confint(model, "S", method="normal") # Gaussian approx works well in this case

## speed up with parallelisation (see ?lnre.bootstrap for more information)
model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40, 
              parallel=8) # on Linux / MacOS with 8 available cores
## End(Not run)

zipfR documentation built on Jan. 8, 2021, 2:37 a.m.