gaitdpoisson: Generally Altered, Inflated, Truncated and Deflated Poisson...

View source: R/family.gaitd.R

gaitdpoissonR Documentation

Generally Altered, Inflated, Truncated and Deflated Poisson Regression

Description

Fits a generally altered, inflated, truncated and deflated Poisson regression by MLE. The GAITD combo model having 7 types of special values is implemented. This allows mixtures of Poissons on nested and/or partitioned support as well as a multinomial logit model for (nonparametric) altered, inflated and deflated values. Truncation may include the upper tail.

Usage

gaitdpoisson(a.mix = NULL, i.mix = NULL, d.mix = NULL,
     a.mlm = NULL, i.mlm = NULL, d.mlm = NULL,
     truncate = NULL, max.support = Inf,
     zero = c("pobs", "pstr", "pdip"),
     eq.ap = TRUE, eq.ip = TRUE, eq.dp = TRUE,
     parallel.a = FALSE, parallel.i = FALSE, parallel.d = FALSE,
     llambda.p = "loglink", llambda.a = llambda.p,
     llambda.i = llambda.p, llambda.d = llambda.p,
     type.fitted = c("mean", "lambdas", "pobs.mlm", "pstr.mlm",
     "pdip.mlm", "pobs.mix", "pstr.mix", "pdip.mix",
     "Pobs.mix", "Pstr.mix", "Pdip.mix", "nonspecial",
     "Numer", "Denom.p", "sum.mlm.i", "sum.mix.i",
     "sum.mlm.d", "sum.mix.d", "ptrunc.p",
     "cdf.max.s"), gpstr.mix = ppoints(7) / 3,
     gpstr.mlm = ppoints(7) / (3 + length(i.mlm)),
     imethod = 1, mux.init = c(0.75, 0.5, 0.75),
     ilambda.p = NULL, ilambda.a = ilambda.p,
     ilambda.i = ilambda.p, ilambda.d = ilambda.p,
     ipobs.mix = NULL, ipstr.mix = NULL, ipdip.mix = NULL,
     ipobs.mlm = NULL, ipstr.mlm = NULL, ipdip.mlm = NULL,
     byrow.aid = FALSE, ishrinkage = 0.95, probs.y = 0.35)

Arguments

truncate, max.support

Vector of truncated values, i.e., nonnegative integers. For the first seven arguments (for the special values) a NULL stands for an empty set, and the seven sets must be mutually disjoint. Argument max.support enables RHS-truncation, i.e., something equivalent to truncate = (U+1):Inf for some upper support point U specified by max.support.

a.mix, i.mix, d.mix

Vector of altered and inflated values corresponding to finite mixture models. These are described as parametric or structured.

The parameter lambda.p is always estimated. If length(a.mix) is 1 or more then the parameter pobs.mix is estimated. If length(i.mix) is 1 or more then the parameter pstr.mix is estimated. If length(d.mix) is 1 or more then the parameter pdip.mix is estimated.

If length(a.mix) is 2 or more then the parameter lambda.a is estimated. If length(i.mix) is 2 or more then the parameter lambda.i is estimated. If length(d.mix) is 2 or more then the parameter lambda.d is estimated.

If length(a.mix) == 1, length(i.mix) == 1 or length(d.mix) == 1 then lambda.a, lambda.i and lambda.d are unidentifiable and therefore ignored. In such cases it would be equivalent to moving a.mix into a.mlm, etc.

Due to its great flexibility, it is easy to misuse this function and ideally the values of the above arguments should be well justified by the application on hand. Adding inappropriate or unnecessary values to these arguments willy-nilly is a recipe for disaster, especially for i.mix and d.mix. Using a.mlm effectively removes a subset of the data from the main analysis, therefore may result in a substantial loss of efficiency. For seeped values, a.mix, a.mlm, d.mix and d.mlm can be used only. Heaped values can be handled by i.mlm and i.mix, as well as a.mix and a.mlm. Because of the NBP reason below, it sometimes may be necessary to specify deflated values to altered values.

a.mlm, i.mlm, d.mlm

Vector of altered, inflated and deflated values corresponding to the multinomial logit model (MLM) probabilities of observing those values—see multinomial. These are described as nonparametric or unstructured.

llambda.p, llambda.a, llambda.i, llambda.d

Link functions for the parent, altered, inflated and deflated distributions respectively. See Links for more choices and information.

eq.ap, eq.ip, eq.dp

Single logical each. Constrain the rate parameters to be equal? See CommonVGAMffArguments for information. Having all three arguments TRUE gives greater stability in the estimation because of fewer parameters and therefore fewer initial values needed, however if so then one should try relax some of the arguments later.

For the GIT–Pois submodel, after plotting the responses, if the distribution of the spikes above the nominal probabilities has roughly the same shape as the ordinary values then setting eq.ip = TRUE would be a good idea so that lambda.i == lambda.p. And if i.mix is of length 2 or a bit more, then TRUE should definitely be entertained. Likewise, for heaped or seeped data, setting eq.ap = TRUE (so that lambda.p == lambda.p) would be a good idea for the GAT–Pois if the shape of the altered probabilities is roughly the same as the parent distribution.

parallel.a, parallel.i, parallel.d

Single logical each. Constrain the MLM probabilities to be equal? If so then this applies to all length(a.mlm) pobs.mlm probabilities or all length(i.mlm) pstr.mlm probabilities or all length(d.mlm) pdip.mlm probabilities. See CommonVGAMffArguments for information. The default means that the probabilities are generally unconstrained and unstructured and will follow the shape of the data. See constraints.

type.fitted

See CommonVGAMffArguments and below for information. The first value is the default, and this is usually the unconditional mean. Choosing an irrelevant value may result in an NA being returned and a warning, e.g., "pstr.mlm" for a nonparametric GAT model.

The choice "lambdas" returns a matrix with at least one column and up to three others, corresponding to all those estimated. In order, their colnames are "lambda.p", "lambda.a", "lambda.i" and "lambda.d". For other distributions such as gaitdlog type.fitted = "shapes" is permitted and the colnames are "shape.p", "shape.a", "shape.i" and "shape.d", etc.

Option "Pobs.mix" provides more detail about "pobs.mix" by returning a matrix whose columns correspond to each altered value; the row sums (rowSums) of this matrix is "pobs.mix". Likewise "Pstr.mix" about "pstr.mix" and "Pdip.mix" about "pdip.mix".

The choice "cdf.max.s" is the CDF evaluated at max.support using the parent distribution, e.g., ppois(max.support, lambda.p) for gaitdpoisson. The value should be 1 if max.support = Inf (the default). The choice "nonspecial" is the probability of a nonspecial value. The choices "Denom.p" and "Numer" are quantities found in the GAITD combo PMF and are for convenience only.

The choice type.fitted = "pobs.mlm" returns a matrix whose columns are the altered probabilities (Greek symbol \omega_s). The choice "pstr.mlm" returns a matrix whose columns are the inflated probabilities (Greek symbol \phi_s). The choice "pdip.mlm" returns a matrix whose columns are the deflated probabilities (Greek symbol \psi_s).

The choice "ptrunc.p" returns the probability of having a truncated value with respect to the parent distribution. It includes any truncated values in the upper tail beyond max.support. The probability of a value less than or equal to max.support with respect to the parent distribution is "cdf.max.s".

The choice "sum.mlm.i" adds two terms. This gives the probability of an inflated value, and the formula can be loosely written down as something like "pstr.mlm" + "Numer" * dpois(i.mlm, lambda.p) / "Denom.p". The other three "sum.m*" arguments are similar.

gpstr.mix, gpstr.mlm

See CommonVGAMffArguments for information. Gridsearch values for the two parameters. If failure occurs try a finer grid, especially closer to 0, and/or experiment with mux.init.

imethod, ipobs.mix, ipstr.mix, ipdip.mix

See CommonVGAMffArguments for information. Good initial values are difficult to compute because of the great flexibility of GAITD regression, therefore it is often necessary to use these arguments. A careful examination of a spikeplot of the data should lead to good choices.

ipobs.mlm, ipstr.mlm, ipdip.mlm

See CommonVGAMffArguments for information.

mux.init

Numeric, of length 3. General downward multiplier for initial values for the sample proportions (MLEs actually). This is under development and more details are forthcoming. In general, 1 means unchanged and values should lie in (0, 1], and values about 0.5 are recommended. The elements apply in order to altered, inflated and deflated (no distinction between mix and MLM).

ilambda.p, ilambda.a, ilambda.i, ilambda.d

Initial values for the rate parameters; see CommonVGAMffArguments for information.

probs.y, ishrinkage

See CommonVGAMffArguments for information.

byrow.aid

Details are at Gaitdpois.

zero

See CommonVGAMffArguments for information. By default, all the MLM probabilities are modelled as simple as possible (intercept-only) to help avoid numerical problems, especially when there are many covariates. The Poisson means are modelled by the covariates, and the default zero vector is pruned of any irrelevant values. To model all the MLM probabilities with covariates set zero = NULL, however, the number of regression coefficients could be excessive.

For the MLM probabilities, to model pobs.mix only with covariates set zero = c('pstr', 'pobs.mlm', 'pdip'). Likewise, to model pstr.mix only with covariates set zero = c('pobs', 'pstr.mlm', 'pdip').

It is noted that, amongst other things, zipoisson and zipoissonff differ with respect to zero, and ditto for zapoisson and zapoissonff.

Details

The full GAITD–Pois combo model may be fitted with this family function. There are seven types of special values and all arguments for these may be used in a single model. Here, the MLM represents the nonparametric while the Pois refers to the Poisson mixtures. The defaults for this function correspond to an ordinary Poisson regression so that poissonff is called instead. A MLM with only one probability to model is equivalent to logistic regression (binomialff and logitlink).

The order of the linear/additive predictors is best explained by an example. Suppose a combo model has length(a.mix) > 2 and length(i.mix) > 2, length(d.mix) > 2, a.mlm = 3:5, i.mlm = 6:9 and d.mlm = 10:12, say. Then loglink(lambda.p) is the first. The second is multilogitlink(pobs.mix) followed by loglink(lambda.a) because a.mix is long enough. The fourth is multilogitlink(pstr.mix) followed by loglink(lambda.i) because i.mix is long enough. The sixth is multilogitlink(pdip.mix) followed by loglink(lambda.d) because d.mix is long enough. Next are the probabilities for the a.mlm values. Then are the probabilities for the i.mlm values. Lastly are the probabilities for the d.mlm values. All the probabilities are estimated by one big MLM and effectively the "(Others)" column of left over probabilities is associated with the nonspecial values. These might be called the nonspecial baseline probabilities (NBP). The dimension of the vector of linear/additive predictors here is M=17.

Two mixture submodels that may be fitted can be abbreviated GAT–Pois or GIT–Pois. For the GAT model the distribution being fitted is a (spliced) mixture of two Poissons with differing (partitioned) support. Likewise, for the GIT model the distribution being fitted is a mixture of two Poissons with nested support. The two rate parameters may be constrained to be equal using eq.ap and eq.ip.

A good first step is to apply spikeplot for selecting candidate values for altering, inflating and deflating. Deciding between parametrically or nonparametrically can also be determined from examining the spike plot. Misspecified a.mix/a.mlm/i.mix/i.mlm/d.mix/d.mlm will result in convergence problems (setting trace = TRUE is a very good idea.) This function currently does not handle multiple responses. Further details are at Gaitdpois.

A well-conditioned data–model combination should pose no difficulties for the automatic starting value selection being successful. Failure to obtain initial values from this self-starting family function indicates the degree of inflation/deflation may be marginal and/or a misspecified model. If this problem is worth surmounting the arguments to focus on especially are mux.init, gpstr.mix, gpstr.mlm, ipdip.mix and ipdip.mlm. See below for the stepping-stone trick.

Apart from the order of the linear/additive predictors, the following are (or should be) equivalent: gaitdpoisson() and poissonff(), gaitdpoisson(a.mix = 0) and zapoisson(zero = "pobs0"), gaitdpoisson(i.mix = 0) and zipoisson(zero = "pstr0"), gaitdpoisson(truncate = 0) and pospoisson(). Likewise, if a.mix and i.mix are assigned a scalar then it effectively moves that scalar to a.mlm and i.mlm because there is no lambda.a or lambda.i being estimated. Thus gaitdpoisson(a.mix = 0) and gaitdpoisson(a.mlm = 0) are the effectively same, and ditto for gaitdpoisson(i.mix = 0) and gaitdpoisson(i.mlm = 0).

Value

An object of class "vglmff" (see vglmff-class). The object is used by modelling functions such as vglm, rrvglm and vgam.

The fitted.values slot of the fitted object, which should be extracted by the generic function fitted, returns the mean \mu by default. See the information above on type.fitted.

Warning

Amateurs tend to be overzealous fitting zero-inflated models when the fitted mean is low—the warning of ziP should be heeded. For GAITD regression the warning applies more strongly and generally; here to all i.mix, i.mlm, d.mix and d.mlm values, not just 0. Even one misspecified special value usually will cause convergence problems.

Default values for this and similar family functions may change in the future, e.g., eq.ap and eq.ip. Important internal changes might occur too, such as the ordering of the linear/additive predictors and the quantities returned as the fitted values.

Using i.mlm requires more caution than a.mlm because gross inflation is ideally needed for it to work safely. Ditto for i.mix versus a.mix. Data exhibiting deflation or little to no inflation will produce numerical problems, hence set trace = TRUE to monitor convergence. More than c.10 IRLS iterations should raise suspicion.

Ranking the four operators by difficulty, the easiest is truncation followed by alteration, then inflation and the most difficult is deflation. The latter needs good initial values and the current default will probably not work on some data sets. Studying the spikeplot is time very well spent. In general it is very easy to specify an overfitting model so it is a good idea to split the data into training and test sets.

This function is quite memory-hungry with respect to length(c(a.mix, i.mix, d.mix, a.mlm, i.mlm, d.mlm)). On consuming something different, because all values of the NBP vector need to be positive it pays to be economical with respect to d.mlm especially so that one does not consume up probabilities unnecessarily so to speak.

It is often a good idea to set eq.ip = TRUE, especially when length(i.mix) is not much more than 2 or the values of i.mix are not spread over the range of the response. This way the estimation can borrow strength from both the inflated and non-inflated values. If the i.mix values form a single small cluster then this can easily create estimation difficulties—the idea is somewhat similar to multicollinearity. The same holds for d.mix.

Note

Numerical problems can easily arise because of the exceeding flexibility of this distribution and/or the lack of sizeable inflation/deflation; it is a good idea to gain experience with simulated data first before applying it to real data. Numerical problems may arise if any of the special values are in remote places of the support, e.g., a value y such that dpois(y, lambda.p) is very close to 0. This is because the ratio of two tiny values can be unstable.

Good initial values may be difficult to obtain using self-starting procedures, especially when there are covariates. If so, then it is advisable to use a trick: fit an intercept-only model first and then use etastart = predict(int.only.model) to fit the model with covariates. This uses the simpler model as a stepping-stone.

The labelling of the linear/additive predictors has been abbreviated to reduce space. For example, multilogitlink(pobs.mix) and multilogitlink(pstr.mix) would be more accurately multilogitlink(cbind(pobs.mix, pstr.mix)) because one grand MLM is fitted. This shortening may result in modifications needed in other parts of VGAM to compensate.

Because estimation involves a MLM, the restricted parameter space means that if the dip probabilities are large then the NBP may become too close to 0. If this is so then there are tricks to avoid a negative NBP. One of them is to model as many values of d.mlm as d.mix, hence the dip probabilities become modelled via the deflation distribution instead. Another trick to alter those special values rather than deflating them if the dip probabilities are large.

Due to its complexity, the HDE test hdeff is currently unavailable for GAITD regressions.

Randomized quantile residuals (RQRs) are available; see residualsvglm.

Author(s)

T. W. Yee

References

Yee, T. W. and Ma, C. (2023) Generally altered, inflated, truncated and deflated regression. In preparation.

See Also

Gaitdpois, multinomial, rootogram4, specials, plotdgaitd, spikeplot, meangaitd, KLD, goffset, Trunc, gaitdnbinomial, gaitdlog, gaitdzeta, multilogitlink, multinomial, residualsvglm, poissonff, zapoisson, zipoisson, pospoisson, CommonVGAMffArguments, simulate.vlm.

Examples

i.mix <- c(5, 10)  # Inflate these values parametrically
i.mlm <- c(14, 15)  # Inflate these values
a.mix <- c(1, 13)  # Alter these values
tvec <- c(3, 11)   # Truncate these values
pstr.mlm <- 0.1  # So parallel.i = TRUE
pobs.mix <- pstr.mix <- 0.1
max.support <- 20; set.seed(1)
gdata <- data.frame(x2 = runif(nn <- 1000))
gdata <- transform(gdata, lambda.p = exp(2 + 0.0 * x2))
gdata <- transform(gdata,
  y1 = rgaitdpois(nn, lambda.p, a.mix = a.mix, i.mix = i.mix,
                  pobs.mix = pobs.mix, pstr.mix = pstr.mix,
                  i.mlm = i.mlm, pstr.mlm = pstr.mlm,
                  truncate = tvec, max.support = max.support))
gaitdpoisson(a.mix = a.mix, i.mix = i.mix, i.mlm = i.mlm)
with(gdata, table(y1))
fit1 <- vglm(y1 ~ 1, crit = "coef", trace = TRUE, data = gdata,
             gaitdpoisson(a.mix = a.mix, i.mix = i.mix,
                          i.mlm = i.mlm, parallel.i = TRUE,
                          eq.ap = TRUE, eq.ip = TRUE, truncate =
                          tvec, max.support = max.support))
head(fitted(fit1, type.fitted = "Pstr.mix"))
head(predict(fit1))
t(coef(fit1, matrix = TRUE))  # Easier to see with t()
summary(fit1)  # No HDE test by default but HDEtest = TRUE is ideal
## Not run:  spikeplot(with(gdata, y1), lwd = 2)
plotdgaitd(fit1, new.plot = FALSE, offset.x = 0.2, all.lwd = 2)  
## End(Not run)

VGAM documentation built on Sept. 19, 2023, 9:06 a.m.