gaitdpoisson | R Documentation |
Fits a generally altered, inflated, truncated and deflated Poisson regression by MLE. The GAITD combo model having 7 types of special values is implemented. This allows mixtures of Poissons on nested and/or partitioned support as well as a multinomial logit model for (nonparametric) altered, inflated and deflated values. Truncation may include the upper tail.
gaitdpoisson(a.mix = NULL, i.mix = NULL, d.mix = NULL,
a.mlm = NULL, i.mlm = NULL, d.mlm = NULL,
truncate = NULL, max.support = Inf,
zero = c("pobs", "pstr", "pdip"),
eq.ap = TRUE, eq.ip = TRUE, eq.dp = TRUE,
parallel.a = FALSE, parallel.i = FALSE, parallel.d = FALSE,
llambda.p = "loglink", llambda.a = llambda.p,
llambda.i = llambda.p, llambda.d = llambda.p,
type.fitted = c("mean", "lambdas", "pobs.mlm", "pstr.mlm",
"pdip.mlm", "pobs.mix", "pstr.mix", "pdip.mix",
"Pobs.mix", "Pstr.mix", "Pdip.mix", "nonspecial",
"Numer", "Denom.p", "sum.mlm.i", "sum.mix.i",
"sum.mlm.d", "sum.mix.d", "ptrunc.p",
"cdf.max.s"), gpstr.mix = ppoints(7) / 3,
gpstr.mlm = ppoints(7) / (3 + length(i.mlm)),
imethod = 1, mux.init = c(0.75, 0.5, 0.75),
ilambda.p = NULL, ilambda.a = ilambda.p,
ilambda.i = ilambda.p, ilambda.d = ilambda.p,
ipobs.mix = NULL, ipstr.mix = NULL, ipdip.mix = NULL,
ipobs.mlm = NULL, ipstr.mlm = NULL, ipdip.mlm = NULL,
byrow.aid = FALSE, ishrinkage = 0.95, probs.y = 0.35)
truncate , max.support |
Vector of truncated values, i.e., nonnegative integers.
For the first seven arguments (for the special values)
a |
a.mix , i.mix , d.mix |
Vector of altered and inflated values corresponding to finite mixture models. These are described as parametric or structured. The parameter If If Due to its great flexibility, it is easy to misuse this function
and ideally the values of the above arguments should be well
justified by the application on hand.
Adding inappropriate or
unnecessary values to these arguments willy-nilly
is a recipe for disaster, especially for
|
a.mlm , i.mlm , d.mlm |
Vector of altered, inflated and deflated values corresponding
to the multinomial logit model (MLM) probabilities of
observing those values—see |
llambda.p , llambda.a , llambda.i , llambda.d |
Link functions for the parent,
altered, inflated and deflated distributions respectively.
See |
eq.ap , eq.ip , eq.dp |
Single logical each.
Constrain the rate parameters to be equal?
See For the GIT–Pois submodel,
after plotting the responses,
if the distribution of the spikes
above the nominal probabilities
has roughly the same shape
as the ordinary values then setting
|
parallel.a , parallel.i , parallel.d |
Single logical each.
Constrain the MLM probabilities to be equal?
If so then this applies to all
|
type.fitted |
See The choice Option The choice The choice The choice The choice |
gpstr.mix , gpstr.mlm |
See |
imethod , ipobs.mix , ipstr.mix , ipdip.mix |
See |
ipobs.mlm , ipstr.mlm , ipdip.mlm |
See |
mux.init |
Numeric, of length 3. General downward multiplier for initial values for the sample proportions (MLEs actually). This is under development and more details are forthcoming. In general, 1 means unchanged and values should lie in (0, 1], and values about 0.5 are recommended. The elements apply in order to altered, inflated and deflated (no distinction between mix and MLM). |
ilambda.p , ilambda.a , ilambda.i , ilambda.d |
Initial values for the rate parameters;
see |
probs.y , ishrinkage |
See |
byrow.aid |
Details are at |
zero |
See For the MLM probabilities,
to model It is noted that, amongst other things,
|
The full
GAITD–Pois combo model
may be fitted with this family function.
There are seven types of special values and all arguments for these
may be used in a single model.
Here, the MLM represents the nonparametric while the Pois
refers to the Poisson mixtures.
The defaults for this function correspond to an ordinary Poisson
regression so that poissonff
is called instead.
A MLM with only one probability to model is equivalent to
logistic regression
(binomialff
and logitlink
).
The order of the linear/additive predictors is best
explained by an example.
Suppose a combo model has
length(a.mix) > 2
and
length(i.mix) > 2
,
length(d.mix) > 2
,
a.mlm = 3:5
,
i.mlm = 6:9
and
d.mlm = 10:12
, say.
Then loglink(lambda.p)
is the first.
The second is multilogitlink(pobs.mix)
followed
by loglink(lambda.a)
because a.mix
is long enough.
The fourth is multilogitlink(pstr.mix)
followed
by loglink(lambda.i)
because i.mix
is long enough.
The sixth is multilogitlink(pdip.mix)
followed
by loglink(lambda.d)
because d.mix
is long enough.
Next are the probabilities for the a.mlm
values.
Then are the probabilities for the i.mlm
values.
Lastly are the probabilities for the d.mlm
values.
All the probabilities are estimated by one big MLM
and effectively
the "(Others)"
column of left over probabilities is
associated with the nonspecial values.
These might be called the
nonspecial baseline probabilities (NBP).
The dimension of the vector of linear/additive predictors here
is M=17
.
Two mixture submodels that may be fitted can be abbreviated
GAT–Pois or
GIT–Pois.
For the GAT model
the distribution being fitted is a (spliced) mixture
of two Poissons with differing (partitioned) support.
Likewise, for the GIT model
the distribution being fitted is a mixture
of two Poissons with nested support.
The two rate parameters may be constrained to be equal using
eq.ap
and eq.ip
.
A good first step is to apply spikeplot
for selecting
candidate values for altering, inflating and deflating.
Deciding between
parametrically or nonparametrically can also be determined from
examining the spike plot. Misspecified
a.mix/a.mlm/i.mix/i.mlm/d.mix/d.mlm
will result in convergence problems
(setting trace = TRUE
is a very good idea.)
This function currently does not handle multiple responses.
Further details are at Gaitdpois
.
A well-conditioned data–model combination should pose no
difficulties for the automatic starting value selection
being successful.
Failure to obtain initial values from this self-starting
family function indicates the degree of inflation/deflation
may be marginal and/or a misspecified model.
If this problem is worth surmounting
the arguments to focus on especially are
mux.init
,
gpstr.mix
, gpstr.mlm
,
ipdip.mix
and ipdip.mlm
.
See below for the stepping-stone trick.
Apart from the order of the linear/additive predictors,
the following are (or should be) equivalent:
gaitdpoisson()
and poissonff()
,
gaitdpoisson(a.mix = 0)
and zapoisson(zero = "pobs0")
,
gaitdpoisson(i.mix = 0)
and zipoisson(zero = "pstr0")
,
gaitdpoisson(truncate = 0)
and pospoisson()
.
Likewise, if
a.mix
and i.mix
are assigned a scalar then
it effectively moves that scalar to a.mlm
and i.mlm
because there is no lambda.a
or lambda.i
being estimated.
Thus
gaitdpoisson(a.mix = 0)
and gaitdpoisson(a.mlm = 0)
are the effectively same, and ditto for
gaitdpoisson(i.mix = 0)
and gaitdpoisson(i.mlm = 0)
.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions
such as vglm
,
rrvglm
and vgam
.
The fitted.values
slot of the fitted object,
which should be extracted by the
generic function fitted
,
returns the mean \mu
by default.
See the information above on type.fitted
.
Amateurs tend to be overzealous fitting
zero-inflated models when the
fitted mean is low—the warning of
ziP
should be heeded.
For GAITD regression the warning applies more
strongly and generally; here to all
i.mix
, i.mlm
, d.mix
and
d.mlm
values, not just 0. Even one
misspecified special value usually will cause
convergence problems.
Default values for this and similar family
functions may change in the future, e.g.,
eq.ap
and eq.ip
. Important
internal changes might occur too, such as the
ordering of the linear/additive predictors and
the quantities returned as the fitted values.
Using i.mlm
requires more caution
than a.mlm
because gross inflation
is ideally needed for it to work safely.
Ditto for i.mix
versus a.mix
.
Data exhibiting deflation or little to no
inflation will produce numerical problems,
hence set trace = TRUE
to monitor
convergence. More than c.10 IRLS iterations
should raise suspicion.
Ranking the four operators by difficulty, the easiest is truncation followed by alteration, then inflation and the most difficult is deflation. The latter needs good initial values and the current default will probably not work on some data sets. Studying the spikeplot is time very well spent. In general it is very easy to specify an overfitting model so it is a good idea to split the data into training and test sets.
This function is quite memory-hungry with
respect to length(c(a.mix, i.mix, d.mix,
a.mlm, i.mlm, d.mlm))
. On consuming something
different, because all values of the NBP vector
need to be positive it pays to be economical
with respect to d.mlm
especially so
that one does not consume up probabilities
unnecessarily so to speak.
It is often a good idea to set eq.ip =
TRUE
, especially when length(i.mix)
is not much more than 2 or the values of
i.mix
are not spread over the range
of the response. This way the estimation
can borrow strength from both the inflated
and non-inflated values. If the i.mix
values form a single small cluster then this
can easily create estimation difficulties—the
idea is somewhat similar to multicollinearity.
The same holds for d.mix
.
Numerical problems can easily arise because
of the exceeding flexibility of this
distribution and/or the lack of sizeable
inflation/deflation; it is a good idea to gain
experience with simulated data first before
applying it to real data.
Numerical problems may arise if any of the
special values are in remote places of the
support, e.g., a value y
such that
dpois(y, lambda.p)
is very close to
0. This is because the ratio of two tiny values
can be unstable.
Good initial values may be difficult to obtain
using self-starting procedures, especially
when there are covariates. If so, then it is
advisable to use a trick: fit an intercept-only
model first and then use etastart =
predict(int.only.model)
to fit the model
with covariates. This uses the simpler model
as a stepping-stone.
The labelling of the linear/additive predictors
has been abbreviated to reduce space. For
example, multilogitlink(pobs.mix)
and
multilogitlink(pstr.mix)
would be more
accurately multilogitlink(cbind(pobs.mix,
pstr.mix))
because one grand MLM is fitted.
This shortening may result in modifications
needed in other parts of VGAM to
compensate.
Because estimation involves a MLM, the restricted
parameter space means that if the dip probabilities
are large then the NBP may become too close to 0.
If this is so then there are tricks to avoid a
negative NBP.
One of them is to model as many values of d.mlm
as d.mix
, hence the dip probabilities become
modelled via the deflation distribution instead.
Another trick to alter those special values rather than
deflating them if the dip probabilities are large.
Due to its complexity,
the HDE test hdeff
is currently
unavailable for GAITD regressions.
Randomized quantile residuals (RQRs) are
available; see residualsvglm
.
T. W. Yee
Yee, T. W. and Ma, C. (2024). Generally altered, inflated, truncated and deflated regression. Statistical Science, 39 (in press).
Gaitdpois
,
multinomial
,
rootogram4
,
specials
,
plotdgaitd
,
spikeplot
,
meangaitd
,
KLD
,
goffset
,
Trunc
,
gaitdnbinomial
,
gaitdlog
,
gaitdzeta
,
multilogitlink
,
multinomial
,
residualsvglm
,
poissonff
,
zapoisson
,
zipoisson
,
pospoisson
,
CommonVGAMffArguments
,
simulate.vlm
.
## Not run:
i.mix <- c(5, 10) # Inflate these values parametrically
i.mlm <- c(14, 15) # Inflate these values
a.mix <- c(1, 13) # Alter these values
tvec <- c(3, 11) # Truncate these values
pstr.mlm <- 0.1 # So parallel.i = TRUE
pobs.mix <- pstr.mix <- 0.1
max.support <- 20; set.seed(1)
gdata <- data.frame(x2 = runif(nn <- 1000))
gdata <- transform(gdata, lambda.p = exp(2 + 0.0 * x2))
gdata <- transform(gdata,
y1 = rgaitdpois(nn, lambda.p, a.mix = a.mix, i.mix = i.mix,
pobs.mix = pobs.mix, pstr.mix = pstr.mix,
i.mlm = i.mlm, pstr.mlm = pstr.mlm,
truncate = tvec, max.support = max.support))
gaitdpoisson(a.mix = a.mix, i.mix = i.mix, i.mlm = i.mlm)
with(gdata, table(y1))
fit1 <- vglm(y1 ~ 1, crit = "coef", trace = TRUE, data = gdata,
gaitdpoisson(a.mix = a.mix, i.mix = i.mix,
i.mlm = i.mlm, parallel.i = TRUE,
eq.ap = TRUE, eq.ip = TRUE, truncate =
tvec, max.support = max.support))
head(fitted(fit1, type.fitted = "Pstr.mix"))
head(predict(fit1))
t(coef(fit1, matrix = TRUE)) # Easier to see with t()
summary(fit1) # No HDE test by default but HDEtest = TRUE is ideal
spikeplot(with(gdata, y1), lwd = 2)
plotdgaitd(fit1, new.plot = FALSE, offset.x = 0.2, all.lwd = 2)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.