# hurdle: Hurdle Models for Count Data Regression In pscl: Political Science Computational Laboratory

## Description

Fit hurdle regression models for count data via maximum likelihood.

## Usage

 ```1 2 3 4 5 6``` ```hurdle(formula, data, subset, na.action, weights, offset, dist = c("poisson", "negbin", "geometric"), zero.dist = c("binomial", "poisson", "negbin", "geometric"), link = c("logit", "probit", "cloglog", "cauchit", "log"), control = hurdle.control(...), model = TRUE, y = TRUE, x = FALSE, ...) ```

## Arguments

 `formula` symbolic description of the model, see details. `data, subset, na.action` arguments controlling formula processing via `model.frame`. `weights` optional numeric vector of weights. `offset` optional numeric vector with an a priori known component to be included in the linear predictor of the count model. See below for more information on offsets. `dist` character specification of count model family. `zero.dist` character specification of the zero hurdle model family. `link` character specification of link function in the binomial zero hurdle (only used if `zero.dist = "binomial"`. `control` a list of control arguments specified via `hurdle.control`. `model, y, x` logicals. If `TRUE` the corresponding components of the fit (model frame, response, model matrix) are returned. `...` arguments passed to `hurdle.control` in the default setup.

## Details

Hurdle count models are two-component models with a truncated count component for positive counts and a hurdle component that models the zero counts. Thus, unlike zero-inflation models, there are not two sources of zeros: the count model is only employed if the hurdle for modeling the occurrence of zeros is exceeded. The count model is typically a truncated Poisson or negative binomial regression (with log link). The geometric distribution is a special case of the negative binomial with size parameter equal to 1. For modeling the hurdle, either a binomial model can be employed or a censored count distribution. The outcome of the hurdle component of the model is the occurrence of a non-zero (positive) count. Thus, for most models, positive coefficients in the hurdle component indicate that an increase in the regressor increases the probability of a non-zero count. Binomial logit and censored geometric models as the hurdle part both lead to the same likelihood function and thus to the same coefficient estimates. A censored negative binomial model for the zero hurdle is only identified if there is at least one non-constant regressor with (true) coefficient different from zero (and if all coefficients are close to zero the model can be poorly conditioned).

The `formula` can be used to specify both components of the model: If a `formula` of type `y ~ x1 + x2` is supplied, then the same regressors are employed in both components. This is equivalent to `y ~ x1 + x2 | x1 + x2`. Of course, a different set of regressors could be specified for the zero hurdle component, e.g., `y ~ x1 + x2 | z1 + z2 + z3` giving the count data model `y ~ x1 + x2` conditional on (`|`) the zero hurdle model `y ~ z1 + z2 + z3`.

Offsets can be specified in both parts of the model pertaining to count and zero hurdle model: `y ~ x1 + offset(x2) | z1 + z2 + offset(z3)`, where `x2` is used as an offset (i.e., with coefficient fixed to 1) in the count part and `z3` analogously in the zero hurdle part. By the rule stated above `y ~ x1 + offset(x2)` is expanded to `y ~ x1 + offset(x2) | x1 + offset(x2)`. Instead of using the `offset()` wrapper within the `formula`, the `offset` argument can also be employed which sets an offset only for the count model. Thus, `formula = y ~ x1` and `offset = x2` is equivalent to `formula = y ~ x1 + offset(x2) | x1`.

All parameters are estimated by maximum likelihood using `optim`, with control options set in `hurdle.control`. Starting values can be supplied, otherwise they are estimated by `glm.fit` (the default). By default, the two components of the model are estimated separately using two `optim` calls. Standard errors are derived numerically using the Hessian matrix returned by `optim`. See `hurdle.control` for details.

The returned fitted model object is of class `"hurdle"` and is similar to fitted `"glm"` objects. For elements such as `"coefficients"` or `"terms"` a list is returned with elements for the zero and count components, respectively. For details see below.

A set of standard extractor functions for fitted model objects is available for objects of class `"hurdle"`, including methods to the generic functions `print`, `summary`, `coef`, `vcov`, `logLik`, `residuals`, `predict`, `fitted`, `terms`, `model.matrix`. See `predict.hurdle` for more details on all methods.

## Value

An object of class `"hurdle"`, i.e., a list with components including

 `coefficients` a list with elements `"count"` and `"zero"` containing the coefficients from the respective models, `residuals` a vector of raw residuals (observed - fitted), `fitted.values` a vector of fitted means, `optim` a list (of lists) with the output(s) from the `optim` call(s) for minimizing the negative log-likelihood(s), `control` the control arguments passed to the `optim` call, `start` the starting values for the parameters passed to the `optim` call(s), `weights` the case weights used, `offset` a list with elements `"count"` and `"zero"` containing the offset vectors (if any) from the respective models, `n` number of observations (with weights > 0), `df.null` residual degrees of freedom for the null model (= `n - 2`), `df.residual` residual degrees of freedom for fitted model, `terms` a list with elements `"count"`, `"zero"` and `"full"` containing the terms objects for the respective models, `theta` estimate of the additional theta parameter of the negative binomial model(s) (if negative binomial component is used), `SE.logtheta` standard error(s) for log(theta), `loglik` log-likelihood of the fitted model, `vcov` covariance matrix of all coefficients in the model (derived from the Hessian of the `optim` output(s)), `dist` a list with elements `"count"` and `"zero"` with character strings describing the respective distributions used, `link` character string describing the link if a binomial zero hurdle model is used, `linkinv` the inverse link function corresponding to `link`, `converged` logical indicating successful convergence of `optim`, `call` the original function call, `formula` the original formula, `levels` levels of the categorical regressors, `contrasts` a list with elements `"count"` and `"zero"` containing the contrasts corresponding to `levels` from the respective models, `model` the full model frame (if `model = TRUE`), `y` the response count vector (if `y = TRUE`), `x` a list with elements `"count"` and `"zero"` containing the model matrices from the respective models (if `x = TRUE`).

## Author(s)

Achim Zeileis <Achim.Zeileis@R-project.org>

## References

Cameron, A. Colin and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.

Cameron, A. Colin and Pravin K. Trivedi 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.

Mullahy, J. 1986. Specification and Testing of Some Modified Count Data Models. Journal of Econometrics. 33:341–365.

Zeileis, Achim, Christian Kleiber and Simon Jackman 2008. “Regression Models for Count Data in R.” Journal of Statistical Software, 27(8). URL http://www.jstatsoft.org/v27/i08/.

`hurdle.control`, `glm`, `glm.fit`, `glm.nb`, `zeroinfl`

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24``` ```## data data("bioChemists", package = "pscl") ## logit-poisson ## "art ~ ." is the same as "art ~ . | .", i.e. ## "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment" fm_hp1 <- hurdle(art ~ ., data = bioChemists) summary(fm_hp1) ## geometric-poisson fm_hp2 <- hurdle(art ~ ., data = bioChemists, zero = "geometric") summary(fm_hp2) ## logit and geometric model are equivalent coef(fm_hp1, model = "zero") - coef(fm_hp2, model = "zero") ## logit-negbin fm_hnb1 <- hurdle(art ~ ., data = bioChemists, dist = "negbin") summary(fm_hnb1) ## negbin-negbin ## (poorly conditioned zero hurdle, note the standard errors) fm_hnb2 <- hurdle(art ~ ., data = bioChemists, dist = "negbin", zero = "negbin") summary(fm_hnb2) ```

### Example output

```Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis

Call:
hurdle(formula = art ~ ., data = bioChemists)

Pearson residuals:
Min      1Q  Median      3Q     Max
-2.4105 -0.8913 -0.2817  0.5530  7.0324

Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.67114    0.12246   5.481 4.24e-08 ***
femWomen    -0.22858    0.06522  -3.505 0.000457 ***
marMarried   0.09649    0.07283   1.325 0.185209
kid5        -0.14219    0.04845  -2.934 0.003341 **
phd         -0.01273    0.03130  -0.407 0.684343
ment         0.01875    0.00228   8.222  < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.23680    0.29552   0.801   0.4230
femWomen    -0.25115    0.15911  -1.579   0.1144
marMarried   0.32623    0.18082   1.804   0.0712 .
kid5        -0.28525    0.11113  -2.567   0.0103 *
phd          0.02222    0.07956   0.279   0.7800
ment         0.08012    0.01302   6.155 7.52e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of iterations in BFGS optimization: 12
Log-likelihood: -1605 on 12 Df

Call:
hurdle(formula = art ~ ., data = bioChemists, zero.dist = "geometric")

Pearson residuals:
Min      1Q  Median      3Q     Max
-2.4105 -0.8913 -0.2817  0.5530  7.0324

Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.67114    0.12246   5.481 4.24e-08 ***
femWomen    -0.22858    0.06522  -3.505 0.000457 ***
marMarried   0.09649    0.07283   1.325 0.185209
kid5        -0.14219    0.04845  -2.934 0.003341 **
phd         -0.01273    0.03130  -0.407 0.684343
ment         0.01875    0.00228   8.222  < 2e-16 ***
Zero hurdle model coefficients (censored geometric with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.23680    0.29552   0.801   0.4230
femWomen    -0.25115    0.15911  -1.579   0.1144
marMarried   0.32623    0.18082   1.804   0.0712 .
kid5        -0.28525    0.11113  -2.567   0.0103 *
phd          0.02222    0.07956   0.279   0.7800
ment         0.08012    0.01302   6.155 7.52e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of iterations in BFGS optimization: 12
Log-likelihood: -1605 on 12 Df
(Intercept)      femWomen    marMarried          kid5           phd
-2.831069e-15  6.106227e-16 -5.606626e-15  1.665335e-16 -1.161918e-14
ment
-1.643130e-14

Call:
hurdle(formula = art ~ ., data = bioChemists, dist = "negbin")

Pearson residuals:
Min      1Q  Median      3Q     Max
-1.2581 -0.8036 -0.2497  0.4745  6.2753

Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.355125   0.196832   1.804  0.07120 .
femWomen    -0.244672   0.097218  -2.517  0.01184 *
marMarried   0.103417   0.109430   0.945  0.34463
kid5        -0.153260   0.072229  -2.122  0.03385 *
phd         -0.002933   0.048067  -0.061  0.95134
ment         0.023738   0.004287   5.537 3.07e-08 ***
Log(theta)   0.603472   0.224995   2.682  0.00731 **
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.23680    0.29552   0.801   0.4230
femWomen    -0.25115    0.15911  -1.579   0.1144
marMarried   0.32623    0.18082   1.804   0.0712 .
kid5        -0.28525    0.11113  -2.567   0.0103 *
phd          0.02222    0.07956   0.279   0.7800
ment         0.08012    0.01302   6.155 7.52e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 1.8285
Number of iterations in BFGS optimization: 15
Log-likelihood: -1553 on 13 Df

Call:
hurdle(formula = art ~ ., data = bioChemists, dist = "negbin", zero.dist = "negbin")

Pearson residuals:
Min      1Q  Median      3Q     Max
-1.2492 -0.7819 -0.2537  0.4688  6.2509

Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.355125   0.196832   1.804  0.07120 .
femWomen    -0.244672   0.097218  -2.517  0.01184 *
marMarried   0.103417   0.109430   0.945  0.34463
kid5        -0.153260   0.072229  -2.122  0.03385 *
phd         -0.002933   0.048067  -0.061  0.95134
ment         0.023738   0.004287   5.537 3.07e-08 ***
Log(theta)   0.603472   0.224995   2.682  0.00731 **
Zero hurdle model coefficients (censored negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)   8.0696    65.9791   0.122    0.903
femWomen     -2.3699    16.2661  -0.146    0.884
marMarried    2.8619    19.5755   0.146    0.884
kid5         -2.3976    16.3776  -0.146    0.884
phd           0.0543     0.7769   0.070    0.944
ment          0.8460     5.7738   0.147    0.884
Log(theta)   -2.5955     6.8229  -0.380    0.704
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 1.8285, zero = 0.0746
Number of iterations in BFGS optimization: 139
Log-likelihood: -1551 on 14 Df
```

pscl documentation built on March 26, 2020, 7:36 p.m.