hltest: Modified Hosmer-Lemeshow Test for Large Samples

Description Usage Arguments Details Value Methods (by class) Examples

View source: R/hltest_functions.R

Description

hltest implements a goodness-of-fit test to assess the goodness of fit of logistic regression models in large samples.

Usage

1
2
3
4
5
6
7
8
9
hltest(...)

## S3 method for class 'numeric'
hltest(y, prob, G = 10, outsample = FALSE,
  epsilon0 = NULL, conf.level = 0.95, citype = "one.sided",
  cimethod = ifelse(citype == "one.sided", NULL, "symmetric"), ...)

## S3 method for class 'glm'
hltest(glmObject, ...)

Arguments

...

Additional arguments (ignored).

y, prob

Numeric vectors with binary responses and predicted probabilities to be evaluated. The vectors must have equal length. Missing values are dropped.

G

Number of groups to be used in the Hosmer-Lemeshow statistic. By default, G=10

outsample

A boolean specifying whether the model has been fit on the data provided (outsample=FALSE, default) or if the model has been developed on an external sample (outsample=TRUE). The distribution of the Hosmer-Lemeshow statistic is assumed to have G-2 and G degrees of freedom if outsample=FALSE and outsample=TRUE, respectively.

epsilon0

Value of the parameter epsilon0, which characterizes the models to be considered as acceptable in terms of goodness of fit. By default (NULL), epsilon0 is set to the value of epsilon expected from a model attaining a p-value of the traditional Hosmer-Lemeshow test of 0.05 in a sample of one million observations. The case epsilon0=0 corresponds to the traditional Hosmer-Lemeshow test. See the section "Details" for further information.

conf.level

Confidence level for the confidence interval of epsilon. Equal to .95 by default.

citype

Type of confidence interval of epsilon to be computed: one-sided (citype="one.sided", default) or two-sided (citype="two.sided").

cimethod

Method to be used to compute the two-sided confidence interval: symmetric (cimethod="symmetric", default) or central (cimethod="central"). See section "Details" for further information.

glmObject

In alternative to the vectors y and prob, it is possible to provide the glm object with the model to be evaluated.

Details

The modification of the Hosmer-Lemeshow test evaluates the hypotheses:

H0: epsilon <= epsilon0 vs. Ha: epsilon > epsilon0,

where epsilon is a parameter that measures the goodness of fit of a model. This parameter is based on a standardization of the noncentrality parameter that characterizes the distribution of the Hosmer-Lemeshow statistic. The case epsilon=0 corresponds to a model with perfect fit.

Because the null hypothesis of the traditional Hosmer-Lemeshow test is the condition of perfect fit, it can be interpreted as a test for H0: epsilon = 0 vs. Ha: epsilon > 0. Therefore, the traditional Hosmer-Lemeshow test can be performed by setting the argument epsilon0=0.

If epsilon0>0, the implemented test evaluates whether the fit of a model is "acceptable", albeit not perfect. The value of epsilon0 defines what is meant for "acceptable" in terms of goodness of fit. By default, epsilon0 is the value of epsilon expected from a model attaining a p-value of the traditional Hosmer-Lemeshow test of 0.05 in a sample of one million observations. In other words, the test assesses whether the fit of a model is worse than the fit of a model that would be considered as borderline-significant (i.e., attaining a p-value of 0.05) in a sample of one million observations.

The function also estimates the parameter epsilon and constructs its confidence interval. The confidence interval of this parameter is based on the confidence interval of the noncentrality parameter that characterizes the distribution of the Hosmer-Lemeshow statistic, which is noncentral chi-squared. Two types of two-sided confidence intervals are implemented: symmetric (default) and central. See Kent and Hainsworth (1995) for further details.

References:

Kent, J. T., & Hainsworth, T. J. (1995). Confidence intervals for the noncentral chi-squared distribution. Journal of Statistical Planning and Inference, 46(2), 147–159.

Nattino, G., Pennell, M. L., & Lemeshow, S.. Assessing the Goodness of fit of Logistic Regression Models in Large Samples: A Modification of the Hosmer-Lemeshow Test. In preparation.

Value

A list of class htest containing the following components:

null.value

The value of epsilon0 used in the test.

statistic

The value of the Hosmer-Lemeshow statistic.

p.value

The p-value of the test.

parameter

A vector with the parameters of the noncentral chi-squared distribution used to compute the p-value: degrees of freedom (dof) and noncentrality parameter (lambda).

lambdaHat

The estimate of noncentrality parameter lambda.

estimate

The estimate of epsilon.

conf.int

The confidence interval of epsilon.

Methods (by class)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#Generate fake data with two variables: one continuous and one binary.
set.seed(1234)
dat <- data.frame(x1 = rnorm(5e5),
                 x2 = rbinom(5e5, size=1, prob=.5))
#The true probabilities of the response depend on a negligible interaction
dat$prob <- 1/(1+exp(-(-1 + dat$x1 + dat$x2 + 0.05*dat$x1*dat$x2)))
dat$y <- rbinom(5e5, size = 1, prob = dat$prob)

#Fit an acceptable model (does not include the negligible interaction)
model <- glm(y ~ x1 + x2, data = dat, family = binomial(link="logit"))

#Check: predicted probabilities are very close to true probabilities
dat$phat <- predict(model, type = "response")
boxplot(abs(dat$prob-dat$phat))

#Traditional Hosmer-Lemeshow test: reject H0
hltest(model, epsilon0 = 0)

#Modified Hosmer-Lemeshow test: fail to reject H0
hltest(model)

#Same output with vectors of responses and predicted probabilities
hltest(y=dat$y, prob=dat$phat)

gnattino/largesamplehl documentation built on March 22, 2021, 3:48 p.m.