hosmer_test: Hosmer-Lemeshow Goodness of Fit Test

View source: R/hosmer_test.R

hosmer_testR Documentation

Hosmer-Lemeshow Goodness of Fit Test

Description

Hosmer-Lemeshow Goodness of Fit Test is to check model quality of logistic regression models. Note that this function has a unique way of dividing subgroups. See details.

Usage

hosmer_test(model, g = 10, simple = FALSE, force = FALSE)

Arguments

model

a glm -object with binomial-family.

g

numeric, the number for how many subgroups the data should be divided into.

simple

logical, If TRUE is selected, the expected values in decreasing order are evenly divided by the number of subgroups specified. if FALSE, identical expected values are placed in identical subgroups, and the number of subgroups is adjusted to make each subgroup as homogeneous as possible. See Detail for details.

force

If excat is TRUE, then the total number of combinations is calculated to minimize the variance among all combinations and the combination with the lowest variance is selected among them. In other words, adjust the number of pieces in each group so that the numbers are as equal as possible. If FALSE, it pseudo-strives to minimize the variance of the number of pieces in each group, but prioritizes calculation speed and does not perform calculations from all combinations.

Details

The Hosmer-Lemeshow Goodness of Fit Test is a method for obtaining statistics by dividing observed and expected values into several arbitrary subgroups. The method of dividing the observed and expected values into subgroups is generally based on the quantile of the expected value, for example, by taking a decile of the expected value. This method is used in the hoslem.test() function of the resouceselection package and the performance_hosmer() function of the performance package. It has been suggested that it may be more accurate to divide subgroups by quantiles such as decile.

However, there are several variations on how to divide the subgroups, and this function uses a method in which the expected values are ordered from smallest to largest so that each subgroup has the same number of samples as possible.

If simple is TRUE, the process simply divides the expected values in decreasing order by the number of subgroups specified so that they are evenly distributed.

If simple is FALSE, the same expected values are included in the same subgroup, and the calculation is performed with the number of subgroups adjusted so that the minimum number of values in a subgroup is maximized and the variance of the number of values in each group is minimized. In other words, it strives to keep the same number of values in the subgroups as much as possible, while ensuring that the same expected values are in the same subgroups. In this algorithm, the subgroup with the smallest number of expected values in the initial disjoint state is merged with its neighboring subgroups (with smaller or larger expected values) and the one with the smaller variance is adopted to create a new subgroup, and then the subgroup with the smallest number of expected values is merged with its neighboring expected value subgroups and the one with the smaller variance is adopted to create a new subgroup, and so on. The next subgroup with the lowest number of expected values is merged with the subgroup with the lowest variance, and the one with the lowest variance is adopted to create a new subgroup. This procedure will result in a homogeneous number of subgroups as expected when the expected number of subgroups are relatively disparate, but will not create the expected number of subgroups when the expected number of subgroups are nearly homogeneous (e.g., only 1 or 2 of each).

However, this algorithm may not minimize the variance. For this reason, we can set force to TRUE with the value calculated by brute force. However, this would require a large amount of computation and may consume a large amount of memory and slow down the process until the result is obtained.

Value

A list with class "htest" containing the following components:

statistic

the value of the chi-squared test statistic, (sum((observed - expected)^2 / expected)).

parameter

the degrees of freedom of the approximate chi-squared distribution of the test statistic (g - 2).

p.value

the p-value for the test.

method

a character string of test performed.

data.name

expressions (objects) for which logistic regression analysis has been performed.

observed

the observed frequencies in a g-by-2 contingency table.

expected

the expected frequencies in a g-by-2 contingency table.

References

David W. Hosmer, Stanley Lemesbow (1980). Goodness of fit tests for the multiple logistic regression model, Communications in Statistics - Theory and Methods, 9:10, 1043-1069, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/03610928008827941")}

HOSMER, D.W., HOSMER, T., LE CESSIE, S. and LEMESHOW, S. (1997), A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL. Statist. Med., 16: 965-980. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-O")}

Examples

data("Titanic")
df <- data.frame(Titanic)
df <- data.frame(Class = rep(df$Class, df$Freq),
                 Sex = rep(df$Sex, df$Freq),
                 Age = rep(df$Age, df$Freq),
                 Survived = rep(df$Survived, df$Freq))
model <- glm(Survived ~ . ,data = df, family = binomial())
hosmer_test(model)

indenkun/infun documentation built on April 17, 2025, 2:52 p.m.