A tutorial on fitting Generalized Linear Models for categorical responses with the GLMcat package"
In GLMcat: Generalized Linear Models for Categorical Responses

Introduction to the (r,F,Z) methodology:

A generalized linear model is characterized by three components: 1) the\textit{ random component} that defines the conditional cdf of the response variable $Y_i$ given the realization of the explanatory variables $\boldsymbol{x_i}$; 2) the \textit{systematic component} which is determined by the \textit{linear predictor} $\eta$ (that specifies the linear entry of the independent variables), and 3) the link function $g$ that relates the expected response and the linear predictor.

The random component of a GLM for a categorical response with $J$ categories is the multinomial cdf with vector of probabilities $(\pi_1,...,\pi_{J})$ where $\sum \pi_r = 1$ . The linear predictor $(\eta_1,...,\eta_{J-1})$ can be written as the product of the design matrix $Z$ and the unknown parameter vector $\beta$. The link function which characterizes this model is given by the equation $g(\pi) = Z\beta$, with $J-1$ equations $g_j = \eta_j$. @peyhardi_new proposed to write the link function as\begin{equation} g_j = F ^ {-1} \circ r_j \Leftrightarrow r_j = F(\eta_j) \quad \quad {j = 1,2,...,J-1} \end{equation}where $F$ is a cumulative cdf function and $r=(r_1,...,r_{J-1})$ is a transformation of the expected value vector. In the following, we will describe in more details the components (r,F,Z) and their modalities.

Ratio of probabilities r\

The linear predictor is not directly related to the expectation $\boldsymbol{\pi}$ instead they are related through a particular transformation $\boldsymbol{r}$ of the vector $\boldsymbol{\pi}$ which is called the ratio. @peyhardi_new proposed four ratios that gather the alternatives to model categorical response data:

\begin{center}{ \begin{tabular}{{lllll}} \hline \multicolumn{1}{|l|}{} & \multicolumn{1}{l}{Cumulative} & \multicolumn{1}{l} {{Sequential}} & \multicolumn{1}{l|}{{Adjacent}} & \multicolumn{1}{l|}{{Reference}} \ \multicolumn{1}{|l|}{$r_j(\boldsymbol{\pi})$ \quad \quad } & \multicolumn{1}{l}{$\pi_1+...+\pi_j$} & \multicolumn{1}{l}{$\dfrac{\pi_j}{\pi_j +...+ \pi_J}$} & \multicolumn{1}{l|}{$\dfrac{\pi_j}{\pi_j + \pi_{j+1}}$} & \multicolumn{1}{l|}{$\dfrac{\pi_j}{\pi_j + \pi_J}$} \ \hline \multicolumn{1}{|l|}{$Y$} & \multicolumn{1}{l}{} & \multicolumn{1}{l} {\centering{{ordinal}}} & \multicolumn{1}{l|}{} & \multicolumn{1}{l|}{\centering {nominal}} \ \cline{1-5} \end{tabular} } \end{center}

Each component $r_j(\boldsymbol{\pi})$ can be viewed as a (conditional) probability. For the reference ratio, each category $j$ is compared to the reference category $J$. For the adjacent ratio, each category $j$ is compared to its adjacent category $j+1$. For the cumulative ratio, the probabilities of categories are cumulated. For the sequential ratio, each category $j$ is compared to its following category, $j+1, \ldots, J$. The adjacent, cumulative and sequential ratios all rely on an ordering assumption among categories. The reference ratio is devoted to nominal responses.

Cumulative cdf function F\

The cumulative cdf functions (distributions) available in glmcat to fit the models are: logistic, normal, Cauchy, Student (with any df), Gompertz and Gumbel. The logistic and normal distributions are the symmetric distributions most commonly used to define link functions in generalized linear models. However, for specific scenarios, the use of other distributions may result in a more accurate fit. An example is presented by @peyhardi_travel, where the employment of the Student cdf leaded to a better fit for a modeling exercise on travel choice data. For the asymmetric case, the Gumbel and Gompertz distributions are the most commonly used.

Design Matrix $Z$\

It is possible to impose restrictions on the thresholds, or on the effects of the covariates, for example, for them to vary or not according to the response categories.

Constraints on the effects: \

It is plausible for a predictor to have specific level of impact on the different categories of the response. Thus, the $J-1$ linear predictors are of the form: $\eta_j =\alpha_j + x' \delta_j$ with $\beta = (\alpha_1,..., \alpha_{J-1}, \delta_1', ..., \delta_{J-1}')$. And, its associated design matrix is: \begin{equation} Z_c=\left( \begin{array}{cccccc} 1 & & & x^t & & \ & \ddots & & & \ddots & \ & & 1 & & & x^t \end{array} \right){(J- 1) \times (J -1)(1 + p)} \ \end{equation} Another case is to constrain the effects of the covariates to be constant across the response categories. Therefore, there is only a global effect that is not specific to the response categories, this is known as the parallelism assumption, for which the constrained space is represented by: \begin{equation} Z_p= \left( \begin{array}{cccc} 1 & & & x^t \ & \ddots & & \vdots \ & & 1 & x^t \end{array} \right){(J- 1) \times (J -1 + p)} \end{equation} The first case $(Z_c)$ is named by @peyhardi_new as the \textit{complete} design, whereas the second $(Z_p)$ as the \textit{parallel} design. These two matrices are sufficient to define all the classical models. A third option is to consider both kind of effects, complete and parallel, this in known as \textit{partial parallel design} \begin{equation} Z=\left( \begin{array}{ccccccc} 1 & & & x_k^t & & & x_l^t \ & \ddots & & & \ddots & & \vdots \ & & 1 & & & x_k^t & x_l^t \end{array} \right)_{(J- 1) \times ((J -1)(1 + K) + L)} \ \end{equation}\

Constraints on the intercepts:\ For the particular case of the cumulative ratio the equidistant constraint considers that the distances between adjacent intercepts are the same for all the pairs ${(j, j+1)}$, therefore we can write the intercepts as \begin{equation} \alpha_j=\alpha_1+(j-1)\theta \end{equation} this restriction implies that only two parameters ($\alpha_1$, the first threshold, and, $\theta$ the spacing) have to be estimated regardless the number of categories.

All the classical models for categorical response data, can be written as an (r,F,Z) triplet, as examples:

The multinomimal model $\equiv$ $(Reference, Logistic, Complete)$\
The odds parallel logit model $\equiv$ $(Cumulative, Logistic, parallel)$\
The parallel hazard model $\equiv$ $(Sequential, Gompertz, parallel)$\
The continuation ratio logit model $\equiv$ $(Sequential, Logistic, Complete)$\
The adjacent logit model $\equiv$ $(Adjacent, Logistic, Complete)$\

Fitting (r,F,Z) with the `glmcat` package

Family of reference models

We used the 223 observations of the boy's disturbed dreams benchmark dataset drawn from a study that cross-classified boys by their age x and the severity of their disturbed dreams y [@maxwell1961Analyzing]. The data is available as the object DisturbedDreams in the package glmcat.

For more information see the manual entry for the DisturbedDreams data: help(DisturbedDreams).

# devtools::load_all()
library(GLMcat)

data("DisturbedDreams")
summary(DisturbedDreams)

We will fit the model $(Reference, Logistic, Complete)$ to the DisturbedDreams data using the function glmcat. We save the fitted glmcat model in the object mod_ref_log_c and we print it by simply typing its name:

DisturbedDreams$Level <- as.factor(as.character(DisturbedDreams$Level))
mod_ref_log_c <- glmcat(
  formula = Level ~ Age, ratio = "reference",
  cdf = "logistic", ref_category = "Very.severe",
  data = DisturbedDreams
)

The most common R functions which describe different model features are available for the objects in glmcat

The summary of the object:

summary(mod_ref_log_c)

The number or observations:

nobs(mod_ref_log_c)

The coefficients of the model

coef(mod_ref_log_c)

The LogLikelihood

logLik(mod_ref_log_c)

Information criteria

AIC(mod_ref_log_c)
BIC(mod_ref_log_c)

It is possible to do predictions in glmcat using the function predict_glmcat. We are going to predict the response for 3 random observations:

# Random observations
set.seed(13)
ind <- sample(x = 1:nrow(DisturbedDreams), size = 3)
# Probabilities

predict(mod_ref_log_c, newdata = DisturbedDreams[ind, ], type = "prob")
# Linear predictor
predict(mod_ref_log_c, newdata = DisturbedDreams[ind, ], type = "linear.predictor")

Now we illustrate how to predict in a set of new observations. Suppose we want to predict the severity of dreams for 3 individuals whose ages are 5, 9.5 and 15 respectively:

# New data
# Age <- c(5, 9.5, 15)
# predict(mod_ref_log_c, newdata = Age, type = "prob")

Assume that we are interested in making the effect of the predictor variable parallel, to that end, we type the name of the predictor variable as the input for the parameter parallel. The model to fit corresponds to the triplet (Reference, Logistic, parallel):

# DisturbedDreams$Level <- as.factor(as.character(DisturbedDreams$Level))
# mod2 <- glmcat(
#   formula = Level ~ Age, cdf = "logistic",
#   parallel = "Age", ref_category = "Very.severe",
#   data = DisturbedDreams
# )
# summary(mod2)
# logLik(mod2)

Another variation of the reference model is obtained at changing the cdf function. Let's now fit the model (Reference, Student (0.5), Complete):

# DisturbedDreams$Level <- as.factor(as.character(DisturbedDreams$Level))
# mod3 <- glmcat(
#   formula = Level ~ Age, ref_category = "Very.severe",
#   data = DisturbedDreams, cdf = list("student",0.5)
# )
# summary(mod3)
# logLik(mod3)

Family of adjacent models

The equivalence between $(Adjacent, Logistic, Complete)$ and $(Reference, Logistic, Complete)$ models is shown by comparing the associated LogLikelihood of both models:

logLik(mod_ref_log_c) # recall (ref,logit,com)
mod_adj_log_c <- glmcat(
  formula = Level ~ Age, ratio = "adjacent",
  data = DisturbedDreams, cdf = "logistic"
)
logLik(mod_adj_log_c)
summary(mod_adj_log_c)

Remark that despite the fact that the LogLikelihoods are equal, the parameters estimations are different $(\alpha \neq \alpha')$. Defining the matrix $A^T$ as follows:

library(xtable)
print(xtableMatharray(matrix(c(1, -1, 0, 0, 1, -1, 0, 0, 1), nrow = 3)), type = "latex")

$$ A^T = \left(\begin{array}{rrr} 1 & 0 & 0 \ -1 & 1 & 0 \ 0 & -1 & 1 \ \end{array}\right) $$ we can check that $A^T * \alpha = \alpha'$.

Note: The adjacent models are stable under the reverse permutation.

$(Adjacent, Cauchy, Complete)$

mod_adj_cau_c <- glmcat(
  formula = Level ~ Age,
  ratio = "adjacent", cdf = "cauchy",
  categories_order = c("Not.severe", "Severe.1", "Severe.2", "Very.severe"),
  data = DisturbedDreams
)
logLik(mod_adj_cau_c)
summary(mod_adj_cau_c)

(Adjacent, Cauchy, Complete) with reversed order

mod_adj_cau_c_rev <- glmcat(
  formula = Level ~ Age,
  ratio = "adjacent", cdf = "cauchy",
  categories_order = c("Very.severe", "Severe.2", "Severe.1", "Not.severe"),
  data = DisturbedDreams
)
logLik(mod_adj_cau_c_rev)
summary(mod_adj_cau_c_rev)

The LogLikelihoods of the last two models are the same, this is because the Cauchy cdf is symmetric; for non symmetric distributions this is not longer true. Note that if the Gumbel cdf is used with the reverse order, then, its LogLikelihood is equal to the model using Gompertz as the cdf, this is because the Gumbel cdf is the symmetric of the Gompertz cdf. Otherwise, the parameter estimations are reversed:

(Adjacent, Gumbel, parallel)

adj_gumbel_p <- glmcat(
  formula = Level ~ Age,
  ratio = "adjacent", cdf = "gumbel",
  categories_order = c("Not.severe", "Severe.1", "Severe.2", "Very.severe"),
  parallel = c("(Intercept)", "Age"),
  data = DisturbedDreams
)
logLik(adj_gumbel_p)
summary(adj_gumbel_p)

(Adjacent, Gompertz, parallel)

adj_gompertz_rev <- glmcat(
  formula = Level ~ Age,
  ratio = "adjacent", cdf = "gompertz",
  categories_order = c("Very.severe", "Severe.2", "Severe.1", "Not.severe"),
  parallel = c("(Intercept)", "Age"),
  data = DisturbedDreams
)
logLik(adj_gompertz_rev)
summary(adj_gompertz_rev)

Family of sequential models

The sequential ratio, which assumes a binary process at each transition, higher levels can be reached only if previous levels where reached at a earlier stage.

(Sequential, Normal, Complete)

seq_probit_c <- glmcat(
  formula = Level ~ Age,
  ratio = "sequential", cdf = "normal",
  data = DisturbedDreams
)
logLik(seq_probit_c)
summary(seq_probit_c)

Family of cumulative models

(Cumulative, Logistic, Complete)

cum_log_co <- glmcat(
  formula = Level ~ Age,
  cdf = "logistic",
  ratio = "cumulative",
  data = DisturbedDreams
)
logLik(cum_log_co)
summary(cum_log_co)

The function glmcat has special features for the cumulative models. The option for the thresholds to be equidistant is a characteristic of interest for the family of cumulative models:

(Cumulative, Logistic, Equidistant)

cum_log_co_e <- glmcat(
  formula = Level ~ Age,
  cdf = "logistic",
  ratio = "cumulative",
  data = DisturbedDreams,
  parallel = "Age",
  threshold = "equidistant",
)
logLik(cum_log_co_e)
summary(cum_log_co_e)

If we have a preliminary idea of the coefficients of the model, we can specify an initialization vector through the parameter beta_init:

cum_log_c <- glmcat(
  formula = Level ~ Age,
  cdf = list("student",0.8),
  ratio = "cumulative",
  data = DisturbedDreams,
  control = control_glmcat(beta_init = coef(cum_log_co))
)
logLik(cum_log_c)
summary(cum_log_c)

The equivalence between the $(Cumulative, Gompertz, parallel)$ and $(Sequential, Gompertz, parallel)$ models has been demonstrated by @laara_matthes and it is hereby tested using the functions:

cum_gom_p <- glmcat(
  formula = Level ~ Age,
  cdf = "gompertz",
  ratio = "cumulative",
  data = DisturbedDreams,
  parallel = "Age"
)
logLik(cum_gom_p)
summary(cum_gom_p)

seq_gom_p <- glmcat(
  formula = Level ~ Age,
  cdf = "gompertz",
  ratio = "sequential",
  data = DisturbedDreams,
  parallel = "Age"
)
logLik(seq_gom_p)
summary(seq_gom_p)

Conclusion

The models for categorical response data have been evolved in different fields of research under different names. Some of them are fairly similar or are even the same. Until recently, there was no methodology that encompassed these models in a comparable scheme. glmcat is based on the new specification of a generalized linear model given by the (r,F,Z)-triplet, which groups together all the proposed methodologies for modelling categorical responses. glmcat offers a full picture of the spectrum of models where the user has three components to combine in order to obtain a model that meets the specifications of the problem.

References

Any scripts or data that you put into this service are public.

GLMcat documentation built on Sept. 30, 2024, 5:08 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

GLMcat
Generalized Linear Models for Categorical Responses

A tutorial on fitting Generalized Linear Models for categorical responses with the GLMcat package"
In GLMcat: Generalized Linear Models for Categorical Responses

Introduction to the (r,F,Z) methodology:

Ratio of probabilities r\

Cumulative cdf function F\

Design Matrix $Z$\

Fitting (r,F,Z) with the `glmcat` package

Family of reference models

Family of adjacent models

Family of sequential models

Family of cumulative models

Conclusion

References

Try the GLMcat package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

GLMcat Generalized Linear Models for Categorical Responses

A tutorial on fitting Generalized Linear Models for categorical responses with the GLMcat package" In GLMcat: Generalized Linear Models for Categorical Responses

Introduction to the (r,F,Z) methodology:

Ratio of probabilities r\

Cumulative cdf function F\

Design Matrix $Z$\

Fitting (r,F,Z) with the glmcat package

Family of reference models

Family of adjacent models

Family of sequential models

Family of cumulative models

Conclusion

References

Try the GLMcat package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

GLMcat
Generalized Linear Models for Categorical Responses

A tutorial on fitting Generalized Linear Models for categorical responses with the GLMcat package"
In GLMcat: Generalized Linear Models for Categorical Responses

Fitting (r,F,Z) with the `glmcat` package