A generalized linear model is characterized by three components: 1) the\textit{ random component} that defines the conditional cdf of the response variable $Y_i$ given the realization of the explanatory variables $\boldsymbol{x_i}$; 2) the \textit{systematic component} which is determined by the \textit{linear predictor} $\eta$ (that specifies the linear entry of the independent variables), and 3) the link function $g$ that relates the expected response and the linear predictor.
The random component of a GLM for a categorical response with $J$ categories is the multinomial cdf with vector of probabilities $(\pi_1,...,\pi_{J})$ where $\sum \pi_r = 1$ . The linear predictor $(\eta_1,...,\eta_{J-1})$ can be written as the product of the design matrix $Z$ and the unknown parameter vector $\beta$. The link function which characterizes this model is given by the equation $g(\pi) = Z\beta$, with $J-1$ equations $g_j = \eta_j$. @peyhardi_new proposed to write the link function as\begin{equation} g_j = F ^ {-1} \circ r_j \Leftrightarrow r_j = F(\eta_j) \quad \quad {j = 1,2,...,J-1} \end{equation}where $F$ is a cumulative cdf function and $r=(r_1,...,r_{J-1})$ is a transformation of the expected value vector. In the following, we will describe in more details the components (r,F,Z) and their modalities.
The linear predictor is not directly related to the expectation $\boldsymbol{\pi}$ instead they are related through a particular transformation $\boldsymbol{r}$ of the vector $\boldsymbol{\pi}$ which is called the ratio. @peyhardi_new proposed four ratios that gather the alternatives to model categorical response data:
\begin{center}{ \begin{tabular}{{lllll}} \hline \multicolumn{1}{|l|}{} & \multicolumn{1}{l}{Cumulative} & \multicolumn{1}{l} {{Sequential}} & \multicolumn{1}{l|}{{Adjacent}} & \multicolumn{1}{l|}{{Reference}} \ \multicolumn{1}{|l|}{$r_j(\boldsymbol{\pi})$ \quad \quad } & \multicolumn{1}{l}{$\pi_1+...+\pi_j$} & \multicolumn{1}{l}{$\dfrac{\pi_j}{\pi_j +...+ \pi_J}$} & \multicolumn{1}{l|}{$\dfrac{\pi_j}{\pi_j + \pi_{j+1}}$} & \multicolumn{1}{l|}{$\dfrac{\pi_j}{\pi_j + \pi_J}$} \ \hline \multicolumn{1}{|l|}{$Y$} & \multicolumn{1}{l}{} & \multicolumn{1}{l} {\centering{{ordinal}}} & \multicolumn{1}{l|}{} & \multicolumn{1}{l|}{\centering {nominal}} \ \cline{1-5} \end{tabular} } \end{center}
Each component $r_j(\boldsymbol{\pi})$ can be viewed as a (conditional) probability. For the reference ratio, each category $j$ is compared to the reference category $J$. For the adjacent ratio, each category $j$ is compared to its adjacent category $j+1$. For the cumulative ratio, the probabilities of categories are cumulated. For the sequential ratio, each category $j$ is compared to its following category, $j+1, \ldots, J$. The adjacent, cumulative and sequential ratios all rely on an ordering assumption among categories. The reference ratio is devoted to nominal responses.
The cumulative cdf functions (distributions) available in glmcat
to fit the models are: logistic, normal, Cauchy, Student (with any df), Gompertz and Gumbel. The logistic and normal distributions are the symmetric distributions most commonly used to define link functions in generalized linear models. However, for specific scenarios, the use of other distributions may result in a more accurate fit. An example is presented by @peyhardi_travel, where the employment of the Student cdf leaded to a better fit for a modeling exercise on travel choice data. For the asymmetric case, the Gumbel and Gompertz distributions are the most commonly used.
It is possible to impose restrictions on the thresholds, or on the effects of the covariates, for example, for them to vary or not according to the response categories.
It is plausible for a predictor to have specific level of impact on the different categories of the response. Thus, the $J-1$ linear predictors are of the form: $\eta_j =\alpha_j + x' \delta_j$ with $\beta = (\alpha_1,..., \alpha_{J-1}, \delta_1', ..., \delta_{J-1}')$. And, its associated design matrix is: \begin{equation} Z_c=\left( \begin{array}{cccccc} 1 & & & x^t & & \ & \ddots & & & \ddots & \ & & 1 & & & x^t \end{array} \right){(J- 1) \times (J -1)(1 + p)} \ \end{equation} Another case is to constrain the effects of the covariates to be constant across the response categories. Therefore, there is only a global effect that is not specific to the response categories, this is known as the parallelism assumption, for which the constrained space is represented by: \begin{equation} Z_p= \left( \begin{array}{cccc} 1 & & & x^t \ & \ddots & & \vdots \ & & 1 & x^t \end{array} \right){(J- 1) \times (J -1 + p)} \end{equation} The first case $(Z_c)$ is named by @peyhardi_new as the \textit{complete} design, whereas the second $(Z_p)$ as the \textit{parallel} design. These two matrices are sufficient to define all the classical models. A third option is to consider both kind of effects, complete and parallel, this in known as \textit{partial parallel design} \begin{equation} Z=\left( \begin{array}{ccccccc} 1 & & & x_k^t & & & x_l^t \ & \ddots & & & \ddots & & \vdots \ & & 1 & & & x_k^t & x_l^t \end{array} \right)_{(J- 1) \times ((J -1)(1 + K) + L)} \ \end{equation}\
All the classical models for categorical response data, can be written as an (r,F,Z) triplet, as examples:
glmcat
packageWe used the 223 observations of the boy's disturbed dreams benchmark dataset drawn from a study that cross-classified boys by their age x and the severity of their disturbed dreams y [@maxwell1961Analyzing]. The data is available as the object DisturbedDreams
in the package glmcat
.
For more information see the manual entry for the DisturbedDreams
data: help(DisturbedDreams)
.
# devtools::load_all() library(GLMcat)
data("DisturbedDreams") summary(DisturbedDreams)
We will fit the model $(Reference, Logistic, Complete)$ to the DisturbedDreams
data using the function glmcat
. We save the fitted glmcat
model in the object mod_ref_log_c
and we print it by simply typing its name:
DisturbedDreams$Level <- as.factor(as.character(DisturbedDreams$Level)) mod_ref_log_c <- glmcat( formula = Level ~ Age, ratio = "reference", cdf = "logistic", ref_category = "Very.severe", data = DisturbedDreams )
The most common R functions which describe different model features are available for the objects in glmcat
summary(mod_ref_log_c)
nobs(mod_ref_log_c)
coef(mod_ref_log_c)
logLik(mod_ref_log_c)
AIC(mod_ref_log_c) BIC(mod_ref_log_c)
It is possible to do predictions in glmcat
using the function predict_glmcat
. We are going to predict the response for 3 random observations:
# Random observations set.seed(13) ind <- sample(x = 1:nrow(DisturbedDreams), size = 3) # Probabilities predict(mod_ref_log_c, newdata = DisturbedDreams[ind, ], type = "prob") # Linear predictor predict(mod_ref_log_c, newdata = DisturbedDreams[ind, ], type = "linear.predictor")
Now we illustrate how to predict in a set of new observations. Suppose we want to predict the severity of dreams for 3 individuals whose ages are 5, 9.5 and 15 respectively:
# New data # Age <- c(5, 9.5, 15) # predict(mod_ref_log_c, newdata = Age, type = "prob")
Assume that we are interested in making the effect of the predictor variable parallel, to that end, we type the name of the predictor variable as the input for the parameter parallel
. The model to fit corresponds to the triplet (Reference, Logistic, parallel):
# DisturbedDreams$Level <- as.factor(as.character(DisturbedDreams$Level)) # mod2 <- glmcat( # formula = Level ~ Age, cdf = "logistic", # parallel = "Age", ref_category = "Very.severe", # data = DisturbedDreams # ) # summary(mod2) # logLik(mod2)
Another variation of the reference model is obtained at changing the cdf function. Let's now fit the model (Reference, Student (0.5), Complete):
# DisturbedDreams$Level <- as.factor(as.character(DisturbedDreams$Level)) # mod3 <- glmcat( # formula = Level ~ Age, ref_category = "Very.severe", # data = DisturbedDreams, cdf = list("student",0.5) # ) # summary(mod3) # logLik(mod3)
The equivalence between $(Adjacent, Logistic, Complete)$ and $(Reference, Logistic, Complete)$ models is shown by comparing the associated LogLikelihood of both models:
logLik(mod_ref_log_c) # recall (ref,logit,com) mod_adj_log_c <- glmcat( formula = Level ~ Age, ratio = "adjacent", data = DisturbedDreams, cdf = "logistic" ) logLik(mod_adj_log_c) summary(mod_adj_log_c)
Remark that despite the fact that the LogLikelihoods are equal, the parameters estimations are different $(\alpha \neq \alpha')$. Defining the matrix $A^T$ as follows:
library(xtable) print(xtableMatharray(matrix(c(1, -1, 0, 0, 1, -1, 0, 0, 1), nrow = 3)), type = "latex")
$$ A^T = \left(\begin{array}{rrr} 1 & 0 & 0 \ -1 & 1 & 0 \ 0 & -1 & 1 \ \end{array}\right) $$ we can check that $A^T * \alpha = \alpha'$.
Note: The adjacent models are stable under the reverse permutation.
$(Adjacent, Cauchy, Complete)$
mod_adj_cau_c <- glmcat( formula = Level ~ Age, ratio = "adjacent", cdf = "cauchy", categories_order = c("Not.severe", "Severe.1", "Severe.2", "Very.severe"), data = DisturbedDreams ) logLik(mod_adj_cau_c) summary(mod_adj_cau_c)
(Adjacent, Cauchy, Complete) with reversed order
mod_adj_cau_c_rev <- glmcat( formula = Level ~ Age, ratio = "adjacent", cdf = "cauchy", categories_order = c("Very.severe", "Severe.2", "Severe.1", "Not.severe"), data = DisturbedDreams ) logLik(mod_adj_cau_c_rev) summary(mod_adj_cau_c_rev)
The LogLikelihoods of the last two models are the same, this is because the Cauchy cdf is symmetric; for non symmetric distributions this is not longer true. Note that if the Gumbel cdf is used with the reverse order, then, its LogLikelihood is equal to the model using Gompertz as the cdf, this is because the Gumbel cdf is the symmetric of the Gompertz cdf. Otherwise, the parameter estimations are reversed:
(Adjacent, Gumbel, parallel)
adj_gumbel_p <- glmcat( formula = Level ~ Age, ratio = "adjacent", cdf = "gumbel", categories_order = c("Not.severe", "Severe.1", "Severe.2", "Very.severe"), parallel = c("(Intercept)", "Age"), data = DisturbedDreams ) logLik(adj_gumbel_p) summary(adj_gumbel_p)
(Adjacent, Gompertz, parallel)
adj_gompertz_rev <- glmcat( formula = Level ~ Age, ratio = "adjacent", cdf = "gompertz", categories_order = c("Very.severe", "Severe.2", "Severe.1", "Not.severe"), parallel = c("(Intercept)", "Age"), data = DisturbedDreams ) logLik(adj_gompertz_rev) summary(adj_gompertz_rev)
The sequential ratio, which assumes a binary process at each transition, higher levels can be reached only if previous levels where reached at a earlier stage.
(Sequential, Normal, Complete)
seq_probit_c <- glmcat( formula = Level ~ Age, ratio = "sequential", cdf = "normal", data = DisturbedDreams ) logLik(seq_probit_c) summary(seq_probit_c)
(Cumulative, Logistic, Complete)
cum_log_co <- glmcat( formula = Level ~ Age, cdf = "logistic", ratio = "cumulative", data = DisturbedDreams ) logLik(cum_log_co) summary(cum_log_co)
The function glmcat has special features for the cumulative models. The option for the thresholds to be equidistant is a characteristic of interest for the family of cumulative models:
(Cumulative, Logistic, Equidistant)
cum_log_co_e <- glmcat( formula = Level ~ Age, cdf = "logistic", ratio = "cumulative", data = DisturbedDreams, parallel = "Age", threshold = "equidistant", ) logLik(cum_log_co_e) summary(cum_log_co_e)
If we have a preliminary idea of the coefficients of the model, we can specify an initialization vector through the parameter beta_init
:
cum_log_c <- glmcat( formula = Level ~ Age, cdf = list("student",0.8), ratio = "cumulative", data = DisturbedDreams, control = control_glmcat(beta_init = coef(cum_log_co)) ) logLik(cum_log_c) summary(cum_log_c)
The equivalence between the $(Cumulative, Gompertz, parallel)$ and $(Sequential, Gompertz, parallel)$ models has been demonstrated by @laara_matthes and it is hereby tested using the functions:
cum_gom_p <- glmcat( formula = Level ~ Age, cdf = "gompertz", ratio = "cumulative", data = DisturbedDreams, parallel = "Age" ) logLik(cum_gom_p) summary(cum_gom_p) seq_gom_p <- glmcat( formula = Level ~ Age, cdf = "gompertz", ratio = "sequential", data = DisturbedDreams, parallel = "Age" ) logLik(seq_gom_p) summary(seq_gom_p)
The models for categorical response data have been evolved in different fields of research under different names. Some of them are fairly similar or are even the same. Until recently, there was no methodology that encompassed these models in a comparable scheme. glmcat
is based on the new specification of a generalized linear model given by the (r,F,Z)-triplet, which groups together all the proposed methodologies for modelling categorical responses. glmcat
offers a full picture of the spectrum of models where the user has three components to combine in order to obtain a model that meets the specifications of the problem.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.