knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(riskclustr)
In epidemiologic studies polytomous logistic regression is commonly used in the study of etiologic heterogeneity when data are from a case-control study, and the method has good statistical properties. Although polytomous logistic regression can be implemented using available software, the additional calculations needed to perform a thorough analysis of etiologic heterogeneity are cumbersome. To facilitate use of this method we provide functions eh_test_subtype()
and eh_test_marker()
to address two key questions regarding etiologic heterogeneity:
Whether disease subtypes are pre-specified or formed by cross-classification of individual disease markers, the resulting polytomous logistic regression model is the same. Let $i$ index study subjects, $i = 1, \ldots, N$, let $m$ index disease subtypes, $m = 0, \ldots M$, where $m=0$ denotes control subjects, and let $p$ index risk factors, $p = 1, \ldots, P$. The polytomous logistic regression model is specified as
$$\Pr(Y = m | \mathbf{X}) = \frac{\exp(\mathbf{X}^T \boldsymbol{\beta}{\boldsymbol{\cdot} m})}{\mathbf{1} + \exp(\mathbf{X}^T \boldsymbol{\beta}) \mathbf{1}}$$ where $\mathbf{X}$ is the $(P+1) \times N$ matrix of risk factor values, with the first row all ones for the intercept, and $\boldsymbol{\beta}$ is the $(P+1) \times M$ matrix of regression coefficients. $\boldsymbol{\beta}{\boldsymbol{\cdot} m}$ indicates the $m$th column of the matrix $\boldsymbol{\beta}$ and $\mathbf{1}$ represents a vector of ones of length $M$.
If disease subtypes are pre-specified, either based on clustering high-dimensional disease marker data or based on a single disease marker or combinations of disease markers, then statistical tests for etiologic heterogeneity according to each risk factor can be conducted using the eh_test_subtype()
function.
Estimates of the parameters of interest related to the question of whether risk factor effects differ across subtypes of disease, $\hat{\boldsymbol{\beta}}$, and the associated estimated variance-covariance matrix, $\widehat{cov}(\hat{\boldsymbol{\beta}})$, are obtained directly from the resulting polytomous logistic regression model. Each $\beta_{pm}$ parameter represents the log odds ratio for a one-unit change in risk factor $p$ for subtype $m$ disease versus controls. Hypothesis tests for the question of whether a specific risk factor effect differs across subtypes of disease can be conducted separately for each risk factor $p$ using a Wald test of the hypothesis
$$H_{0_{\beta_{p.}}}: \beta_{p1} = \dots = \beta_{pM}$$
Using the subtype_data
simulated dataset, we can examine the influence of risk factors x1
, x2
, and x3
on the 4 pre-specified disease subtypes in variable subtype
using the following code:
library(riskclustr) mod1 <- eh_test_subtype( label = "subtype", M = 4, factors = list("x1", "x2", "x3"), data = subtype_data)
See the function documentation for details of function arguments.
The resulting estimates $\hat{\boldsymbol{\beta}}$ can be accessed with
mod1$beta
knitr::kable(mod1$beta)
the associated standard deviation estimates $\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}$ with
mod1$beta_se
knitr::kable(mod1$beta_se)
and the heterogeneity p-values with
mod1$eh_pval
knitr::kable(mod1$eh_pval)
An overall formatted dataframe containing $\hat{\boldsymbol{\beta}} \Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\Big)$ and heterogeneity p-values p_het
to test the null hypotheses $H_{0_{\beta_{p.}}}$ can be obtained as
mod1$beta_se_p
knitr::kable(mod1$beta_se_p)
Because it is often of interest to examine associations in a case-control study on the odds ratio (OR) scale rather than the original parameter estimate scale, it is also possible to obtain a matrix containing $OR=\exp(\hat{\boldsymbol{\beta}})$, along with 95\% confidence intervals and heterogeneity p-values p_het
to test the null hypotheses $H_{0_{\beta_{p.}}}$ using
mod1$or_ci_p
knitr::kable(mod1$or_ci_p)
If disease subtypes are formed by cross-classifying individual binary disease markers, then statistical tests for associations between risk factors and individual disease markers can be conducted using the eh_test_marker()
funtion.
Let $k$ index disease markers, $k = 1, \ldots, K$. Here the $M$ disease subtypes are formed by cross-classification of the $K$ binary disease markers, so that we have $M = 2^K$ disease subtypes.
To evaluate the independent influences of individual disease markers, it is convenient to transform the parameters in $\boldsymbol{\beta}$ using the one-to-one linear transformation
$$\hat{\boldsymbol{\gamma}} = \frac{\hat{\boldsymbol{\beta}} \mathbf{L}}{M/2}.$$ Here $\mathbf{L}$ is an $M \times K$ contrast matrix such that the entries are -1 if disease marker $k$ is absent for disease subtype $m$ and 1 if disease marker $k$ is present for disease subtype $m$. $\boldsymbol{\gamma}$ is then the $(P+1) \times K$ matrix of parameters that reflect the independent effects of distinct disease markers. Each element of the $\boldsymbol{\gamma}$ parameters represents the average of differences in log odds ratios between disease subtypes defined by different levels of the $k$th disease marker with respect to the $p$th risk factor when the other disease markers are held constant. Variance estimates corresponding to each $\hat{\gamma}_{pk}$ are obtained using
$$\widehat{var}(\hat{\gamma}{pk}) = \left(\frac{M}{2}\right)^{-2} \mathbf{L}{\boldsymbol{\cdot} k}^T \widehat{cov}(\hat{\boldsymbol{\beta}}{p \boldsymbol{\cdot}}^T) \mathbf{L}{\boldsymbol{\cdot} k}$$ where $\mathbf{L}{\boldsymbol{\cdot} k}$ is the $k$th column of the $\mathbf{L}$ matrix and the estimated variance-covariance matrix $\widehat{cov}(\hat{\boldsymbol{\beta}}{p \boldsymbol{\cdot}})$ for each risk factor $p$ is obtained directly from the polytomous logistic regression model.
Hypothesis tests for the question of whether a risk factor effect differs across levels of each individual disease marker of which the disease subtypes are comprised can be conducted separately for each combination of risk factor $p$ and disease marker $k$ using a Wald test of the hypothesis
$$H_{0_{{\gamma_{pk}}}}: \gamma_{pk} = 0.$$
Using the subtype_data
simulated dataset, we can examine the influence of risk factors x1
, x2
, and x3
on the two individual disease markers marker1
and marker2
. These two binary disease markers will be cross-classified to form four disease subtypes that will be used as the outcome in the polytomous logistic regression model to obtain the $\hat{\boldsymbol{\beta}}$ estimates, which are then transformed in order to obtain estimates and hypothesis tests related to the individual disease markers.
library(riskclustr) mod2 <- eh_test_marker( markers = list("marker1", "marker2"), factors = list("x1", "x2", "x3"), case = "case", data = subtype_data)
See the function documentation for details of function arguments.
The resulting estimates $\hat{\boldsymbol{\gamma}}$ can be accessed with
mod2$gamma
knitr::kable(mod2$gamma)
the associated standard deviation estimates $\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}$ with
mod2$gamma_se
knitr::kable(mod2$gamma_se)
and the associated p-values with
mod2$gamma_pval
knitr::kable(mod2$gamma_pval)
An overall formatted dataframe containing the $\hat{\boldsymbol{\gamma}} \Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\Big)$ and p-values to test the null hypotheses $H_{0_{\gamma_{pk}}}$ can be obtained as
mod2$gamma_se_p
knitr::kable(mod2$gamma_se_p)
The estimates and heterogeneity p-values for disease subtypes formed by cross-classifying these individual disease markers can also be accessed in objects beta_se_p
and or_ci_p
, as described in the section on Pre-specified subtypes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.