README.md
In sbohora/sAUC: Semi-parametric Area Under the Curve (AUC) regression

Perform AUC analyses with discrete covariates and a semi-parametric estimation

# This package has not been submitted to CRAN yet. You can install its development version from GitHub:

# install.packages("devtools")
devtools::install_github("sbohora/sAUC")

If any bug is encountered, please raise an issue with a minimal reproducible example on github.

In many applications, comparing two groups while adjusting for multiple covariates is desired for the statistical analysis. For instance, in clinical trials, adjusting for covariates is a necessary aspect of the statistical analysis in order to improve the precision of the treatment comparison and to assess effect modification. sAUC is a semi-parametric AUC regression model to compare the effect of two treatment groups in the intended non-normal outcome while adjusting for discrete covariates. More detailed reasons on what it is and why it is proposed are outlined in this paper. A major reason behind the development of this method is that this method is computationally simple and is based on closed-form parameter and standard error estimation.

We consider applications that compare a response variable y between two groups (A and B) while adjusting for k categorical covariates $X_1,X_2,...,X_k$ . The response variable y is a continuous or ordinal variable that is not normally distributed. Without loss of generality, we assume each covariate is coded such that $X_i=1,...,n_i$ ,for $i=1,...,k$ . For each combination of the levels of the covariates, we define the Area Under the ROC curve (AUC) in the following way:

$\pi_{x_1 x_2...x_k}=P(Y^A>Y^B|X_1=x_1,X_2=x_2,...,X_k=x_k$ +%5Cfrac%7B1%7D%7B2%7D%20P(Y%5EA%3DY%5EB%7CX_1%3Dx_1,X_2%3Dx_2,...,X_k%3Dx_k%20),)

where $x_1=1,...,n_1,...,x_k=1,...,n_k$ , and $Y^A$ and $Y^B$ are two randomly chosen observations from Group A and B, respectively. The second term in the above equation is for the purpose of accounting ties.

For each covariate $X_i$ , without loss of generality, we use the last category as the reference category and define ( $n_i-1$ ) dummy variables $X_i^{(1$ %7D,X_i%5E%7B(2)%7D,...,X_i%5E%7B(n_i-1)%7D) such that

$X_i^{(j$ %7D%20(x)%3D%201,%20if%20j%20%3D%20x) and $0, if j \ne x.$

where $i=1,...,k; j=1,...,n_i-1; x=1,...,n_i$ . We model the association between AUC $\pi_(x_1 x_2...x_k$ ) and covariates using a logistic model. Such a model specifies that the logit of $\pi_(x_1 x_2...x_k$ ) is a linear combination of terms that are products of the dummy variables defined above. Specifically,

$logit(\pi_{x_1 x_2...x_k }$ %3DZ_%7B(x_1%20x_2...x_k%20)%7D%20%5Cboldsymbol%7B%5Cbeta%7D,)

where $Z_{(x_1 x_2...x_k$ %7D) is a row vector whose elements are zeroes or ones and are products of $X_1^{(1$ %7D%20(x_1%20),...,X_1%5E%7B(n_i-1)%20%7D%20(x_1),...,X_k%5E%7B(1)%7D%20(x_k),...,X_k%5E%7B(n_k-1)%7D%20(x_k)), and $\boldsymbol{\beta}$ is a column vector of nonrandom unknown parameters. Now, define a column vector $\pi$ by stacking up $\pi_(x_1 x_2...x_k$ ) and define a matrix Z by stacking up $Z_{(x_1 x_2...x_k$ %7D), as $x_i$ ranges from 1 to $n_i, i=1,...,k$ , our final model is

$logit(\pi$ %3DZ%5Cboldsymbol%7B%5Cbeta%7D%20...(1))

The reason for us to use a logit transformation of the AUC instead of using the original AUC is for variance stabilization. We will illustrate the above general model using examples.

First, we denote the number of observations with covariates $X_1=i_1,...,X_k=i_k$ in groups A and B by $N_(i_1...i_k$ %5EA) and $N_(i_1...i_k$ %5EB), respectively. We assume both $N_(i_1...i_k$ %5EA) and $N_(i_1...i_k$ %5EB) are greater than zero in the following development. An unbiased estimator of $\pi_(i_1...i_k$ ) proposed by Mann and Whitney (1947) is

$\hat{\pi}_(i_1...i_k$ %3D%5Cfrac%7B%5Csum_%7Bl%3D1%7D%5E%7BN_%7Bi_1...i_k%7D%5EA%7D%20%5Csum_%7Bj%3D1%7D%5E%7BN_%7Bi_1...i_k%7D%5EB%7D%20I_%7Blj%7D%7D%7BN_%7Bi_1...i_k%7D%5EA%20N_%7Bi_1...i_k%7D%5EB%7D,)

where

$I_(i_1...i_k$ ;%20lj%20%3D%201) if $Y_{i_1...i_k;l}^A > Y_{i_1...i_k;j}^B$

and

$I_(i_1...i_k$ ;%20lj%20%3D%20%5Cfrac%7B1%7D%7B2%7D) if $Y_{i_1...i_k;l}^A = Y_{i_1...i_k;j}^B$

and

$I_(i_1...i_k$ ;%20lj%20%3D%200) if $Y_{i_1...i_k;l}^A < Y_{i_1...i_k;j}^B$

and $Y_(i_1...i_k; l$ %5EA) and $Y_(i_1...i_k; j$ %5EB) are observations with $X_1=i_1,...,X_k=i_k$ in groups A and B, respectively. Delong, Delong and Clarke-Pearson (1988) have shown that

$\hat{\pi}_{i_1...i_k} \approx N(\pi_{i_1...i_k},\sigma_{i_1...i_k}^2$ ).

In order to obtain an estimator for $\sigma_{i_1...i_k}^2$ , they first computed

$V_{i_1...i_k; l}^A=\frac{1}{N_{i_1...i_k}^B } \sum_{j=1}^{N_{i_1...i_k}^B} I_{lj}, l=1,...,N_{i_1...i_k}^A$

and

$V_{i_1...i_k;j}^B=\frac{1}{N_{i_1...i_k}^A } \sum_{l=1}^{N_{i_1...i_k}^A} I_{lj}, j=1,...,N_{i_1...i_k}^B$ .

Then, an estimate of the variance of the nonparametric AUC was

$\hat{\sigma}_{i_1...i_k}^2=\frac{(s_{i_1...i_k}^A$ %5E2%7D%7BN_%7Bi_1...i_k%7D%5EA%7D%20+%20%5Cfrac%7B(s_%7Bi_1...i_k%7D%5EB%20)%5E2%7D%7BN_%7Bi_1...i_k%7D%5EB%7D),

where

$(s_{i_1...i_k}^A$ %5E2) and $(s_{i_1...i_k}^B$ %5E2) were the sample variances of

$V_{i_1...i_k; l}^A; l=1,...,N_{i_1...i_k}^A$ and $V_{i_1...i_k; j}^B; j=1,...,N_{i_1...i_k}^B,$ respectively. Clearly, we need both $N_{i_1...i_k}^A$ and $N_{i_1...i_k}^B$ are greater than two in order to compute $\hat{\sigma}_{i_1...i_k}^2$ .

Now, in order to estimate parameters in Model (1), we first derive the asymptotic variance of $\hat{\gamma}_{i_1...i_k}$ using the delta method, which results in

$\hat{\gamma}_{i_1...i_k}=logit(\hat{\pi}_{i_1...i_k}$ %20%5Capprox%20N(logit(%5Cpi_%7Bi_1...i_k%7D),%5Ctau_%7Bi_1...i_k%7D%5E2),)

where $\hat{\tau}_{i_1...i_k}^2=\frac{\hat{\gamma}_{i_1...i_k}^2}{\hat{\pi}_{i_1...i_k}^2 (1-\hat{\pi}_{i_1...i_k}$ %5E2%7D)

Rewriting the above model, we obtain

$\hat{\gamma}_{i_1...i_k}=logit(\pi_{i_1...i_k }$ %20%3DZ_%7Bi_1...i_k%7D%20%5Cboldsymbol%7B%5Cbeta%7D%20+%20%5Cepsilon_%7Bi_1...i_k%7D)

where,

$\epsilon_{i_1,...,i_k} \approx N(0,\tau_{i_1,...,i_k}^2$ ). Then, by stacking up the $\hat{\gamma}_{1_i,...,i_k}$ to be $\hat{\gamma}, Z_{i_1...i_k}$ to be $\boldsymbol{Z}$ , and $\epsilon_{i_1,...,i_k}$ to be $\boldsymbol{\epsilon}$ , we have

$\boldsymbol{\hat{\gamma}} =logit \boldsymbol{\hat{\pi}} = \boldsymbol{Z\beta + \epsilon}$ ,

where, $E(\epsilon$ %3D0) and $\hat{T}=Var(\epsilon$ %3Ddiag(%5Chat%7B%5Ctau%7D_%7Bi_1...%20i_k%7D%5E2)) which is a diagonal matrix. Finally, by using the generalized least squares method, we estimate the parameters β and its variance-covariance matrix as follows;

$\boldsymbol{\hat{\beta} ={(\hat{Z}^T \hat{T}^{-1} Z$ %7D%5E%7B-1%7D%20Z%5ET%20%20%5Chat%7BT%7D%5E%7B-1%7D%20%5Chat%7B%5Cgamma%7D%7D)

and $\hat{V}(\boldsymbol{\hat{\beta}}$ %20%3D%20%5Cboldsymbol%7B%7B(%5Chat%7BZ%7D%5ET%20%20%5Chat%7BT%7D%5E%7B-1%7D%20%20Z)%7D%5E%7B-1%7D%7D)

The above equations can be used to construct a 100(1-α)% Wald confidence intervals for $\boldsymbol{\beta_i}$ using formula

$\hat{\beta}_i \pm Z_{1-\frac{\alpha}{2}} \sqrt{\hat{V}(\hat{\beta}_i$ %7D),

where $Z_{1-\frac{\alpha}{2}}$ is the $(1-\frac{\alpha}{2}$ %5E%7Bth%7D) quantile of the standard normal distribution. Equivalently, we reject

$H_0:\beta_i = 0$ if $|\hat{\beta}_i| > Z_{1-\frac{\alpha}{2}} \sqrt{\hat{V}(\hat{\beta}_i$ %7D.),

The p-value for testing $H_0$ is $2 * P(Z > |\hat{\beta}_i|/\sqrt{\hat{V}(\hat{\beta}_i$ %7D),)

, where Z is a random variable with the standard normal distribution. Now, the total number of cells (combinations of covariates $X_1,...,X_k$ ) is $n_1 n_2...n_k$ . As mentioned earlier, for a cell to be usable in the estimation, the cell needs to have at least two observations from Group A and two observations from Group B. As long as the total number of usable cells is larger than the dimension of $\boldsymbol{\beta}$ , then the matrix ${\boldsymbol{\hat{Z}^T \hat{T}^{-1} Z}}$ is invertible and consequently, $\boldsymbol{\hat{\beta}}$ is computable and model (1) is identifiable.