# dx: Diagnostics for binomial regression In dardisco/LogisticDx: Diagnostic Tests for Models with a Binomial Response

## Description

Diagnostics for binomial regression

Returns diagnostic measures for a binary regression model by covariate pattern

## Usage

 ```1 2 3 4``` ```dx(x, ...) ## S3 method for class 'glm' dx(x, ..., byCov = TRUE) ```

## Arguments

 `x` A regression model with class `glm` and `x\$family\$family == "binomial"`. `...` Additional arguments which can be passed to: ?stats::model.matrix e.g. `contrasts.arg` which can be used for `factor` coding. `byCov` Return values by covariate pattern, rather than by individual observation.

## Value

A `data.table`, with rows sorted by dBhat.

If `byCov==TRUE`, there is one row per covariate pattern with at least one observation.

The initial columns give the predictor variables 1 ... p.
Subsequent columns are labelled as follows:

 `y` The actual number of observations with y=1 in the model data. `P` Probability of this covariate pattern. This is given by the inverse of the link function, `x\$family\$linkinv`. See: ?stats::family `n` Number of observations with these covariates. If `byCov=FALSE` then this will be =1 for all observations. `yhat` The predicted number of observations having a response of y=1, according to the model. This is: yhat[i] = n[i] * P[i] `h` Leverage, the diagonal of the hat matrix used to generate the model: H = V^0.5 X (X'VX)^-1 X'V^0.5 Here ^-1 is the inverse and ' is the transpose of a matrix. X is the matrix of predictors, given by `stats::model.matrix`. V is an N * N sparse matrix. All elements are =0 except for the diagonal, which is: v[i][i] = n[i]P[i] * (1 - P[i]) Leverage H is also the estimated covariance matrix of Bhat. Leverage is measure of the influence of this covariate pattern on the model and is approximately h[i] = x[i] - mean(x), 0.1 < P[i] < 0.9 That is, leverage is approximately equal to the distance of the covariate pattern i from the mean mean(x). For values of p which are large (>0.9) or small (<0.1) this relationship no longer holds. `Pr` The Pearson residual, a measure of influence. This is: Pr[i] = (y[i] - ybar) / SD[y] where ybar and SD[y] refer to the mean and standard deviation of a binomial distribution. SD^2 = Var, is the variance. E(y=1) = ybar = yhat = nP and SE[y] = (nP(1-P))^0.5 Thus: Pr[i] = (y[i] - n[i]P[i]) / (n[i]P[i](1 - P[i])^0.5) `dr` The deviance residual, a measure of influence: dr[i] = sign(y[i] - yhat[i]) * d[i]^0.5 d[i] is the contribution of observation i to the model deviance. The sign above is: y[i] > yhat --> sign(i) = 1 y[i] = yhat --> sign(i) = 0 y[i] < yhat --> sign(i) = -1 In logistic regression this is: y[i] = 1 --> dr[i] = (2 * log (1 + e^f(x) - f(x)))^0.5 y[i] = 0 --> dr[i] = (2 * log (1 + e^f(x))) where f(x) is the linear function of the predictors 1 ... p: f(x) = B[0] + B[1][i] * x[1][i] + ... + B[p][i] * x[p][i] this is also: dr[i] = sign(y[i] - yhat[i]) [2 * (y[i] * log(y[i] / n[i] * p[i])) + (n[i] - y[i]) * log((n[i] - y[i]) / (n[i] * (1 - p[i])))]^0.5 To avoid the problem of division by zero: y[i] = 0 --> dr[i] = (2 * n[i] * | log(1 - P[i]) |)^0.5 Similarly to avoid log(Inf): y[i] = n[i] --> dr[i] = (2 * n[i] * | log(P[i]) |)^0.5 The above equations are used when calculating dr[i] by covariate group. `sPr` The standardized Pearson residual. The residual is standardized by the leverage h[i]: sPr[i] = Pr[i] / (1 - h[i])^0.5 `sdr` The standardized deviance residual. The residual is standardized by the leverage, as above: sdr[i] = dr[i] / (1 - h[i])^0.5 `dChisq` The change in the Pearson chi-square statistic with observation i removed. Given by: dChi^2 = sPr[i]^2 = Pr[i]^2 / (1 - h[i]) where sPr[i] is the standardized Pearson residual, Pr[i] is the Pearson residual and h[i] is the leverage. dChisq[i] should be <4 if the observation has little influence on the model. `dDev` The change in the deviance statistic D = SUM dr[i] with observation i excluded. It is scaled by the leverage h[i] as above: dDev[i] = sdr[i]^2 = dr[i]^2 / (1 - h[i]) `dBhat` The change in Bhat with observation i excluded. This is scaled by the leverage as above: dBhat = h[i] * sPr[i]^2 / (1 - h[i]) where sPR[i] is the standardized Pearson residual. dBhat[i] should be <1 if the observation has little influence on the model coefficients.

## Note

By default, values for the statistics are calculated by covariate pattern. Different values may be obtained if calculated for each individual obervation (e.g. rows in a `data.frame`).

Generally, the values calculated by covariate pattern are preferred, particularly where the number of observations in a group is >5.
In this case Pearsons chi-squared and the deviance statistic should follow a chi-squared distribution with i - p degrees of freedom.

`plot.glm`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```## H&L 2nd ed. Table 5.8. Page 182. ## Pattern nos. 31, 477, 468 data(uis) uis <- within(uis, { NDRGFP1 <- 10 / (NDRGTX + 1) NDRGFP2 <- NDRGFP1 * log((NDRGTX + 1) / 10) }) (d1 <- dx(g1 <- glm(DFREE ~ AGE + NDRGFP1 + NDRGFP2 + IVHX + RACE + TREAT + SITE + AGE:NDRGFP1 + RACE:SITE, family=binomial, data=uis))) d1[519:521, ] ```