dx: Diagnostics for binomial regression In dardisco/LogisticDx: Diagnostic Tests for Models with a Binomial Response

Description

Diagnostics for binomial regression

Returns diagnostic measures for a binary regression model by covariate pattern

Usage

 ```1 2 3 4``` ```dx(x, ...) ## S3 method for class 'glm' dx(x, ..., byCov = TRUE) ```

Arguments

 `x` A regression model with class `glm` and `x\$family\$family == "binomial"`. `...` Additional arguments which can be passed to: ?stats::model.matrix e.g. `contrasts.arg` which can be used for `factor` coding. `byCov` Return values by covariate pattern, rather than by individual observation.

Value

A `data.table`, with rows sorted by dBhat.

If `byCov==TRUE`, there is one row per covariate pattern with at least one observation.

The initial columns give the predictor variables 1 ... p.
Subsequent columns are labelled as follows:

 `y` The actual number of observations with y=1 in the model data. `P` Probability of this covariate pattern. This is given by the inverse of the link function, `x\$family\$linkinv`. See: ?stats::family `n` Number of observations with these covariates. If `byCov=FALSE` then this will be =1 for all observations. `yhat` The predicted number of observations having a response of y=1, according to the model. This is: yhat[i] = n[i] * P[i] `h` Leverage, the diagonal of the hat matrix used to generate the model: H = V^0.5 X (X'VX)^-1 X'V^0.5 Here ^-1 is the inverse and ' is the transpose of a matrix. X is the matrix of predictors, given by `stats::model.matrix`. V is an N * N sparse matrix. All elements are =0 except for the diagonal, which is: v[i][i] = n[i]P[i] * (1 - P[i]) Leverage H is also the estimated covariance matrix of Bhat. Leverage is measure of the influence of this covariate pattern on the model and is approximately h[i] = x[i] - mean(x), 0.1 < P[i] < 0.9 That is, leverage is approximately equal to the distance of the covariate pattern i from the mean mean(x). For values of p which are large (>0.9) or small (<0.1) this relationship no longer holds. `Pr` The Pearson residual, a measure of influence. This is: Pr[i] = (y[i] - ybar) / SD[y] where ybar and SD[y] refer to the mean and standard deviation of a binomial distribution. SD^2 = Var, is the variance. E(y=1) = ybar = yhat = nP and SE[y] = (nP(1-P))^0.5 Thus: Pr[i] = (y[i] - n[i]P[i]) / (n[i]P[i](1 - P[i])^0.5) `dr` The deviance residual, a measure of influence: dr[i] = sign(y[i] - yhat[i]) * d[i]^0.5 d[i] is the contribution of observation i to the model deviance. The sign above is: y[i] > yhat --> sign(i) = 1 y[i] = yhat --> sign(i) = 0 y[i] < yhat --> sign(i) = -1 In logistic regression this is: y[i] = 1 --> dr[i] = (2 * log (1 + e^f(x) - f(x)))^0.5 y[i] = 0 --> dr[i] = (2 * log (1 + e^f(x))) where f(x) is the linear function of the predictors 1 ... p: f(x) = B + B[i] * x[i] + ... + B[p][i] * x[p][i] this is also: dr[i] = sign(y[i] - yhat[i]) [2 * (y[i] * log(y[i] / n[i] * p[i])) + (n[i] - y[i]) * log((n[i] - y[i]) / (n[i] * (1 - p[i])))]^0.5 To avoid the problem of division by zero: y[i] = 0 --> dr[i] = (2 * n[i] * | log(1 - P[i]) |)^0.5 Similarly to avoid log(Inf): y[i] = n[i] --> dr[i] = (2 * n[i] * | log(P[i]) |)^0.5 The above equations are used when calculating dr[i] by covariate group. `sPr` The standardized Pearson residual. The residual is standardized by the leverage h[i]: sPr[i] = Pr[i] / (1 - h[i])^0.5 `sdr` The standardized deviance residual. The residual is standardized by the leverage, as above: sdr[i] = dr[i] / (1 - h[i])^0.5 `dChisq` The change in the Pearson chi-square statistic with observation i removed. Given by: dChi^2 = sPr[i]^2 = Pr[i]^2 / (1 - h[i]) where sPr[i] is the standardized Pearson residual, Pr[i] is the Pearson residual and h[i] is the leverage. dChisq[i] should be <4 if the observation has little influence on the model. `dDev` The change in the deviance statistic D = SUM dr[i] with observation i excluded. It is scaled by the leverage h[i] as above: dDev[i] = sdr[i]^2 = dr[i]^2 / (1 - h[i]) `dBhat` The change in Bhat with observation i excluded. This is scaled by the leverage as above: dBhat = h[i] * sPr[i]^2 / (1 - h[i]) where sPR[i] is the standardized Pearson residual. dBhat[i] should be <1 if the observation has little influence on the model coefficients.

Note

By default, values for the statistics are calculated by covariate pattern. Different values may be obtained if calculated for each individual obervation (e.g. rows in a `data.frame`).

Generally, the values calculated by covariate pattern are preferred, particularly where the number of observations in a group is >5.
In this case Pearsons chi-squared and the deviance statistic should follow a chi-squared distribution with i - p degrees of freedom.

`plot.glm`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```## H&L 2nd ed. Table 5.8. Page 182. ## Pattern nos. 31, 477, 468 data(uis) uis <- within(uis, { NDRGFP1 <- 10 / (NDRGTX + 1) NDRGFP2 <- NDRGFP1 * log((NDRGTX + 1) / 10) }) (d1 <- dx(g1 <- glm(DFREE ~ AGE + NDRGFP1 + NDRGFP2 + IVHX + RACE + TREAT + SITE + AGE:NDRGFP1 + RACE:SITE, family=binomial, data=uis))) d1[519:521, ] ```