dx: Diagnostics for binomial regression

Description Usage Arguments Value Note See Also Examples

Description

Diagnostics for binomial regression

Returns diagnostic measures for a binary regression model by covariate pattern

Usage

1
2
3
4
dx(x, ...)

## S3 method for class 'glm'
dx(x, ..., byCov = TRUE)

Arguments

x

A regression model with class glm and x$family$family == "binomial".

...

Additional arguments which can be passed to:
?stats::model.matrix
e.g. contrasts.arg which can be used for factor coding.

byCov

Return values by covariate pattern, rather than by individual observation.

Value

A data.table, with rows sorted by dBhat.

If byCov==TRUE, there is one row per covariate pattern with at least one observation.

The initial columns give the predictor variables 1 ... p.
Subsequent columns are labelled as follows:

y

The actual number of observations with y=1 in the model data.

P

Probability of this covariate pattern.
This is given by the inverse of the link function, x$family$linkinv. See:
?stats::family

n

Number of observations with these covariates.
If byCov=FALSE then this will be =1 for all observations.

yhat

The predicted number of observations having a response of y=1, according to the model.
This is:

yhat[i] = n[i] * P[i]

h

Leverage, the diagonal of the hat matrix used to generate the model:

H = V^0.5 X (X'VX)^-1 X'V^0.5

Here ^-1 is the inverse and ' is the transpose of a matrix.
X is the matrix of predictors, given by stats::model.matrix.
V is an N * N sparse matrix. All elements are =0 except for the diagonal, which is:

v[i][i] = n[i]P[i] * (1 - P[i])

Leverage H is also the estimated covariance matrix of Bhat.
Leverage is measure of the influence of this covariate pattern on the model and is approximately

h[i] = x[i] - mean(x), 0.1 < P[i] < 0.9

That is, leverage is approximately equal to the distance of the covariate pattern i from the mean mean(x).
For values of p which are large (>0.9) or small (<0.1) this relationship no longer holds.

Pr

The Pearson residual, a measure of influence. This is:

Pr[i] = (y[i] - ybar) / SD[y]

where ybar and SD[y] refer to the mean and standard deviation of a binomial distribution.
SD^2 = Var, is the variance.

E(y=1) = ybar = yhat = nP and SE[y] = (nP(1-P))^0.5

Thus:

Pr[i] = (y[i] - n[i]P[i]) / (n[i]P[i](1 - P[i])^0.5)

dr

The deviance residual, a measure of influence:

dr[i] = sign(y[i] - yhat[i]) * d[i]^0.5

d[i] is the contribution of observation i to the model deviance.
The sign above is:

  • y[i] > yhat --> sign(i) = 1

  • y[i] = yhat --> sign(i) = 0

  • y[i] < yhat --> sign(i) = -1

In logistic regression this is:

y[i] = 1 --> dr[i] = (2 * log (1 + e^f(x) - f(x)))^0.5

y[i] = 0 --> dr[i] = (2 * log (1 + e^f(x)))

where f(x) is the linear function of the predictors 1 ... p:

f(x) = B[0] + B[1][i] * x[1][i] + ... + B[p][i] * x[p][i]

this is also:

dr[i] = sign(y[i] - yhat[i]) [2 * (y[i] * log(y[i] / n[i] * p[i])) + (n[i] - y[i]) * log((n[i] - y[i]) / (n[i] * (1 - p[i])))]^0.5

To avoid the problem of division by zero:

y[i] = 0 --> dr[i] = (2 * n[i] * | log(1 - P[i]) |)^0.5

Similarly to avoid log(Inf):

y[i] = n[i] --> dr[i] = (2 * n[i] * | log(P[i]) |)^0.5

The above equations are used when calculating dr[i] by covariate group.

sPr

The standardized Pearson residual.
The residual is standardized by the leverage h[i]:

sPr[i] = Pr[i] / (1 - h[i])^0.5

sdr

The standardized deviance residual.
The residual is standardized by the leverage, as above:

sdr[i] = dr[i] / (1 - h[i])^0.5

dChisq

The change in the Pearson chi-square statistic with observation i removed. Given by:

dChi^2 = sPr[i]^2 = Pr[i]^2 / (1 - h[i])

where sPr[i] is the standardized Pearson residual, Pr[i] is the Pearson residual and h[i] is the leverage.
dChisq[i] should be <4 if the observation has little influence on the model.

dDev

The change in the deviance statistic D = SUM dr[i] with observation i excluded.
It is scaled by the leverage h[i] as above:

dDev[i] = sdr[i]^2 = dr[i]^2 / (1 - h[i])

dBhat

The change in Bhat with observation i excluded.
This is scaled by the leverage as above:

dBhat = h[i] * sPr[i]^2 / (1 - h[i])

where sPR[i] is the standardized Pearson residual.
dBhat[i] should be <1 if the observation has little influence on the model coefficients.

Note

By default, values for the statistics are calculated by covariate pattern. Different values may be obtained if calculated for each individual obervation (e.g. rows in a data.frame).

Generally, the values calculated by covariate pattern are preferred, particularly where the number of observations in a group is >5.
In this case Pearsons chi-squared and the deviance statistic should follow a chi-squared distribution with i - p degrees of freedom.

See Also

plot.glm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## H&L 2nd ed. Table 5.8. Page 182.
## Pattern nos. 31, 477, 468
data(uis)
uis <- within(uis, {
    NDRGFP1 <- 10 / (NDRGTX + 1)
    NDRGFP2 <- NDRGFP1 * log((NDRGTX + 1) / 10)
})
(d1 <- dx(g1 <- glm(DFREE ~ AGE + NDRGFP1 + NDRGFP2 + IVHX +
                    RACE + TREAT + SITE +
                    AGE:NDRGFP1 + RACE:SITE,
                    family=binomial, data=uis)))
d1[519:521, ]

dardisco/LogisticDx documentation built on May 14, 2019, 6:08 p.m.