lmInfl: Checks and analyzes leave-one-out (LOO) p-values and a...

View source: R/lmInfl.R

lmInflR Documentation

Checks and analyzes leave-one-out (LOO) p-values and a variety of influence measures in linear regression

Description

This function calculates leave-one-out (LOO) p-values for all data points and identifies those resulting in "significance reversal", i.e. in the p-value of the model's slope traversing the user-defined \alpha-level. It also extends the classical influence measures from influence.measures with a few newer ones (e.g, 'Hadi's measure', 'Coefficient of determination ratio' and 'Pena's Si') within an output format where each outlier is marked when exceeding the measure's specific threshold, as defined in the literature. Belsley, Kuh & Welsch's dfstat criterion is also included.

Usage

lmInfl(model, alpha = 0.05, cutoff = c("BKW", "R"), verbose = TRUE, ...) 

Arguments

model

the linear model of class lm.

alpha

the \alpha-level to use as the threshold border.

cutoff

use the cutoff-values from Belsley, Kuh & Welsch or the R-internal ones. See 'Details'.

verbose

logical. If TRUE, results are displayed on the console.

...

other arguments to lm.

Details

The algorithm
1) calculates the p-value of the full model (all points),
2) calculates a LOO-p-value for each point removed,
3) checks for significance reversal in all data points and
4) returns all models as well as classical influence.measures with LOO-p-values, \Deltap-values, slopes and standard errors attached.

The idea of p-value influencers was first introduced by Belsley, Kuh & Welsch, and described as an influence measure pertaining directly to the change in t-statistics, that will "show whether the conclusions of hypothesis testing would be affected", termed dfstat in [1, 2, 3] or dfstud in [4]:

\rm{dfstat}_{ij} \equiv \frac{\hat{\beta}_j}{s\sqrt{(X'X)^{-1}_{jj}}}-\frac{\hat{\beta}_{j(i)}}{s_{(i)}\sqrt{(X'_{(i)}X_{(i)})^{-1}_{jj}}}

where \hat{\beta}_j is the j-th estimate, s is the residual standard error, X is the design matrix and (i) denotes the i-th observation deleted.
dfstat, which for the regression's slope \beta_1 is the difference of t-statistics

\Delta t = t_{\beta1} - t_{\beta1(i)} = \frac{\beta_1}{\rm{s.e.(\beta_1)}} - \frac{\beta_1(i)}{\rm{s.e.(\beta_1(i)})}

is inextricably linked to the changes in p-value \Delta p, calculated from

\Delta p = p_{\beta1} - p_{\beta1(i)} = 2\left(1-P_t(t_{\beta1}, \nu)\right) - 2\left(1-P_t(t_{\beta1(i)} , \nu-1)\right)

where P_t is the Student's t cumulative distribution function with \nu degrees of freedom, and where significance reversal is attained when \alpha \in [p_{\beta1}, p_{\beta1(i)}]. Interestingly, the seemingly mandatory check of the influence of single data points on statistical inference is living in oblivion: apart from [1-4], there is, to the best of our knowledge, no reference to dfstat or \Delta p in current literature on influence measures.

Cut-off values for the different influence measures are per default (cutoff = "BKW") those defined in Belsley, Kuh & Welsch (1980) and additional literature.

dfbeta slope: | \Delta\beta1_i | > 2/\sqrt{n} (page 28)
dffits: | \mathrm{dffits}_i | > 2\sqrt{2/n} (page 28)
covratio: |\mathrm{covr}_i - 1| > 3k/n (page 23)
Cook's D: D_i > Q_F(0.5, k, n - k) (Cook & Weisberg, 1982)
leverage: h_{ii} > 2k/n (page 17)
studentized residual: t_i > Q_t(0.975, n - k - 1) (page 20)

If (cutoff = "R"), the criteria from influence.measures are employed:

dfbeta slope: | \Delta\beta1_i | > 1
dffits: | \mathrm{dffits}_i | > 3\sqrt{(k/(n - k))}
covratio: |1 - \mathrm{covr}_i| > 3k/(n - k)
Cook's D: D_i > Q_F(0.5, k, n - k)
leverage: h_{ii} > 3k/n

The influence output also includes the following more "recent" measures:
Hadi's measure (column "hadi"):

H_i^2 = \frac{h_{ii}}{1 - h_{ii}} + \frac{p}{1 - h_{ii}}\frac{d_i^2}{(1-d_i^2)}

where h_{ii} are the diagonals of the hat matrix (leverages), p = 2 in univariate linear regression and d_i = e_i/\sqrt{\rm{SSE}}, and threshold value \mathrm{Med}(H_i^2) + 2 \cdot \mathrm{MAD}(H_i^2).

Coefficient of Determination Ratio (column "cdr"):

\mathrm{CDR}_i = \frac{R_{(i)}^2}{R^2}

with R_{(i)}^2 being the coefficient of determination without value i, and threshold

\frac{B_{\alpha,p/2,(n-p-2)/2}}{B_{\alpha,p/2,(n-p-1)/2}}

Pena's Si (column "Si"):

S_i = \frac{\mathbf{s}'_i\mathbf{s}_i}{p\widehat{\mathrm{var}}(\hat{y}_i)}

where \mathbf{s_i} is the vector of each fitted value from the original model, \hat{y}_i, subtracted with all fitted values after 1-deletion, \hat{y}_i - \hat{y}_{i(-1)}, \cdots, \hat{y}_i - \hat{y}_{i(-n)}, p = number of parameters, and \widehat{\mathrm{var}}(\hat{y}_i) = s^2h_{ii}, s^2 = (\mathbf{e}'\mathbf{e})/(n - p), \mathbf{e} being the residuals. In this package, a cutoff value of 0.9 is used, as the published criterion of |\mathbf{S_i} - \mathrm{Med}(\mathbf{S})| \ge 4.5\mathrm{MAD}(\mathbf{S}) seemed too conservative. Results from this function were verified by Prof. Daniel Pena through personal communication.

Value

A list with the following items:

origModel

the original model with all data points.

finalModels

a list of final models with the influencer(s) removed.

infl

a matrix with the original data, classical influence.measures, studentized residuals, leverages, dfstat, LOO-p-values, LOO-slopes/intercepts and their \Delta's, LOO-standard errors and R^2s. Influence measures that exceed their specific threshold - see inflPlot - will be marked with asterisks.

raw

same as infl, but with pure numeric data.

sel

a vector with the influencers' indices.

alpha

the selected \alpha-level.

origP

the original model's p-value.

Author(s)

Andrej-Nikolai Spiess

References

For dfstat / dfstud :
Regression diagnostics: Identifying influential data and sources of collinearity.
Belsley DA, Kuh E, Welsch RE.
John Wiley, New York, USA (2004).

Econometrics, 5ed.
Baltagi B.
Springer-Verlag Berlin, Germany (2011).

Growth regressions and what the textbooks don't tell you.
Temple J.
Bull Econom Res, 52, 2000, 181-205.

Robust Regression and Outlier Detection.
Rousseeuw PJ & Leroy AM.
John Wiley & Sons, New York, NY (1987).

Hadi's measure:
A new measure of overall potential influence in linear regression.
Hadi AS.
Comp Stat & Data Anal, 14, 1992, 1-27.

Coefficient of determination ratio:
On the detection of influential outliers in linear regression analysis.
Zakaria A, Howard NK, Nkansah BK.
Am J Theor Appl Stat, 3, 2014, 100-106.

On the Coefficient of Determination Ratio for Detecting Influential Outliers in Linear Regression Analysis.
Zakaria A, Gordor BK, Nkansah BK.
Am J Theor Appl Stat, 11, 2022, 27-35.

Pena's measure:
A New Statistic for Influence in Linear Regression.
Pena D.
Technometrics, 47, 2005, 1-12.

Examples

## Example #1 with single influencer and significant model (p = 0.0089).
## Removal of #21 results in p = 0.115!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
a <- c(a, 25); b <- c(b, 10)
LM1 <- lm(b ~ a)
lmInfl(LM1) 

## Example #2 with single influencer and insignificant model (p = 0.115).
## Removal of #18 results in p = 0.0227!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM2 <- lm(b ~ a)
lmInfl(LM2) 

## Example #3 with multiple influencers and significant model (p = 0.0269).
## Removal of #2, #17, #18 or #20 results in crossing p = 0.05!
set.seed(125)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM3 <- lm(b ~ a)
lmInfl(LM3) 

## Large Example #4 with top 10 influencers and significant model (p = 6.72E-8).
## Not possible to achieve a crossing of alpha with any point despite strong noise.
set.seed(123)
a <- 1:100
b <- 5 + 0.08 * a + rnorm(100, 0, 5)
LM4 <- lm(b ~ a)
lmInfl(LM4) 

reverseR documentation built on Sept. 12, 2024, 7:32 a.m.