lmInfl | R Documentation |
This function calculates leave-one-out (LOO) p-values for all data points and identifies those resulting in "significance reversal", i.e. in the p-value of the model's slope traversing the user-defined \alpha
-level. It also extends the classical influence measures from influence.measures
with a few newer ones (e.g, 'Hadi's measure', 'Coefficient of determination ratio' and 'Pena's Si') within an output format where each outlier is marked when exceeding the measure's specific threshold, as defined in the literature. Belsley, Kuh & Welsch's dfstat criterion is also included.
lmInfl(model, alpha = 0.05, cutoff = c("BKW", "R"), verbose = TRUE, ...)
model |
the linear model of class |
alpha |
the |
cutoff |
use the cutoff-values from |
verbose |
logical. If |
... |
other arguments to |
The algorithm
1) calculates the p-value of the full model (all points),
2) calculates a LOO-p-value for each point removed,
3) checks for significance reversal in all data points and
4) returns all models as well as classical influence.measures
with LOO-p-values, \Delta
p-values, slopes and standard errors attached.
The idea of p-value influencers was first introduced by Belsley, Kuh & Welsch, and described as an influence measure pertaining directly to the change in t-statistics, that will "show whether the conclusions of hypothesis testing would be affected", termed dfstat in [1, 2, 3] or dfstud in [4]:
\rm{dfstat}_{ij} \equiv \frac{\hat{\beta}_j}{s\sqrt{(X'X)^{-1}_{jj}}}-\frac{\hat{\beta}_{j(i)}}{s_{(i)}\sqrt{(X'_{(i)}X_{(i)})^{-1}_{jj}}}
where \hat{\beta}_j
is the j-th estimate, s is the residual standard error, X is the design matrix and (i) denotes the i-th observation deleted.
dfstat, which for the regression's slope \beta_1
is the difference of t-statistics
\Delta t = t_{\beta1} - t_{\beta1(i)} = \frac{\beta_1}{\rm{s.e.(\beta_1)}} - \frac{\beta_1(i)}{\rm{s.e.(\beta_1(i)})}
is inextricably linked to the changes in p-value \Delta p
, calculated from
\Delta p = p_{\beta1} - p_{\beta1(i)} = 2\left(1-P_t(t_{\beta1}, \nu)\right) - 2\left(1-P_t(t_{\beta1(i)} , \nu-1)\right)
where P_t
is the Student's t cumulative distribution function with \nu
degrees of freedom, and where significance reversal is attained when \alpha \in [p_{\beta1}, p_{\beta1(i)}]
.
Interestingly, the seemingly mandatory check of the influence of single data points on statistical inference is living in oblivion: apart from [1-4], there is, to the best of our knowledge, no reference to dfstat or \Delta p
in current literature on influence measures.
Cut-off values for the different influence measures are per default (cutoff = "BKW"
) those defined in Belsley, Kuh & Welsch (1980) and additional literature.
dfbeta slope: | \Delta\beta1_i | > 2/\sqrt{n}
(page 28)
dffits: | \mathrm{dffits}_i | > 2\sqrt{2/n}
(page 28)
covratio: |\mathrm{covr}_i - 1| > 3k/n
(page 23)
Cook's D: D_i > Q_F(0.5, k, n - k)
(Cook & Weisberg, 1982)
leverage: h_{ii} > 2k/n
(page 17)
studentized residual: t_i > Q_t(0.975, n - k - 1)
(page 20)
If (cutoff = "R"
), the criteria from influence.measures
are employed:
dfbeta slope: | \Delta\beta1_i | > 1
dffits: | \mathrm{dffits}_i | > 3\sqrt{(k/(n - k))}
covratio: |1 - \mathrm{covr}_i| > 3k/(n - k)
Cook's D: D_i > Q_F(0.5, k, n - k)
leverage: h_{ii} > 3k/n
The influence output also includes the following more "recent" measures:
Hadi's measure (column "hadi"):
H_i^2 = \frac{h_{ii}}{1 - h_{ii}} + \frac{p}{1 - h_{ii}}\frac{d_i^2}{(1-d_i^2)}
where h_{ii}
are the diagonals of the hat matrix (leverages), p = 2
in univariate linear regression and d_i = e_i/\sqrt{\rm{SSE}}
, and threshold value \mathrm{Med}(H_i^2) + 2 \cdot \mathrm{MAD}(H_i^2)
.
Coefficient of Determination Ratio (column "cdr"):
\mathrm{CDR}_i = \frac{R_{(i)}^2}{R^2}
with R_{(i)}^2
being the coefficient of determination without value i, and threshold
\frac{B_{\alpha,p/2,(n-p-2)/2}}{B_{\alpha,p/2,(n-p-1)/2}}
Pena's Si (column "Si"):
S_i = \frac{\mathbf{s}'_i\mathbf{s}_i}{p\widehat{\mathrm{var}}(\hat{y}_i)}
where \mathbf{s_i}
is the vector of each fitted value from the original model, \hat{y}_i
, subtracted with all fitted values after 1-deletion, \hat{y}_i - \hat{y}_{i(-1)}, \cdots, \hat{y}_i - \hat{y}_{i(-n)}
, p
= number of parameters, and \widehat{\mathrm{var}}(\hat{y}_i) = s^2h_{ii}
, s^2 = (\mathbf{e}'\mathbf{e})/(n - p)
, \mathbf{e}
being the residuals. In this package, a cutoff value of 0.9 is used, as the published criterion of |\mathbf{S_i} - \mathrm{Med}(\mathbf{S})| \ge 4.5\mathrm{MAD}(\mathbf{S})
seemed too conservative. Results from this function were verified by Prof. Daniel Pena through personal communication.
A list with the following items:
origModel |
the original model with all data points. |
finalModels |
a list of final models with the influencer(s) removed. |
infl |
a matrix with the original data, classical |
raw |
same as |
sel |
a vector with the influencers' indices. |
alpha |
the selected |
origP |
the original model's p-value. |
Andrej-Nikolai Spiess
For dfstat / dfstud :
Regression diagnostics: Identifying influential data and sources of collinearity.
Belsley DA, Kuh E, Welsch RE.
John Wiley, New York, USA (2004).
Econometrics, 5ed.
Baltagi B.
Springer-Verlag Berlin, Germany (2011).
Growth regressions and what the textbooks don't tell you.
Temple J.
Bull Econom Res, 52, 2000, 181-205.
Robust Regression and Outlier Detection.
Rousseeuw PJ & Leroy AM.
John Wiley & Sons, New York, NY (1987).
Hadi's measure:
A new measure of overall potential influence in linear regression.
Hadi AS.
Comp Stat & Data Anal, 14, 1992, 1-27.
Coefficient of determination ratio:
On the detection of influential outliers in linear regression analysis.
Zakaria A, Howard NK, Nkansah BK.
Am J Theor Appl Stat, 3, 2014, 100-106.
On the Coefficient of Determination Ratio for Detecting Influential Outliers in Linear Regression Analysis.
Zakaria A, Gordor BK, Nkansah BK.
Am J Theor Appl Stat, 11, 2022, 27-35.
Pena's measure:
A New Statistic for Influence in Linear Regression.
Pena D.
Technometrics, 47, 2005, 1-12.
## Example #1 with single influencer and significant model (p = 0.0089).
## Removal of #21 results in p = 0.115!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
a <- c(a, 25); b <- c(b, 10)
LM1 <- lm(b ~ a)
lmInfl(LM1)
## Example #2 with single influencer and insignificant model (p = 0.115).
## Removal of #18 results in p = 0.0227!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM2 <- lm(b ~ a)
lmInfl(LM2)
## Example #3 with multiple influencers and significant model (p = 0.0269).
## Removal of #2, #17, #18 or #20 results in crossing p = 0.05!
set.seed(125)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM3 <- lm(b ~ a)
lmInfl(LM3)
## Large Example #4 with top 10 influencers and significant model (p = 6.72E-8).
## Not possible to achieve a crossing of alpha with any point despite strong noise.
set.seed(123)
a <- 1:100
b <- 5 + 0.08 * a + rnorm(100, 0, 5)
LM4 <- lm(b ~ a)
lmInfl(LM4)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.