lmInfl | R Documentation |
This function calculates leave-one-out (LOO) p-values for all data points and identifies those resulting in "significance reversal", i.e. in the p-value of the model's slope traversing the user-defined α-level.
lmInfl(model, alpha = 0.05, verbose = TRUE, ...)
model |
the linear model of class |
alpha |
the α-level to use as the threshold border. |
verbose |
logical. If |
... |
other arguments to |
The algorithm
1) calculates the p-value of the full model (all points),
2) calculates a LOO-p-value for each point removed,
3) checks for significance reversal in all data points and
4) returns all models as well as classical influence.measures
with LOO-p-values, Δp-values, slopes and standard errors attached.
The idea of p-value influencers was first introduced by Belsley, Kuh & Welsch, and described as an influence measure pertaining directly to the change in t-statistics, that will "show whether the conclusions of hypothesis testing would be affected", termed dfstat in [1, 2, 3] or dfstud in [4]:
\rm{dfstat}_{ij} \equiv \frac{\hat{β}_j}{s√{(X'X)^{-1}_{jj}}}-\frac{\hat{β}_{j(i)}}{s_{(i)}√{(X'_{(i)}X_{(i)})^{-1}_{jj}}}
where \hat{β}_j is the j-th estimate, s is the residual standard error, X is the design matrix and (i) denotes the i-th observation deleted.
dfstat, which for the regression's slope β_1 is the difference of t-statistics
Δ t = t_{β1} - t_{β1(i)} = \frac{β_1}{\rm{s.e.(β_1)}} - \frac{β_1(i)}{\rm{s.e.(β_1(i)})}
is inextricably linked to the changes in p-value Δ p, calculated from
Δ p = p_{β1} - p_{β1(i)} = 2≤ft(1-P_t(t_{β1}, ν)\right) - 2≤ft(1-P_t(t_{β1(i)} , ν-1)\right)
where P_t is the Student's t cumulative distribution function with ν degrees of freedom, and where significance reversal is attained when α \in [p_{β1}, p_{β1(i)}]. Interestingly, the seemingly mandatory check of the influence of single data points on statistical inference is living in oblivion: apart from [1-4], there is, to the best of our knowledge, no reference to dfstat or Δ p in current literature on influence measures.
The influence output also includes the more recent Hadi's measure (column "hadi"):
H_i^2 = \frac{p_{ii}}{1 - p_{ii}} + \frac{k}{1 - p_{ii}}\frac{d_i^2}{(1-d_i^2)}
where p_{ii} are the diagonals of the hat matrix (leverages), k = 2 in univariate linear regression and d_i = e_i/√{\rm{SSE}}.
A list with the following items:
origModel |
the original model with all data points. |
finalModels |
a list of final models with the influencer(s) removed. |
infl |
a matrix with the original data, classical |
raw |
same as |
sel |
a vector with the influencers' indices. |
alpha |
the selected α-level. |
origP |
the original model's p-value. |
stab |
the stability measure, see |
Andrej-Nikolai Spiess
For dfstat / dfstud :
1. Regression diagnostics: Identifying influential data and sources of collinearity.
Belsley DA, Kuh E, Welsch RE.
John Wiley, New York, USA (2004).
2. Econometrics, 5ed.
Baltagi B.
Springer-Verlag Berlin, Germany (2011).
3. Growth regressions and what the textbooks don't tell you.
Temple J.
Bull Econom Res, 52, 2000, 181-205.
4. Robust Regression and Outlier Detection.
Rousseeuw PJ & Leroy AM.
John Wiley & Sons, New York, NY (1987).
Hadi's measure:
A new measure of overall potential influence in linear regression.
Hadi AS.
Comp Stat & Data Anal, 14, 1992, 1-27.
## Example #1 with single influencers and insignificant model (p = 0.115). ## Removal of #18 results in p = 0.0227! set.seed(123) a <- 1:20 b <- 5 + 0.08 * a + rnorm(20, 0, 1) LM1 <- lm(b ~ a) res1 <- lmInfl(LM1) lmPlot(res1) pvalPlot(res1) inflPlot(res1) slsePlot(res1) stability(res1) ## Example #2 with multiple influencers and significant model (p = 0.0269). ## Removal of #2, #17, #18 or #20 result in crossing p = 0.05! set.seed(125) a <- 1:20 b <- 5 + 0.08 * a + rnorm(20, 0, 1) LM2 <- lm(b ~ a) res2 <- lmInfl(LM2) lmPlot(res2) pvalPlot(res2) inflPlot(res2) slsePlot(res2) stability(res2) ## Large Example #3 with top 10 influencers and significant model (p = 6.72E-8). ## Not possible to achieve a crossing of alpha with any point despite strong noise. set.seed(123) a <- 1:100 b <- 5 + 0.08 * a + rnorm(100, 0, 5) LM3 <- lm(b ~ a) res3 <- lmInfl(LM3) lmPlot(res3) stability(res3) ## Example #4 with replicates and single influencer (p = 0.114). ## Removal of #58 results in p = 0.039. set.seed(123) a <- rep(1:20, each = 3) b <- 5 + 0.08 * a + rnorm(20, 0, 2) LM4 <- lm(b ~ a) res4 <- lmInfl(LM4) lmPlot(res4) pvalPlot(res4) inflPlot(res4) slsePlot(res4) stability(res4) ## As Example #1, but with weights. ## Removal of #18 results in p = 0.04747. set.seed(123) a <- 1:20 b <- 5 + 0.08 * a + rnorm(20, 0, 1) LM5 <- lm(b ~ a, weights = 1:20) res5 <- lmInfl(LM5) lmPlot(res5) stability(res5)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.