Description Usage Arguments Details Value Author(s) References See Also Examples
This function detects outliers using a user-specified method, and
fits a linear regression model with outliers removed. The object
returned by this function can be used for valid inference corrected
for outlier removal through generic functions like summary
,
confint
, predict
.
1 2 3 4 5 |
formula, |
an object of class |
data, |
an optional data frame, list or environment containing the variables in the model, the same
syntax as in |
method, |
the outlier detection method, must be one of |
cutoff, |
the cutoff of the outlier detection method. If |
sigma, |
the noise level. Must be one of |
x, |
an object of class |
digits, |
the number of significant digits to use when printing. |
..., |
other arguments. |
This function uses the same syntax as lm
for the formula
and data
arguments.
Users can access the original "lm"
objects through $fit.full
and $fit.rm
.
Common generic functions for lm
, including coef
, confint
,
plot
, predict
and summary
are re-written so that
they can be used to extract useful features of the object returned by this function.
Currently, this function supports three outlier detection methods. For "cook"
, the i-th
observation is considered as an outlier when its Cook's distance is greater than cutoff/n
,
where n
is the number of observations. For "dffits"
, the i-th observation is
considered as an outlier when the square of its DFFITS measure is greater than cutoff*p/(n-p)
,
where p
is the number of variables (including the intercept). The rule of thumb of cutoff
for both methods are 4, which is the default value when one sets cutoff = NULL
.
The outlier detection event of both methods can be characterized as a set of quadratic constraints
in the response y:
\bigcap_{i \in [n]} {y^T Q_i y ≥ 0},
and the constraint returned by this function is the list of Q_i matrices.
For "lasso"
, we assume the mean-shift model
y = X β + u + ε, where u is the "outlying coefficients" and
ε ~ N(0, σ^2 I) is the noise. We solve the following program:
(\hat β, \hat u) = argmin ||y-Xβ-u||_2^2 + cutoff*||u||_1.
The i-th observation
is considered as an outlier when \hat u_i differs from 0. The default cutoff for
"lasso"
is 0.75*E[||X^T ε||_∞]/n, which is a less conservative choice
than the prediction-optimal cutoff 2*E[||X^T ε||_∞]/n. This cutoff is computed
by Monte Carlo simulation and σ is replaced by an estimate when the true noise level
is unknown. The outlier detection event of "lasso"
can be characterized as a
set of affine constraints in the response y:
A y ≥ b,
where the "≥" is interpreted as element-wise. The constraint returned by this function is
then a list of (A, b)
.
This function returns an object of class
"outference"
.
The function summary
is used to obtain and print a summary (including p-values)
of the results. The generic functions coef
, confint
, plot
,
predict
are used to extract useful features of the object returned by this function.
An object of class "outference"
is a list containing the following components:
fit.full, |
an |
fit.rm, |
an |
method, |
the method used for outlier detection. |
cutoff, |
the cutoff of the method. |
outlier.det, |
indexes of detected outliers. |
magnitude, |
a measure of "outlying-ness". For |
constraint, |
the constraint in the response that characterizes the outlier detection event.
For |
sigma, |
the noise level used in the fit. |
call, |
the function call. |
Shuxiao Chen <sc2667@cornell.edu>
Lee, Jason D., et al. "Exact post-selection inference, with application to the lasso." The Annals of Statistics 44.3 (2016): 907-927.
S. Chen and J. Bien. “Valid Inference Corrected for Outlier Removal”. arXiv preprint arXiv:1711.10635 (2017).
summary.outference
for summaries;
coef.outference
for extracting coefficients;
confint.outference
for confidence intervals of regression coefficients;
plot.outference
for plotting the outlying measure;
predict.outference
for making predictions.
1 2 3 4 5 6 7 8 9 10 11 | ## Brownlee’s Stack Loss Plant Data
data("stackloss")
head("stackloss") # look at the dataset
## fit the model
## detect outlier using Cook's distance with cutoff = 4
fit <- outference(stack.loss ~ ., data = stackloss, method = "cook", cutoff = 4)
plot(fit) # plot the Cook's distance of each observation
## observation 21 is considered as an outlier with cutoff = 4
summary(fit$fit.full) # look at the fit with all the data
summary(fit$fit.rm) # look at the fit with observation 21 deleted
summary(fit) # extract the corrected p-values after outlier removal
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.