Description Usage Arguments Details Value Author(s) References See Also Examples
This function detects outliers using a user-specified method, and
fits a linear regression model with outliers removed. The object
returned by this function can be used for valid inference corrected
for outlier removal through generic functions like summary,
confint, predict.
1 2 3 4 5 |
formula, |
an object of class |
data, |
an optional data frame, list or environment containing the variables in the model, the same
syntax as in |
method, |
the outlier detection method, must be one of |
cutoff, |
the cutoff of the outlier detection method. If |
sigma, |
the noise level. Must be one of |
x, |
an object of class |
digits, |
the number of significant digits to use when printing. |
..., |
other arguments. |
This function uses the same syntax as lm for the formula and data arguments.
Users can access the original "lm" objects through $fit.full and $fit.rm.
Common generic functions for lm, including coef, confint,
plot, predict and summary are re-written so that
they can be used to extract useful features of the object returned by this function.
Currently, this function supports three outlier detection methods. For "cook", the i-th
observation is considered as an outlier when its Cook's distance is greater than cutoff/n,
where n is the number of observations. For "dffits", the i-th observation is
considered as an outlier when the square of its DFFITS measure is greater than cutoff*p/(n-p),
where p is the number of variables (including the intercept). The rule of thumb of cutoff
for both methods are 4, which is the default value when one sets cutoff = NULL.
The outlier detection event of both methods can be characterized as a set of quadratic constraints
in the response y:
\bigcap_{i \in [n]} {y^T Q_i y ≥ 0},
and the constraint returned by this function is the list of Q_i matrices.
For "lasso", we assume the mean-shift model
y = X β + u + ε, where u is the "outlying coefficients" and
ε ~ N(0, σ^2 I) is the noise. We solve the following program:
(\hat β, \hat u) = argmin ||y-Xβ-u||_2^2 + cutoff*||u||_1.
The i-th observation
is considered as an outlier when \hat u_i differs from 0. The default cutoff for
"lasso" is 0.75*E[||X^T ε||_∞]/n, which is a less conservative choice
than the prediction-optimal cutoff 2*E[||X^T ε||_∞]/n. This cutoff is computed
by Monte Carlo simulation and σ is replaced by an estimate when the true noise level
is unknown. The outlier detection event of "lasso" can be characterized as a
set of affine constraints in the response y:
A y ≥ b,
where the "≥" is interpreted as element-wise. The constraint returned by this function is
then a list of (A, b).
This function returns an object of class "outference".
The function summary is used to obtain and print a summary (including p-values)
of the results. The generic functions coef, confint, plot,
predict are used to extract useful features of the object returned by this function.
An object of class "outference" is a list containing the following components:
fit.full, |
an |
fit.rm, |
an |
method, |
the method used for outlier detection. |
cutoff, |
the cutoff of the method. |
outlier.det, |
indexes of detected outliers. |
magnitude, |
a measure of "outlying-ness". For |
constraint, |
the constraint in the response that characterizes the outlier detection event.
For |
sigma, |
the noise level used in the fit. |
call, |
the function call. |
Shuxiao Chen <sc2667@cornell.edu>
Lee, Jason D., et al. "Exact post-selection inference, with application to the lasso." The Annals of Statistics 44.3 (2016): 907-927.
S. Chen and J. Bien. “Valid Inference Corrected for Outlier Removal”. arXiv preprint arXiv:1711.10635 (2017).
summary.outference for summaries;
coef.outference for extracting coefficients;
confint.outference for confidence intervals of regression coefficients;
plot.outference for plotting the outlying measure;
predict.outference for making predictions.
1 2 3 4 5 6 7 8 9 10 11 | ## Brownlee’s Stack Loss Plant Data
data("stackloss")
head("stackloss") # look at the dataset
## fit the model
## detect outlier using Cook's distance with cutoff = 4
fit <- outference(stack.loss ~ ., data = stackloss, method = "cook", cutoff = 4)
plot(fit) # plot the Cook's distance of each observation
## observation 21 is considered as an outlier with cutoff = 4
summary(fit$fit.full) # look at the fit with all the data
summary(fit$fit.rm) # look at the fit with observation 21 deleted
summary(fit) # extract the corrected p-values after outlier removal
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.