outference: Fit a linear model with outliers detected and removed

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

This function detects outliers using a user-specified method, and fits a linear regression model with outliers removed. The object returned by this function can be used for valid inference corrected for outlier removal through generic functions like summary, confint, predict.

Usage

1
2
3
4
5
outference(formula, data, method = c("cook", "dffits", "lasso"),
  cutoff = NULL, sigma = NULL)

## S3 method for class 'outference'
print(x, digits = max(3, getOption("digits") - 3), ...)

Arguments

formula,

an object of class "formula", the same syntax as in lm.

data,

an optional data frame, list or environment containing the variables in the model, the same syntax as in lm.

method,

the outlier detection method, must be one of "cook", "dffits", "lasso". See also 'details'.

cutoff,

the cutoff of the outlier detection method. If cutoff = NULL, then this function uses the default values. For "cook" or "dffits", the default cutoff is 4; for "lasso", the default cutoff is 0.75*E[||X^T ε||_∞]/n. See also 'details'.

sigma,

the noise level. Must be one of NULL, "estimate", or a positive scaler value. If sigma = NULL, then the inference will assume the noise level is unknown; if sigma = "estimate", then the inference will base on an estimated noise level.

x,

an object of class "outference".

digits,

the number of significant digits to use when printing.

...,

other arguments.

Details

This function uses the same syntax as lm for the formula and data arguments. Users can access the original "lm" objects through $fit.full and $fit.rm. Common generic functions for lm, including coef, confint, plot, predict and summary are re-written so that they can be used to extract useful features of the object returned by this function.

Currently, this function supports three outlier detection methods. For "cook", the i-th observation is considered as an outlier when its Cook's distance is greater than cutoff/n, where n is the number of observations. For "dffits", the i-th observation is considered as an outlier when the square of its DFFITS measure is greater than cutoff*p/(n-p), where p is the number of variables (including the intercept). The rule of thumb of cutoff for both methods are 4, which is the default value when one sets cutoff = NULL. The outlier detection event of both methods can be characterized as a set of quadratic constraints in the response y:

\bigcap_{i \in [n]} {y^T Q_i y ≥ 0},

and the constraint returned by this function is the list of Q_i matrices. For "lasso", we assume the mean-shift model y = X β + u + ε, where u is the "outlying coefficients" and ε ~ N(0, σ^2 I) is the noise. We solve the following program:

(\hat β, \hat u) = argmin ||y-Xβ-u||_2^2 + cutoff*||u||_1.

The i-th observation is considered as an outlier when \hat u_i differs from 0. The default cutoff for "lasso" is 0.75*E[||X^T ε||_∞]/n, which is a less conservative choice than the prediction-optimal cutoff 2*E[||X^T ε||_∞]/n. This cutoff is computed by Monte Carlo simulation and σ is replaced by an estimate when the true noise level is unknown. The outlier detection event of "lasso" can be characterized as a set of affine constraints in the response y:

A y ≥ b,

where the "≥" is interpreted as element-wise. The constraint returned by this function is then a list of (A, b).

Value

This function returns an object of class "outference".

The function summary is used to obtain and print a summary (including p-values) of the results. The generic functions coef, confint, plot, predict are used to extract useful features of the object returned by this function.

An object of class "outference" is a list containing the following components:

fit.full,

an "lm" object representing the fit using the full data (no outliers are removed).

fit.rm,

an "lm" object representing the fit using the data after outlier removal.

method,

the method used for outlier detection.

cutoff,

the cutoff of the method.

outlier.det,

indexes of detected outliers.

magnitude,

a measure of "outlying-ness". For "cook" and "dffits", this is the vector of the Cook's distance or DFFITS for all observations; for "lasso", this is the vector of "outlying coefficients" estimated by lasso. See also 'details'.

constraint,

the constraint in the response that characterizes the outlier detection event. For "cook" and "dffits", this is a list of n by n matrices; for "lasso", this is a list of (A, b), where A is a matrix and b is a vector. See also 'details'.

sigma,

the noise level used in the fit.

call,

the function call.

Author(s)

Shuxiao Chen <sc2667@cornell.edu>

References

Lee, Jason D., et al. "Exact post-selection inference, with application to the lasso." The Annals of Statistics 44.3 (2016): 907-927.

S. Chen and J. Bien. “Valid Inference Corrected for Outlier Removal”. arXiv preprint arXiv:1711.10635 (2017).

See Also

summary.outference for summaries;

coef.outference for extracting coefficients;

confint.outference for confidence intervals of regression coefficients;

plot.outference for plotting the outlying measure;

predict.outference for making predictions.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
## Brownlee’s Stack Loss Plant Data
data("stackloss")
head("stackloss")     # look at the dataset
## fit the model
## detect outlier using Cook's distance with cutoff = 4
fit <- outference(stack.loss ~ ., data = stackloss, method = "cook", cutoff = 4)
plot(fit)             # plot the Cook's distance of each observation
## observation 21 is considered as an outlier with cutoff = 4
summary(fit$fit.full) # look at the fit with all the data
summary(fit$fit.rm)   # look at the fit with observation 21 deleted
summary(fit)          # extract the corrected p-values after outlier removal

shuxiaoc/outference documentation built on July 8, 2019, 8:30 p.m.