This vignette illustrates the usage of the outference
package, which is a tool for doing valid statistical inference corrected for outlier removal in linear regression. We refer the reader to https://arxiv.org/abs/1711.10635 for technical details.
The main function in this package is outference
, which is written in a similar fashion to the lm
function and returns an object of S3 class outference
. This function detects outliers using a user-specified method, and fits a linear regression model with outliers removed. Common generic functions for the lm
object are overwritten to extract useful inferential results from the outference
object.
We now illustrate the package using a synthetic example.
We generate the data from a "mean-shift model". That is, we first generate responses from a classical linear model, and then shift several responses, which are considered as outliers.
set.seed(2667) n <- 100; p <- 6 # generate the design matrix X <- matrix(rnorm(n*(p-1)), n, (p-1)) X <- scale(X, FALSE, TRUE)*sqrt(n/(n-1)) # col of X and has norm sqrt(n) my.data <- as.data.frame(X) X <- cbind(rep(1, n), X) # generate the response with noise i.i.d. from N(0, 1) beta <- rnorm(p) y <- X %*% beta + rnorm(n) # assume the first five observations are outliers, shifed upwards by 5 units y[1:5] <- y[1:5] + 5 # format the dataset my.data <- cbind(y = y, my.data) head(my.data)
We next detect the outliers in the synthetic data. For illustration purposes, we use the classical Cook's distance, and claim an observation is an outlier if its Cook's distance is greater than $4/n$. Our method accommodates both cases when the noise level $\sigma$ is known and unknown. In the $\sigma$ known case, our method is the generalization of the classical $z$-test (testing the regression coefficients) and $\chi^2$-test (tesing for group structures or "nested models"), as well as the classical $z$-intervals (for regression coefficients). When $\sigma$ is unknown, we can either plug in an estimate of $\sigma$, or we resort to the generalization of the classical $t$-test (testing the regression coefficients) and $F$-test (tesing for group structures or "nested models"). For now our method does not provide the corresponding generalization of $t$-intervals. In this example, we will focus on the case where $\sigma$ is estimated.
library(outference) fit <- outference(y ~ ., data = my.data, method = "cook", cutoff = 4, sigma = "estimate") fit
Notice that the syntax and the output are very similar to those of the lm
function.
To visualize how the outliers are detected, we can plot the Cook's distance for each observation.
plot(fit)
We see that the first five observations are successfully detected, while the 42-nd observation is a false positive. As usual, we can use the coef
function to extract the regression coefficients in the model with detected outliers removed.
coef(fit)
To test the significance of regression coefficients and to test for the "global null" (i.e. test if the model only consists of the intercept), we use the summary
function. Again notice the similarity of the output with that of summary.lm
.
summary(fit)
In the output above, we note that the column of $p$-values are adjusted to account for the outlier removal, as does the $p$-value in the last row of the output.
One calls the confint
function to extract the confidence intervals for each coefficient.
confint(fit, level = 0.95)
One uses the predict
function to extract confidence intervals for the regression surfaces, as well as the prediction intervals.
# new data points for prediction new.data <- t(rnorm(p-1)) colnames(new.data) <- colnames(my.data)[-1] new.data <- as.data.frame(new.data) predict(fit, newdata = new.data, interval = "confidence", level = 0.95) predict(fit, newdata = new.data, interval = "prediction", level = 0.95)
Apart from the outference
function, there are two other functions that are also exported: coeftest
and grptest
. Those two functions are internally called by the summary
function, and they correspond to testing for regression coefficients and testing the group structures. The aim is to allow more flexibility for experienced users (since users can access more information from the output of those two functions).
coeftest(fit, index = 2) # test the significance of "V1" grptest(fit, group = 2:3) # test if "V1" and "V2" are simultaneously zero
We now illustrate the package by analyzing a real data example. This example reproduces some of the results given in https://arxiv.org/pdf/1711.10635.pdf. We consider the Brownlee's Stack Loss Plant Data, which involves measures on an industrial plant's operation and has $21$ observations and three covariates. In fact, this dataset is readily available in R:
data("stackloss") head(stackloss)
Interested readers can run ?stackloss
for detailed information on this dataset. Air.Flow
is the rate of operation of the plant, Water.Temp
is the temperature of cooling water circulated through coils in the absorption tower, and Acid.Conc
is the concentration of the acid circulating, minus 50, times 10. The response, stack.loss
is an inverse measure of the overall efficiency of the plant. This data set is considered by many papers in the outlier detection literature. The general consensus is that observations $1, 3, 4$ and $21$ are outliers.
We now run the outfernce
command, using Cook's distance to help us to identify outliers. We use the cutoff of $4$. That is, any observations whose Cook's distance is greater than $4/n$, with $n$ being the number of observations, are considered as outliers. For this example, we do not specify sigma = "estimate"
, so that we are actually doing selective $t$-tests and selective $F$-tests.
require(outference) stack.fit <- outference(stack.loss ~ ., data = stackloss, method = "cook", cutoff = 4)
We first plot each observation's Cook's distance as well as our cutoff value.
plot(stack.fit) # show which observations are considered as outliers stack.fit$outlier.det
We see that observation 21 is detected with the current cutoff value. We remove this observation and proceed to statistical inference.
summary(stack.fit)
Notice that the output of the summary
closely matches with those produced by native lm
functionalities. As a comparison, we print the summary for the original full model (i.e., without removing any outliers), as well as the detect-and-forget method (i.e., outliers are removed but the inference is not corrected):
# full model summary(stack.fit$fit.full) # detect-and-forget summary(stack.fit$fit.rm)
Comparing the above results, we can see that removing the 21st observation increases the goodness-of-fit of the linear model (adjusted R-squared goes from $0.8983$ to $0.9392$). And the corrected $p$-values may (or may not) be different from the naive, uncorrected ones.
We now construct confidence intervals for each regression coefficient (this may take a while due to computational burdens):
confint(stack.fit, level = 0.95)
We can compare the corrected CIs with the detect-and-forget ones:
confint(stack.fit$fit.rm, level = 0.95)
Notice that the corrected CIs are somewhat wider than detect-and-forget intervals.
We can also form CIs for regression surfaces, as well as prediction intervals, using the predict
function:
# say we want to predict with the first observation newdata = stackloss[1, 1:3] # CI for regression surface (a.k.a., short prediction intervals) predict(stack.fit, newdata = newdata, interval = "confidence", level = 0.95) # prediction intervals (a.k.a., wide prediction intervals) predict(stack.fit, newdata = newdata, interval = "prediction", level = 0.95) # predict with the whole dataset is also possible by running the following command, # but it may take a while # predict(stack.fit, interval = "confidence", level = 0.95)
We now investigate the other two outlier detection methods considered in our paper. We first consider using a Lasso regression to detect outliers:
stack.lasso = outference(stack.loss ~ ., data = stackloss, method = "lasso") plot(stack.lasso) stack.lasso$outlier.det
Now we detect two outliers. We proceed to the inference.
summary(stack.lasso)
We then consider using DFFITS to detect outliers. We expect the results to be similar to the one produced by using Cook's distance, since DFFITS is approximately a constant multiple of Cook's distance.
stack.dffits = outference(stack.loss ~ ., data = stackloss, method = "dffits") plot(stack.dffits) stack.dffits$outlier.det
We proceed to the inference.
summary(stack.dffits)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.