This vignette illustrates the usage of the outference package, which is a tool for doing valid statistical inference corrected for outlier removal in linear regression. We refer the reader to https://arxiv.org/abs/1711.10635 for technical details.

The Main Function

The main function in this package is outference, which is written in a similar fashion to the lm function and returns an object of S3 class outference. This function detects outliers using a user-specified method, and fits a linear regression model with outliers removed. Common generic functions for the lm object are overwritten to extract useful inferential results from the outference object.

A Synthetic Example

We now illustrate the package using a synthetic example.

Generating the Data

We generate the data from a "mean-shift model". That is, we first generate responses from a classical linear model, and then shift several responses, which are considered as outliers.

set.seed(2667)
n <- 100; p <- 6
# generate the design matrix
X <- matrix(rnorm(n*(p-1)), n, (p-1))
X <- scale(X, FALSE, TRUE)*sqrt(n/(n-1)) # col of X and has norm sqrt(n)
my.data <- as.data.frame(X)
X <- cbind(rep(1, n), X)
# generate the response with noise i.i.d. from N(0, 1)
beta <- rnorm(p)
y <- X %*% beta + rnorm(n)
# assume the first five observations are outliers, shifed upwards by 5 units
y[1:5] <- y[1:5] + 5
# format the dataset
my.data <- cbind(y = y, my.data)
head(my.data)

Fitting the Model

We next detect the outliers in the synthetic data. For illustration purposes, we use the classical Cook's distance, and claim an observation is an outlier if its Cook's distance is greater than $4/n$. Our method accommodates both cases when the noise level $\sigma$ is known and unknown. In the $\sigma$ known case, our method is the generalization of the classical $z$-test (testing the regression coefficients) and $\chi^2$-test (tesing for group structures or "nested models"), as well as the classical $z$-intervals (for regression coefficients). When $\sigma$ is unknown, we can either plug in an estimate of $\sigma$, or we resort to the generalization of the classical $t$-test (testing the regression coefficients) and $F$-test (tesing for group structures or "nested models"). For now our method does not provide the corresponding generalization of $t$-intervals. In this example, we will focus on the case where $\sigma$ is estimated.

library(outference)
fit <- outference(y ~ ., data = my.data, method = "cook", cutoff = 4, sigma = "estimate")
fit

Notice that the syntax and the output are very similar to those of the lm function.

Extracting the Information

To visualize how the outliers are detected, we can plot the Cook's distance for each observation.

plot(fit)

We see that the first five observations are successfully detected, while the 42-nd observation is a false positive. As usual, we can use the coef function to extract the regression coefficients in the model with detected outliers removed.

coef(fit)

To test the significance of regression coefficients and to test for the "global null" (i.e. test if the model only consists of the intercept), we use the summary function. Again notice the similarity of the output with that of summary.lm.

summary(fit)

In the output above, we note that the column of $p$-values are adjusted to account for the outlier removal, as does the $p$-value in the last row of the output.

One calls the confint function to extract the confidence intervals for each coefficient.

confint(fit, level = 0.95)

One uses the predict function to extract confidence intervals for the regression surfaces, as well as the prediction intervals.

# new data points for prediction
new.data <- t(rnorm(p-1))
colnames(new.data) <- colnames(my.data)[-1]
new.data <- as.data.frame(new.data)
predict(fit, newdata = new.data, interval = "confidence", level = 0.95)
predict(fit, newdata = new.data, interval = "prediction", level = 0.95)

Other Exported Functions

Apart from the outference function, there are two other functions that are also exported: coeftest and grptest. Those two functions are internally called by the summary function, and they correspond to testing for regression coefficients and testing the group structures. The aim is to allow more flexibility for experienced users (since users can access more information from the output of those two functions).

coeftest(fit, index = 2) # test the significance of "V1"
grptest(fit, group = 2:3) # test if "V1" and "V2" are simultaneously zero

A Real Data Example

We now illustrate the package by analyzing a real data example. This example reproduces some of the results given in https://arxiv.org/pdf/1711.10635.pdf. We consider the Brownlee's Stack Loss Plant Data, which involves measures on an industrial plant's operation and has $21$ observations and three covariates. In fact, this dataset is readily available in R:

data("stackloss")
head(stackloss)

Interested readers can run ?stackloss for detailed information on this dataset. Air.Flow is the rate of operation of the plant, Water.Temp is the temperature of cooling water circulated through coils in the absorption tower, and Acid.Conc is the concentration of the acid circulating, minus 50, times 10. The response, stack.loss is an inverse measure of the overall efficiency of the plant. This data set is considered by many papers in the outlier detection literature. The general consensus is that observations $1, 3, 4$ and $21$ are outliers.

We now run the outfernce command, using Cook's distance to help us to identify outliers. We use the cutoff of $4$. That is, any observations whose Cook's distance is greater than $4/n$, with $n$ being the number of observations, are considered as outliers. For this example, we do not specify sigma = "estimate", so that we are actually doing selective $t$-tests and selective $F$-tests.

require(outference)
stack.fit <- outference(stack.loss ~ ., data = stackloss, method = "cook", cutoff = 4)

We first plot each observation's Cook's distance as well as our cutoff value.

plot(stack.fit)
# show which observations are considered as outliers
stack.fit$outlier.det

We see that observation 21 is detected with the current cutoff value. We remove this observation and proceed to statistical inference.

summary(stack.fit)

Notice that the output of the summary closely matches with those produced by native lm functionalities. As a comparison, we print the summary for the original full model (i.e., without removing any outliers), as well as the detect-and-forget method (i.e., outliers are removed but the inference is not corrected):

# full model
summary(stack.fit$fit.full)
# detect-and-forget
summary(stack.fit$fit.rm)

Comparing the above results, we can see that removing the 21st observation increases the goodness-of-fit of the linear model (adjusted R-squared goes from $0.8983$ to $0.9392$). And the corrected $p$-values may (or may not) be different from the naive, uncorrected ones.

We now construct confidence intervals for each regression coefficient (this may take a while due to computational burdens):

confint(stack.fit, level = 0.95)

We can compare the corrected CIs with the detect-and-forget ones:

confint(stack.fit$fit.rm, level = 0.95)

Notice that the corrected CIs are somewhat wider than detect-and-forget intervals.

We can also form CIs for regression surfaces, as well as prediction intervals, using the predict function:

# say we want to predict with the first observation 
newdata = stackloss[1, 1:3]
# CI for regression surface (a.k.a., short prediction intervals)
predict(stack.fit, newdata = newdata, interval = "confidence", level = 0.95)
# prediction intervals (a.k.a., wide prediction intervals) 
predict(stack.fit, newdata = newdata, interval = "prediction", level = 0.95)

# predict with the whole dataset is also possible by running the following command, 
# but it may take a while

# predict(stack.fit, interval = "confidence", level = 0.95)

We now investigate the other two outlier detection methods considered in our paper. We first consider using a Lasso regression to detect outliers:

stack.lasso = outference(stack.loss ~ ., data = stackloss, method = "lasso")
plot(stack.lasso)
stack.lasso$outlier.det

Now we detect two outliers. We proceed to the inference.

summary(stack.lasso)

We then consider using DFFITS to detect outliers. We expect the results to be similar to the one produced by using Cook's distance, since DFFITS is approximately a constant multiple of Cook's distance.

stack.dffits = outference(stack.loss ~ ., data = stackloss, method = "dffits")
plot(stack.dffits)
stack.dffits$outlier.det

We proceed to the inference.

summary(stack.dffits)


shuxiaoc/outference documentation built on July 8, 2019, 8:30 p.m.