In stamats/rmx: Radius-Minimax Estimators

Introduction

We consider the Poisson distribution. For more details we refer to Chapter 4 of @Kohl2005.

We first load the package.

library(rmx)

Influence functions

We compute the influence function (IF) for estimating the probability. We first compute and plot the IF of the maximum likelihood (ML) estimator, which is the arithmetic mean.

library(ggplot2)
IF.ML <- optIF(radius = 0, model = "pois", lambda = 7)
IF.ML
checkIF(IF.ML)
summary(IF.ML)
gg <- plot(IF.ML)
gg + ggtitle("IF of ML estimator for Poisson(lambda = 7)")

From a theoretical point of view the IF is unbounded. However, no observations smaller than 0 will be accepted in practical applications.

Next, we compute and plot an IF of our optimally robust asymptotically linear (AL) estimators, where we choose a neighborhood radius of 0.1 corresponding to a gross-error probability of 0.1/sqrt(n). The IF can also be regarded as the IF of a radius-minimax (RMX) estimator with minimax-radius 0.1.

IF.AL <- optIF(radius = 0.1, model = "pois", lambda = 7)
IF.AL
checkIF(IF.AL)
summary(IF.AL)
plot(IF.AL)

The IF is obviously bounded.

Finally, we plot the IF of the minimum bias (MB) estimator, which represents the most conservative estimate in our class of AL estimators.

IF.MB <- optIF(radius = Inf, model = "pois", lambda = 7)
IF.MB
checkIF(IF.MB)
summary(IF.MB)
plot(IF.MB)

In case of discrete models, the MB estimator is usually already achieved for finite radius, the so-called lower-case radius [@Kohl2005].

RMX estimates

We consider the estimation of a Poisson mean and use the dataset by @Rutherford1910.

data(rutherford)

We compute the observed relative frequencies.

signif(table(rutherford)/length(rutherford), 3)

We compute the ML estimate.

library(MASS)
mean(rutherford)
## SE of ML estimate
sqrt(mean(rutherford)/length(rutherford))
## for comparison
fitdistr(rutherford, densfun = "Poisson")

We compute the variance of the data.

var(rutherford)

The variance is slightly smaller than the mean which indicates a small underdispersion of the data. We will ignore it and will use the Poisson distribution to model the data.

We compute the rmx-estimator. The RMX-estimator is constructed as a k-step estimate using a default value of k = 3, where we use a Cramer von Mises minimum distance (MD) estimate as initial estimate. We assume 0 and 5% outliers in the data, which is typical in case of routine data; see Section 1.2c in @Hampel1986. The computation of the finite-sample correction is based on Monte-Carlo simulations and is performed during the computation of the estimator. The computation time is quite high and we omit this step in our first approach.

rmx.x <- rmx(rutherford, model = "pois", eps.lower = 0, eps.upper = 0.05, 
             fsCor = FALSE)
rmx.x

The estimated Poisson mean is larger than in case of the ML estimator. We take a closer look at the result.

checkIF(rmx.x)
coef(rmx.x)
vcov(rmx.x)
sqrt(vcov(rmx.x))
bias(rmx.x)
mse(rmx.x)

As is also noted in the output, the covariance as well as the standard errors are asymptotic values. Bias and MSE (mean squared error) correspond to the maximum asymptotic bias and MSE, respectively. A more comprehensive output containing the above values can be generated by function summary.

summary(rmx.x)

Now, in a second approach we could take the finite-sample correction into consideration. For speeding up the computations, one can use parallel computing. The computations may take up so a few minutes on a reasonably well equipped computer.

rmx.x2 <- rmx(rutherford, model = "pois", eps.lower = 0, eps.upper = 0.05, 
              fsCor = TRUE, parallel = TRUE)
rmx.x2
summary(rmx.x2)

Since we expect outliers in the data, we also be aware of the bias that is usually unavoidable in this case leading to the consideration of MSE instead of (co)variance alone. For simplicity we restrict ourselves to MSE, but one could also consider a whole class of convex risks [@Ruckdeschel2004].

Next, we perform some diagnostics. We start with the calculation of cniper and outlier region, respectively.

cniper(rmx.x)
outlier(rmx.x)

That is, observations of 0 as well as of 7 or larger can be considered as cniper points. If such points exist, the rmx estimator can be expected to be (asymptotically) more precise than the respective maximum-likelihood (ML) estimator; see @Kohl2005 or @Kohl2010.

Any observation of 12 or larger is very unlikely from the assumed model and should be considered as an outlier. The probability to get a value of 12 or larger is smaller than 0.1%.

table(getCnipers(rmx.x)$values)
getOutliers(rmx.x)

The dataset includes 284 cniper and two outlier points. We can also plot the cniper region including the data points.

plot(cniper(rmx.x))

There are several more diagnostic plots. First, we plot the influence function.

ifPlot(rmx.x)

The plot shows the influence function (IF), which is the IF for estimating the Poisson mean. The plot also shows the (absolute) information (info) that is associated with each observation. The (absolute) information corresponds to the square of the length of the IF evaluated at the respective data point. In case of our rmx estimators the length of the influence function and hence the information is bounded. Consequently, a single observation can only have a bounded influence on the estimates. The orange and red vertical lines represent the boundaries of the cniper and outlier regions, respectively.

Next, we plot the (absolute) information.

aiPlot(rmx.x)

We can also compare the (absolute) information our rmx estimator attributes to the data with respective information of the ML estimator.

iiPlot(rmx.x)

The orange line represents the maximum information our RMX estimator is attributing to a single observation. The black line corresponds to y = x.

Furthermore, the density of the estimated model can be compared with the empirical density.

dPlot(rmx.x)

Based on the diagnostic plot, the Poisson distribution seems to fit quite well to the data; that is, our inference with the RMX approach should be reliable. Hence, we finally compute confidence intervals for the estimates.

confint(rmx.x)
confint(rmx.x, method = "as.bias")

The first confidence interval ignores a potential bias and is only based on the asymptotic (co)variance. The second interval also takes the maximum asymptotic bias into consideration and therefore is more conservative; i.e., leads to longer confidence intervals.

As an alternative approach, we have also implemented a bootstrap option based on the functions of package boot.

confint(rmx.x, method = "boot", R = 599, type = "stud") # low R to reduce computation time
## with package parallel
#confint(rmx.x, method = "boot", parallel = TRUE)

The bootstrap results are similar to the results of the asymptotic intervals without considering bias.

We can also use rmx to compute ML estimates.

rmx.x.ML <- rmx(rutherford, model = "pois", eps = 0)
rmx.x.ML
checkIF(rmx.x.ML)
confint(rmx.x.ML)
confint(rmx.x.ML, method = "boot", R = 599, type = "stud")

In addition, we can compute the most robust estimator (for some given sample size).

rmx.x.MB <- rmx(rutherford, model = "pois", eps = 0.5)
rmx.x.MB
checkIF(rmx.x.MB)
confint(rmx.x.MB)
confint(rmx.x.MB, method = "as.bias")
confint(rmx.x.MB, method = "boot", R = 599, type = "stud")

For high-dimensional datasets such as omics-data there are also functions for row- and column-wise computation of RMX estimators as developed for @Kohl2010a. We use some simulated data (100 samples of sample size 20).

M <- matrix(rpois(2000, lambda = 7), ncol = 20)
rowRmx(M, model = "pois", eps.lower = 0, eps.upper = 0.05)
## with package parallel
#rowRmx(M, model = "binom", eps.lower = 0, eps.upper = 0.05, size = 10, 
#       parallel = TRUE)
colRmx(t(M), model = "pois", eps.lower = 0, eps.upper = 0.05)