In stamats/rmx: Radius-Minimax Estimators

Introduction

We consider the binomial distribution, where we assume the size of the distribution to be known. For more details we refer to Chapter 3 of @Kohl2005.

We first load the package.

library(rmx)

Influence functions

We compute the influence function (IF) for estimating the probability. We first compute and plot the IF of the maximum likelihood (ML) estimator, which is the arithmetic mean divided by size. For this demonstration we use size = 10.

library(ggplot2)
IF.ML <- optIF(radius = 0, model = "binom", prob = 0.05, size = 10)
IF.ML
checkIF(IF.ML)
summary(IF.ML)
gg <- plot(IF.ML)
gg + ggtitle("IF of ML estimator for Binom(prob = 0.05, size = 10)")

From a theoretical point of view the IF is unbounded. However, since the size is known and hence no observations smaller than 0 or larger than size will be accepted in practical applications, the IF is bounded. Hence, the ML estimator (arithmetic mean divided by size) can already be regarded as robust. Nevertheless, there can be deviations form the model that will make our RMX estimators more precise than the ML estimator.

Next, we compute and plot an IF of our optimally robust asymptotically linear (AL) estimators, where we choose a neighborhood radius of 0.1 corresponding to a gross-error probability of 0.1/sqrt(n). The IF can also be regarded as the IF of a radius-minimax (RMX) estimator with minimax-radius 0.1.

IF.AL <- optIF(radius = 0.1, model = "binom", prob = 0.05, size = 10)
IF.AL
checkIF(IF.AL)
summary(IF.AL)
plot(IF.AL)

The IF is obviously bounded and observations with values of 2 or larger are downweighted to have less influence on the estimations.

Finally, we plot the IF of the minimum bias (MB) estimator, which represents the most conservative estimate in our class of AL estimators.

IF.MB <- optIF(radius = Inf, model = "binom", prob = 0.05, size = 10)
IF.MB
checkIF(IF.MB)
summary(IF.MB)
plot(IF.MB)

In case of discrete models, the MB estimator is usually already achieved for finite radius, the so-called lower-case radius [@Kohl2005]. In case of the Bernoulli model (size = 1) the lower-case radius is even 0. That is, the MB estimator as well as the AL estimators for all radii coincide with the ML estimator. Consequently, in the Bernoulli model, the ML estimator (relative frequency) is also optimally robust for all radii. An example showing the IFs is given in the following plot.

library(gridExtra)
IF.ML <- optIF(radius = 0, model = "binom", prob = 0.1, size = 1)
IF.MB <- optIF(radius = 1e-10, model = "binom", prob = 0.1, size = 1)
ggML <- plot(IF.ML, plot = FALSE) + 
  ggtitle("IF for ML estimator (prob = 0.05)")
ggMB <- plot(IF.MB, plot = FALSE) + 
  ggtitle("IF for MB estimator (prob = 0.05)")
grid.arrange(ggML, ggMB, nrow = 1)

Thus, if we assume gross-errors or other deviations from the ideal binomial model, (we would either have to allow observations smaller than zero or larger than one or) we need a size of the binomial distribution of at least two to be able to improve the ML estimator.

RMX estimates

We consider the estimation of a binomial proportion and use some simulated data. Our ideal (undisturbed) model is Binom(size = 10, prob = 0.05), where in 3% of the cases 4 is observed.

set.seed(123)
ind <- rbinom(100, size = 1, prob = 0.03)
x <- (1-ind)*rbinom(100, size = 10, prob = 0.05) + ind*4

We compare the observed relative frequencies with the theoretical probabilities of Binom(size = 10, prob = 0.05).

table(x)/length(x)
signif(dbinom(0:4, size = 10, prob = 0.05), 3)

We do not observe any 3, but observe twice a result of 4, which is cleary more frequent than can be expected from the ideal model.

We compute the ML estimate.

mean(x)/10

We now compute the rmx-estimator, where we assume between one and two outliers. The RMX-estimator is constructed as a k-step estimate using a default value of k = 3, where we use a Cramer von Mises minimum distance (MD) estimate as initial estimate. We assume 0 and 5% outliers in the data, which is typical in case of routine data; see Section 1.2c in @Hampel1986. The computation of the finite-sample correction is based on Monte-Carlo simulations and is performed during the computation of the estimator. The computation time is quite high and we omit this step in our first approach.

rmx.x <- rmx(x, model = "binom", eps.lower = 0, eps.upper = 0.05, 
             size = 10, fsCor = FALSE)
rmx.x

The estimated probability of success is smaller than in case of the ML estimator. We take a closer look at the result.

checkIF(rmx.x)
coef(rmx.x)
vcov(rmx.x)
sqrt(vcov(rmx.x))
bias(rmx.x)
mse(rmx.x)

As is also noted in the output, the covariance as well as the standard errors are asymptotic values. Bias and MSE (mean squared error) correspond to the maximum asymptotic bias and MSE, respectively. A more comprehensive output containing the above values can be generated by function summary.

summary(rmx.x)

Now, in a second approach we could take the finite-sample correction into consideration. For speeding up the computations, one can use parallel computing. The computations may take up to a few minutes on a reasonably well equipped computer.

rmx.x2 <- rmx(x, model = "binom", eps.lower = 0, eps.upper = 0.05, 
              size = 10, fsCor = TRUE, parallel = TRUE)
rmx.x2
summary(rmx.x2)

Since we expect outliers in the data, we also be aware of the bias that is usually unavoidable in this case leading to the consideration of MSE instead of (co)variance alone. For simplicity we restrict ourselves to MSE, but one could also consider a whole class of convex risks [@Ruckdeschel2004].

Next, we perform some diagnostics. We start with the calculation of cniper and outlier region, respectively.

cniper(rmx.x)
outlier(rmx.x)

That is, observation of 2 or larger can be considered as cniper points. If such points exist, the rmx estimator can be expected to be (asymptotically) more precise than the respective maximum-likelihood (ML) estimator; see @Kohl2005 or @Kohl2010.

Any observation of 5 or larger is very unlikely from the assumed model and should be considered as an outlier. The probability to get a value of 5 or larger is clearly smaller than 0.1%.

getCnipers(rmx.x)
getOutliers(rmx.x)

The dataset includes eleven cniper and no outlier points. We can also plot the cniper region including the data points.

plot(cniper(rmx.x))

There are several more diagnostic plots. First, we plot the influence functions.

ifPlot(rmx.x)

The plot shows the influence function (IF), which is the IF for estimating the probability of success. The plot also shows the (absolute) information (info) that is associated with each observation. The (absolute) information corresponds to the square of the length of the IF evaluated at the respective data point. In case of our rmx estimators the length of the influence function and hence the information is bounded. Consequently, a single observation can only have a bounded influence on the estimates. The orange and red vertical lines represent the boundaries of the cniper and outlier regions, respectively.

Next, we plot the (absolute) information.

aiPlot(rmx.x)

We can also compare the (absolute) information our rmx estimator attributes to the data with respective information of the ML estimator.

iiPlot(rmx.x)

The orange line represents the maximum information our RMX estimator is attributing to a single observation. The black line corresponds to y = x.

Furthermore, the density of the estimated model can be compared with the empirical density.

dPlot(rmx.x)

Based on the diagnostic plot, the binomial distribution seems to fit well to the data except for the observed frequency of 4; that is, our inference with the RMX approach should be reliable. Hence, we finally compute confidence intervals for the estimates.

confint(rmx.x)
confint(rmx.x, method = "as.bias")

The first confidence interval ignores a potential bias and is only based on the asymptotic (co)variance. The second interval also takes the maximum asymptotic bias into consideration and therefore is more conservative; i.e., leads to longer confidence intervals.

As an alternative approach, we have also implemented a bootstrap option based on the functions of package boot.

confint(rmx.x, method = "boot", R = 599) # low R to reduce computation time
## with package parallel
#confint(rmx.x, method = "boot", parallel = TRUE)

The bootstrap results are between the results of the asymptotic intervals with and without considering bias.

We can use rmx also to compute ML estimates.

rmx.x.ML <- rmx(x, model = "binom", eps = 0, size = 10)
rmx.x.ML
checkIF(rmx.x.ML)
confint(rmx.x.ML)
confint(rmx.x.ML, method = "boot")

In addition, we can compute the most robust estimator (for some given sample size).

rmx.x.MB <- rmx(x, model = "binom", eps = 0.5, size = 10)
rmx.x.MB
checkIF(rmx.x.MB)
confint(rmx.x.MB)
confint(rmx.x.MB, method = "as.bias")
confint(rmx.x.MB, method = "boot", R = 599)

For high-dimensional datasets such as omics-data there are also functions for row- and column-wise computation of RMX estimators as developed for @Kohl2010a. We use some simulated data (100 samples of sample size 20).

M <- matrix(rbinom(2000, size = 10, prob = 0.3), ncol = 20)
rowRmx(M, model = "binom", eps.lower = 0, eps.upper = 0.05, size = 10)
## with package parallel
#rowRmx(M, model = "binom", eps.lower = 0, eps.upper = 0.05, size = 10, 
#       parallel = TRUE)
colRmx(t(M), model = "binom", eps.lower = 0, eps.upper = 0.05, size = 10)