Description Usage Arguments Details Value Author(s) References See Also Examples
Replicates the experiment presented in Cerioli et al. (2009), Tables 1 and 3, for a wider variety of estimators.
1 | table13sim.parallel(cl, p, nn, N, B = 250, alpha = c(0.01, 0.025, 0.05), lgf = "", mlgf = "", maxtries = 100)
|
cl |
A cluster object, e.g., returned from |
p |
The dimension of the data used in each simulated run. |
nn |
The number of observations used in each simulated run. |
N |
The number of simulations to run. |
B |
The batch/block size: the number of simulations to run
in each block. This is useful when running very large
simulation runs ( |
alpha |
The significance level to use for detecting outliers. Can be a vector; the outlier detection tests will be run at each level. |
lgf |
Path to log file into which logging information should be written. |
mlgf |
not used at this time |
maxtries |
The maximum number of times to retry failed blocks. The Rocke S-estimator can fail when n/p is small if it gets a bad random sample, so we restart such blocks. Default is 100. |
This is a work function designed for use in replicating Tables 1 and 3 of Cerioli et al. (2009), pages 344–346, but using the asymptotic method of Green and Martin (2014) instead of the Hardin-Rocke method. The experiment investigates how many false-positives certain Mahalanobis-based tests of outlyingness produce, compared to the nominal Type I error rate α.
For the simulataneous outlier tests, some of the reweighted MCD estimates use
a Bonferroni-corrected quantile to compute the inclusion/exclusion threshold.
This significance level used is α/nn, and the quantile used is the
( 1 - α/nn ) quantile of the reference distribution (e.g., chi-squared).
This calculation currently requires the use of the function covMcd2
.
Green and Martin also considered some simultaneous outlier tests using quantiles
computed based on the distribution of a maximum of iid randon variables. The significance
level is 1 - ((1-α)^(1/nn)) and the quantile used is the
( 1 - α )^(1/nn) quantile of the reference distribution. These tests
are indicated by the suffix “ALT” in the return value of table13sim.parallel
.
Internally the simulation function does B
runs at a time. Blocks
of size B
are distributed across the cluster. Set B
smaller if
your machines have less memory or you have lots of cluster nodes.
An array of dimension 3:
The results of each of the N
simulation runs appear along the first dimension.
The various estimators and tests appear along the second dimension. Results with suffix “T1” correspond to Table 1 of Cerioli et al. (2009) (the individual outlier tests) while those with suffix “T3” correspond to Table 3 (the simultaneous outlier tests). Currently the 74 columns appear in the following order.
Column Name | Covariate Estimate | Test Statistic |
"OGK.T1" | OGK estimate (β = 0.9) | chi-squared |
"ROGK.T1" | Reweighted OGK estimate | chi-squared |
"SEST.BS.T1" | S-estimate using bisquare ρ-function | chi-squared |
"SEST.RK.T1" | Rocke S-estimate (arp = 0.05 ) | chi-squared |
"MCDMBP.RAW.T1" | MCD (max. breakdown pt.) | chi-squared |
"MCDMBP.GMRAW.T1" | MCD (max. breakdown pt.) | Green-Martin |
"MCDMBP.GMADJ.T1" | MCD (max. breakdown pt.) | Green-Martin (adj.) |
"RMCDMBP.T1" | Reweighted MCD (max. breakdown pt.) | chi-squared |
"MCD75.RAW.T1" | MCD(0.75) | chi-squared |
"MCD75.GMRAW.T1" | MCD(0.75) | Green-Martin |
"MCD75.GMADJ.T1" | MCD(0.75) | Green-Martin (adj.) |
"RMCD75.T1" | Reweighted MCD(0.75) | chi-squared |
"MCD95.RAW.T1" | MCD(0.95) | chi-squared |
"MCD95.GMRAW.T1" | MCD(0.95) | Green-Martin |
"MCD95.GMADJ.T1" | MCD(0.95) | Green-Martin (adj.) |
"RMCD95.T1" | Reweighted MCD(0.95) | chi-squared |
"OGK.T3" | OGK estimate | chi-squared |
"ROGK.T3" | Reweighted OGK estimate | chi-squared |
"ROGK.CH.T3" | Reweighted OGK estimate using Bonferroni corrected β | chi-squared |
"SEST.BS.T3" | S-estimate using bisquare ρ-function | chi-squared |
"SEST.RK.T3" | Rocke S-estimate | chi-squared |
"MCDMBP.RAW.T3" | MCD (max. breakdown pt.) | chi-squared |
"MCDMBP.GMRAW.T3" | MCD (max. breakdown pt.) | Green-Martin |
"MCDMBP.GMADJ.T3" | MCD (max. breakdown pt.) | Green-Martin (adj.) |
"MCDMBP.HRADJ.T3" | MCD (max. breakdown pt.) | Hardin-Rocke (adj.) |
"RMCDMBP.T3" | Reweighted MCD (max. breakdown pt.) | chi-squared |
"RMCDMBP.CH.T3" | Reweighted MCD (max. breakdown pt.) with Bonferroni correction | chi-squared |
"MCD75.RAW.T3" | MCD(0.75) | chi-squared |
"MCD75.GMRAW.T3" | MCD(0.75) | Green-Martin |
"MCD75.GMADJ.T3" | MCD(0.75) | Green-Martin (adj.) |
"RMCD75.T3" | Reweighted MCD(0.75) | chi-squared |
"RMCD75.CH.T3" | Reweighted MCD(0.75) with Bonferroni correction | chi-squared |
"MCD95.RAW.T3" | MCD(0.95) | chi-squared |
"MCD95.GMRAW.T3" | MCD(0.95) | Green-Martin |
"MCD95.GMADJ.T3" | MCD(0.95) | Green-Martin (adj.) |
"RMCD95.T3" | Reweighted MCD(0.95) | chi-squared |
"RMCD95.CH.T3" | Reweighted MCD(0.95) with Bonferroni correction | chi-squared |
"MCDMBP.HRRAW.T1" | MCD (max. breakdown pt.) | Hardin-Rocke |
"MCDMBP.HRADJ.T1" | MCD (max. breakdown pt.) | Hardin-Rocke (adj.) |
"MCDMBP.HRRAW.T3" | MCD (max. breakdown pt.) | Hardin-Rocke |
"MCD75.HRRAW.T1" | MCD(0.75) | Hardin-Rocke |
"MCD75.HRADJ.T1" | MCD(0.75) | Hardin-Rocke (adj.) |
"MCD75.HRRAW.T3" | MCD(0.75) | Hardin-Rocke |
"MCD75.HRADJ.T3" | MCD(0.75) | Hardin-Rocke (adj.) |
"MCD95.HRRAW.T1" | MCD(0.95) | Hardin-Rocke |
"MCD95.HRADJ.T1" | MCD(0.95) | Hardin-Rocke (adj.) |
"MCD95.HRRAW.T3" | MCD(0.95) | Hardin-Rocke |
"MCD95.HRADJ.T3" | MCD(0.95) | Hardin-Rocke (adj.) |
"OGK.T3.ALT" | OGK estimate | chi-squared |
"ROGK.T3.ALT" | Reweighted OGK estimate | chi-squared |
"ROGK.CH.T3.ALT" | Reweighted OGK estimate with Bonferroni correction | chi-squared |
"SEST.BS.T3.ALT" | S-estimate with bisquare ρ-function | chi-squared |
"SEST.RK.T3.ALT" | Rocke S-estimate | chi-squared |
"MCDMBP.RAW.T3.ALT" | MCD (max. breakdown pt.) | chi-squared |
"MCDMBP.GMRAW.T3.ALT" | MCD (max. breakdown pt.) | Green-Martin |
"MCDMBP.HRRAW.T3.ALT" | MCD (max. breakdown pt.) | Hardin-Rocke |
"MCDMBP.GMADJ.T3.ALT" | MCD (max. breakdown pt.) | Green-Martin (adj.) |
"MCDMBP.HRADJ.T3.ALT" | MCD (max. breakdown pt.) | Hardin-Rocke (adj.) |
"RMCDMBP.T3.ALT" | Reweighted MCD (max. breakdown pt.) | chi-squared |
"RMCDMBP.CH.T3.ALT" | Reweighted MCD (max. breakdown pt.) with Bonferroni correction | chi-squared |
"MCD75.RAW.T3.ALT" | MCD(0.75) | chi-squared |
"MCD75.GMRAW.T3.ALT" | MCD(0.75) | Green-Martin |
"MCD75.HRRAW.T3.ALT" | MCD(0.75) | Hardin-Rocke |
"MCD75.GMADJ.T3.ALT" | MCD(0.75) | Green-Martin (adj.) |
"MCD75.HRADJ.T3.ALT" | MCD(0.75) | Hardin-Rocke (adj.) |
"RMCD75.T3.ALT" | Reweighted MCD(0.75) | chi-squared |
"RMCD75.CH.T3.ALT" | Reweighted MCD(0.75) with Bonferroni correction | chi-squared |
"MCD95.RAW.T3.ALT" | MCD(0.95) | chi-squared |
"MCD95.GMRAW.T3.ALT" | MCD(0.95) | Green-Martin |
"MCD95.HRRAW.T3.ALT" | MCD(0.95) | Hardin-Rocke |
"MCD95.GMADJ.T3.ALT" | MCD(0.95) | Green-Martin (adj.) |
"MCD95.HRADJ.T3.ALT" | MCD(0.95) | Hardin-Rocke (adj.) |
"RMCD95.T3.ALT" | Reweighted MCD(0.95) | chi-squared |
"RMCD95.CH.T3.ALT" | Reweighted MCD(0.95) with Bonferroni correction | chi-squared |
The adjusted versions of the Hardin-Rocke tests remove the finite sample correction when the sample size is 100 or greater. Empirical tests suggested that Hardin and Rocke did not use this correction factor.
The specified values of alpha
correspond to the third
dimension; the dimnames will be of the form “alpha” + alpha
.
Written and maintained by Christopher G. Green <christopher.g.green@gmail.com>
Andrea Cerioli, Marco Riani, and Anthony C. Atkinson. Controlling the size of multivariate outlier tests with the mcd estimator of scatter. Statistical Computing, 19:341-353, 2009.
C. G. Green and R. Douglas Martin. An extension of a method of Hardin and Rocke, with an application to multivariate outlier detection via the IRMCD method of Cerioli. Working Paper, 2014. Available from http://students.washington.edu/cggreen/uwstat/papers/cerioli_extension.pdf
J. Hardin and D. M. Rocke. The distribution of robust distances. Journal of Computational and Graphical Statistics, 14:928-946, 2005.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ## Not run:
# this runs an experiment
# assumes a cluster
# the vignette provides a better recipe for
# replicating Cerioli et al. (2009)
require( parallel )
require( CerioliOutlierDetection )
require( HardinRockeExtensionSimulations )
# we use a socket cluster on Windows,
# change to your preferred method of
# creating a cluster
thecluster <- makePSOCKcluster(4)
N.SIM <- 500
B.SIM <- 50
# initialize each node
tmp.rv <- clusterEvalQ( cl = thecluster, {
require(abind, quietly=TRUE)
require(rrcov, quietly=TRUE)
require(mvtnorm, quietly=TRUE)
require(CerioliOutlierDetection, quietly=TRUE)
require(HardinRockeExtensionSimulations, quietly=TRUE)
Sys.sleep(30)
invisible(NULL)
})
results <- table1sim.parallel(cl=thecluster, p = 4, nn = 300,
N=500, B=50, lgf=logfile)
stopCluster(thecluster)
# calculate some statistics
apply(results,c(2,3),mean),
apply(results,c(2,3),sd)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.