Description Usage Arguments Details Value Author(s) References Examples
This is a work function to calculate, via simulation, the (reciprocal) consistency correction “c” and the Wishart degrees of freedom “m” needed in the Hardin and Rocke approximate distribution for the minimum covariance estimate (MCD) covariance. This function generalizes their work to MCD using data fractions other than the maximum breakdown point case.
1 2 | hr.cm(p, n, N, B = 10000,
mcd.alpha = max.bdp.mcd.alpha, logfile = "hr_cm.log")
|
p |
The dimension of the data used in each simulated run. |
n |
The number of observations used in each simulated run. |
N |
The number of simulations to run. |
B |
The batch/block size: the number of simulations to run
in each block. This is useful when running very large
simulation runs ( |
alpha |
The significance level to use for detecting outliers. |
mcd.alpha |
The fraction of the data to use in computing
MCD. Defaults to the maximum breakdown point fraction. Can
be a vector of values, in which case the calculations will
be done for each value of |
logfile |
Name of file into which to write logging information. |
For each simulated data set, calculate the covariance of the MCD subset. Then we save the sum of the diagonal entries of said covariance matrix and the sum of the squares of the diagonal entries. The estimate of “c”, the reciprocal consistency constant, is then given by the mean of the sums of the diagonal entries. The estimate of “m” uses this estimated “c” and an estimate of the variance of the diagonal elements.
Based on Johanna Hardin's cm.R
and
mcd.est.R
code.
A data frame with the following columns
c |
The estimate of “c”, the reciprocal consistency constant |
m |
The estimated Wishart degrees of freedom “m” |
mcd.alpha |
The input data fraction(s) |
Written and maintained by Christopher G. Green <christopher.g.green@gmail.com>. Based on code by Johanna S. Hardin.
J. Hardin and D. M. Rocke. The distribution of robust distances. Journal of Computational and Graphical Statistics, 14:928-946, 2005.
J. Hardin. R code: to estimate the MCD – mcd.est.r
and
to estimate c and m – cm.r
. Available
from http://pages.pomona.edu/~jsh04747/Research/cm.r
and http://pages.pomona.edu/~jsh04747/Research/mcd.est.r
respectively. 2005. Accessed December 10, 2015.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | ## Not run:
# Monte Carlo simulation to compute the
# Wishart degrees of freedom m associated
# with the Hardin-Rocke approximation to
# the MCD covariance estimate
#
# See Hardin and Rocke (2005) for the
# original simulation in the maximum
# breakdown point case.
#
# This script extends their work to
# sampling fractions greater than the
# maximum breakdown point case
# floor( ( n+v+1 )/2 )/n.
#
require(parallel)
# on Windows we use socket clusters
# use whatever parallel backend works best for you
# change '4' to the number of nodes/workers/etc. you
# want to use
thecluster <- makePSOCKcluster(4)
# make this reproducible
clusterSetRNGStream(cl = thecluster, 2014)
# initialize each node
tmp.rv <- clusterEvalQ( cl = thecluster, {
require( CerioliOutlierDetection )
require( HardinRockeExtensionSimulations )
N.SIM <- 5000
B.SIM <- 250
my.pid <- Sys.getpid()
cat("My pid is ", my.pid, "\n")
logfile <- paste("Simulation_AllAlphas_Parallel_logfile_",my.pid,".txt",sep="")
cat("Initialized\n\n", file=logfile)
invisible(NULL)
})
# build the pairs of sample size n and dimension p
hr.cm.params <- expand.grid(
list(
p=c(3,5,7,10,15,20),
n=c(50,100,250,500,750,1000)
)
)
# adding more coverage for small sample sizes
hr.cm.params <- rbind( hr.cm.params, within(
expand.grid(list(p=c(3,5,7,10,15,20), ratio=c( 3,5,7,9,11 ) )),
{
n <- p * ratio
rm(ratio)
}
))
# remove any duplicates
hr.cm.params <- unique(hr.cm.params)
# want to run most expensive cases first
hr.cm.params <- hr.cm.params[ order( hr.cm.params$n, hr.cm.params$p, decreasing=TRUE ), ]
# add maximum breakdown point case to the params data set
hr.cm.params[,"mbp"] <- apply( hr.cm.params, 1, function(x) floor( (x[2] + x[1] + 1)/2 )/x[2] )
# want each case to be a column so that we can use parLapply
hr.cm.params <- data.frame(t(as.matrix(hr.cm.params)))
mcd.alphas <- c(0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,0.99,0.995)
clusterExport(cl = thecluster, "hr.cm.params")
clusterExport(cl = thecluster, "mcd.alphas")
#
# using parLapply here to prevent simplification of the
# results (as parApply would attempt to do)
#
cat("Starting run at ", format(Sys.time()), "\n")
hr.cm.results.all.pre <- parLapply(cl = thecluster,
X = hr.cm.params[1:5], function(pn) {
cat("Starting case p = ",pn[1]," n = ",pn[2]," at time ",
format(Sys.time()), " \n",file=logfile,append=TRUE)
results <- hr.cm(p = pn[1] , n = pn[2], N=N.SIM, B=B.SIM,
mcd.alpha=unique(c(pn[3],mcd.alphas)), logfile=logfile)
cat("Finished case p = ",pn[1]," n = ",pn[2]," at time ",
format(Sys.time()), " \n",file=logfile,append=TRUE)
data.frame(p=pn[1],n=pn[2],mbp=pn[3],results)
}
)
cat("Run completed at ", format(Sys.time()), "\n")
stopCluster(thecluster)
hr.cm.results.all <- do.call("rbind", hr.cm.results.all.pre )
save("hr.cm.results.all", file="hr.cm.results.all.final.rda")
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.