# hr.cm: Hardin and Rocke (2005) corrections for general MCD estimates In christopherggreen/HardinRockeExtensionSimulations: Replication and Extension of Hardin and Rocke (2005), Cerioli et al. (2009)

## Description

This is a work function to calculate, via simulation, the (reciprocal) consistency correction “c” and the Wishart degrees of freedom “m” needed in the Hardin and Rocke approximate distribution for the minimum covariance estimate (MCD) covariance. This function generalizes their work to MCD using data fractions other than the maximum breakdown point case.

## Usage

 1 2 hr.cm(p, n, N, B = 10000, mcd.alpha = max.bdp.mcd.alpha, logfile = "hr_cm.log")

## Arguments

 p The dimension of the data used in each simulated run. n The number of observations used in each simulated run. N The number of simulations to run. B The batch/block size: the number of simulations to run in each block. This is useful when running very large simulation runs (N very large) where memory is a concern. alpha The significance level to use for detecting outliers. mcd.alpha The fraction of the data to use in computing MCD. Defaults to the maximum breakdown point fraction. Can be a vector of values, in which case the calculations will be done for each value of mcd.alpha. logfile Name of file into which to write logging information.

## Details

For each simulated data set, calculate the covariance of the MCD subset. Then we save the sum of the diagonal entries of said covariance matrix and the sum of the squares of the diagonal entries. The estimate of “c”, the reciprocal consistency constant, is then given by the mean of the sums of the diagonal entries. The estimate of “m” uses this estimated “c” and an estimate of the variance of the diagonal elements.

Based on Johanna Hardin's cm.R and mcd.est.R code.

## Value

A data frame with the following columns

 c The estimate of “c”, the reciprocal consistency constant m The estimated Wishart degrees of freedom “m” mcd.alpha The input data fraction(s) mcd.alpha

## Author(s)

Written and maintained by Christopher G. Green <[email protected]>. Based on code by Johanna S. Hardin.

## References

J. Hardin and D. M. Rocke. The distribution of robust distances. Journal of Computational and Graphical Statistics, 14:928-946, 2005.

J. Hardin. R code: to estimate the MCD – mcd.est.r and to estimate c and m – cm.r. Available from http://pages.pomona.edu/~jsh04747/Research/cm.r and http://pages.pomona.edu/~jsh04747/Research/mcd.est.r respectively. 2005. Accessed December 10, 2015.

## Examples

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 ## Not run: # Monte Carlo simulation to compute the # Wishart degrees of freedom m associated # with the Hardin-Rocke approximation to # the MCD covariance estimate # # See Hardin and Rocke (2005) for the # original simulation in the maximum # breakdown point case. # # This script extends their work to # sampling fractions greater than the # maximum breakdown point case # floor( ( n+v+1 )/2 )/n. # require(parallel) # on Windows we use socket clusters # use whatever parallel backend works best for you # change '4' to the number of nodes/workers/etc. you # want to use thecluster <- makePSOCKcluster(4) # make this reproducible clusterSetRNGStream(cl = thecluster, 2014) # initialize each node tmp.rv <- clusterEvalQ( cl = thecluster, { require( CerioliOutlierDetection ) require( HardinRockeExtensionSimulations ) N.SIM <- 5000 B.SIM <- 250 my.pid <- Sys.getpid() cat("My pid is ", my.pid, "\n") logfile <- paste("Simulation_AllAlphas_Parallel_logfile_",my.pid,".txt",sep="") cat("Initialized\n\n", file=logfile) invisible(NULL) }) # build the pairs of sample size n and dimension p hr.cm.params <- expand.grid( list( p=c(3,5,7,10,15,20), n=c(50,100,250,500,750,1000) ) ) # adding more coverage for small sample sizes hr.cm.params <- rbind( hr.cm.params, within( expand.grid(list(p=c(3,5,7,10,15,20), ratio=c( 3,5,7,9,11 ) )), { n <- p * ratio rm(ratio) } )) # remove any duplicates hr.cm.params <- unique(hr.cm.params) # want to run most expensive cases first hr.cm.params <- hr.cm.params[ order( hr.cm.params\$n, hr.cm.params\$p, decreasing=TRUE ), ] # add maximum breakdown point case to the params data set hr.cm.params[,"mbp"] <- apply( hr.cm.params, 1, function(x) floor( (x[2] + x[1] + 1)/2 )/x[2] ) # want each case to be a column so that we can use parLapply hr.cm.params <- data.frame(t(as.matrix(hr.cm.params))) mcd.alphas <- c(0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,0.99,0.995) clusterExport(cl = thecluster, "hr.cm.params") clusterExport(cl = thecluster, "mcd.alphas") # # using parLapply here to prevent simplification of the # results (as parApply would attempt to do) # cat("Starting run at ", format(Sys.time()), "\n") hr.cm.results.all.pre <- parLapply(cl = thecluster, X = hr.cm.params[1:5], function(pn) { cat("Starting case p = ",pn[1]," n = ",pn[2]," at time ", format(Sys.time()), " \n",file=logfile,append=TRUE) results <- hr.cm(p = pn[1] , n = pn[2], N=N.SIM, B=B.SIM, mcd.alpha=unique(c(pn[3],mcd.alphas)), logfile=logfile) cat("Finished case p = ",pn[1]," n = ",pn[2]," at time ", format(Sys.time()), " \n",file=logfile,append=TRUE) data.frame(p=pn[1],n=pn[2],mbp=pn[3],results) } ) cat("Run completed at ", format(Sys.time()), "\n") stopCluster(thecluster) hr.cm.results.all <- do.call("rbind", hr.cm.results.all.pre ) save("hr.cm.results.all", file="hr.cm.results.all.final.rda") ## End(Not run)

christopherggreen/HardinRockeExtensionSimulations documentation built on May 13, 2019, 7:04 p.m.