Description Usage Arguments Value Examples
View source: R/CDFtestingSuite.R
The suite is a system for determining the utility of differentially private cumulative distribution function (DP-CDF) algorithm implementations. The system can empirically evaluate and provide visualizations for several DP-CDF algorithms simultaneously, under various parameters. It can also be set to focus strictly on data collection, rather than spending time on visualization.
It comes with several pre-loaded adjustable synthetic datasets, and can also analyze functions on user-defined datasets.
dpCDF implementations to test must take the following as arguments:
data, epsilon, granularity, range, and any number of other inputs.
Use "?functionH" for an example of an implementation drawing on C++ files
through Rcpp.
USERS SHOULD NOTE: the
following included diagnostic functions are under development:
SkewDiffpdf,KurtDiffpdf, StdDiffpdf, corresponding to error measurements of
skewness, kurtoses, and standard deviations generated from dpCDFs.
This is evident through the occasional result of NA.
1 2 3 4 5 6 7 8 | CDFtest(Visualization = TRUE, OutputDirectory = 0, functlist, Fnameslist,
epslist = c(0.05, 0.1, 1), datalist, Dnameslist, synthsets = NULL, range,
gran = 1, granlist = c(1), samplesize = 0, nlist = (10000),
cdfstep = 1, reps = 5, ExtraTests_CDF = list(),
ExtraTests_PDF = list(), setseed = -100, comments = "none",
SmoothAll = FALSE, EmpiricBounds = FALSE, AnalyticBounds = FALSE,
AnalyticProbSleeve = FALSE, SuppressRealCDF = FALSE,
SuppressDPCDF = FALSE, SuppressLegends = FALSE, ...)
|
Visualization |
Sets the testing suite into Visualization mode (default,
|
OutputDirectory |
The location of the folder which will hold the
output ( |
functlist |
A list of CDF-computing functions to be tested on the
|
Fnameslist |
A vector of function names corresponding to the functions |
epslist |
A vector of epsilon values for differential privacy |
datalist |
A list containing vectors of data, each to be used in a test |
Dnameslist |
A list of dataset names corresponding to the data/variables being tested; used for labelling the output |
synthsets |
This script generates pre-defined synthetic datasets upon
request, and fully incorporates them into testing. To call them, users
should input a string vector containing the names of the sets they desire.
For example, |
range |
The range of the domain as a vector |
gran |
FOR Visualization MODE ONLY. refer to |
granlist |
FOR Data Collection MODE ONLY. refer to |
samplesize |
FOR Visualization MODE ONLY. refer to |
nlist |
FOR Data Collection MODE ONLY. refer to |
cdfstep |
The step size used in outputting the approximate CDF; |
reps |
The number of times to repeat each diagnostic. higher |
ExtraTests_CDF |
If a user wishes to add extra diagnostics, the proper
|
ExtraTests_PDF |
See above |
setseed |
In the function, each combination of data, epsilon, and function is executed with a separate seed, which by default is randomly generated and reported. Users interested in replicating specific results can locate the reported seed and parameter combination to replicate tests. |
comments |
"Comments written here print to a log in excel" |
SmoothAll |
Applies L2 monotnocity post-processing to every DP-CDF |
EmpiricBounds |
FOR Visualization MODE ONLY. When TRUE, outputted graphs depict the minimum and maximum values taken by each bin across reps |
AnalyticBounds |
FOR Visualization MODE ONLY. This is a flag and should
be set to |
AnalyticProbSleeve |
FOR Visualization MODE ONLY. When |
SuppressRealCDF |
FOR Visualization MODE ONLY. When |
SuppressDPCDF |
FOR Visualization MODE ONLY. When |
SuppressLegends |
FOR Visualization MODE ONLY. When |
... |
Optionally add additional parameters. This is primarily used to allow automated execution of varied diagnostic functions. |
If Visualization = TRUE, a list containing:
...$means Contains mean diagnostic
results for each diagnostic across reps iterations for each parameter combination;
..$medians Contains median diagnostic
results for each diagnostic across reps iterations for each parameter combination;\
...$yourCDFoutput Containing a single dpCDF iteration for each parameter combination;\
...$yourPDFoutput Containing a single dpPDF iteration for each parameter combination;\
...$realCDFoutput Containing the real (non-DP) CDF output for each relevant parameter combination;
...$realPDFoutput Containing the real (non-DP) PDF output for each relevant parameter combination;
...$databins Containing the domain used to construct the CDFs;
...$TestPack_CDF Containing the definitions of diagnostic functions used on dpCDFs;
...$TestPack_PDF Containing the definitions of diagnostic functions used on dpPDFs;
...$allscores Containing all raw diagnostic output.
...$seed Containing the list of seeds used in the test
...$permetric holding a rearranged dataframe (ordered by parameter
combinations) useful for plotting.
A .pdf file:
with boxplots showing the distributions of diagnostic outputs,
and categorized plots of dp-CDF function output. Each such graph with
show one arbitrary CDF iterations and empirical boundaries.
the empirical boundaries are the max and min values reached by that
function (and parameters) during the test.
A .csv file:
containing the mean and median scores of each diagnostic on each
combination of data, eps, function, and the seedlist for reproduction.
Notes on Visualization mode: Both the .pdf and .csv components are named with a time stamp index,
in the form of YearMonthDayHourMinuteSecond. To locate particular tests,
look at the CDFtestindexchart.csv, which automatically records the
parameters and index of each test. These can be found in the file specified by
OutputDirectory, which defaults to the R temp files tempdir().
Alternatively in Data Collection mode (Visualization = FALSE), a list containing:
...$allscores holding the output of each combination of parameters,
which is that each eps in epslist is varied across the first value
specified in granlist and nlist. The same is true for varying
granularity and sample size. In that way, only one variable is varied at a time
while the other two are held fixed. All such combinations of parameters are
executed on all combinations of data and function (specified within
...datalist and functlist);
...$seed holding the list of seeds used in the test.
A .csv file conatining the entire (raw) results (across reps iterations)
of diagnostic functions on DP-CDF algorithms per each combination of
dataset, and function, looped over epsilon, granularity, and sample size values
as described directly above.\
This mode was designed for collecting metric data for subsequent supervised
learning modelling.
1 2 3 4 5 6 7 8 9 10 11 | CDFtest( Visualization = TRUE,OutputDirectory = 0, functlist = c(functionH),
Fnameslist = c("H"), epslist = c(.1, .01), datalist = list(),
Dnameslist = c(), synthsets= list(list("wage", 100000, "uniform"),
list("wage",100000,"sparse"), list("wage",100000,"bimodal")),
range = c(1,500000),gran =1000,granlist =c(2500,1250,1000,500),
samplesize = 0,nlist = c(100,1000,10000,100000,1000000),
cdfstep =0, reps = 5, ExtraTests_CDF = list(),ExtraTests_PDF = list(),
setseed = c(-100),
comments = "x",SmoothAll = FALSE,EmpiricBounds = FALSE,
AnalyticBounds = FALSE,AnalyticProbSleeve = FALSE,
SuppressRealCDF = FALSE,SuppressDPCDF = FALSE,SuppressLegends = FALSE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.