Description Usage Arguments Value Examples
View source: R/CDFtestingSuite.R
The suite is a system for determining the utility of differentially private cumulative distribution function (DP-CDF) algorithm implementations. The system can empirically evaluate and provide visualizations for several DP-CDF algorithms simultaneously, under various parameters. It can also be set to focus strictly on data collection, rather than spending time on visualization.
It comes with several pre-loaded adjustable synthetic datasets, and can also analyze functions on user-defined datasets.
dpCDF implementations to test must take the following as arguments:
data, epsilon, granularity, range
, and any number of other inputs.
Use "?functionH" for an example of an implementation drawing on C++ files
through Rcpp.
USERS SHOULD NOTE: the
following included diagnostic functions are under development:
SkewDiffpdf,KurtDiffpdf, StdDiffpdf
, corresponding to error measurements of
skewness, kurtoses, and standard deviations generated from dpCDFs.
This is evident through the occasional result of NA
.
1 2 3 4 5 6 7 8 | CDFtest(Visualization = TRUE, OutputDirectory = 0, functlist, Fnameslist,
epslist = c(0.05, 0.1, 1), datalist, Dnameslist, synthsets = NULL, range,
gran = 1, granlist = c(1), samplesize = 0, nlist = (10000),
cdfstep = 1, reps = 5, ExtraTests_CDF = list(),
ExtraTests_PDF = list(), setseed = -100, comments = "none",
SmoothAll = FALSE, EmpiricBounds = FALSE, AnalyticBounds = FALSE,
AnalyticProbSleeve = FALSE, SuppressRealCDF = FALSE,
SuppressDPCDF = FALSE, SuppressLegends = FALSE, ...)
|
Visualization |
Sets the testing suite into Visualization mode (default,
|
OutputDirectory |
The location of the folder which will hold the
output ( |
functlist |
A list of CDF-computing functions to be tested on the
|
Fnameslist |
A vector of function names corresponding to the functions |
epslist |
A vector of epsilon values for differential privacy |
datalist |
A list containing vectors of data, each to be used in a test |
Dnameslist |
A list of dataset names corresponding to the data/variables being tested; used for labelling the output |
synthsets |
This script generates pre-defined synthetic datasets upon
request, and fully incorporates them into testing. To call them, users
should input a string vector containing the names of the sets they desire.
For example, |
range |
The range of the domain as a vector |
gran |
FOR Visualization MODE ONLY. refer to |
granlist |
FOR Data Collection MODE ONLY. refer to |
samplesize |
FOR Visualization MODE ONLY. refer to |
nlist |
FOR Data Collection MODE ONLY. refer to |
cdfstep |
The step size used in outputting the approximate CDF; |
reps |
The number of times to repeat each diagnostic. higher |
ExtraTests_CDF |
If a user wishes to add extra diagnostics, the proper
|
ExtraTests_PDF |
See above |
setseed |
In the function, each combination of data, epsilon, and function is executed with a separate seed, which by default is randomly generated and reported. Users interested in replicating specific results can locate the reported seed and parameter combination to replicate tests. |
comments |
"Comments written here print to a log in excel" |
SmoothAll |
Applies L2 monotnocity post-processing to every DP-CDF |
EmpiricBounds |
FOR Visualization MODE ONLY. When TRUE, outputted graphs depict the minimum and maximum values taken by each bin across reps |
AnalyticBounds |
FOR Visualization MODE ONLY. This is a flag and should
be set to |
AnalyticProbSleeve |
FOR Visualization MODE ONLY. When |
SuppressRealCDF |
FOR Visualization MODE ONLY. When |
SuppressDPCDF |
FOR Visualization MODE ONLY. When |
SuppressLegends |
FOR Visualization MODE ONLY. When |
... |
Optionally add additional parameters. This is primarily used to allow automated execution of varied diagnostic functions. |
If Visualization = TRUE
, a list containing:
...$means
Contains mean diagnostic
results for each diagnostic across reps iterations for each parameter combination;
..$medians
Contains median diagnostic
results for each diagnostic across reps iterations for each parameter combination;\
...$yourCDFoutput
Containing a single dpCDF iteration for each parameter combination;\
...$yourPDFoutput
Containing a single dpPDF iteration for each parameter combination;\
...$realCDFoutput
Containing the real (non-DP) CDF output for each relevant parameter combination;
...$realPDFoutput
Containing the real (non-DP) PDF output for each relevant parameter combination;
...$databins
Containing the domain used to construct the CDFs;
...$TestPack_CDF
Containing the definitions of diagnostic functions used on dpCDFs;
...$TestPack_PDF
Containing the definitions of diagnostic functions used on dpPDFs;
...$allscores
Containing all raw diagnostic output.
...$seed
Containing the list of seeds used in the test
...$permetric
holding a rearranged dataframe (ordered by parameter
combinations) useful for plotting.
A .pdf
file:
with boxplots showing the distributions of diagnostic outputs,
and categorized plots of dp-CDF function output. Each such graph with
show one arbitrary CDF iterations and empirical boundaries.
the empirical boundaries are the max and min values reached by that
function (and parameters) during the test.
A .csv
file:
containing the mean and median scores of each diagnostic on each
combination of data, eps, function, and the seedlist for reproduction.
Notes on Visualization mode: Both the .pdf
and .csv
components are named with a time stamp index,
in the form of YearMonthDayHourMinuteSecond
. To locate particular tests,
look at the CDFtestindexchart.csv
, which automatically records the
parameters and index of each test. These can be found in the file specified by
OutputDirectory
, which defaults to the R temp files tempdir()
.
Alternatively in Data Collection mode (Visualization = FALSE
), a list containing:
...$allscores
holding the output of each combination of parameters,
which is that each eps
in epslist is varied across the first value
specified in granlist
and nlist
. The same is true for varying
granularity and sample size. In that way, only one variable is varied at a time
while the other two are held fixed. All such combinations of parameters are
executed on all combinations of data and function (specified within
...datalist
and functlist
);
...$seed
holding the list of seeds used in the test.
A .csv
file conatining the entire (raw) results (across reps iterations)
of diagnostic functions on DP-CDF algorithms per each combination of
dataset, and function, looped over epsilon, granularity, and sample size values
as described directly above.\
This mode was designed for collecting metric data for subsequent supervised
learning modelling.
1 2 3 4 5 6 7 8 9 10 11 | CDFtest( Visualization = TRUE,OutputDirectory = 0, functlist = c(functionH),
Fnameslist = c("H"), epslist = c(.1, .01), datalist = list(),
Dnameslist = c(), synthsets= list(list("wage", 100000, "uniform"),
list("wage",100000,"sparse"), list("wage",100000,"bimodal")),
range = c(1,500000),gran =1000,granlist =c(2500,1250,1000,500),
samplesize = 0,nlist = c(100,1000,10000,100000,1000000),
cdfstep =0, reps = 5, ExtraTests_CDF = list(),ExtraTests_PDF = list(),
setseed = c(-100),
comments = "x",SmoothAll = FALSE,EmpiricBounds = FALSE,
AnalyticBounds = FALSE,AnalyticProbSleeve = FALSE,
SuppressRealCDF = FALSE,SuppressDPCDF = FALSE,SuppressLegends = FALSE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.