CDFtest: Comprehensively evaluate and visualize the utility of...
In CDF.PSIdekick: Evaluate Differentially Private Algorithms for Publishing Cumulative Distribution Functions

Description Usage Arguments Value Examples

The suite is a system for determining the utility of differentially private cumulative distribution function (DP-CDF) algorithm implementations. The system can empirically evaluate and provide visualizations for several DP-CDF algorithms simultaneously, under various parameters. It can also be set to focus strictly on data collection, rather than spending time on visualization.

It comes with several pre-loaded adjustable synthetic datasets, and can also analyze functions on user-defined datasets.

dpCDF implementations to test must take the following as arguments: data, epsilon, granularity, range, and any number of other inputs. Use "?functionH" for an example of an implementation drawing on C++ files through Rcpp.

USERS SHOULD NOTE: the following included diagnostic functions are under development: SkewDiffpdf,KurtDiffpdf, StdDiffpdf, corresponding to error measurements of skewness, kurtoses, and standard deviations generated from dpCDFs. This is evident through the occasional result of NA.

CDFtest(Visualization = TRUE, OutputDirectory = 0, functlist, Fnameslist,
  epslist = c(0.05, 0.1, 1), datalist, Dnameslist, synthsets = NULL, range,
  gran = 1, granlist = c(1), samplesize = 0, nlist = (10000),
  cdfstep = 1, reps = 5, ExtraTests_CDF = list(),
  ExtraTests_PDF = list(), setseed = -100, comments = "none",
  SmoothAll = FALSE, EmpiricBounds = FALSE, AnalyticBounds = FALSE,
  AnalyticProbSleeve = FALSE, SuppressRealCDF = FALSE,
  SuppressDPCDF = FALSE, SuppressLegends = FALSE, ...)

`Visualization`	Sets the testing suite into Visualization mode (default, `Visualization = TRUE`) or Data Collection mode `(Visualization = FALSE)` In Visualization mode (default): A `.csv` file conatining the mean and median results (across `reps` iterations) of diagnostic functions on DP-CDF algorithms per each combination of data, function, and epsilon. A `.pdf` file containing one graphical example DP CDF for each combination of dataset, function, and epsilon, as well as a set of boxplots showing the distribution of all diagnostic results for all combinations of parameters. In Data Collection mode (set `Visualization = FALSE`): A `.csv` file containing the entire (raw) results (across `reps` iterations) of diagnostic functions on DP-CDF algorithms per each combination of dataset, and function, seperately looped over all epsilons, then all granularities, and all samplesizes.
`OutputDirectory`	The location of the folder which will hold the output (`.csv` and `.pdf` files). This defaults to the `tempdir()` directory.
`functlist`	A list of CDF-computing functions to be tested on the `CDFtestTrack` (if `visualization = TRUE`) or `CDFtestTrackx` (if `Visualization =FALSE`))
`Fnameslist`	A vector of function names corresponding to the functions
`epslist`	A vector of epsilon values for differential privacy
`datalist`	A list containing vectors of data, each to be used in a test
`Dnameslist`	A list of dataset names corresponding to the data/variables being tested; used for labelling the output
`synthsets`	This script generates pre-defined synthetic datasets upon request, and fully incorporates them into testing. To call them, users should input a string vector containing the names of the sets they desire. For example, `synthsets = list(list(type,size,shape),list(type,size,shape))`. There are no limits on the amounts of datasets included. Sets available include: type: `"age"` (which ranges from about 0 to 100, `gran =1`) and `"wage"` which ranges from 0 to 500k); size: Any positive integer. Type in exact numerical representation (eg, for ten thousand use 10000 not 10k and not 10^4); shape: gaussian, sparse, uniform, bimodal; It is assumed that the data input is rounded to the granularity
`range`	The range of the domain as a vector `c(min, max)`. Defined based on user intuition. to preserve differential privacy, the domain is constructed using this range. Setting the min too high will bias output upward. Same in reverse for a low max. However, setting min too low and max too high could reveal the true limits of your data, compromising some privacy.
`gran`	FOR Visualization MODE ONLY. refer to `granlist` for setting granularities (thus domain sizes) in Data Collection mode. This command is irrelevant in Data Collection mode. The granularity of the domain between the min and max. ie, if age is measureds per 1 year of age, `gran =1`. The same granularity is applied to all datasets, so using comparable (or scaled) data is necessary.
`granlist`	FOR Data Collection MODE ONLY. refer to `gran` for selecting samplesizes in Data Collection mode. This command is irrelevant in Visualization mode. A list of granularities of the domain between the min and max. ie, if age is measure per 1 year of age, `gran =1`.
`samplesize`	FOR Visualization MODE ONLY. refer to `nlist` for selecting samplesizes in Data Collection mode. This command is irrelevant in Data Collection mode. when set to zero, the entire dataset is used. Otherwise, the specified sample size is randomly selected from each dataset without replacement.
`nlist`	FOR Data Collection MODE ONLY. refer to `samplesize` for selecting samplesizes in visualization mode. This command is irrelevant in Visualization mode. Sets the absolute sample sizes to draw from each dataset, with replacement. Any vector of integer values is appropriate.
`cdfstep`	The step size used in outputting the approximate CDF;
`reps`	The number of times to repeat each diagnostic. higher `reps` lends greater accuracy, but comsumes time and power. Author recommends `reps = 10` for quick examples and `reps = 100` for more robust examinations.
`ExtraTests_CDF`	If a user wishes to add extra diagnostics, the proper `ExtraTests_CDF = list(functionName1=function1, functionName2=function2)`. Diagnostic Functions should have inputs such as `Y` for a public CDF, `est` for a DP-representation of that CDF, `range` and `gran`, and the output should be just one value.
`ExtraTests_PDF`	See above
`setseed`	In the function, each combination of data, epsilon, and function is executed with a separate seed, which by default is randomly generated and reported. Users interested in replicating specific results can locate the reported seed and parameter combination to replicate tests.
`comments`	"Comments written here print to a log in excel"
`SmoothAll`	Applies L2 monotnocity post-processing to every DP-CDF
`EmpiricBounds`	FOR Visualization MODE ONLY. When TRUE, outputted graphs depict the minimum and maximum values taken by each bin across reps
`AnalyticBounds`	FOR Visualization MODE ONLY. This is a flag and should be set to `TRUE` if the functions being tested are expected to output analytical variance bounds. The proper output form for such a function is `output = list(DPCDFvector, LowerBoundVector, UpperBoundVector)`.
`AnalyticProbSleeve`	FOR Visualization MODE ONLY. When `TRUE`, outputted DP-CDFs will have a 'fuzzy' analytic sleeve around them, approximating probabalitity density for each point given by DP. This also requires the function format specified above in the description for `AnalyticBounds`.
`SuppressRealCDF`	FOR Visualization MODE ONLY. When `TRUE`, outputted graphs will not include real (non-private) CDFs.
`SuppressDPCDF`	FOR Visualization MODE ONLY. When `TRUE`, outputted graphs will not include DP-CDFs (but if `SmoothAll = TRUE`, monotonized DP CDFs still appear).
`SuppressLegends`	FOR Visualization MODE ONLY. When `TRUE`, outputted graphs will not include legends
`...`	Optionally add additional parameters. This is primarily used to allow automated execution of varied diagnostic functions.

If Visualization = TRUE, a list containing:

...$means Contains mean diagnostic results for each diagnostic across reps iterations for each parameter combination;

..$medians Contains median diagnostic results for each diagnostic across reps iterations for each parameter combination;\

...$yourCDFoutput Containing a single dpCDF iteration for each parameter combination;\

...$yourPDFoutput Containing a single dpPDF iteration for each parameter combination;\

...$realCDFoutput Containing the real (non-DP) CDF output for each relevant parameter combination;

...$realPDFoutput Containing the real (non-DP) PDF output for each relevant parameter combination;

...$databins Containing the domain used to construct the CDFs;

...$TestPack_CDF Containing the definitions of diagnostic functions used on dpCDFs;

...$TestPack_PDF Containing the definitions of diagnostic functions used on dpPDFs;

...$allscores Containing all raw diagnostic output.

...$seed Containing the list of seeds used in the test

...$permetric holding a rearranged dataframe (ordered by parameter combinations) useful for plotting.

A .pdf file: with boxplots showing the distributions of diagnostic outputs, and categorized plots of dp-CDF function output. Each such graph with show one arbitrary CDF iterations and empirical boundaries. the empirical boundaries are the max and min values reached by that function (and parameters) during the test.

A .csv file: containing the mean and median scores of each diagnostic on each combination of data, eps, function, and the seedlist for reproduction.

Notes on Visualization mode: Both the .pdf and .csv components are named with a time stamp index, in the form of YearMonthDayHourMinuteSecond. To locate particular tests, look at the CDFtestindexchart.csv, which automatically records the parameters and index of each test. These can be found in the file specified by OutputDirectory, which defaults to the R temp files tempdir().

Alternatively in Data Collection mode (Visualization = FALSE), a list containing:

...$allscores holding the output of each combination of parameters, which is that each eps in epslist is varied across the first value specified in granlist and nlist. The same is true for varying granularity and sample size. In that way, only one variable is varied at a time while the other two are held fixed. All such combinations of parameters are executed on all combinations of data and function (specified within ...datalist and functlist);

...$seed holding the list of seeds used in the test.

A .csv file conatining the entire (raw) results (across reps iterations) of diagnostic functions on DP-CDF algorithms per each combination of dataset, and function, looped over epsilon, granularity, and sample size values as described directly above.\ This mode was designed for collecting metric data for subsequent supervised learning modelling.

CDFtest( Visualization = TRUE,OutputDirectory = 0, functlist = c(functionH),
Fnameslist = c("H"), epslist  = c(.1, .01), datalist = list(),
Dnameslist = c(), synthsets= list(list("wage", 100000, "uniform"), 
 list("wage",100000,"sparse"), list("wage",100000,"bimodal")),
 range    = c(1,500000),gran =1000,granlist =c(2500,1250,1000,500), 
 samplesize = 0,nlist = c(100,1000,10000,100000,1000000),
 cdfstep  =0, reps = 5,  ExtraTests_CDF = list(),ExtraTests_PDF = list(),
 setseed = c(-100),
 comments = "x",SmoothAll = FALSE,EmpiricBounds = FALSE,
 AnalyticBounds = FALSE,AnalyticProbSleeve = FALSE,
 SuppressRealCDF = FALSE,SuppressDPCDF = FALSE,SuppressLegends = FALSE)

CDF.PSIdekick documentation built on May 30, 2017, 5:09 a.m.

CDF.PSIdekick index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CDF.PSIdekick
Evaluate Differentially Private Algorithms for Publishing Cumulative Distribution Functions

CDFtest: Comprehensively evaluate and visualize the utility of...
In CDF.PSIdekick: Evaluate Differentially Private Algorithms for Publishing Cumulative Distribution Functions

Description

Usage

Arguments

Value

Examples

Related to CDFtest in CDF.PSIdekick...

R Package Documentation

Browse R Packages

We want your feedback!

CDF.PSIdekick Evaluate Differentially Private Algorithms for Publishing Cumulative Distribution Functions

CDFtest: Comprehensively evaluate and visualize the utility of... In CDF.PSIdekick: Evaluate Differentially Private Algorithms for Publishing Cumulative Distribution Functions

Description

Usage

Arguments

Value

Examples

Related to CDFtest in CDF.PSIdekick...

R Package Documentation

Browse R Packages

We want your feedback!

CDF.PSIdekick
Evaluate Differentially Private Algorithms for Publishing Cumulative Distribution Functions

CDFtest: Comprehensively evaluate and visualize the utility of...
In CDF.PSIdekick: Evaluate Differentially Private Algorithms for Publishing Cumulative Distribution Functions