suppressMessages(library(PrivacyUnbiased)) suppressMessages(library(ggplot2))
This package implements methods developed in:
\begin{itemize} \item Evans, Georgina, and Gary King (2020): \textit{“Statistically Valid Inferences from Differentially Private Data Releases"}. In: URL: \url{https://gking.harvard.edu/dpd}. \end{itemize}
In a major development for research data sharing, data providers are beginning to supplement insecure privacy protection strategies, such as "de-identification" with a formal approach called "differential privacy". One version of differential privacy adds specially calibrated random noise to a dataset, which is then released to researchers. This offers mathematical guarantees for the privacy of research subjects while still making it possible to learn about aggregate patterns of interest. Unfortunately, adding random noise creates measurement error, which, if ignored, induces statistical bias --- including in different situations attenuation, exaggeration, switched signs, and incorrect uncertainty estimates. The procedures implemented in PrivacyUnbiased
account for these biases, producing statistically consistent point estimates from differentially private data.
PrivacyUnbiased
, which corrects statistical problems with privacy protective procedures added to data, is designed to complement UnbiasedPrivacy
, which corrects statistical problems with privacy protective procedures added to the results of statistical analyses [@dp1].
To install PrivacyUnbiased
, run:
devtools::install_github("georgieevans/PrivacyUnbiased") library(PrivacyUnbiased)
We demonstrate the capabilities of PrivacyUnbiased
by simulating the scenario described above. We start with a hypothetical private data set (private_data
). We then add random error to every cell of the data by drawing errors, $\epsilon_{ik}$, from a mean $0$ normal distribution, $\epsilon_{ik} \sim \mathcal{N}(0, \sigma_k^2)$. We set $\sigma_k$ for each of the $k$ columns of the data. This produces a differentially private data set (dp_data
). In practice, the data analyst would not have access to private_data
and would only be able to observe dp_data
.
This example data can be loaded into the R environment (after loading the package) by running the following code:
# Load the private data data("private_data") # Load the DP data data('dp_data')
lmdp()
is the primary function of the package. It returns estimates of bias corrected coefficients from differentially private data, alongside several other quantities. Users can interact with it in a similar way to lm()
. There are only two required inputs, the formula
and data
. For instance:
\
lmdp_test <- lmdp(Y ~ X1 + X2 + X3, data = dp_data)
You can read the documentation for lmdp()
by running the code:
?lmdp
\
An important distinction between lmdp()
and lm()
is that the first row of data
must indicate the standard deviations of the DP error added to the rest of the data matrix. For instance, if we look at dp_data
, we see by looking at row $1$ that no noise was added to $Y$, the standard error of noise added to $X1$ was $0.7$, and so on.
head(dp_data)
\
An exception to this rule is if the argument noise
is set to something other than it's default (= NULL
). If noise = x
(where x
is any real number), then lmdp()
will automatically set the error for every column to x
. In this situation, the first row of the data matrix will be ignored.
\
The output from lmdp()
can be summarized using summary()
, just like a standard lm
object.
summary(lmdp_test)
\
The additional output from lmdp()
is stored in a list that can be accessed as follows:
# This summarizes the output of an lmdp object str(lmdp_test) # For instance we can access the variance covariance matrix as follows lmdp_test$beta_tilde_vcov
\
It is informative to compare lmdp()
estimates to the estimates produced from lm()
that do not adjust for the random error in dp_data
:
lm_test <- lm(Y ~ X1 + X2 + X3, data = dp_data) # Biased OLS estimates round(summary(lm_test)$coef, 4)
\
# Notice that if we set noise = 0, lmdp gives the same point estimates as lm() # Standard errors differ since we use a different estimate procedure lmdp_test_0 <- lmdp(Y ~ X1 + X2 + X3, data = dp_data, noise = 0) summary(lmdp_test_0)
\
We can compare the lmdp()
estimates and lm()
estimates to the unbiased estimates on private data.
lm_true <- lm(Y ~ Z1 + Z2 + Z3, data = private_data) # We see that the lmdp estimates are very close to the lm estimates on private data # In contrast, the lm estimates appear biased round(summary(lm_true)$coef, 4)
The default setting of lmdp()
is to estimate the standard errors using the simulation method developed in @dp2. We also offer the option to bootstrap the standard errors by setting the argument bootstrap_var
to TRUE
. In general the two methods will produce very similar estimates. The advantage of the simulation method is computational. For large datasets, the bootstrap is essentially infeasible without access to large amounts of computing power. In contrast, the computational time of our simulation procedure scales only slowly in dataset size.
# Timing simulation variance estimation system.time(simulation <- lmdp(Y ~ X1 + X2 + X3, data = dp_data)) # Timing bootstrap variance estimation system.time(bootstrap <- lmdp(Y ~ X1 + X2 + X3, data = dp_data, bootstrap_var = TRUE)) # Bootstrap takes ~30 times longer than simulation for dataset of size N = 100000 # The standard error estimates are similar between the two methods: # Bootstrap Std. Error summary(bootstrap)[, "Std. Error"] # Simulation Std. Error summary(simulation)[, "Std. Error"]
\
On small datasets with a relatively large amount of DP error, the variance-covariance matrix
we estimate as a paramater to draw random variables may not be positive definite. If this happens, then we use the function nearPD()
from the package Matrix
, which finds a close positive definite matrix [@matrix]. lmdp()
will poduce the warning message VC matrix not positive definite
to alert users to this. The function also returns an indicator variable which records whether the matrix was positive definite which can be accessed as follows:
lmdp_test$vc_pos_def
\
As discussed in @dp2, transforming variables with random error poses additional complications for estimation. PrivacyUnbiased
can currently accomdate two types of variable transformation: interaction variables, and squared variables. For example:
# Interaction variable lmdp_interaction <- lmdp(Y ~ X1 + X2 + X3 + X1*X2, data = dp_data) summary(lmdp_interaction) # lmdp with interactions produces similar estimates to lm on private data # Standard errors are lower since Z's do not contain random noise lm_interaction <- lm(Y ~ Z1 + Z2 + Z3 + Z1*Z2, data = private_data) round(summary(lm_interaction)$coefficients, 4)
\
Other variable transformations, or multiple variable transformations, are not allowed in this version of the package and their inclusion will induce an error message. Note also that lmdp()
currently only supports bootstrap estimation when the model includes transformed variables. For future releasse, we are working on expanding the set of admisssible variable transformations and introducing the simulation approach to variance estimation for these cases.
PrivacyUnbiased
also includes a function to calculate estimates of descriptive statistics of the \textit{private data}, using the moment estimation methods described in @dp2. The primary function for descriptive statistics is descriptiveDP()
, which takes two arguments, the variable name and the data frame.
# Estimate Descriptive statistics descriptiveDP(X3, dp_data) # We can compare these estimates to the true values from private_data: true_descriptive <- round(c(mean(private_data$Z3), sd(private_data$Z3), moments::skewness(private_data$Z3), moments::kurtosis(private_data$Z3)), 4) names(true_descriptive) <- c('Mean', 'Std.Dev', 'Skewness', 'Kurtosis') true_descriptive
We will look at simulated heteroskedastic data with zero-inflated covariates to demonstrate methods described in @dp2 for data of this nature. The simualted data can be loaded as follows. \
# Load the private data data('private_data2') # Load the differntially private data data('dp_data2')
\
The function distributionDP()
estimates private histograms from the differentialy private data. It takes three required arguments: variable
should be the name of the column in the data, data
is the DP data frame, and distsributions
is a vector of distributions to paramaterize. The remaining arguments are set as defaults but can be altered by the user.
\
dist_test <- distributionDP(variable = X1, data = dp_data2, distributions = c('Normal', 'Poisson', 'ZIP', 'NB', 'ZINB'), moments_fit = 6, plot = TRUE, plot_dp = TRUE) # Check that moments were estimated with sufficient precision dist_test$Normal$moment_precision # So it looks like ZINB fits the data well # Here we can check this is right by comparing paramaterized distribution to private data # Normally we would not have access to the private data and would rely on the moment fit dist_test$ZINB$plot
\
ggplot2::ggplot(private_data2) + ggplot2::geom_histogram(ggplot2::aes(x = Z1), alpha = .7, bins = 50, fill = 'navyblue') + ggplot2::theme_bw() + ggplot2::theme(legend.position = c(0.9, 0.9), legend.justification = c(0.9, 0.9), panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank()) + ggplot2::labs(x = '', title = 'Private data') + ggplot2::scale_fill_manual(values = c('navyblue')) + ggplot2::xlim(c(-5, 20))
\
The function diagnosticsDP()
runs regression diagnostics of the kind that would be run after running OLS on private data. It takes a single argument which must be an lmdp()
object.
\
# First we run a bias corrected regression lmdp_obj <- lmdp(Y ~ X1, data = dp_data2) # Now we can run regression diagnostics diagnostics <- diagnosticsDP(lmdp_obj) # We see that there is no strong evidence of non-normality # We also detect heteroskedasticity since X1 coefficient deviates significantly from 0 # To validate, we can check these results against the private data # Calculate the true errors true_coef <- c(2, 8) true_errors <- private_data2$Y - cbind(1, private_data2$Z1)%*%true_coef # Validate they are ~ normally distributed # Skewness of a normal is 0 moments::skewness(true_errors) # Kurtosis of a normal is 3 moments::kurtosis(true_errors) # Test for heteroskedasticity (we obtain similar coefficients to above showing heteroskedasticity) summary(lm(true_errors^2 ~ private_data2$Z1))$coefficients
\
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.