Installation and loading the package can be done using following codes.
require(devtools) install_github("SOCR/DataSifter") library(DataSifter.lite)
Here we show a small example to share synthetic data with DataSifter. We generate 5 predictors that follows uniform distribution and a corresponding outcome variable.
set.seed(1234) x1<-runif(1000) x2<-runif(1000) x3<-runif(1000) x4<-runif(1000) x5<-runif(1000) data1<-data.frame(x_1=x1,x_2=x2,x_3=x3,x_4=x4,x_5=x5) data1$y=1+x1+x2-0.5*x3-2*x4+0.5*x5
Then, proceed to generate synthetic datasets under different levels of obfuscations. Note that under the "indep" level, DataSifter creates each variable independently from their empirical distribution in the original data.
set.seed(1234) siftedata_s<-DataSifter.lite::dataSifter(level = "small",data=data1,nomissing = TRUE) siftedata_m<-DataSifter.lite::dataSifter(level = "medium",data=data1,nomissing = TRUE) siftedata_l<-DataSifter.lite::dataSifter(level = "large",data=data1,nomissing = TRUE) siftedata_i<-DataSifter.lite::dataSifter(level = "indep",data=data1,nomissing = TRUE)
We utilize the pctMatch()
to examine privacy protection ability. pctMatch()
compares each record in the original and "sifted" data and outputs a list of Percent of Identical Feature Values (PIFV) for all records.
PIFV_s <- DataSifter.lite::pctMatch(data1,siftedata_s) PIFV_m <- DataSifter.lite::pctMatch(data1,siftedata_m) PIFV_l <- DataSifter.lite::pctMatch(data1,siftedata_l) PIFV_i <- DataSifter.lite::pctMatch(data1,siftedata_i)
Let's visualize the results.
library(ggplot2) PIFV <- data.frame(Levels=rep(c("small","medium","large","indep"),each=1000), PIFVs=c(PIFV_s,PIFV_m,PIFV_l,PIFV_i)) PIFV$Levels <- factor(PIFV$Levels,levels=c("small","medium","large","indep")) ggplot(data = PIFV,aes(x=Levels,y=PIFVs,fill=Levels))+ geom_boxplot()+ scale_fill_brewer(palette="RdBu")+ ggtitle("PIFVs under different levels of obfuscations")
As shown in the box plots, there is an increasing effect of privacy protection for higher levels of obfuscations.
Let's fit linear models to investigate the preservation of the original joint distribution.
original<-lm(y~x1+x2+x3+x4+x5,data = data1) sum_o <- summary(original)$coefficients small <- lm(y~x1+x2+x3+x4+x5,data = siftedata_s) sum_s <- summary(small)$coefficients medium <- lm(y~x1+x2+x3+x4+x5,data = siftedata_m) sum_m <- summary(medium)$coefficients large <- lm(y~x1+x2+x3+x4+x5,data = siftedata_l) sum_l <- summary(large)$coefficients indep <- lm(y~x1+x2+x3+x4+x5,data = siftedata_i) sum_i <- summary(indep)$coefficients summary_models <- cbind.data.frame(sum_o[,1],sum_s[,1],sum_m[,1],sum_l[,1],sum_i[,1]) colnames(summary_models) <- c("Original","Small","Medium","Large","Indep") library(knitr) kable(summary_models)
The original data utility measured by the linear model is diminishing when level of obfuscation is higher. Overall, under the "medium" level of obfuscation, we can achieve a good balance between patient privacy and data utility.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.