dataSifter: Data Sifter Algorithm

Description Usage Arguments Details Value

View source: R/dataSifter.R

Description

Create a informative privacy-preserving dataset that guarantees subjects' privacy while preserving the information contained in the original dataset.

Usage

1
2
dataSifter(level = "indep", data, subjID = NULL, col_preserve = 0.5,
  col_pct = 0.7, nomissing = FALSE, maxiter = 1)

Arguments

level

Takes a value among ("none","small","medium","large","indep"). The user-defined level of obfuscation for the data sifter algorithm. Greater value represents higher level of obfuscation. The default value is "indep" with most obfuscation, which produces independent variables that follows the imperical distribution of the original data.

data

Original data to be processed.

subjID

Vector of characters indicating the variables for subject ID. These variables will be removed for privacy protection.

col_preserve

The maximum percentage of number of columns can be deleted due to massive missingness.

col_pct

Criterion for column deletion due to massive missingness. If missing percentage is larger than this threshold, delete the corresponding column.

nomissing

Indicator of missing in the original dataset. If nomissing=TRUE, there are no missing in the original data.

Details

When level="indep" each variable in the sifted dataset is independently generated from their empirical distribution from the original data. On the other hand, level="none" returns the original dataset. When some factors contain a level with empty value " ", it will likely to present "out of bounds" error.

The process could take a while to run with large datasets. There will be two messages indicating the progress of the process. They are Artifical missingness and imputation done", and "Obfuscation step done".

Value

Return sifted dataset.


SOCR/DataSifter documentation built on Dec. 11, 2021, 2:55 p.m.