randomize: Create Column-Wise Randomized Test Data for Non-Statistical...

View source: R/randomize.R

randomizeR Documentation

Create Column-Wise Randomized Test Data for Non-Statistical Validation

Description

randomize() draws n samples from the unique values in each column of a data frame and returns the randomized data. This destroys all statistical information in the dataset, both univariate and multivariate. However, since the set of possible output values is the same as the input values, the minimum and maximum of numeric columns will be the same, as will the unique values of all columns (if n is larger than the number of observations).

Usage

randomize(.data, n = NULL, .groups = NULL)

Arguments

.data

A data frame or data frame extension (e.g. a tibble)

n

The desired number of observations in the returned dataset; the default is the number of observations in the input

.groups

How to handle grouping variables; see the .groups parameter documentation in summarize() for more information

Details

randomize() can perform up- and down-sampling of the input data. Downsampling is simple random sampling without replacement. Upsampling samples without replacement up to the size of the input data, then samples with replacement for the remaining observations. This approach ensures that all values from the input data appear at least once if n is greater than or equal to the number of observations in the data.

A stratified version that restricts randomization to occur within strata can be obtained by grouping the data using group_by() prior to calling randomize(). In this case, the relative proportions of the groups within the dataset remain the same; this allows the user to retain portions of the data's structure while destroying the remaining information.

Note that the above only provides anonymity when the number of unique values for quasi-identifiers (within each group) is large and unique identifiers are handled separately. Also note that when groups are defined, information both within and between grouping variables will not be modified.

Value

A tibble containing the randomized test data


jesse-smith/coviData documentation built on Jan. 14, 2023, 11:08 a.m.