#' Targeted Record Swapping
#'
#' Applies targeted record swapping on micro data considering the identification
#' risk of each record as well the geographic topology.
#'
#' @details The procedure accepts a `data.frame` or `data.table`
#' containing all necessary information for the record swapping, e.g
#' parameter `hid`, `similar`, `hierarchy`, etc ...
#' First, the micro data in `data` is ordered by `hid` and the identification
#' risk is calculated for each record in each hierarchy level. As of right
#' now only counts is used as identification risk and the inverse of counts
#' is used as sampling probability.
#' NOTE: It will be possible to supply an identification risk for each record
#' and hierarchy level which will be passed down to the C++-function. This
#' is however not fully implemented.
#'
#' With the parameter `k_anonymity` a k-anonymity rule is applied to define
#' risky households in each hierarchy level. A household is set to risky
#' if counts < k_anonymity in any hierarchy level and the household needs
#' to be swapped across this hierarchy level.
#' For instance, having a geographic hierarchy of NUTS1 > NUTS2 > NUTS3 the
#' counts are calculated for each geographic variable and defined
#' `risk_variables`. If the counts for a record falls below `k_anonymity`
#' for hierarchy county (NUTS1, NUTS2, ...) then this record needs to be swapped
#' across counties.
#' Setting `k_anonymity = 0` disables this feature and no risky households
#' are defined.
#'
#' After that the targeted record swapping is applied, starting from the highest
#' to the lowest hierarchy level and cycling through all possible geographic
#' areas at each hierarchy level, e.g every county, every municipality in
#' every county, etc, ...
#'
#' At each geographic area, a set of values is created for records to be
#' swapped. In all but the lowest hierarchy level, this is ONLY made out
#' of all records which do not fulfil the k-anonymity and have not already
#' been swapped. Those records are swapped with records not belonging to
#' the same geographic area, which have not already been swapped beforehand.
#' Swapping refers to the interchange of geographic variables defined in
#' `hierarchy`. When a record is swapped all other records containing the
#' same `hid` are swapped as well.
#'
#' At the lowest hierarchy level in every geographic area, the set of records to
#' be swapped is made up of all records which do not fulfil the k-anonymity
#' as well as the remaining number of records such that the proportion of
#' swapped records of the geographic area is in coherence with the `swaprate`.
#' If due to the k-anonymity condition, more records have already been swapped
#' in this geographic area then only the records which do not fulfil the
#' k-anonymity are swapped.
#'
#' Using the parameter `similar` one can define similarity profiles.
#' `similar` needs to be a list of vectors with each list entry containing
#' column indices of `data`. These entries are used when searching for donor
#' households, meaning that for a specific record the set of all donor
#' records is made out of records which have the same values in
#' `similar[[1]]`. It is however important to note, that these variables
#' can only be variables related to households (not persons!). If no suitable
#' donor can be found the next similarity profile is used, `similar[[2]]` and
#' the set of all donors is then made up out of all records which have the
#' same values in the column indices in `similar[[2]]`. This procedure
#' continues until a donor record was found or all the similarity profiles
#' have been used.
#'
#' `swaprate` sets the swaprate of households to be swapped, where a single
#' swap counts for swapping 2 households, the sampled household and the
#' corresponding donor. Prior to the procedure, the swaprate is applied on
#' the lowest hierarchy level, to determine the target number of swapped
#' households in each of the lowest hierarchies. If the target numbers of a
#' decimal point they will randomly be rounded up or down such that the
#' number of households swapped in total is in coherence to the swaprate.
#'
#' @param data must be either a micro data set in the form of a
#' `data.table` or `data.frame`, or an `sdcObject`, see
#' \link[sdcMicro]{createSdcObj}.
#' @param hid column index or column name in `data` which refers
#' to the household identifier.
#' @param hierarchy column indices or column names of variables in
#' `data` which refer to the geographic hierarchy in the micro data
#' set. For instance county > municipality > district.
#' @param similar vector or list of integer vectors or column names
#' containing similarity profiles, see details for more explanations.
#' @param swaprate double between 0 and 1 defining the proportion of
#' households which should be swapped, see details for more explanations
#' @param risk either column indices or column names in `data` or
#' `data.table`, `data.frame` or `matrix` indicating risk of each record
#' at each hierarchy level. If `risk`-matrix is supplied to swapping procedure
#' will not use the k-anonymity rule but the values found in this matrix
#' for swapping.
#' When using the risk parameter is expected to have assigned a maximum value
#' in a household for each member of the household. If this condition is not
#' satisfied, the risk parameter is automatically adjusted to comply with this
#' condition.
#' If risk parameter is provided then k-anonymity rule is suppressed.
#' @param risk_threshold single numeric value indicating when a household is
#' considered "high risk", e.g. when this household must be swapped. Is only
#' used when `risk` is not `NULL`.
#' Risk threshold indicates households that have to be swapped, but be aware
#' that households with risk lower than threshold, but with still high enough
#' risk may be swapped as well. Only households with risk set to 0 are not swapped.
#' Risk and risk threshold must be equal or bigger then 0.
#' @param k_anonymity integer defining the threshold of high risk households
#' (counts<k) for using k-anonymity rule
#' @param risk_variables column indices or column names of variables in `data`
#' which will be considered for estimating the risk. Only used when k-anonymity
#' rule is applied.
#' @param carry_along integer vector indicating additional variables to swap
#' besides to hierarchy variables. These variables do not interfere with the
#' procedure of finding a record to swap with or calculating risk. This
#' parameter is only used at the end of the procedure when swapping the
#' hierarchies. We note that the variables to be used as `carry_along` should
#' be at household level. In case it is detected that they are at individual
#' level (different values within `hid`), a warning is given.
#' @param return_swapped_id, boolean if `TRUE` the output includes an
#' additional column showing the `hid` with which a record was swapped with.
#' The new column will have the name `paste0(hid,"_swapped")`.
#' @param log_file_name character, path for writing a log file. The log
#' file contains a list of household IDs (`hid`) which could not have been
#' swapped and is only created if any such households exist.
#' @param seed integer defining the seed for the random number generator, for
#' reproducibility. if `NULL` a random seed will be set using `sample(1e5,1)`.
#' @param ... parameters passed to `recordSwap.default()`
#' @author Johannes Gussenbauer
#'
#' @return `data.table` with swapped records.
#'
#' @examples
#' # generate 10000 dummy households
#' library(data.table)
#' seed <- 2021
#' set.seed(seed)
#' nhid <- 10000
#' \donttest{
#' dat <- sdcMicro::createDat(nhid)
#'
#' # define paramters for swapping
#' k_anonymity <- 1
#' swaprate <- .05 # 5%
#' similar <- list(c("hsize"))
#' hier <- c("nuts1", "nuts2")
#' risk_variables <- c("ageGroup", "national")
#' hid <- "hid"
#'
#' ## apply record swapping
#' #dat_s <- recordSwap(
#' # data = dat,
#' # hid = hid,
#' # hierarchy = hier,
#' # similar = similar,
#' # swaprate = swaprate,
#' # k_anonymity = k_anonymity,
#' # risk_variables = risk_variables,
#' # carry_along = NULL,
#' # return_swapped_id = TRUE,
#' # seed = seed
#' #)
#' #
#' ## number of swapped households
#' #dat_s[hid != hid_swapped, uniqueN(hid)]
#' #
#' ## hierarchies are not consistently swapped
#' #dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]
#' #
#' ## use parameter carry_along
#' #dat_s <- recordSwap(
#' # data = dat,
#' # hid = hid,
#' # hierarchy = hier,
#' # similar = similar,
#' # swaprate = swaprate,
#' # k_anonymity = k_anonymity,
#' # risk_variables = risk_variables,
#' # carry_along = c("nuts3", "lau2"),
#' # return_swapped_id = TRUE,
#' # seed = seed)
#' #
#' #dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]
#' }
#' @export
recordSwap <- function(data,...){
UseMethod("recordSwap")
}
#' @rdname recordSwap
#' @export
recordSwap.sdcMicroObj <- function(data, ...){
hid <- hierarchy <- similar <- similar <- swaprate <-
risk <- risk_threshold <- k_anonymity <- risk_variables <-
carry_along <- return_swapped_id <- seed <- log_file_name <- NULL
ellipsis <- list(...)
# get inputs from recordSwap.default
rsArgs <- formals(recordSwap.default)
rsArgs$data <- rsArgs$`...` <- NULL
# gett parameters from ..., sdcObject and default values
fun_params <- names(rsArgs)
fun_values <- lapply(fun_params,function(z,ell,sdcObj,default){
getVar(ell=ell,
sdcObj=sdcObj,
default=default,variable=z)
},ell=ellipsis,sdcObj=data,default=rsArgs)
names(fun_values) <- fun_params
expr_values <- paste(fun_params,"<-",paste0("fun_values[['",fun_params,"']]"))
eval(parse(text = paste(expr_values)))
# get data
data <- data@origData
# run record swaping default
data <- recordSwap.default(data=data, hid=hid, hierarchy = hierarchy,
similar = similar, swaprate = swaprate, risk = risk,
risk_threshold = risk_threshold, k_anonymity = k_anonymity,
risk_variables = risk_variables, carry_along = carry_along,
return_swapped_id = return_swapped_id,
log_file_name = log_file_name,
seed = seed)
return(data)
}
#' @rdname recordSwap
#' @export
recordSwap.default <- function(data, hid, hierarchy, similar,
swaprate=0.05, risk=NULL, risk_threshold=0,
k_anonymity=3, risk_variables=NULL,
carry_along = NULL,
return_swapped_id = FALSE,
log_file_name = "TRS_logfile.txt",
seed = NULL, ...){
helpVariableforMergingAfterTRS <- . <- hid_help <- NULL
# check data
if(missing(data)){
stop("data is missing, data must be either a data.table, data.frame")
}
if(all(!class(data)%in%c("data.table","data.frame"))){
stop("data must be either a data.table, data.frame")
}
data <- as.data.table(data)
cnames <- copy(colnames(data))
##########################
# # check inputs
# check mandatory parameters
if(any(missing("hid"), missing("hierarchy"),missing("similar"))){
stop("One of mandatory parameters (hid, hierarchy, similar) is missing.")
}
# check hid
hid <- checkIndexString(hid,cnames,matchLength = 1)
# check hierarchy
hierarchy <- checkIndexString(hierarchy,cnames,minLength = 1)
# check similar
if(!is.list(similar)){
similar <- list(similar)
}
similar <- lapply(similar,checkIndexString,cnames=cnames,minLength = 1)
# check risk_variables
if(is.null(risk) & is.null(risk_variables)){
stop("risk_variables are missing for calculation of k-anonymity rule.")
}
if(is.null(risk)){
risk_variables <- checkIndexString(risk_variables,cnames,minLength = 1)
} else {
risk_variables <- 0
}
# check carry_along
carry_along <- checkIndexString(carry_along,cnames,minLength = 0)
carry_along <- carry_along[!carry_along%in%hierarchy] # otherwise they are swapped twice
# check return_swapped_id and use with carry_along if TRUE
if(!is.logical(return_swapped_id) | length(return_swapped_id)!=1){
stop("return_swapped_id must be logical of length 1")
}
if(return_swapped_id==TRUE){
orig_id <- cnames[hid+1]
swapped_id <- paste0(orig_id,"_swapped")
data[,c(swapped_id):=get(orig_id)]
cnames <- copy(colnames(data))
swapped_id <- checkIndexString(swapped_id,cnames,
matchLength = 1)
carry_along <- c(carry_along,swapped_id)
}
# check that carry_along-variables are at household level
.chk_hhlevel <- function(data, hid, idx, action = "warning") {
# hid and idx are c-level indices (starting at 0!)
N <- NULL
vhid <- names(data)[hid + 1]
vidx <- names(data)[idx + 1]
agg <- data[, .(N = length(unique(get(vidx)))), by = vhid]
agg <- agg[N != 1]
if (nrow(agg) > 0) {
msg <- paste(
"Variable", shQuote(vidx),
"(used in `carry_along`) is not at household-level",
"which might lead to unexpected results."
)
if (action == "message") {
message(msg)
} else if (action == "warning") {
warning(msg, call. = FALSE)
} else {
stop(msg, call. = FALSE)
}
}
invisible(TRUE)
}
if (length(carry_along) > 0) {
for (i in carry_along) {
.chk_hhlevel(data = data, hid = hid, idx = i, action = "warning")
}
}
# check k_anonymity
if(!all((!is.null(risk_variables))&checkInteger(k_anonymity)&length(k_anonymity)==1&k_anonymity>=0)){
stop("k_anonymity must be a positiv single integer!")
}
# check risk_threshold
if(!(is.numeric(risk_threshold)&&length(risk_threshold)==1&&risk_threshold>=0)){
stop("risk_threshold must be a positiv numeric value!")
}
# check swaprate
if(!all(is.numeric(swaprate)&&length(swaprate)==1&&swaprate%between%c(0,1))){
stop("swaprate must be a single number between 0 and 1!")
}
# check risk
if(is.null(risk)){
risk <- data.table()
risk_threshold <- 0
}
if(is.vector(risk)){
if(any(length(risk) == nrow(data),is.list(risk))){
stop("If risk is not a vector containing column indices or column names in data then risk must be either a data.table, data.frame or matrix!")
}
if(length(risk)!=length(hierarchy)){
stop("risk and hierarchy need to address the same number of columns!")
}
risk <- checkIndexString(risk,cnames,minLength = 1)
risk <- data[,c(risk+1)]
}else{
if(all(!class(risk)%in%c("data.table","data.frame","matrix"))){
stop("If risk is not a vector containing column indices or column names in data then risk must be either a data.table, data.frame or matrix!")
}
}
if(nrow(risk)>0){
if(ncol(risk)!=length(hierarchy)){
stop("number of columns in risk does not match number of hierarchies!")
}
}
cnamesrisk <- copy(colnames(risk))
risk <- data.table(risk)
if(nrow(risk)>0){
if(is.null(cnamesrisk)){
message("risk does not contain column names; the first column in risk will be used for the first hierarchy level, e.g ",cnames[hierarchy[1]+1]," and so on.")
}else{
if(!any(cnamesrisk%in%cnames[hierarchy+1])){
stop("the columnnames of risk do not appear in data")
}
}
if(any(risk<0)||any(!unlist(lapply(risk, is.numeric)))
){
stop("risk must contain positive real values only!")
}
# Check the values of risk variable if they need adjustment for recordSwap_cpp()
# This check if risk values in each household are unique and if not then assign
# every member of the household the highest value in the household.
risk_variables_names <- copy(colnames(risk))
risk[,hid_help:=data[[hid+1]]]
tryCatch(
expr = risk[,lapply(.SD,
function(z){
if( (length(unique(z)) > 1)) {stop()} else {0}
}), # error when value not equal 0
.SDcols=c(risk_variables_names),
by=.(hid_help)], # calculate if each household have unique values
error = function(e){
message("risk was adjusted in order to give each household member the maximum household risk value")
risk[,c(risk_variables_names):=lapply(.SD,max),
.SDcols=c(risk_variables_names),
by=.(hid_help)] # assign to each household its max value
risk[,hid_help:=NULL]
}
)
}
# check seed
# if(is.character(seed)){
# stop("seed must be a single positive integer!")
# }
if(is.null(seed) | any(is.na(seed))){
seed <- sample(1e5,1)
}
if(!(is.numeric(seed)&&length(seed)==1&&seed%%1==0&&seed>0)){
stop("seed must be a single positive integer!")
}
##########################
# setup data and inputs for c++ function
# order data
setkeyv(data,cnames[hid+1])
# take sub data
data[,helpVariableforMergingAfterTRS:=.I]
sim_vars <- sort(unique(unlist(similar)))
original_cols <- unique(c(hid,hierarchy,risk_variables,sim_vars,carry_along))
select_cols <- unique(c(original_cols+1,ncol(data)))
data_sw <- copy(data[,.SD,.SDcols=c(select_cols)])
cnames_sw <- colnames(data_sw) # save column names for later use
# remove columns from original data except help variable for merging
drop_cols <- cnames_sw[-length(cnames_sw)]
data[,c(drop_cols):=NULL]
# remap column indices
hid <- which(hid %in% original_cols)-1
hierarchy <- sapply(hierarchy,function(z){
which(original_cols %in% z) -1
})
if(length(similar)>0){
# remap all similarity variables
similar <- lapply(similar,function(z){
sapply(z,function(z.s){
which(original_cols %in% z) -1
})
})
}
if(length(risk_variables)>0){
risk_variables <- sapply(risk_variables,function(z){
which(original_cols %in% z) -1
})
}
if(length(carry_along)>0){
carry_along <- sapply(carry_along,function(z){
which(original_cols %in% z) -1
})
}
# check if any non numeric values are present in data
if(any(!unlist(apply(data_sw,2,is.numeric)))){
stop("Columns specified in hid, hierarchy, similar and carry_along must contain only integer values at this point")
}
# check if any values with NA values are present in data
NAOccured <- apply(data_sw,2,function(z){any(is.na(z))})
if(any(NAOccured)){
stop("data must contain only integer values. \nColumn(s)\n ",paste( names(which(NAOccured)),collapse=", "),"\ncontain(s) NA values")
}
# check if any values with decimal values are present in data
decOccured <- apply(data_sw,2,function(z){any((z%%1)!=0)})
if(any(decOccured)){
decOccured <- names(decOccured)[decOccured]
stop("data must contain only integer values.\nColumn(s)\n ",paste(decOccured,collapse=", "),"\ncontain(s) decimal numbers")
}
# transpose data for cpp function
data_sw <- transpose(data_sw)
# transpose risk
if(nrow(risk)>0){
risk <- transpose(risk)
}else{
risk <- numeric(0)
}
# risk <- numeric(0) # drop this if risk was tested enough
# take time before starting swapping
start_time <- Sys.time()
data_sw <- recordSwap_cpp(data=data_sw, similar_cpp=similar, hierarchy=hierarchy,
risk_variables=risk_variables, hid=hid, k_anonymity=k_anonymity,
swaprate=swaprate,
risk_threshold=risk_threshold, risk=risk,
carry_along = carry_along,
log_file_name = log_file_name,
seed=seed)
# check if swapping was successful
if(file.exists(log_file_name) && file.mtime(log_file_name)>start_time){
message("Donor household was not found in ",length(readLines(log_file_name))-2," case(s).\nSee ",log_file_name," for a detailed list")
}else{
message("Recordswapping was successful!\n")
}
setDT(data_sw)
data_sw <- transpose(data_sw)
setnames(data_sw,colnames(data_sw),cnames_sw)
data[data_sw,c(drop_cols):=mget(drop_cols),on=.(helpVariableforMergingAfterTRS)]
rm(data_sw)
setcolorder(data,cnames)
data[,helpVariableforMergingAfterTRS:=NULL]
return(data)
}
# helpfunction to check if inputs are correct
# check if integer or string and if length is appropriate
checkInteger <- function(x){
if(is.numeric(x)){
return(all(x%%1==0))
}else{
return(FALSE)
}
}
# check length and type and convert to integer position for c++ function
checkIndexString <- function(x=NULL,cnames,matchLength=NULL,minLength=NULL){
# return numeric(0) of input is null
if(is.null(x) & (is.null(minLength)||minLength==0)){
return(numeric(0))
}
varName <- deparse(substitute(x))
if(!is.null(matchLength)){
if(!((checkInteger(x)|is.character(x))&length(x)==matchLength)){
stop(varName," must be an integer (column index) or character (column name) of length ",matchLength)
}
}else{
if(!((checkInteger(x)|is.character(x))&length(x)>=minLength)){
stop(varName," must contain integers (column indices) or characters (~column names) of data")
}
}
# check when string that names are in cnames
if(all(is.character(x))){
if(any(!x%in%cnames)){
stop("Column name(s) in ",varName," are not found in data")
}
# initialize index
x <- match(x,cnames)
}
# check that index does not
# - exceed number of column of data
# x must be integer from this point onwards
if(any(x>length(cnames))){
stop("Column index in ",varName," exceeds number of columns in data")
}
if(any(x==0)){
stop("Index in ",varName," does contain zero.\nIndexing in R starts with 1!")
}
if(any(x<0)){
stop("Column indices cannot be negative")
}
# indices start with 0 for c++ routine
x <- x-1
return(x)
}
# helpfunctino to get paramete values from ..., sdcObject and default values
getVar <- function(ell,sdcObj,default,variable){
in_ell <- ell[[variable]]
if(variable=="hid"){
in_sdcObj <- sdcObj@hhId
}else if(variable=="risk_variables"){
in_sdcObj <- sdcObj@keyVars
}else{
in_sdcObj <- sdcObj@options[[variable]]
}
null_ell <- is.null(in_ell)
null_sdcObj <- is.null(in_sdcObj)
if(null_ell & null_sdcObj){
if(is.symbol(default[[variable]])){
if(variable=="risk_variables"){
stop("argument `",variable,"` is missing, with no default\n Alternatively one can specifcy `",variable,"` through the parameter `keyVars` in `createSdcObj()`")
}else if(variable=="hid"){
stop("argument `",variable,"` is missing, with no default\n Alternatively one can specifcy `",variable,"` through the parameter `hhId` in `createSdcObj()`")
}else{
stop("argument `",variable,"` is missing, with no default\n Alternatively one can specifcy `",variable,"` through the parameter `options` in `createSdcObj()`")
}
}else{
# set default value for variable
take_value <- default[[variable]]
}
}
if(!null_ell & !null_sdcObj){
warning("argument `",variable,"` defined in function call and in `data`: taking value from function call")
take_value <- in_ell
}
if(!null_ell){
take_value <- in_ell
}else if(!null_sdcObj){
take_value <- in_sdcObj
}
return(take_value)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.