library(ccdata)
library(ccanonym)
library(pander)

panderOptions('table.split.table', 115)

Introduction

The simplest way to create the anonymised dataset is to use anonymisation() function provided by the ccanonym package and supplied with a configuration YAML file following the guidance of SOP. We will show the basic steps the anonymisation process. The anonymisation() function creates the anonymised data extract by doing following steps,

  1. Remove the alive episode.

  2. Calculate the age based on the date of birth (DOB) and date of ICU admission (DAICU) and date of admission will be removed subsequently. The Removal of DAICU should be specified in the configuration file.

  3. All demographic time stamps will be converted based on their difference between the date of admission and date of admission will be converted to an arbitrary time 1970-01-01. i.e. admission date: 2014-01-01 -> 1970-01-01; discharge date: 2014-01-03 -> 1970-01-03. With this process, all the time information will be hidden from the users. However the cadence of such of length of stay will still be preserved.

  4. Remove episodes which stays longer than a certain period of time. One can specify it in the configuration file, e.g. maxStay: 30.

  5. Micro-aggregate the numerical/date variables specified in the configuration file.

  6. Special aggregation, such as we suppress the post code in such a way that NW1 1AA -> NW1. The function can be written in the configuration file.

  7. Suppress the key variables where the k-anonymity is violated.

  8. Suppress the sensitive variables where the l-diversity is violated.

  9. Adding noise to the selected data.

  10. Combine and create the new ccRecord object and convert all the 2d date time stamps to the hour difference to the admission time.

library(ccdata)
library(ccanonym)
conf <- yaml.load_file("test_demo.yaml")
vars <- parse.conf(conf)
all.var <- c(vars$dirv, vars$all.vars, "DIS") # all variables besides non-confidential data
all.var.origin <- all.var[-which(all.var=="AGE")]

Identifiable data set

The identifiable data set is usually stored in ccRecord format, which is the default for CCHIC data. ccRecord object is usually converted from the XML file. In the toy data set, it contains only 7 episodes.

ccd <- xml2Data("../tests/data/test_data_anonym.xml")
demg.table <- as.data.frame(sql.demographic.table(ccd))[, all.var.origin]
pander(demg.table,  style = 'rmarkdown')

Simple anonymisation procedure

Step 1: Prepare YAML configuration file

Create a YAML configuration file as such, where the following variables are set up. The name of the items are the standard short name of ccdata R package. Please see the 'ITEM_REF.YAML'.

As we can see, the YAML configuration file not only specifies the variables, but also indicates certain operations which should be performed during the anonymisation SOP. In principal this YAML file should be served as a replicate of the SOP.

directVars:
    - pasno   
    - ICNNO   
    - ADNO    
    - TUADNO
    - DOB

categoryVars:
    - GPCODE
    - SEX
    - PCODE #  Postcode

numVars:
    AGE:  # Height
        aggr: 2
        noise: 2
    HCM:  # Height
        aggr: 2

    DAH: # Date of admission to your hospital
        aggr: 2
        noise: 2

sensVars:
    - BPC # Biopsy proven cirrhosis
    - AIDS_V3 # HIC/AIDS
    - PH # Portal hypertension
    - RAICU1
    - RAICU2 
    - URAICU 

deltaTime:
    - DAH
    - DOAH  # Date of original admission to/attendance at acute hospital'
    - DOAICU  # Date of original admission to ICU/HDU'
    - DUDICU  # Date of ultimate discharge from ICU/HDU'
    - DOD  # Date of death on your unit'
    - DDBSD  # Date of declaration of brain stem death'
    - DWFRD  # Date fully ready for discharge'
    - DDH  # Date of discharge from your hospital'
    - DUDH  # Date of ultimate discharge from your hospital'
    - DAICU  # Date & Time of admission to your unit'
    - DLCCA 
    - DDICU  # Date & Time of discharge from your unit'

Operations:
    PCODE: 'function(x){sapply(strsplit(as.character(x), split=" "), function(y) y[1])}'

... ...

Step 2: Micro-aggregation

In this step, we try to reduce the granularity of numerical data by aggregating data using "rmd" which groups based on robust multivariate (Mahalanobis) distance measures. [1]

proximity using multivariate distances.

numVars:
    HCM:  # Height
        aggr: 2 
conf$numVars$HCM$aggr <- 2 # modify the parameter implicitly. 
sdc <- sdc.trial(ccd, conf, remove.alive=T, k=1)
pander(data.frame(HCM_ORIG=demg.table$HCM[1:5], HCM_ANON=sdc$data$HCM), style="rmarkdown")

Custom suppression

Operations:
    PCODE: 'function(x){sapply(strsplit(as.character(x), split=" "), function(y) y[1])}'
conf[["Operations"]] <- list()
conf$Operations[['PCODE']] <- 'function(x){sapply(strsplit(as.character(x), split=" "), function(y) y[1])}'
sdc <- sdc.trial(ccd, conf, k=1, l=1)
pander(data.frame(PCODE_ORIG=demg.table$PCODE[1:5], 
                  PCODE_ANON=sdc$data$PCODE), 
       style="rmarkdown")

Adjust $k-anonymity$

k-anonymity = 1

vars <- c('GPCODE', 'SEX', 'PCODE', 'AIDS_V3', 'HCM', 'DAH', 'AGE')
sdc <- sdc.trial(ccd, conf, k.anon=1, l.div=1, verbose=F)
pander(sdc$data[, vars], style="rmarkdown")

k-anonymity = 2

sdc <- sdc.trial(ccd, conf, k.anon=2, l.div=1, verbose=T)
pander(sdc$data[, vars], style="rmarkdown")

k-anonymity = 3

sdc <- sdc.trial(ccd, conf, k.anon=3, l.div=1, verbose=T)
pander(sdc$data[, vars], style="rmarkdown")

Adjust $k-anonymity$

l-diversity = 1

sens.vars <- c("BPC", "AIDS_V3", "PH", "RAICU1", "RAICU2", "URAICU")
sdc <- sdc.trial(ccd, conf, k.anon=1, l.div=1)
pander(sdc$data[, sens.vars], style="rmarkdown")

l-diversity = 2

sdc <- sdc.trial(ccd, conf, k.anon=1, l.div=2)
pander(sdc$data[, sens.vars], style="rmarkdown")

Create the anonymisation dataset

anonccd <- anonymisation(ccd, conf, remove.alive=T, k=2, l=1)
demg.new <- data.frame(sql.demographic.table(anonccd))
pander(demg.new[, all.var], style="rmarkdown")

Reference

[1]. Templ M, Meindl B, Kowarik A, Chen S. Introduction to statistical disclosure control (SDC). IHSN Working Paper No 007 2014, Oct 28:1-25.



CC-HIC/ccanonym documentation built on May 6, 2019, 9:23 a.m.