ccanonym: What the package does (short line)

Documented in addnoise.numvar anonymisation append.age deltaTime1d microaggregation.numvar remove.direct.vars sdc.trial

#' Anonymise the critical care dataset
#'
#' This is the main function of the ccanonym package. It creates the anonymised
#' data abstract from the identifiable dataset through the anonymisation
#' process. 
#'
#' @details
#'1. Remove the alive episode if \code{remove.alive} variable is TRUE.
#'
#'2. Calculate the age based on the date of birth (DOB) and date of ICU admission
#'(DAICU) and date of admission will be removed subsequently. The Removal of
#'DAICU should be specified in the configuration file. 
#'
#'3. All demographic time stamps will be converted based on their difference
#'between the date of admission and date of admission will be converted to an
#'arbitrary time 1970-01-01. i.e. admission date: 2014-01-01 -> 1970-01-01;
#'discharge date: 2014-01-03 -> 1970-01-03. With this process, all the time
#'information will be hidden from the users. However the cadence of such of length
#'of stay will still be preserved. 
#'
#'4. Remove VIPs from a list file which has the identifiers of VIPs. The
#' identifiers can be NHS number or PAS number or site episode id combination
#' (Q70:000001). 
#'
#'5. Remove episodes which stays longer than a certain period of time. One can
#'specify it in the configuration file, e.g. maxStay: 30. 
#'
#'6. Micro-aggregate the numerical/date variables specified in the configuration
#'file. 
#'
#'7. Special aggregation, such as we suppress the post code in such a way that
#'NW1 1AA -> NW1. The function can be written in the configuration file. 
#'
#'8. Suppress the key variables where the k-anonymity is violated. 
#'
#'9. Suppress the sensitive variables where the l-diversity is violated. 
#'
#'10. Adding noise to the selected data. 
#' 
#'11. Combine and create the new ccRecord object and convert all the 2d date
#'time stamps to the hour difference to the admission time. 
#' 
#' @param ccd identifiable data set in ccRecord format (see. cleanEHR R package)
#' @param conf YAML configuration which can be either path of the YAML
#'        file or a configuration list equivalent to the YAML configuration. 
#' @param remove.alive logical value determines whether to remove all alive
#'        episodes.
#' @param verbose logical 
#' @param k.anon integer the K-anonymity
#' @param l.div integer the minimum L-diversity of the data extract. For
#'        entries where the L-diversity is above this threshold will be suppressed. 
#' @return ccRecord 
#'
#' @examples
#' \dontrun{
#' # We assume the original dataset is called `ccd`
#' # Create a template configuration file, modify it if necessary. 
#' template.conf("test.yaml")
#' 
#' # Trial: adjust K-anonymity and L-diversity. 
#' sdc-trial(ccd, "test.yaml", k.anon=10, l.div=2)
#' 
#' # Create the data extract after the k.anon and l.div is decided. 
#' ccd.anon <- anonymisation(ccd, "test.yaml", k.anon=10)
#' }
#' @import sdcMicro
#' @import yaml
#' @import cleanEHR
#' @export
anonymisation <- function(ccd, conf, remove.alive=T, verbose=F, 
                          k.anon=5, l.div=NULL, ...) {
    vn <- parse.conf(conf)
    ccd <- deltaTime(ccd, ...)
    sdc <- sdc.trial(ccd, conf, remove.alive, verbose, k=k.anon, l=l.div)
    newccd <- create.anonym.ccd(ccd, sdc)
    security.check(newccd, vn$dirv)
    newccd
}

#' Performe statstical disclosure control process on demgraphic data.
#'
#' The sdc.trial is a function called by \code{anonymisation()}. However it is
#' also a very useful function for tuning the variables, such as k and l, when the
#' data is new. This function will return the anonymised demographic data and a
#' SDC object which tells more about the entire anonymisation process, such as
#' individual risks, information lost and so on. The user should balance the
#' security and usefulness based on data anonymisation SOP. 
#'
#' @details 
#' 
#'1. Remove the alive episode if \code{remove.alive} variable is TRUE.
#'
#'2. Calculate the age based on the date of birth (DOB) and date of ICU admission
#'(DAICU) and date of admission will be removed subsequently. The Removal of
#'DAICU should be specified in the configuration file. 
#'
#'3. All demographic time stamps will be converted based on their difference
#'between the date of admission and date of admission will be converted to an
#'arbitrary time 1970-01-01. i.e. admission date: 2014-01-01 -> 1970-01-01;
#'discharge date: 2014-01-03 -> 1970-01-03. With this process, all the time
#'information will be hidden from the users. However the cadence of such of length
#'of stay will still be preserved. 
#'
#'4. Remove episodes which stays longer than a certain period of time. One can
#'specify it in the configuration file, e.g. maxStay: 30. 
#'
#'5. Micro-aggregate the numerical/date variables specified in the configuration
#'file. 
#'
#'6. Special aggregation, such as we suppress the post code in such a way that
#'NW1 1AA -> NW1. The function can be written in the configuration file. 
#'
#'7. Suppress the key variables where the k-anonymity is violated. 
#'
#'8. Suppress the sensitive variables where the l-diversity is violated. 
#'
#'9. Adding noise to the selected data. 
#' @param ccd the identifiable ccRecord object
#' @param conf the YAML file location or a list equivalent to the YAML file. 
#' @param remove.alive logical whether remove all non-dead episodes. 
#' @param verbose logical showing more or less information. 
#' @param k.anon minimum K-anonymity threshold.
#' @param l.div minimum l-diversity threshold.
#' @return sdc a list contains the parsed data and the sdcMicro object where
#'         the detailed individual risks can be checked.
#' @examples
#' \dontrun{
#' # We assume the original dataset is called `ccd`
#'
#' # Create a template configuration file, modify it if necessary. 
#' template.conf("test.yaml")
#' 
#' # Trial: adjust K-anonymity and L-diversity. 
#' sdc <- sdc-trial(ccd, "test.yaml", k.anon=10, l.div=2)
#'
#' # To access SDC object
#' sdc$sdc
#' 
#' # Create the data extract after the k.anon and l.div is decided. 
#' ccd.anon <- anonymisation(ccd, "test.yaml", k.anon=10)
#' }
#' @import data.table
#' @export
sdc.trial <- function(ccd, conf, remove.alive=T, verbose=F, k.anon=5,
                      l.div=NULL) {

    if (is.character(conf))
        conf <- yaml.load_file(conf)

    vn <- parse.conf(conf)

    if (verbose)  cat("parsing the cleanEHR object ...\n")

    demg <- data.table(suppressWarnings(demographic.patient.spell(ccd)))
    demg <- remove.patients(demg, conf$removelist)
    demg <- append.age(demg)

    # Remove dead episodes 
    if (remove.alive)
        demg <- demg[DIS=="D"]

    demg <- deltaTime1d(demg, conf$deltaTime, conf$maxStay)
    
    demg <- custom.operation(demg, conf)
    # Remove direct identifiable variables 
    demg <- remove.direct.vars(demg, vn$dirv)
    demg <- microaggregation.numvar(demg, conf)

    sdc <- createSdcObj(demg, keyVars=c(vn$ctgrv, vn$numv))
    sdc <- localSuppression(sdc, k=k.anon)


    if (!is.null(l.div)) {
        manipSensVars <- suppress.ldiversity(sdc, vn$sensv, verbose, l.div)
        demg[, vn$sensv] <- manipSensVars[, vn$sensv]
    }

    demg[, sdc@keyVars] <- sdc@manipKeyVars

    demg <- addnoise.numvar(demg, conf)

    if (verbose)
        print(sdc)

    return(list(data=demg, sdc=sdc, conf=vn))
}


custom.operation <- function(demg, conf) {
    ops <- conf$Operations
    for (item in names(ops)) {
        func <- eval(parse(text=ops[item]))
        demg[[item]] <- func(demg[[item]])
    }
    return(demg)
}


suppress.ldiversity <- function(sdc, sensv, verbose, l.div) {
    manipSensVars <- list()
    for (i in sensv) {
        ld <- ldiversity(sdc, i)@risk$ldiversity
        manipVar <- sdc@origData[[i]]
        manipVar[ld[1:nrow(ld), 2] < l.div] <- NA
        manipSensVars[[i]] <- manipVar    
    }
    as.data.frame(manipSensVars)
}

#' adding noise to all numerical variables. 
#' 
#' The numerical variables includes datetime which will be
#' treated as numeric by converting them to numeric. It will be converted back
#' to datetime format at them end. 
addnoise.numvar <- function(demg, conf) {
    vn <- parse.conf(conf)
    demg <- data.frame(convert.numeric.datetime(demg))
    numconf <- conf$numVars
    numv <- non.unique.columns(demg, vn$numv)

    for (item in numv) {
        if (!is.null(numconf[[item]][['noise']])) {
            noise.level <- numconf[[item]][['noise']]
            demg[, item] <- addNoise(data.frame(demg[, item]), noise=noise.level)$xm
        }
    }
    demg <- convert.back.datetime(demg)
    demg
}

#' microaggregation on numeric variables
#' 
#' @param demg data.table/data.frame 
#' @param conf configuration list
#' @return demg data.frame
microaggregation.numvar <- function(demg, conf) {
    vn <- parse.conf(conf)
    # It is acceptable to have wrong formatted date time which generate warnings.
    demg <- suppressWarnings(data.frame(convert.numeric.datetime(demg)))

    numconf <- conf$numVars
    numv <- non.unique.columns(demg, vn$numv)

    for (item in numv) {
        if (!is.null(numconf[[item]][['aggr']])) {
            agg.level <- numconf[[item]][['aggr']]
            demg[, item] <- 
                microaggregation(data.frame(demg[, item]), 
                                 aggr=agg.level)$mx
        }
    }
    demg <- suppressWarnings(convert.back.datetime(demg))
    demg
}

#' Remove direct identifiers specified
#' 
#' @param data data.frame/data.table
#' @param dirvs direct variables name 
#' @return data data.frame/data/table 
remove.direct.vars <- function(data, dirvs) {
    for (i in dirvs) 
        data[[i]] <- NULL
    data 
}


#' Append AGE column to the demographic table 
append.age <- function(demg) {
    format.dob <- "%Y-%m-%d"
    format.dah <- "%Y-%m-%d"
    demg$AGE <- difftime(as.POSIXct(demg$DAH, format=format.dah),
                         as.POSIXct(demg$DOB, format=format.dob), 
                         units="days")/365
    demg$AGE <- as.numeric(floor(as.numeric(demg$AGE)))
    demg$index <- seq(nrow(demg))
    return(demg)
}


#' Convert all the date-time columns to delta time, but still convert them back
#' with an origin of 1970-01-01. 
deltaTime1d <- function(demg, items, maxstay) {
    if (!"DAICU" %in% items)
        stop("DAICU must in deltaTime. ")

    demg <- data.frame(demg)
    demg <- convert.numeric.datetime(demg, items)
    demg[, items] <- demg[, items] - demg$DAICU

    mxind <- apply(demg[, items], 1, function(x) max(x, na.rm=T)) > maxstay * 3600 * 24
    
    if (any(mxind))
        demg <- demg[-which(mxind), ]

    demg <- convert.back.datetime(demg, items)
    demg
}


clinic.data.list <- function(ccd, index, nidentify) {
    cdl <- list()
    for (i in stname2code(nidentify)) {
        cdl[[i]] <- ccd@episodes[[index]]@data[[i]]
    }

#
#
#    cdl  <- lapply(ccd@episodes[[index]]@data, 
#                   function(x) {
#                       for (i in stname2code(nidentify))
#
#                       if (length(x) > 1)
#                           return(x)
#                       else 
#                           return(NULL)
#                   })
#    for (i in names(cdl)) {
#        if (is.null(cdl[[i]]))
#            cdl[[i]] <- NULL
#    }
    cdl
}

create.anonym.ccd <- function(ccd, sdc) {
    sdc$data$ADNO <- seq(nrow(sdc$data))
    sdc$data$ICNNO <- "pseudo_site"

    sdc.row2list <- function(sdcrow) {
        lst <- as.list(sdcrow)
        lst <- lapply(lst, as.character) 
        lst <- lapply(lst, 
                      function(x) {
                          if (is.na(x)) 
                              return("NULL")
                          else
                              return(x)
                      })
        lst[['index']] <- NULL
        names(lst) <- stname2code(names(lst))
        for (i in names(lst)) {
            if (lst[[i]] == "NULL")
                lst[[i]] <- NULL
        }
        lst
    }


    new.record <- ccRecord()

    for (i in seq(nrow(sdc$data))) {
        cdl <- clinic.data.list(ccd, sdc$data$index[i], sdc$conf$nidentify)
        sdcl <- sdc.row2list(sdc$data[i, ])
        new.record <- new.record + new.episode(append(cdl, sdcl))
    }
    new.record@infotb[, pid:=sdc$data$pid]
    new.record@infotb[, spell:=sdc$data$spell]
    new.record
}



non.unique.columns <- function(data, numv) {
    # check the data type and avoid processing on the 
    # all NA columns.
    new.numv <- vector()
    for (i in seq(numv)) {
        unidata <- unique(data[[numv[i]]])
        if (NA %in% unidata)
            unidata <- unidata[-which(is.na(unidata))]
        if (length(unidata) > 1)
            new.numv <- c(new.numv, numv[i])
    }

    if (length(new.numv) == 0)
        return(NULL)
    else
        return(new.numv)
}

security.check <- function(ccd, dirv) {
    cleanEHR:::for_each_episode(ccd, 
                              function(x) {
                                  if (any(dirv %in% names(x@data)))
                                      stop("direct identifiable items appeared in the final data!!!")
                              })
}

remove.patients <- function(demg, rmfile) {
    if (is.null(rmfile))
        return(demg)
    rmpatients <- as.character(read.csv(rmfile, header=F)[[1]])
    demg <- demg[!NHSNO %in% rmpatients]
    demg <- demg[!pasno %in% rmpatients]
    demg <- demg[!paste(ICNNO, ADNO, sep=":") %in% rmpatients]
   
    demg
}

CC-HIC/ccanonym documentation built on May 6, 2019, 9:23 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CC-HIC/ccanonym
What the package does (short line)

R/sdc-anonymisation.R
In CC-HIC/ccanonym: What the package does (short line)

Defines functions anonymisation sdc.trial custom.operation suppress.ldiversity addnoise.numvar microaggregation.numvar remove.direct.vars append.age deltaTime1d clinic.data.list create.anonym.ccd non.unique.columns security.check remove.patients

Documented in addnoise.numvar anonymisation append.age deltaTime1d microaggregation.numvar remove.direct.vars sdc.trial

R Package Documentation

Browse R Packages

We want your feedback!

CC-HIC/ccanonym What the package does (short line)

R/sdc-anonymisation.R In CC-HIC/ccanonym: What the package does (short line)

Defines functions anonymisation sdc.trial custom.operation suppress.ldiversity addnoise.numvar microaggregation.numvar remove.direct.vars append.age deltaTime1d clinic.data.list create.anonym.ccd non.unique.columns security.check remove.patients

Documented in addnoise.numvar anonymisation append.age deltaTime1d microaggregation.numvar remove.direct.vars sdc.trial

R Package Documentation

Browse R Packages

We want your feedback!

CC-HIC/ccanonym
What the package does (short line)

R/sdc-anonymisation.R
In CC-HIC/ccanonym: What the package does (short line)