sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

Documented in microaggregation print.micro summary.micro

#' Microaggregation
#'
#' Function to perform various methods of microaggregation.
#'
#' On \url{https://research.cbs.nl/casc/glossary.htm} one can found the
#' \dQuote{official} definition of microaggregation:
#'
#' Records are grouped based on a proximity measure of variables of interest,
#' and the same small groups of records are used in calculating aggregates for
#' those variables. The aggregates are released instead of the individual
#' record values.
#'
#' The recommended method is \dQuote{rmd} which forms the proximity using
#' multivariate distances based on robust methods. It is an extension of the
#' well-known method \dQuote{mdav}.  However, when computational speed is
#' important, method \dQuote{mdav} is the preferable choice.
#'
#' While for the proximity measure very different concepts can be used, the
#' aggregation itself is naturally done with the arithmetic mean.
#' Nevertheless, other measures of location can be used for aggregation,
#' especially when the group size for aggregation has been taken higher than 3.
#' Since the median seems to be unsuitable for microaggregation because of
#' being highly robust, other mesures which are included can be chosen. If a
#' complex sample survey is microaggregated, the corresponding sampling weights
#' should be determined to either aggregate the values by the weighted
#' arithmetic mean or the weighted median.
#'
#' This function contains also a method with which the data can be clustered
#' with a variety of different clustering algorithms. Clustering observations
#' before applying microaggregation might be useful.  Note, that the data are
#' automatically standardised before clustering.
#'
#' The usage of clustering method \sQuote{Mclust} requires package mclust02,
#' which must be loaded first. The package is not loaded automatically, since
#' the package is not under GPL but comes with a different licence.
#'
#' The are also some projection methods for microaggregation included.  The
#' robust version \sQuote{pppca} or \sQuote{clustpppca} (clustering at first)
#' are fast implementations and provide almost everytime the best results.
#'
#' Univariate statistics are preserved best with the individual ranking method
#' (we called them \sQuote{onedims}, however, often this method is named
#' \sQuote{individual ranking}), but multivariate statistics are strong
#' affected.
#'
#' With method \sQuote{simple} one can apply microaggregation directly on the
#' (unsorted) data. It is useful for the comparison with other methods as a
#' benchmark, i.e. replies the question how much better is a sorting of the
#' data before aggregation.
#'
#' @name microaggregation
#' @docType methods
#' @param obj either an object of class \code{\link{sdcMicroObj-class}} or a \code{data.frame}
#' @param variables variables to microaggregate. For \code{NULL}: If obj is of class
#' sdcMicroObj, all numerical key variables are chosen per default. For
#' \code{data.frames}, all columns are chosen per default.
#' @param aggr aggregation level (default=3)
#' @param strata_variables for \code{data.frames}, by-variables for applying microaggregation only
#' within strata defined by the variables. For \code{\link{sdcMicroObj-class}}-objects, the
#' stratification-variable defined in slot \code{@strataVar} is used. This slot can be changed any
#' time using \code{strataVar<-}.
#' @param method pca, rmd, onedims, single, simple, clustpca, pppca,
#' clustpppca, mdav, clustmcdpca, influence, mcdpca
#' @param nc number of cluster, if the chosen method performs cluster analysis
#' @param weights sampling weights. If obj is of class sdcMicroObj the vector
#' of sampling weights is chosen automatically. If determined, a weighted
#' version of the aggregation measure is chosen automatically, e.g. weighted
#' median or weighted mean.
#' @param clustermethod clustermethod, if necessary
#' @param measure aggregation statistic, mean, median, trim, onestep (default=mean)
#' @param trim trimming percentage, if measure=trim
#' @param varsort variable for sorting, if method=single
#' @param transf transformation for data x
#' @return If \sQuote{obj} was of class \code{\link{sdcMicroObj-class}} the corresponding
#' slots are filled, like manipNumVars, risk and utility. If \sQuote{obj} was
#' of class \dQuote{data.frame}, an object of class \dQuote{micro} with following entities is returned:
#' \describe{
#' \item{\code{x}: }{original data}
#' \item{\code{mx}: }{the microaggregated dataset}
#' \item{\code{method}: }{method}
#' \item{\code{aggr}: }{aggregation level}
#' \item{\code{measure}: }{proximity measure for aggregation}}
#' @note if only one variable is specified, \code{\link{mafast}} is applied and argument \code{method} is ignored.
#' Parameters \code{measure} are ignored for methods \code{mdav} and \code{rmd}.
#' @author Matthias Templ, Bernhard Meindl
#'
#' For method \dQuote{mdav}: This work is being supported by the International
#' Household Survey Network (IHSN) and funded by a DGF Grant provided by the
#' World Bank to the PARIS21 Secretariat at the Organisation for Economic
#' Co-operation and Development (OECD).  This work builds on previous work
#' which is elsewhere acknowledged.
#'
#' Author for the integration of the code for mdav in R: Alexander Kowarik.
#' @seealso \code{\link{summary.micro}}, \code{\link{plotMicro}},
#' \code{\link{valTable}}
#' @references
#' Templ, M. and Meindl, B., \emph{Robust Statistics Meets SDC: New Disclosure
#' Risk Measures for Continuous Microdata Masking}, Lecture Notes in Computer
#' Science, Privacy in Statistical Databases, vol. 5262, pp. 113-126, 2008.
#'
#' Templ, M. \emph{Statistical Disclosure Control for Microdata Using the
#' R-Package sdcMicro}, Transactions on Data Privacy, vol. 1, number 2, pp.
#' 67-85, 2008.  \url{http://www.tdp.cat/issues/abs.a004a08.php}
#'
#' Templ, M. \emph{New Developments in Statistical Disclosure Control and
#' Imputation: Robust Statistics Applied to Official Statistics},
#' Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280,
#' 264 pages.
#'
#' Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.
#' \emph{Springer International Publishing}, 287 pages, 2017. ISBN 978-3-319-50272-4. \doi{10.1007/978-3-319-50272-4}
#' \doi{10.1007/978-3-319-50272-4}
#'
#' Templ, M. and Meindl, B. and Kowarik, A.: \emph{Statistical Disclosure Control for
#' Micro-Data Using the R Package sdcMicro}, Journal of Statistical Software,
#' 67 (4), 1--36, 2015.
#' @keywords manip
#' @rdname microaggregation
#' @export
#' @examples
#' data(testdata)
#' # donttest since Examples with CPU time larger 2.5 times elapsed time, because
#' # of using data.table and multicore computation.
#' \donttest{
#' m <- microaggregation(
#'   obj = testdata[1:100, c("expend", "income", "savings")],
#'   method = "mdav",
#'   aggr = 4
#' )
#' summary(m)
#'
#' ## for objects of class sdcMicro:
#' ## no stratification because `@strataVar` is `NULL`
#' data(testdata2)
#' sdc <- createSdcObj(
#'   dat = testdata2,
#'   keyVars = c("urbrur", "roof", "walls", "water", "electcon", "sex"),
#'   numVars = c("expend", "income", "savings"),
#'   w = "sampling_weight"
#' )
#' sdc <- microaggregation(
#'   obj = sdc,
#'   variables = c("expend", "income")
#' )
#'
#' ## with stratification using variable `"relat"`
#' strataVar(sdc) <- "relat"
#' sdc <- microaggregation(
#'   obj = sdc,
#'   variables = "savings"
#' )
#' }
microaggregation <- function(obj, variables=NULL, aggr=3, strata_variables=NULL,
  method="mdav", weights=NULL, nc=8, clustermethod="clara",
  measure="mean", trim=0, varsort=1, transf="log") {

  if (!is.data.frame(obj) & !is.null(strata_variables)) {
    message("Argument 'strata_variables' is ignored. Only variables specified in slot 'strataVar' (if any) of the input object are used!\n")
  }
  microaggregationX(obj=obj, variables=variables, aggr=aggr, strata_variables=strata_variables,
    method=method, weights=weights, nc=nc, clustermethod=clustermethod,
    measure=measure, trim=trim, varsort=varsort, transf=transf)
}

setGeneric("microaggregationX", function(obj, variables=NULL, aggr=3, strata_variables=NULL,
  method="mdav", weights=NULL, nc=8, clustermethod="clara",
  measure="mean", trim=0, varsort=1, transf="log") {
  standardGeneric("microaggregationX")
})

setMethod(f="microaggregationX", signature=c("sdcMicroObj"), definition=function(obj,
  variables=NULL, aggr=3, method="mdav", nc=8, clustermethod="clara",
  measure="mean", trim=0, varsort=1, transf="log") {

  x <- get.sdcMicroObj(obj, type="manipNumVars")
  if (is.null(variables)) {
    variables <- colnames(x)
  }
  strataVars <- get.sdcMicroObj(obj, type="strataVar")
  if (length(strataVars) > 0) {
    sx <- get.sdcMicroObj(obj, type="origData")[, strataVars, drop=FALSE]
    x <- cbind(x, sx)
    strataVars <- utils::tail(colnames(x), 1)
  }

  weights <- get.sdcMicroObj(obj, type="weightVar")
  if (!is.null(weights)) {
    weights <- get.sdcMicroObj(obj, type="origData")[, weights]
  }
  if (any(weights < 0)) {
    warnMsg <- "negative weights have been detected!\n"
    obj <- addWarning(obj, warnMsg=warnMsg, method="microaggregation", variable=NA)
  }
  res <- microaggregationWORK(x, variables=variables, aggr=aggr, strata_variables=strataVars,
    method=method, weights=weights, nc=nc, clustermethod=clustermethod,
    measure=measure, trim=trim, varsort=varsort, transf=transf)
  obj <- nextSdcObj(obj)
  x[, variables] <- res$mx[, variables]

  obj <- set.sdcMicroObj(obj, type="manipNumVars", input=list(as.data.frame(x[,
    colnames(obj@origData)[obj@numVars], drop=FALSE])))
  obj <- dRisk(obj)
  obj <- dUtility(obj)
  obj
})

setMethod(f="microaggregationX", signature=c("data.frame"), definition=function(obj,
  variables=NULL, aggr=3, strata_variables=NULL, method="mdav", weights=NULL,
  nc=8, clustermethod="clara", measure="mean", trim=0, varsort=1,
  transf="log") {

  if (is.null(variables)) {
    variables <- colnames(obj)
  }
  microaggregationWORK(x=obj, variables=variables, aggr=aggr, strata_variables=strata_variables,
    method=method, weights=weights, nc=nc, clustermethod=clustermethod,
    measure=measure, trim=trim, varsort=varsort, transf=transf)
})

microaggregationWORK <- function(x, variables=colnames(x), method="mdav", aggr=3,
  weights=NULL, nc=8, clustermethod="clara", measure="mean",
  trim=0, varsort=1, transf="log", strata_variables=NULL) {

  ## helper-functions used in several micro_xxx() functions
  factorOfTotals <- function(x, aggr) {
    n <- dim(x)[1]
    abgerundet <- floor(n/aggr)
    fot <- n/abgerundet
    return(fot)
  }
  weightedQuantile <- function(x, weights=NULL, probs=seq(0, 1, 0.25), sorted=FALSE, na.rm=FALSE) {
    if (!is.numeric(x)) {
      stop("'x' must be a numeric vector")
    }
    n <- length(x)
    if (n == 0 || (!isTRUE(na.rm) && any(is.na(x)))) {
      return(rep.int(NA, length(probs)))
    }
    if (!is.null(weights)) {
      if (!is.numeric(weights)) {
        stop("'weights' must be a numeric vector")
      } else if (length(weights) != n) {
        stop("'weights' must have the same length as 'x'")
      } else if (!all(is.finite(weights))) {
        stop("missing or infinite weights")
      }
      if (any(weights < 0)) {
        warnMsg <- "negative weights have been detected!\n"
        warning(warnMsg)
      }
      if (!is.numeric(probs) || all(is.na(probs)) || isTRUE(any(probs < 0 | probs > 1))) {
        stop("'probs' must be a numeric vector with values in [0,1]")
      }
      if (all(weights == 0)) {
        warnMsg <- "all weights equal 0!\n"
        obj <- addWarning(obj, warnMsg=warnMsg, method="microaggregation", variable=NA)
        warning(warnMsg)
        return(rep.int(0, length(probs)))
      }
    }
    if (isTRUE(na.rm)) {
      indices <- !is.na(x)
      x <- x[indices]
      if (!is.null(weights))
        weights <- weights[indices]
    }
    if (!isTRUE(sorted)) {
      order <- order(x)
      x <- x[order]
      weights <- weights[order]
    }
    if (is.null(weights))
      rw <- (1:n)/n else rw <- cumsum(weights)/sum(weights)
    q <- sapply(probs, function(p) {
      if (p == 0)
        return(x[1]) else if (p == 1)
        return(x[n])
      select <- min(which(rw >= p))
      if (rw[select] == p)
        mean(x[select:(select + 1)]) else x[select]
    })
    return(unname(q))
  }
  weightedMedian <- function(x, weights=NULL, sorted=FALSE, na.rm=FALSE) {
    weightedQuantile(x, weights, probs=0.5, sorted=sorted, na.rm=na.rm)
  }
  indexMicro <- function(x, aggr) {
    n <- dim(x)[1]
    if (n < 2 * aggr) {
      stop(paste0("Too less observations (", n, ") for aggregate =", aggr,"\n"))
    }
    aa <- seq(1, n, aggr)
    j <- 1
    teiler <- n/aggr
    d1 <- 1:n
    index <- list()
    if (teiler %in% 1:n) {
      for (i in 1:length(aa)) {
        index[[i]] <- d1[j:(j + aggr - 1)]
        j <- j + aggr
      }
    } else {
      for (i in 1:(length(aa) - 2)) {
        index[[i]] <- d1[j:(j + aggr - 1)]
        j <- j + aggr
      }
      index[[i + 1]] <- d1[(j):n]
    }
    index
  }
  means <- function(x, index, measure, trim=0) {
    m <- matrix(ncol=ncol(x), nrow=length(index))
    if (measure == "mean" & is.null(weights)) {
      for (i in 1:length(index)) {
        m[i, ] <- colMeans(x[index[[i]], ])
      }
    }
    if (measure == "median" & is.null(weights)) {
      for (i in 1:length(index)) {
        m[i, ] <- apply(x[index[[i]], ], 2, median)
      }
    }
    if (measure == "mean" & !is.null(weights)) {
      for (i in 1:length(index)) {
        m[i, ] <- apply(x[index[[i]], ], 2, function(x) stats::weighted.mean(x,
          w=weights[index[[i]]]))
      }
    }
    if (measure == "median" & !is.null(weights)) {
      for (i in 1:length(index)) {
        m[i, ] <- apply(x[index[[i]], ], 2, function(x) weightedMedian(x,
          weights=weights[index[[i]]]))
      }
    }
    if (measure == "trim") {
      for (i in 1:length(index)) {
        for (j in 1:length(index[[i]])) {
          m[i, ] <- apply(x[index[[i]], ], 2, mean, trim=trim)
        }
      }
    }
    if (measure == "onestep") {
      y <- x
      constant <- 3/1.486
      for (i in 1:length(index)) {
        m1 <- apply(x[index[[i]], ], 2, median)
        m2 <- apply(x[index[[i]], ], 2, mad)
        limit1 <- m1 + constant * m2
        limit2 <- m1 - constant * m2
        for (ii in 1:length(index[[i]])) {
          if (any(x[index[[i]][ii], ] > limit1)) {
          w <- which(x[index[[i]][ii], ] > limit1)
          le <- length(w)
          y[index[[i]][ii], w] <- limit1[w]
          }
          if (any(x[index[[i]][ii], ] < limit2)) {
          w <- which(x[index[[i]][ii], ] < limit2)
          le <- length(w)
          y[index[[i]][ii], w] <- limit2[w]
          }
          m[i, ] <- colMeans(y[index[[i]], ])
        }
      }
    }
    colnames(m) <- colnames(x)
    return(m)
  }
  blowup <- function(x, mr, aggr) {
    n <- dim(x)[1]
    aa <- seq(1, n, aggr)
    j <- 1
    teiler <- n/aggr
    d1 <- 1:n
    xx <- matrix(0, ncol=ncol(x), nrow=nrow(x))
    if (teiler %in% 1:n) {
      for (i in 1:length(aa)) {
        for (s in j:(j + aggr - 1)) {
          xx[s, ] <- as.matrix(mr[i, , drop=FALSE])
        }
        j <- j + aggr
      }
    } else {
      for (i in 1:(length(aa) - 2)) {
        for (s in j:(j + aggr - 1)) {
          xx[s, ] <- as.matrix(mr[i, , drop=FALSE])
        }
        j <- j + aggr
      }
      for (s in j:n) {
        xx[s, ] <- mr[i + 1, ]
      }
    }
    rownames(xx) <- rownames(x)
    xx
  }
  clust <- function(x, nc, clustermethod="clara", transf="log") {
    if (transf == "none") {
      y <- x
    }
    if (transf == "log") {
      y <- scale(log(x))
    }
    if (transf == "boxcox") {
      lambda <- car::powerTransform(x)$lambda
      y <- scale(car::bcPower(x, lambda))
    }
    if (clustermethod == "clara") {
      a <- clara(x, nc)
      clustresult <- a$clust
      centers <- a$med
      size <- a$clusinfo[, 1]
    }
    if (clustermethod == "pam") {
      a <- pam(x, nc)
      clustresult <- a$clust
      centers <- a$med
      size <- a$clusinfo[, 1]
    }
    if (clustermethod == "kmeans") {
      a <- stats::kmeans(x, nc)
      centers <- a$centers
      clustresult <- a$cluster
      size <- a$size
    }
    if (clustermethod == "cmeans") {
      a <- e1071::cmeans(x, nc)
      centers <- a$centers
      clustresult <- a$cluster
      size <- a$size
      res@mem <- a$mem
    }
    if (clustermethod == "bclust") {
      a <- e1071::bclust(x, nc)
      centers <- a$centers
      groesse <- rep(0, nc)
      for (i in seq(nc)) {
        groesse[i] <- length(which(a$cluster == i))
      }
      size <- groesse
      clustresult <- a$cluster
    }
    list(centers=centers, clustresult=clustresult, nc=nc)
  }
  prcompRob <- function(X, k=0, sca="mad", scores=TRUE) {
    n <- nrow(X)
    p <- ncol(X)
    if (k == 0) {
      p1 <- min(n, p)
    } else {
      p1 <- k
    }
    S <- rep(1, p1)
    V <- matrix(1:(p * p1), ncol=p1, nrow=p)
    P <- diag(p)
    m <- apply(X, 2, median)
    Xcentr <- scale(X, center=m, scale=FALSE)
    for (k in 1:p1) {
      B <- Xcentr %*% P
      Bnorm <- sqrt(apply(B^2, 1, sum))
      A <- diag(1/Bnorm) %*% B
      Y <- A %*% P %*% t(X)
      if (sca == "mad")
        s <- apply(Y, 1, mad)
      # if (sca == 'tau') s <- apply(Y, 1, scale.tau) if (sca == 'A') s <- apply(Y, 1,
      # scale.a)
      j <- order(s)[n]
      S[k] <- s[j]
      V[, k] <- A[j, ]
      if (V[1, k] < 0)
        V[, k] <- (-1) * V[, k]
      P <- P - (V[, k] %*% t(V[, k]))
    }
    if (scores) {
      list(scale=S, loadings=V, scores=Xcentr %*% V)
    } else list(scale=S, loadings=V)
  }

  # implementations of microaggregation methods
  micro_simple <- function(x, aggr, measure, trim) {
    index <- indexMicro(x, aggr)
    m <- means(x=x, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    return(list(x=x, method="simple", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=m, roundxm=mr, clustermethod=NULL, measure=measure, trim=trim,
      varsort=NULL, transf=NULL, blowup=TRUE, blowxm=blowxm, fot=0))
  }
  micro_single <- function(x, aggr, measure, trim, varsort) {
    sortvec <- sort(x[, varsort], index.return=TRUE)$ix
    xx <- x[sortvec, ]
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(xx)
    return(list(x=x, method="single", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=m, roundxm=mr, clustermethod=NULL, measure=measure, trim=trim,
      varsort=varsort, transf=NULL, blowup=TRUE, blowxm=blowxm, fot=0))
  }
  micro_onedims <- function(x, aggr, measure, trim) {
    i <- dim(x)[2]
    xx <- sapply(1:i, function(i) {
      x[order(x[, i]), i]
    })
    xxx <- sapply(1:i, function(i) {
      rank(x[, i], ties.method="first")
    })
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    b <- blowup(x, m, aggr)
    y <- x
    for (i in 1:dim(x)[2]) {
      y[, i] <- b[xxx[, i], i]
    }
    return(list(x=x, method="onedims", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=m, roundxm=mr, clustermethod=NULL, measure=measure, trim=trim,
      varsort=NULL, transf=NULL, blowup=TRUE, blowxm=y, fot=0))
  }
  micro_pca <- function(x, aggr, measure, trim) {
    p <- stats::princomp(scale(x))
    s1 <- sort(p$scores[, 1], index.return=TRUE)$ix
    xx <- x[s1, ]
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(xx)
    return(list(x=x, method="pca", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=m, roundxm=mr, clustermethod=NULL, measure=measure, trim=trim,
      varsort=NULL, transf=NULL, blowup=TRUE, blowxm=blowxm, fot=0))
  }
  micro_mcdpca <- function(x, aggr, measure, trim) {
    x.mcd <- cov.mcd(x, cor=TRUE)
    x.scale <- scale(x, x.mcd$center, sqrt(diag(x.mcd$cor)))
    p <- stats::princomp(x.scale, covmat=x.mcd)
    s1 <- sort(p$scores[, 1], index.return=TRUE)$ix
    xx <- x[s1, ]
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(xx)
    return(list(x=x, method="mcdpca", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=m, roundxm=mr, clustermethod=NULL, measure=measure, trim=trim,
      varsort=NULL, transf=NULL, blowup=TRUE, blowxm=blowxm, fot=0))
  }
  micro_pppca <- function(x, aggr, measure, trim) {
    p <- prcompRob(x)
    s1 <- sort(p$scores[, 1], index.return=TRUE)$ix
    xx <- x[s1, ]
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(xx)
    return(list(x=x, method="pppca", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=m, roundxm=mr, clustermethod=NULL, measure=measure, trim=trim,
      varsort=NULL, transf=NULL, blowup=TRUE, blowxm=blowxm, fot=0))
  }
  micro_influence <- function(x, aggr, measure, trim, clustermethod, transf, nc) {
    ac.scale <- clust(x=x, nc=nc, clustermethod=clustermethod, transf=transf)
    cent <- matrix(ac.scale$centers, ncol=nc, byrow=TRUE)
    j <- matrix(ncol=1, nrow=nc)
    vmax <- matrix(ncol=1, nrow=nc)
    for (i in 1:nc) {
      j[i, ] <- max(cent[, i])
      vmax[i, ] <- which(cent[, i] == j[i, ])
    }
    ncols <- c(1:ncol(x))
    xx <- list()
    for (i in 1:nc) {
      w <- which(ac.scale$clustresult == i)
      s <- x[w, , drop=FALSE]
      xx[[i]] <- s[order(s[, vmax[i]]), ]
    }
    yy <- NULL
    for (i in 1:nc) {
      yy <- rbind(yy, matrix(unlist(xx[[i]]), ncol=ncol(x), dimnames=list(rownames(xx[[i]]),
        colnames(xx[[i]]))))
    }
    xx <- yy
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(yy)
    return(list(x=x, method="influence", clustering=TRUE, aggr=aggr,
      nc=ac.scale$nc, xm=m, roundxm=mr, clustermethod=clustermethod,
      measure=measure, trim=trim, varsort=NULL, transf=transf, blowup=TRUE,
      blowxm=blowxm, fot=0))
  }
  micro_clustpca <- function(x, aggr, measure, trim, clustermethod, transf, nc) {
    ac.scale <- clust(x=x, nc=nc, clustermethod=clustermethod, transf=transf)
    cent <- matrix(ac.scale$centers, ncol=nc, byrow=TRUE)
    xx <- list()
    for (i in 1:nc) {
      w <- which(ac.scale$clustresult == i)
      if (length(w) < dim(x)[2]) {
        y <- x[w, , drop=FALSE]
        xx[[i]] <- y[order(y[, varsort]), ]
      } else {
        p <- stats::princomp(scale(x[w, , drop=FALSE]))$scores[, 1]
        psortind <- sort(p, index.return=TRUE)$ix
        y <- x[w, , drop=FALSE]
        xx[[i]] <- y[psortind, ]
      }
    }
    yy <- NULL
    for (i in 1:nc) {
      yy <- rbind(yy, matrix(unlist(xx[[i]]), ncol=ncol(x), dimnames=list(rownames(xx[[i]]),
        colnames(xx[[i]]))))
    }
    xx <- yy
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(xx)
    return(list(x=x, method="clustpca", clustering=TRUE, aggr=aggr, nc=ac.scale$nc,
      xm=m, roundxm=mr, clustermethod=clustermethod, measure=measure,
      trim=trim, varsort=NULL, transf=transf, blowup=TRUE, blowxm=blowxm,
      fot=0))
  }
  micro_clustmcdpca <- function(x, aggr, measure, trim, clustermethod, transf, nc) {
    ac.scale <- clust(x=x, nc=nc, clustermethod=clustermethod, transf=transf)
    cent <- matrix(ac.scale$centers, ncol=nc, byrow=TRUE)
    xx <- list()
    for (i in 1:nc) {
      w <- which(ac.scale$clustresult == i)
      if (length(w) < dim(x)[2]) {
        y <- x[w, , drop=FALSE]
        xx[[i]] <- y[order(y[, varsort]), ]
      } else {
        message("length(w):", length(w), "\n")
        x.mcd <- cov.mcd(x[w, ], cor=TRUE)
        x.scale <- scale(x[w, ], x.mcd$center, sqrt(diag(x.mcd$cor)))
        p <- stats::princomp(x.scale, covmat=x.mcd)$scores[, 1]
        psortind <- sort(p, index.return=TRUE)$ix
        y <- x[w, , drop=FALSE]
        xx[[i]] <- y[psortind, ]
      }
    }
    yy <- NULL
    for (i in 1:nc) {
      yy <- rbind(yy, matrix(unlist(xx[[i]]), ncol=ncol(x), dimnames=list(rownames(xx[[i]]),
        colnames(xx[[i]]))))
    }
    xx <- yy
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(xx)
    return(list(x=x, method="clustmcdpca", clustering=TRUE, aggr=aggr,
      nc=ac.scale$nc, xm=m, roundxm=mr, clustermethod=clustermethod,
      measure=measure, trim=trim, varsort=NULL, transf=transf, blowup=TRUE,
      blowxm=blowxm, fot=0))
  }
  micro_clustpppca <- function(x, aggr, measure, trim, clustermethod, transf, nc) {
    ac.scale <- clust(x=x, nc=nc, clustermethod=clustermethod, transf=transf)
    cent <- matrix(ac.scale$centers, ncol=nc, byrow=TRUE)
    xx <- list()
    for (i in 1:nc) {
      w <- which(ac.scale$clustresult == i)
      if (length(w) < dim(x)[2]) {
        y <- x[w, , drop=FALSE]
        xx[[i]] <- y[order(y[, varsort]), ]
      } else {
        p <- prcompRob(x[w, , drop=FALSE], 1)$scores
        psortind <- sort(p, index.return=TRUE)$ix
        y <- x[w, , drop=FALSE]
        xx[[i]] <- y[psortind, ]
      }
    }
    yy <- NULL
    for (i in 1:nc) {
      yy <- rbind(yy, matrix(unlist(xx[[i]]), ncol=ncol(x), dimnames=list(rownames(xx[[i]]),
        colnames(xx[[i]]))))
    }
    xx <- yy
    index <- indexMicro(xx, aggr)
    m <- means(x=xx, index=index, measure=measure, trim=trim)
    mr <- round(m)
    blowxm <- blowup(x, m, aggr)
    rownames(blowxm) <- rownames(xx)
    return(list(x=x, method="clustpppca", clustering=TRUE, aggr=aggr,
      nc=ac.scale$nc, xm=m, roundxm=mr, clustermethod=clustermethod,
      measure=measure, trim=trim, varsort=NULL, transf=transf, blowup=TRUE,
      blowxm=blowxm, fot=0))
  }
  micro_rmd <- function(x, aggr) {
    kn <- function(ds, aggr) {
      w <- rep(0, aggr)
      for (i in 1:aggr) {
        w[i] <- which.min(ds)
        ds[w[i]] <- NA
      }
      return(w)
    }
    y <- x
    cm <- colMeans(x, na.rm=TRUE)
    csd <- apply(x, 2, sd, na.rm=TRUE)
    len <- nrow(y)
    y <- apply(y, 2, function(x) (x - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE))

    d <- as.matrix(stats::dist(y))
    rr <- covMcd(y)
    md <- stats::mahalanobis(y, center=rr$center, cov=rr$cov)
    diag(d) <- 0

    for (i in 1:(floor(dim(x)[1]/aggr) - 1)) {
      s <- which.max(md)
      w <- kn(d[, s], aggr)
      d[w, ] <- NA
      md[w] <- NA
      y[w, ] <- rep(colMeans(y[w, ]), each=aggr)
    }
    w <- which(!is.na(d[, 1]))
    y[w, ] <- rep(colMeans(y[w, ]), each=length(w))
    for (i in 1:dim(x)[2]) {
      y[, i] <- as.numeric((y[, i] * csd[i]) + cm[i])
    }
    return(list(x=x, method="rmd", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=y, roundxm=round(y), clustermethod=NULL, measure=NULL, trim=NULL,
      varsort=NULL, transf=NULL, blowup=TRUE, blowxm=y, fot=0))
  }
  micro_mdav <- function(x, aggr) {
    resX <- mdav(x, variables=NULL, weights=NULL, K=aggr, missing=-999)
    return(list(x=x, method="mdav", clustering=FALSE, aggr=aggr, nc=NULL,
      xm=NULL, roundxm=NULL, clustermethod=NULL, measure="mean", trim=NULL,
      varsort=NULL, transf=NULL, blowup=FALSE, blowxm=resX, fot=0))
  }

  stopifnot(method %in% c("simple", "single", "onedims", "pca", "mcdpca", "pppca",
    "clustmcdpca", "clustpppca", "clustpca", "rmd", "mdav", "influence"))
  rownames(x) <- 1:nrow(x)

  if (length(variables) == 1) {
    res <- list()
    res$mx <- mafast(x, variables=variables, by=strata_variables, aggr=aggr,
      measure=eval(parse(text=measure)))
    res$x <- x
    res$method <- "mafast"
    res$aggr <- aggr
    res$measure <- measure
    res$fot <- factorOfTotals(x, aggr)
    class(res) <- "micro"
    return(res)
  }

  xall <- x
  if (!is.null(strata_variables)) {
    if (!all(strata_variables %in% colnames(x))) {
      stop("strata_variables are not found in the data set!")
    }
    byvar <- rep("", nrow(x))
    for (i in 1:length(strata_variables)) {
      byvar <- paste(byvar, x[, strata_variables[i]], sep="-")
    }
    xsp <- split(x, as.factor(byvar))
  } else {
    xsp <- list(dataset=x)
  }
  reslist <- list()
  for (spind in 1:length(xsp)) {
    x <- xsp[[spind]][, variables, drop=FALSE]
    if (method == "simple") {
      res <- micro_simple(x=x, aggr=aggr, measure=measure, trim=trim)
    }
    if (method == "single") {
      res <- micro_single(x=x, aggr=aggr, measure=measure, trim=trim, varsort=varsort)
    }
    if (method == "onedims") {
      res <- micro_onedims(x=x, aggr=aggr, measure=measure, trim=trim)
    }
    if (method == "pca") {
      res <- micro_pca(x=x, aggr=aggr, measure=measure, trim=trim)
    }
    if (method == "mcdpca") {
      res <- micro_mcdpca(x=x, aggr=aggr, measure=measure, trim=trim)
    }
    if (method == "pppca") {
      res <- micro_pppca(x=x, aggr=aggr, measure=measure, trim=trim)
    }
    if (method == "influence") {
      res <- micro_influence(x=x, aggr=aggr, measure=measure, trim=trim,
        clustermethod=clustermethod, transf=transf, nc=nc)
    }
    if (method == "clustpca") {
      res <- micro_clustpca(x=x, aggr=aggr, measure=measure, trim=trim,
        clustermethod=clustermethod, transf=transf, nc=nc)
    }
    if (method == "clustmcdpca") {
      res <- micro_clustmcdpca(x=x, aggr=aggr, measure=measure, trim=trim,
        clustermethod=clustermethod, transf=transf, nc=nc)
    }
    if (method == "clustpppca") {
      res <- micro_clustpppca(x=x, aggr=aggr, measure=measure, trim=trim,
        clustermethod=clustermethod, transf=transf, nc=nc)
    }
    if (method == "rmd") {
      res <- micro_rmd(x=x, aggr=aggr)
    }
    if (method == "mdav") {
      res <- micro_mdav(x, aggr)
    }
    res$fot <- factorOfTotals(x, aggr)
    reslist[[spind]] <- res
  }
  res <- reslist[[1]]
  if (length(reslist) > 1) {
    blowxm <- vector()
    fot <- vector()
    for (i in 1:length(reslist)) {
      blowxm <- rbind(blowxm, reslist[[i]]$blowxm)
      fot <- c(fot, reslist[[i]]$fot)
    }
    res$x <- xall
    res$blowxm <- blowxm
    names(fot) <- substring(names(xsp), 2)
    res$fot <- fot
  }

  res$x <- res$x[order(as.numeric(rownames(res$x))), ]
  res$blowxm <- res$blowxm[order(as.numeric(rownames(res$blowxm))), ]
  res$blowxm <- res$blowxm[1:nrow(xall), ]
  class(res) <- "micro"
  res$mx <- as.data.frame(res$blowxm)
  colnames(res$mx) <- variables
  resv <- c("x", "mx", "method", "aggr", "measure")
  res1 <- list()
  for (v in resv) {
    res1[[v]] <- res[[v]]
  }
  class(res1) <- "micro"
  return(res1)
}

#' Print method for objects from class micro
#'
#' printing an object of class \code{micro}
#'
#' @param x object from class micro
#' @param \dots Additional arguments passed through.
#' @return information about method and aggregation level from objects of class
#' micro.
#' @author Matthias Templ
#' @seealso \code{\link{microaggregation}}
#' @keywords print
#' @method print micro
#' @export
#' @examples
#'
#' data(free1)
#' free1 <- as.data.frame(free1)
#' m1 <- microaggregation(free1[, 31:34], method='onedims', aggr=3)
#' m1
#'
print.micro <- function(x, ...) {
  message(paste("\n Object created with method", x$method, "and aggregation level",
    x$aggr))
  message("\n -------------------------\n")
  message("x ... original values \n")
  print(summary(x$x))
  message("\n -------------------------\n")
  message("mx ... microaggregated values\n")
  print(summary(x$mx))
  message("\n -------------------------\n")

  message("Try names(your object from class micro) for more details")
  message("\n")
}

#' Summary method for objects from class micro
#'
#' Summary method for objects from class \sQuote{micro}.
#'
#' This function computes several measures of information loss, such as
#'
#' @param object objects from class micro
#' @param \dots Additional arguments passed through.
#' @return
#' \item{meanx }{A conventional summary of the original data}
#' \item{meanxm }{A conventional summary of the microaggregated data}
#' \item{amean }{average relative absolute deviation of means}
#' \item{amedian}{average relative absolute deviation of medians}
#' \item{aonestep }{average relative absolute deviation of onestep from median}
#' \item{devvar }{average relative absolute deviation of variances}
#' \item{amad }{average relative absolute deviation of the mad}
#' \item{acov }{average relative absolute deviation of covariances}
#' \item{arcov }{average relative absolute deviation of robust (with mcd) covariances}
#' \item{acor }{average relative absolute deviation of correlations}
#' \item{arcor }{average relative absolute deviation of robust (with mcd) correlations}
#' \item{acors }{average relative absolute deviation of rank-correlations}
#' \item{adlm }{average absolute deviation of lm regression coefficients (without intercept)}
#' \item{adlts}{average absolute deviation of lts regression coefficients (without intercept)}
#' \item{apcaload }{average absolute deviation of pca loadings}
#' \item{apppacaload }{average absolute deviation of robust (with projection pursuit approach) pca loadings}
#' \item{atotals }{average relative absolute deviation of totals}
#' \item{pmtotals }{average relative deviation of totals}
#' @author Matthias Templ
#' @seealso \code{\link{microaggregation}}, \code{\link{valTable}}
#' @references Templ, M. \emph{Statistical Disclosure Control for Microdata
#' Using the R-Package sdcMicro}, Transactions on Data Privacy, vol. 1, number
#' 2, pp. 67-85, 2008. \url{http://www.tdp.cat/issues/abs.a004a08.php}
#' @keywords print
#' @method summary micro
#' @export
#' @examples
#' data(Tarragona)
#' m1 <- microaggregation(Tarragona, method = "onedims", aggr = 3)
#' \donttest{
#' summary(m1)
#' }
summary.micro <- function(object, ...) {
  prcompRob <- function(X, k=0, sca="mad", scores=TRUE) {
    ## Copyright: Croux and Filzmoser
    n <- nrow(X)
    p <- ncol(X)
    if (k == 0) {
      p1 <- min(n, p)
    } else {
      p1 <- k
    }
    S <- rep(1, p1)
    V <- matrix(1:(p * p1), ncol=p1, nrow=p)
    P <- diag(p)
    m <- apply(X, 2, median)
    Xcentr <- scale(X, center=m, scale=FALSE)
    for (k in 1:p1) {
      B <- Xcentr %*% P
      Bnorm <- sqrt(apply(B^2, 1, sum))
      A <- diag(1/Bnorm) %*% B
      Y <- A %*% P %*% t(X)
      if (sca == "mad")
        s <- apply(Y, 1, mad)
      # if (sca == 'tau') s <- apply(Y, 1, scale.tau) if (sca == 'A') s <- apply(Y, 1,
      # scale.a)
      j <- order(s)[n]
      S[k] <- s[j]
      V[, k] <- A[j, ]
      if (V[1, k] < 0)
        V[, k] <- (-1) * V[, k]
      P <- P - (V[, k] %*% t(V[, k]))
    }
    if (scores) {
      list(scale=S, loadings=V, scores=Xcentr %*% V)
    } else list(scale=S, loadings=V)
  }

  x1 <- as.data.frame(object$x)
  x2 <- as.data.frame(object$mx)
  colnames(x2) <- colnames(x1)
  amx <- mapply(mean, x1)
  amxn <- mapply(mean, x2)
  amean <- sum(abs(amx - amxn)/(abs(amx)))
  meds1 <- mapply(median, x1)
  meds2 <- mapply(median, x2)
  amedian <- sum(abs(meds1 - meds2) / abs(meds1), na.rm = TRUE)
  onestep <- function(x) {
    y <- x
    constant <- 3/1.486
    m1 <- mapply(median, x)
    m2 <- mapply(mad, x)
    limit1 <- m1 + constant * m2
    limit2 <- m1 - constant * m2
    for (i in 1:dim(x)[2]) {
      if (any(x[, i] > limit1[i])) {
        w <- which(x[, i] > limit1[i])
        le <- length(w)
        y[w, i] <- limit1[i]
      }
      if (any(x[, i] < limit2[i])) {
        w <- which(x[, i] < limit2[i])
        le <- length(w)
        y[w, i] <- limit2[i]
      }
    }
    y
  }
  aox <- onestep(x1)
  aox <- mapply(mean, aox)
  aoxm <- onestep(x2)
  aoxm <- mapply(mean, aoxm)
  aonestep <- sum(abs(aox - aoxm) / abs(aox), na.rm = TRUE)
  devvar <- sum(abs(var(x1) - var(x2))/abs(var(x1)))/length(x1)
  amx <- mapply(mad, x1)
  amxn <- mapply(mad, x2)
  amad <- sum(abs(amx - amxn) / (abs(amx)), na.rm = TRUE)
  acov <- sum(abs(cov(x1) - cov(x2))/abs(cov(x1)))/(2 * length(x1))
  arcov <- NA
  acor <- sum(abs(cor(x1) - cor(x2))/abs(cor(x1)))/(2 * length(x2))
  arcor <- NA
  acors <- sum(abs(cor(x1, method = "spearman") - cor(x2, method = "spearman")) /
          abs(cor(x1, method = "spearman"))) / (2 * length(x1))
  l1 <- lm(as.matrix(x1[, 1]) ~ as.matrix(x1[, -1]))$coeff
  l2 <- lm(as.matrix(x2[, 1]) ~ as.matrix(x2[, -1]))$coeff
  adlm <- sum(abs(l1[2:length(l1)] - l2[2:length(l2)]), na.rm = TRUE)
  adlts <- NA
  if (dim(x1)[1] > dim(x1)[2] && dim(x2)[1] > dim(x2)[2]) {
    p1 <- stats::princomp(x1)
    p2 <- stats::princomp(x2)
    cp1 <- colMeans(p1$load)
    cp2 <- colMeans(p2$load)
    apcaload <- sum(abs(cp1 - cp2)/abs(cp1))
  } else {
    apcaload <- "too less observations"
  }
  if (dim(x1)[1] > dim(x1)[2] && dim(x2)[1] > dim(x2)[2]) {
    p1 <- prcompRob(x1)
    p2 <- prcompRob(x2)
    cp1 <- colMeans(p1$load)
    cp2 <- colMeans(p2$load)
    apppcaload <- sum(abs(cp1 - cp2)/abs(cp1))
  } else {
    apppcaload <- "too less observations"
  }
  cmx1 <- apply(x1, 2, sum)
  cmx2 <- apply(x2, 2, sum) * object$fot
  atotals <- sum(abs((cmx1 - cmx2)/cmx1))
  pmtotals <- sum((cmx2 - cmx1)/cmx1)
  util1 <- dUtility(x1, x2)
  deigenvalues <- dUtility(x1, x2, method = "eigen")
  risk0 <- dRisk(x1, x2)
  r <- dRiskRMD(x1, x2, k = 0.7)
  risk1 <- r$risk1
  risk2 <- r$risk2
  wrisk1 <- r$wrisk1
  wrisk2 <- r$wrisk2
  list(
    meansx = summary(x1),
    meansxm = summary(x2),
    amean = amean,
    amedian = amedian,
    aonestep = aonestep,
    devvar = devvar,
    amad = amad,
    acov = acov,
    arcov = arcov,
    acor = acor,
    arcor = arcor,
    acors = acors,
    adlm = adlm,
    adlts = adlts,
    apcaload = apcaload,
    apppcaload = apppcaload,
    totalsOrig = cmx1,
    totalsMicro = cmx2,
    atotals = atotals,
    pmtotals = pmtotals,
    util1 = util1,
    deigenvalues = deigenvalues,
    risk0 = risk0,
    risk1 = risk1,
    risk2 = risk2,
    wrisk1 = wrisk1,
    wrisk2 = wrisk2)
}
sdcTools/sdcMicro documentation built on March 15, 2024, 12:32 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
sdcTools/sdcMicro
Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

R/microaggregation.R
In sdcTools/sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

Defines functions summary.micro print.micro microaggregationWORK microaggregation

Documented in microaggregation print.micro summary.micro

R Package Documentation

Browse R Packages

We want your feedback!

sdcTools/sdcMicro Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

R/microaggregation.R In sdcTools/sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

Defines functions summary.micro print.micro microaggregationWORK microaggregation

Documented in microaggregation print.micro summary.micro

R Package Documentation

Browse R Packages

We want your feedback!

sdcTools/sdcMicro
Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

R/microaggregation.R
In sdcTools/sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation