Nothing
#' Create multi-resolution grids based on confidentiality or reliability restrictions
#'
#' Function that creates a multi-resolution grid with larger grid cells in
#' regions with lower resolution of data, or where data needs to
#' be anonymized for disclosure control reasons. The function can also be used
#' to create a grid of new variables, using an existing multi-resolution grid
#' as template.
#'
#' The possible restrictions that will lead to aggregation of a grid cell are:
#' \enumerate{
#' \item{ Frequency rule (Aggregate to reach a minimum number of counts)}
#' \item{ Dominance rule (Aggregate because of dominance by one or more units)}
#' \item{ Reliability rule (Aggregate because the uncertainty is too high)}
#' \item{ User defined rule (Aggregate because a grid cell does not respect a user defined criteria)}
#' }
#'
#'
#' @eval MRGparam("MRGinp")
#' @eval MRGparam("mincount")
#' @eval MRGparam("nlarge")
#' @eval MRGparam("plim")
#' @eval MRGparam("ifg")
#' @eval MRGparam("vars")
#' @eval MRGparam("weights")
#' @eval MRGparam("countFeatureOrTotal")
#' # @eval MRGparam("minpos")
#' @eval MRGparam("verbose")
#' @eval MRGparam("plim")
#' @eval MRGparam("domEstat")
#' @eval MRGparam("outfile")
#' @eval MRGparam("checkDominance")
#' @eval MRGparam("checkReliability")
#' @eval MRGparam("pseudoreg")
#' @eval MRGparam("userfun")
#' @eval MRGparam("strat")
#' @eval MRGparam("confrules")
#' @eval MRGparam("suppresslim")
#' @eval MRGparam("sumsmall")
#' @eval MRGparam("suppresslimSum")
#' @eval MRGparam("reliabilitySplit")
#' @eval MRGparam("plotIntermediate")
#' @eval MRGparam("addIntermediate")
#' @eval MRGparam("postProcess")
#' @eval MRGparam("rounding")
#' @eval MRGparam("remCols")
#' @eval MRGparam("ellipsis")
#'
#' @details This function will find the highest resolution data set that fulfills the
#' confidentiality rules and potential reliability rules for variable(s) of interest.
#' Starting with the second highest resolution (5 km in the default settings),
#' the function will check if any of the 1 km sub pixels will have values not fulfilling
#' any of the confidentiality rules (number of farms, values of the 2 largest compared to values of the
#' entire grid cell).
#' If all values are above the confidentiality limits, the grid cells will be kept at a 1 km resolution,
#' otherwise only the 5 km grid cell will be kept. This will again be tested against the confidentiality
#' rules in the next iteration, when grid cells will possibly be merged to 10 km grid cells.
#'
#' The function can also be called if it is necessary to create a grid of a new variable for the same
#' grid as an already existing variable. The confidentiality rules will then be applied to the new
#' variables for the existing grid cells, and mask the ones that do not respect the rules.
#' The function will not do any further merging
#' of grid cells, for this it is necessary to grid the variables together.
#' This feature is useful when the new data set has a similar resolution
#' as the original data set. It will give a high number of missing values if
#' the resolution of the new data is more sparse than the original. In the examples below,
#' this means that it is possible to copy the grid of organic organic
#' agricultural area to a grid of all agricultural area, whereas the opposite
#' will not work well.
#'
#' The standard threshold rule for spatial data is at least 10 units (mincount).
#' The parameters nlarge and plim are used for determining the dominance treatment for the variable of interest,
#' with default values of \code{nlarge = 2} and \code{plim = 0.85}.
#' If more than plim of the values of the grid cell (e.g. UAA, arable land, number of livestock)
#' is explained by 1-nlarge weighted holdings, the grid cell will not pass the confidentiality rule.
#'
#' The concept of reliability is explained in details in section 4.6 in the integrated farm survey handbook for 2023:
#' https://wikis.ec.europa.eu/display/IFS/Integrated+Farm+Statistics+Manual+%7C+2023+edition
#' In short, it is an estimate of the coefficient of variation for an estimate (a grid cell in this case),
#' based on the number in the sample relative to the number in the population, and taking into account
#' possible stratified sampling approaches. The number is zero if all holdings in the population in
#' a grid cell has been sampled, and the default requirement is that the CV is less than 35%.
#'
#' The computation can be time and memory intensive, particularly for the first iteration.
#' The method involves creation (and inversion) of a matrix of
#' size \code{nr*ng}, where \code{nr} is the number of records and \code{ng} is the number of grid cells.
#' it is therefore sometimes necessary to split the data set into smaller parts, to reduce
#' the computational challenges. The parameter \code{reliabilitySplit} is used for this.
#' It will split the area of interest into several subsets. This will have some impact
#' on the reliability calculations. The \code{reliabilitySplit} value might be set temporarily
#' higher for the first iterations, as it will also depend on the number of grid cells.
#'
#' Reliability cannot be calculated for records belonging to strata with only one record.
#' The function will therefore attempt to merge these into pseudostrata, if there is more than one
#' of these strata. The \code{pseudoreg}-parameter can be used
#' to define the regions within which the pseudostrata are created (for example NUTS2-region).
#' If there are still strata with only one record, these will cause a printed warning.
#'
#'
#' There are some cases where aggregation might not be desired. In the situation where a
#' relatively large single grid cell does not respect the confidentiality rules, it is fine to
#' aggregate it if the neighbouring grid cells are also relatively large. However, it can be seen
#' as unfortunate if the single cell was aggregated with many smaller grid cells that could otherwise
#' be disseminated at a high resolution. The added value of being able to present a value for a
#' region with very few farms is perhaps lower than what is lost by having to aggregate to a
#' lower resolution. The parameter \code{suppresslim} indicates the minimum value in a grid cell
#' relative to the possible lower resolution grid cell
#' before it is necessary to aggregate. If the limit is 0.05, a grid cell would only cause an aggregation
#' to lower resolution if the value in the grid cell is more than 5\% of the value in the lower resolution
#' grid cell. Instead, it would be left as it is, and will be suppressed in the post-processing step.
#'
#' There are cases when the built-in confidentiality checks are not what the user needs.
#' That is why it is possible to submit a user defined function. This function needs to follow
#' certain rules.
#' \enumerate{
#' \item The first argument must be a data.frame with name \code{df}.
#' This is a data.frame with the individual records
#' for a particular grid cell. It has three columns:
#' \enumerate{
#' \item himgid - the ID of the current grid cell. This is the grouping variable
#' and is constant for the data.frame
#' \item gridvar - a new common name for the current variable to be gridded
#' \item weight - the weight of the variable to be gridded
#' }
#' \item The function can include additional parameters for calculation of confidentiality
#' (or reliability, or suitability, if the meaning of the function refers to something else).
#' This can be new parameters to this particular function (through the
#' ellipsis argument (...) of \code{multiResGrid}), existing parameters to \code{multiResGrid},
#' or potentially internal variables of \code{multiResGrid.})
#' \item The result of the function must be a logical, either the rule was passed
#' for the records of this grid cell, or not (TRUE/FALSE)
#' \item The function can potentially use internal variables of \code{multiResGrid}, however,
#' the meaning of these will have to be understood from the code
#' }
#' A simple example of a \code{userfun} is given in the example section below (the one producing \code{himg6})
#'
#'
#'
#' @returns The function will return a multi-resolution grid with observations
#' gridded to different grid cell sizes according to the confidentiality rules
#' to be applied. It can also include some additional columns that indicates
#' which of the different confidentiality rules that have been applied.
#'
#' Note that the function might (if \code{postProcess = FALSE})
#' return values also for the confidential grid-cells.
#'
#' @examples
#' \donttest{
#' library(sf)
#' if (!require(ggplot2)) print("Plotting of results will not work
#' without installation of ggplot2")
#' if (!require(viridis)) print("Some of the plots will not work
#' without installation of viridis package")
#' if (!require(patchwork)) print("Some of the plots will not work
#' without installation of patchwork")
#'
#' if (require(giscoR)) {
#' useBorder = TRUE
#' } else {
#' useBorder = FALSE
#' print("You need to install giscoR for plotting borders and clipping the gridded maps")
#' }
#' # These are SYNTHETIC agricultural FSS data
#' data(ifs_dk) # Census data
#' ifs_weight = ifs_dk %>% dplyr::filter(Sample == 1) # Extract weighted subsample
#'
#' # Create spatial data
#' ifg = fssgeo(ifs_dk, locAdj = "LL")
#' fsg = fssgeo(ifs_weight, locAdj = "LL")
#'
#' if (useBorder) {
#' # Read country borders, only used for plotting
#' borders = gisco_get_nuts(nuts_level = 0)
#' dkb = borders[borders$CNTR_CODE == "DK",] %>% st_transform(crs = 3035)
#' }
#'
#' ress = c(1,5,10,20,40, 80, 160)*1000
#' # Gridding Utilized agricultural area (UAA)
#' ifl = gridData(ifg, "UAA",res = ress)
#' # Gridding organic utilized agricultural area
#' ifl2 = gridData(ifg, vars = "UAAXK0000_ORG", res = ress)
#'
#' # Gridding UAA and organic UAA together
#' ifl3 = gridData(ifg, vars = c("UAA", "UAAXK0000_ORG"), res = ress)
#'
#' # Gridding the UAA from the survey - the survey weights are in the column EXT_MODULE
#' fsl = gridData(fsg, vars = c("UAA"), weights = "EXT_MODULE", res = ress)
#'
#' # Create a multi-resolution grid only with farm number as confidentiality rule, then plot results
#' himg0 = multiResGrid(ifl, checkReliability = FALSE, suppresslim = 0)
#' ggplot(himg0) + geom_sf(aes(fill = count))
#'
#' # Create a multi-resolution grid of UAA, also based on the dominance rule (default)
#' himg1 = multiResGrid(ifl, vars = "UAA", ifg = ifg)
#' p1 = ggplot(himg1) + geom_sf(aes(fill = UAA))
#' p1
#' # Create multi-resolution grid of organic UAA
#' himg2 = multiResGrid(ifl2, vars = "UAAXK0000_ORG", ifg = ifg)
#' himg21 = multiResGrid(ifl2, vars = "UAAXK0000_ORG", ifg = ifg, postProcess = FALSE)
#'
#' ggplot(himg2) + geom_sf(aes(fill = UAAXK0000_ORG))
#'
#' # Create joint multi-resolution grid of organic UAA and total UAA
#' himg3 = multiResGrid(ifl3, vars = c("UAA", "UAAXK0000_ORG"), ifg = ifg,
#' checkReliability = FALSE, suppresslim = 0)
#' # Create multi-resolution grid of organic UAA, based on the UAA grid
#' # The large number of missing values indicates that this feature should
#' # mainly be used for data that have similar or higher resolution as the
#' # original data set.
#' himg33 = multiResGrid(himg1, vars = c("UAAXK0000_ORG"), ifg = ifg,
#' checkReliability = FALSE, suppresslim = 0)
#' p31 = ggplot(himg3) + geom_sf(aes(fill = UAA))
#' p32 = ggplot(himg3) + geom_sf(aes(fill = UAAXK0000_ORG))
#' p33 = ggplot(himg33) + geom_sf(aes(fill = UAAXK0000_ORG))
#' p31 + p32 + p33
#'
#' # Create multi-resolution grid of UAA, based on survey data,
#' # with and without applying reliability check
#' # This is a relatively slow functionality
#' # rounding is set to FALSE, to be better able to visualize the few records
#' # (Not recommended for data to be published)
#' himg4 = multiResGrid(fsl, vars = c("UAA"), weights = "EXT_MODULE", ifg = fsg,
#' strat = "STRA_ID_CORE", checkReliability = FALSE, rounding = FALSE)
#'# The parameter reliabilitySplit = 15 will divide the data set in 15 groups for the
#'# reliabilityCheck.
#'# A lower value would be recommended, but a high value speeds up the computation for this example
#' himg5 = multiResGrid(fsl, vars = c("UAA"), weights = "EXT_MODULE", ifg = fsg,
#' strat = "STRA_ID_CORE", checkReliability = TRUE,
#' reliabilitySplit = TRUE, rounding = FALSE, pseudoreg = "REGIONS")
#'
#'# Apply suppreslim to suppress insignificant grid cells
#'# Show intermediate maps of confidential cells (wait 5 seconds)
#' pint = ifelse(interactive(), 5, FALSE)
#' #himg11 = multiResGrid(ifl, vars = "UAA", ifg = ifg,
#' # suppresslim = 0, plotIntermediate = pint)
#' himg11 = himg1
#' himg12 = multiResGrid(ifl, vars = "UAA", ifg = ifg,
#' suppresslim = 0.02, plotIntermediate = pint)
#' himg13 = multiResGrid(ifl, vars = "UAA", ifg = ifg,
#' suppresslim = 0.05, plotIntermediate = pint)
#' himg14 = multiResGrid(ifl, vars = "UAA", ifg = ifg,
#' suppresslim = 0.1, plotIntermediate = pint)
#'
#'
#' # This is an example of a userfun that can be used for alternative restrictions
#' # for a grid cell. This particular toy example assures that there are at least
#' # \\code{nabove} records with a value (UAA in this case) above a certain "limit".
#' ufun = function(df, nabove, limit) {
#' sum(df$gridvar > limit) < nabove
#' }
#'
#' himg6 = multiResGrid(ifl, vars = "UAA", ifg = ifg,
#' suppresslim = 0.2, plotIntermediate = pint, userfun = ufun, nabove = 5, limit = 10)
#'
#'
#'
#' if (useBorder) himg00 = st_intersection(dkb, himg0) else himg00 = himg0
#' p00 = ggplot() + geom_sf(data = himg00, aes(fill = count, color = count)) +
#' scale_fill_viridis( name = "number of farms", trans = "log10") +
#' scale_color_viridis( name = "number of farms", trans = "log10") +
#' coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
#' ggtitle("Number of farms for variable grid cell size, only frequency confidentiality") +
#' theme_bw()
#' if (useBorder) p00 = p00 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1)
#' p00
#'
#' if (useBorder) himg01 = st_intersection(dkb, himg1) else himg01 = himg1
#' p01 = ggplot() + geom_sf(data = himg01, aes(fill = count, color = count)) +
#' scale_fill_viridis( name = "number of farms", trans = "log10") +
#' scale_color_viridis( name = "number of farms", trans = "log10") +
#' coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
#' ggtitle("Number of farms for variable grid cell size, frequency and dominance confidentiality") +
#' theme_bw()
#' if (useBorder) p01 = p01 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1)
#' p01
#'
#' # Plot the density of organic agriculture, as hectares per square km
#' if (useBorder)himg02 = st_intersection(dkb, himg2) else himg02 = himg2
#' himg02$orgarea = himg02$UAAXK0000_ORG/units::set_units(st_area(himg02), "km^2")
#' units(himg02$orgarea) = NULL
#' p02 = ggplot() + geom_sf(data = himg02, aes(fill = orgarea), lwd = 0) +
#' scale_fill_viridis( name = "ha / km2") +
#' coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
#' ggtitle("Organic UAA density") +
#' theme_bw()
#' if (useBorder) p02 = p02 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1)
#' p02
#'
#' # Plot the relative abundance of organic UAA relative to total UAA
#' if (useBorder) himg03 = st_intersection(dkb, himg3) else himg03 = himg3
#' himg03$ouaashare = himg03$UAAXK0000_ORG/himg03$UAA*100
#' p03 = ggplot() + geom_sf(data = himg03, aes(fill = ouaashare), lwd = 0) +
#' scale_fill_viridis( name = "% Organic") +
#' coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
#' ggtitle("Organic share") +
#' theme_bw()
#' if (useBorder) p03 = p03 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1)
#' p03
#'
#'
#' # Plot maps from survey data before and after adding the reliability constraint
#' # The percentage of UAA can be above 100% due to farm area being registered at the location
#' # of the administration building, but the map without reliability check has too high values
#' # for too many cells
#'
#' if (useBorder) himg04 = st_intersection(dkb, himg4) else himg04 = himg4
#' himg04$area = st_area(himg04)/1e6
#' units(himg04$area) = NULL
#' himg04$uaashare = himg04$UAA/himg04$area
#' himg04$uaashare[himg04$uaashare > 1000] = 1000
#' p04 = ggplot() + geom_sf(data = himg04, aes(fill = uaashare), lwd = 0) +
#' scale_fill_viridis( name = "% UAA", trans = "log10", limits = c(1,1000)) +
#' geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
#' coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
#' ggtitle("UAA share (sample without reliability check)") +
#' theme_bw()
#' if (useBorder) p04 = p04 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1)
#' p04
#'
#' if (useBorder) himg05 = st_intersection(dkb, himg5) else himg05 = himg5
#' himg05$area = st_area(himg05)/1e6
#' units(himg05$area) = NULL
#' himg05$uaashare = himg05$UAA/himg05$area
#' himg05$uaashare[himg05$uaashare > 1000] = 1000
#' p05 = ggplot() + geom_sf(data = himg05, aes(fill = uaashare), lwd = 0) +
#' scale_fill_viridis( name = "% UAA", trans = "log10", limits = c(1,1000)) +
#' coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
#' ggtitle("UAA share (sample with reliability check)") +
#' theme_bw()
#' if (useBorder) p05 = p05 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1)
#'
#' if (require(patchwork)) p04 + p05 + plot_layout(guides = "collect")
#'
#' if (useBorder) himg06 = st_intersection(dkb, himg6) else himg06 = himg6
#' p06 = ggplot() + geom_sf(data = himg06, aes(fill = UAA), lwd = 0) +
#' scale_fill_viridis( name = "ha") +
#' coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
#' ggtitle("UAA, with additional user defined function") +
#' theme_bw()
#' if (useBorder) p06 = p06 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1)
#' p06
#'
#'
#' # Plot the different maps from using different suppreslim values
#' himgs = list(himg11, himg12, himg13, himg14)
#' slims = c(0, 0.02, 0.05, 0.1, 0.2)
#' plots = list()
#' uaas = c(himg11$UAA, himg12$UAA, himg13$UAA, himg14$UAA)
#' lims = range(uaas[uaas > 0], na.rm = TRUE)
#' for (ii in 1:4) {
#' if (useBorder) himg = st_intersection(dkb, himgs[[ii]]) else himg = himgs[[ii]]
#' plots[[ii]] =
#' ggplot() + geom_sf(data = himg, aes(fill = UAA), lwd = 0) +
#' scale_fill_viridis( name = "UAA (ha)", trans = "log10", limits = lims, na.value="red") +
#' ggtitle(paste("Suppresslim = ", slims[[ii]])) +
#' xlab("") + ylab("") +
#' theme_bw()
#' if (useBorder) plots[[ii]] = plots[[ii]] +
#' geom_sf(data = dkb, fill = NA, colour='black', lwd = 0.5)
#' }
#'
#' if (require(patchwork)) plots[[1]] + plots[[2]] + plots[[3]] + plots[[4]] +
#' plot_layout(guides = "collect")
#'
#'
#' }
#'
#' #' @rdname multiResGrid
#' @export
multiResGrid <- function(MRGinp, ...) UseMethod("multiResGrid")
#'
#'
#' @rdname multiResGrid
#' @export
multiResGrid.MRG <- function(MRGinp, ...) {
dots = list(...)
#' @importFrom utils modifyList
if (length(dots) > 0) MRGinp = modifyList(MRGinp, dots)
do.call(multiResGrid, MRGinp)
}
#'
#' @rdname multiResGrid
#' @export
multiResGrid.sf <- function(MRGinp, ..., ifg, vars) {
if (missing(ifg) | missing(vars)) stop("Both ifg and vars (as named variables) are necessary for the data.frame method of multiResGrid")
dots = c(list(...), list(ifg = ifg, vars = vars))
#' @importFrom utils modifyList
MRGobject = list(list(MRGinp = MRGinp))
MRGobject = modifyList(MRGobject, dots)
do.call(multiResGrid, MRGobject)
}
#'
#'
#'
#' @rdname multiResGrid
#' @export
multiResGrid.list <- function(MRGinp, ifg, vars, weights, countFeatureOrTotal = "feature", mincount = 10, #minpos = 4,
nlarge = 2,
plim = 0.85, verbose = FALSE, domEstat = TRUE,
outfile = NULL, checkDominance = TRUE,
checkReliability = FALSE, userfun, strat = NULL, confrules = "individual",
suppresslim = 0, sumsmall = FALSE, suppresslimSum = NULL,
reliabilitySplit = TRUE, pseudoreg = NULL,
plotIntermediate = FALSE, addIntermediate = FALSE,
postProcess = TRUE, rounding = -1, remCols = TRUE, ...) {
# To avoid R CMD check notes
hsum = wsum = www = small = weight = data = himgid = dominance = . = NULL
if (!missing(ifg) && !inherits(ifg, "sf")) stop("ifg is not an sf-object ")
if (length(MRGinp) > 1) {
if (!inherits(MRGinp[[1]], "sf")) cat("MRGinp is not a list of sf-objects \n")
if (!"res" %in% names(MRGinp[[1]])) cat("MRGinp does not have a column with resolutions \n")
ress = unlist(lapply(MRGinp, FUN = function(MRGinpl) MRGinpl$res[1]))
} else {
if (!inherits(MRGinp, "sf")) cat("MRGinp is not an sf-object \n")
ress = 0
}
if (checkReliability) {
if (missing(strat) | is.null(strat) ) {
if (!"strat" %in% names(ifg)) ifg$strat = 1
#' @importFrom dplyr mutate group_by
} else ifg = ifg %>% mutate(strat = .data[[strat]])
if (missing(pseudoreg) | is.null(pseudoreg)) {
if (!"pseudoreg" %in% names(ifg)) ifg$pseudoreg = 1
} else ifg = ifg %>% mutate(pseudoreg = .data[[pseudoreg]])
}
if (!missing(vars)) {
if (missing(ifg))
stop(paste("Cannot create values for variable(s) ",
vars, " without ifg"))
# Some kind of test needed to check if this step has already been done, and also if it is necessary
# for (iw in 1:length(vars)) ifg[, paste0(vars, iw)] = data.frame(ifg)[, vars[iw]]
if (!"ID" %in% names(ifg)) ifg$ID = 1:dim(ifg)[1]
if (!length(grep(paste0("weight_", vars, collapse = "|"), names(ifg))) == length(vars)) ifg = addweights(ifg, vars, weights)
for (iw in 1:length(vars)) {
ifg[, paste0("gridvar", iw)] = st_drop_geometry(ifg[, vars[iw]])
ifg[, paste0("weight", iw)] = st_drop_geometry(ifg[, paste0("weight_", vars[iw])])
}
if (checkReliability) {
ifg = ifg[, c("ID", paste0("gridvar", 1:length(vars)), paste0("weight", 1:length(vars)), "strat", "pseudoreg")]
} else {
ifg = ifg[, c("ID", paste0("gridvar", 1:length(vars)), paste0("weight", 1:length(vars)))]
}
}
if (length(MRGinp) == 1) {
loh = NULL
MRGinp = MRGinp[[1]]
if (!"ID" %in% names(MRGinp)) MRGinp$ID = 1:dim(MRGinp)[1]
if (!"ID" %in% names(ifg)) ifg$ID = 1:dim(ifg)[1]
ifg = ifg %>% mutate(himgid = st_join(., MRGinp, join = st_within)$ID.y)
himg = MRGinp[, c("ID", "res")]
#' @importFrom dplyr summarize count
himg = himg %>% mutate(count = st_drop_geometry(ifg) %>% group_by(himgid) %>% summarize(count = n()) %>% select(count) %>% pull)
if (missing(weights) || is.null(weights)) weights = 1
if (weights == 1) himg$countw = himg$count
if (!is.null(vars)) {
if (length(vars) > 0 & length(weights) == 1) weights = rep(weights, length(vars))
for (ivar in 1:length(vars)) {
ww = paste0("weight", ivar)
vv = paste0("gridvar", ivar)
vvv = vars[ivar]
#' @importFrom rlang :=
if (!is.null(ww) & ww != 1) ifg$wsum = (st_drop_geometry(ifg[,vv]) %>% pull)*(st_drop_geometry(ifg[,ww]) %>% pull) else ifg$gridvar = st_drop_geometry(ifg[,vv]) %>% pull
himg = himg %>% mutate(!!vvv := st_drop_geometry(ifg) %>% group_by(himgid) %>% summarize(vvv = sum(wsum)) %>% select(vvv) %>% pull) %>%
mutate(!!ww := st_drop_geometry(ifg) %>% group_by(himgid) %>% summarize(www = sum(.data[[ww]])) %>% select(www) %>% pull)
}
}
} else {
himg = MRGinp[[1]]
if (missing(vars)) {
vvars = NULL
wweights = NULL
} else {
vvars = vars
wweights = paste0("weight_", vars)
}
hcols = which(names(himg) %in% c("ID", "res", "count", "countw", "geometry", vvars, wweights))
himg = himg[,hcols]
}
himg = himg %>% mutate(confidential = FALSE, reliability = FALSE, small = FALSE,
freq = FALSE, dom = FALSE, ufun = FALSE)
himgs = list()
lohs = list()
if (!missing(vars)) for (ivar in 1:length(vars)) himg[,paste0("vres", ivar)] = 0
for (ires in 2:(length(ress) + 1)) {
if (verbose) cat("Creating multi-resolution grid, iteration ", ires, "\n")
lres = ress[ires]
if (ires <= length(ress)) {
limg = MRGinp[[ires]]
lcols = which(names(limg) %in% c("ID", "res", "count", "countw", "geometry", vvars, wweights))
limg = limg[,lcols]
limg = limg %>% mutate(confidential = FALSE, reliability = FALSE, small = FALSE,
freq = FALSE, dom = FALSE, ufun = FALSE)
if (!missing(vars)) for (ivar in 1:length(vars)) limg[,paste0("vres", ivar)] = 0
}
## #' @importFrom utils txtProgressBar setTxtProgressBar
#' @importFrom sf st_join st_within st_drop_geometry st_make_grid st_coordinates st_crs st_area
if (!missing(ifg) && !is.null(ifg) && ires <= length(ress)) {
ifg$himgid = st_join(ifg, himg, join = st_within)$ID.y
ifg$limgid = st_join(ifg, limg, join = st_within)$ID.y
ifgl = data.frame(ifg)[, grep("ID|himgid|gridvar|weight|strat", names(ifg))]
} else {
ifgl = NULL
if (!missing(ifg)) {
ifg$himgid = st_join(ifg, himg, join = st_within)$ID.y
ifgl = data.frame(ifg)[, grep("ID|himgid|gridvar|weight|strat", names(ifg))]
}
}
if (ires <= length(ress)) loh = st_drop_geometry(st_join(himg, limg, join = st_within))
if (!missing(vars) & ires <= length(ress)) {
for (ivar in 1:length(vars)){
if (tolower(countFeatureOrTotal) == "feature") {
# sel = (loh[[paste0(vars[ivar], "_w", ivar, ".x")]] <
# suppresslim*loh[[paste0(vars[ivar], "_w", ivar, ".y")]]) & loh[[paste0("weight",ivar, ".x")]] < mincount
sel = (loh[[paste0(vars[ivar], ".x")]] <
suppresslim*loh[[paste0(vars[ivar], ".y")]]) & loh[[paste0("weight_",vars[ivar], ".x")]] < mincount
# } else sel = (loh[[paste0(vars[ivar], "_w", ivar, ".x")]] <
# suppresslim*loh[[paste0(vars[ivar], "_w", ivar, ".y")]]) & loh[["countw.x"]] < mincount
} else sel = (loh[[paste0(vars[ivar], ".x")]] <
suppresslim*loh[[paste0(vars[ivar], ".y")]]) & loh[["countw.x"]] < mincount
if (sumsmall & sum(sel) == 0) {
sshare = data.frame(Group.1 = 999999, x = 9999999)
} else if (sumsmall) {
#' @importFrom stats aggregate
# sshare = aggregate(loh[[paste0(vars[ivar], "_w", ivar, ".x")]][sel], by = list(loh$ID.y[sel]), sum)
sshare = aggregate(loh[[paste0(vars[ivar], ".x")]][sel], by = list(loh$ID.y[sel]), sum)
loh$sshare = sshare$x[match(loh$ID.y, sshare$Group.1)]
# loh$sshare[is.na(loh$sshare)] = max(loh[[paste0(vars[ivar], "_w", ivar, ".y")]])
loh$sshare[is.na(loh$sshare)] = max(loh[[paste0(vars[ivar], ".y")]])
# loh[[paste("hsmall", ivar)]] = loh$sshare < suppresslimSum*loh[[paste0(vars[ivar], "_w", ivar, ".y")]]
loh[[paste("hsmall", ivar)]] = loh$sshare < suppresslimSum*loh[[paste0(vars[ivar], ".y")]]
} else {
loh[[paste("hsmall", ivar)]] = sel
}
}
#' @importFrom tidyselect contains
himg$small = loh %>% select(contains("hsmall")) %>% mutate(hsum = rowSums(.)) %>%
mutate(small = (hsum == length(vars))) %>% select(small) %>% pull
}
himgdat = st_drop_geometry(himg)
ifgdat = st_drop_geometry(ifgl)
if (tolower(countFeatureOrTotal) == "feature" & !missing(vars)) {
ww = himgdat[,names(himgdat) %in% paste0("weight_", vars), drop = FALSE]
ww = apply(ww, MARGIN = 1, FUN = function(x) if (sum(x > 0)) min(x[x>0]) else 0)
} else ww = himgdat[, "countw"]
wf = which(ww > 0 & ww < mincount)
if (length(wf) > 0) himg$freq[which(himg$ID %in% himgdat$ID[wf])] = TRUE
len = ifelse(missing(vars), 1, length(vars))
ehimgid = ufres = NULL
for (ivar in 1:len) {
ifgdatl = NULL
if (checkDominance & !missing(vars)) {
if (verbose) cat("Checking dominance \n")
ifgdatl <- ifgdat[,c("himgid", paste0("gridvar", ivar), paste0("weight",ivar))]
names(ifgdatl) = c("himgid", "gridvar", "weight")
ifgdatl$ehimgid = ifgdatl$himgid
ifgdatl = ifgdatl[order(ifgdatl$ehimgid),]
#' @importFrom tidyr unnest nest
#' @importFrom purrr map
dom = ifgdatl %>% filter(weight != 0) %>%
group_by(ehimgid) %>% nest() %>%
mutate(dominance = map(data, ~dominanceRule(., nlarge = nlarge, plim = plim,
domEstat = domEstat))) %>%
unnest(dominance) %>% ungroup %>% select(dominance) %>% pull
domid = which(dom)
if (length(domid) > 0) himg$dom[domid] = TRUE
}
if (!missing(userfun) && is.function(userfun)) {
if (verbose) cat("Checking userfun \n")
if (is.null(ifgdatl)) ifgdatl <- ifgdat[,c("himgid", paste0("gridvar", ivar),
paste0("weight",ivar))]
ifgdatl$ehimgid = ifgdatl$himgid
ifgdatl = ifgdatl[order(ifgdatl$ehimgid),]
dots = list(...)
# The next 14 lines are based on an answer on StackOverflow:
# https://stackoverflow.com/questions/78647845/using-purrrmap-with-a-user-defined-function-how-to-pass-arguments
# by the user Nir Graham (userid: 11726436)
# licensed by StackOverlow under CC BY-SA 4.0
fargs = names(formals(userfun))
if ("hareas" %in% fargs) hareas = st_area(himg)
passed_in <- setdiff(intersect(ls(),fargs),"df")
names(passed_in) <- passed_in
args_to_pass <- map(passed_in, dynGet)
ddots = dots[names(dots) %in% fargs]
args_to_pass <- c(args_to_pass,ddots)
localUserFun = function(subdata, args_to_pass) {
local_args <- c(list("df" = subdata), args_to_pass)
do.call(userfun, args=local_args)}
ufRes = ifgdatl %>%
group_by(ehimgid) %>% nest() %>%
mutate(ufres = map(data, ~localUserFun(., args_to_pass))) %>%
unnest(ufres) %>% ungroup %>% select(ufres) %>% pull
ufid = which(ufRes)
if (length(ufid) > 0) himg$ufun[ufid] = TRUE
}
}
if (checkReliability) {
if (verbose) cat("Checking reliability \n")
rsplit = reliabilitySplit
if (reliabilitySplit & (dim(ifg)[1] > 50000 | dim(himg)[1] > 1000)) {
if (is.logical(reliabilitySplit)) rsplit = dim(ifg)[1] %/% 30000
if (dim(himg)[1] > rsplit*1000) rsplit = dim(himg)[1] %/% 1000
}
if (!missing(vars) && !is.null(vars)){
for (ivar in 1:length(vars)){
nhimg = dim(himg)[1]
vestres = mrg_varestim(ifg, var = paste0("gridvar", ivar), strat = "strat", PSU = "ID",
weight = paste0("weight", ivar), split = rsplit, pseudoreg = "pseudoreg",
verbose = verbose, nhimg = nhimg)
himg[,paste0("vres",ivar)] = vestres
}
}
nonvalids = suppressWarnings(which(apply(st_drop_geometry(himg[, grep("vres", names(himg))]), 1, max, na.rm = TRUE) > 0.35))
himg$reliability[nonvalids] = TRUE
}
himg$confidential = rowSums(st_drop_geometry(himg[,c("freq", "dom", "ufun", "reliability")])) > 0
if (verbose) cat("Finished checks, updating multi-resoluion grid \n")
if (ires <= length(ress)) {
loh$confidential = himg$confidential
loh$small = himg$small
idRem = which(himg$confidential & !himg$small)
# Compute how many non-confidential himg-cells there are per limg-cell
nclimg = aggregate(!loh$confidential, by = list(loh$ID.y), sum)
loh$nclimg = nclimg$x[loh$ID.y]
# Add cells to the aggregation list if none of the hing-cells in a limg-cell
# are non-confidential, even if tthey are are small
idAdd = NULL
idRem = unique(c(idRem, which(loh$small & loh$nclimg == 0)))
if (length(idRem) > 0) {
idAdd = which(limg$ID %in% unique(loh$ID.y[loh$ID.x %in% idRem]))
# Check how many himg-cells per limg-cell. Do not aggregate if it is only 1
iac = aggregate(rep(1, length(loh$ID.x)), by = list(loh$ID.y), FUN = sum)
singlimg = iac$Group.1[iac$x == 1]
if (length(singlimg) > 0) {
idAdd = idAdd[!(idAdd %in% singlimg)]
himg$singlimg = FALSE
himg$singlimg[loh$ID.y %in% singlimg] = TRUE
}
if ("singlimg" %in% names(himg)) limg$singlimg = FALSE
# Find all himg-cells in the limg-cells to be added
idRem = which(himg$ID %in% unique(loh$ID.x[loh$ID.y %in% idAdd]))
if (length(idRem) == 0) break
remIdRem = loh$ID.x[loh$ID.y %in% singlimg]
if (length(remIdRem) > 0) {
if (sum(idRem %in% remIdRem) > 0) {
stop(paste("There are still single himg-cells to be deleted from limg-cells
This should not happen", sum(idRem %in% remIdRem)))
}
idRem = idRem[!(idRem %in% remIdRem)]
}
himg = himg[-idRem,]
if (verbose > 1) {
cat("names(himg):", names(himg), "\n")
cat("names(limg):", names(limg), "\n")
cat("length(idAdd)", length(idAdd), "\n")
cat("sum(himg$singlimg", sum(himg$singlimg), "\n")
}
himg = rbind(himg, limg[idAdd,])
}
} else {idRem = NULL; idAdd = NULL}
# Reindex himg, and add himg and loh to the lists
himg$ID = 1:dim(himg)[1]
himgs[[ires]] = himg
lohs[[ires]] = loh
if (ires <= length(ress)) print(paste("ires", ires, ress[ires], "#himg-cells:", dim(himg)[1], "; removed:",
length(idRem), "; added:", length(idAdd), "; confidential:", sum(himg$confidential) ))
if (plotIntermediate) {
plot(himg[, "confidential"], main = "Confidential cells")
Sys.sleep(ifelse(is.logical(plotIntermediate), 5, plotIntermediate))
}
}
if ("gridvar" %in% names(himg)) himg = himg[, -grep("gridvar", names(himg))]
if (!missing(vars)) attr(himg, "vars") = vars
if (postProcess) {
himg = MRGpostProcess(himg, vars, remCols, rounding)
} else {
attr(himg, "remCols" ) = remCols
attr(himg, "rounding") = rounding
}
if (addIntermediate) {
attr(himg, "himgs") = himgs
attr(himg, "lohs") = lohs
}
himg
}
mrg_varestim <- function(x, var, strat, PSU, weight, split, verbose, pseudoreg, nhimg){
t0 = proc.time()[3]
ID = n = hld = w_sum = wdiff = NULL
nx = dim(x)[1]
himgids = unique(x$himgid)
if (inherits(x, "sf")) x = st_drop_geometry(x)
if (!missing(PSU)) x$ID = x[[PSU]]
icor = 0
if (missing(strat) || is.null(strat)) {
x$strat = 1
} else {
if (!strat %in% names(x)) stop(paste(strat, "is missing from from the data.frame"))
x$strat = x[[strat]]
}
# Check if any grid cells only have unit value weights
tt = st_drop_geometry(x) %>% group_by(himgid) %>%
summarise(wdiff = sum(abs(.data[[weight]] - 1), na.rm = T)) %>% filter(wdiff < 0.1)
if (dim(tt)[1] > 0) {
x = x[!(x$himgid %in% tt$himgid), ]
}
t1 = proc.time()[3]
if (verbose > 1) cat("Varestim - finished preprocessing - t= ", round(t1-t0,2), "secs \n")
if (dim(x)[1] == 0) {
out_var = data.frame(himgid = himgids, rse = 0)
} else {
if (split == 1) {
df = x
out_var = vardom(dataset = df, Y= var, H = "strat",
PSU = "ID",
w_final = weight, Dom = "himgid")$all_result
t21 = proc.time()[3]
if (verbose > 1) cat("varestim - finished single split vardom estimation in ", round(t21-t1, 2), "secs \n")
} else {
himgid <- unique(x$himgid)
#' @importFrom dplyr left_join mutate ungroup group_by distinct case_when n
#' @importFrom sjmisc split_var
df_cl <- left_join(x, data.frame(himgid = himgid, cluster = split_var(himgid, n = split)), by = "himgid")
out_var <- NULL
t20 = proc.time()[3]
if (verbose > 1) cat("varestim - will split reliability calcs in ", split, "cases\n")
for (isp in 1:split){
t21 = proc.time()[3]
df = x[which(df_cl$cluster == isp),]
if (verbose) print(paste("reliabilitySplit: ", isp,
"- Number of records: ", paste(dim(df)[1],
" - Number of unique IDs: ", length(unique(df$himgid)))))
icor = icor + 1
t <- df %>% group_by(strat) %>%
summarise(hld=n(), w_sum = sum(.data[[weight]], na.rm = T)) %>%
filter(hld == 1 & w_sum > 1) %>% ungroup
if (!is.null(pseudoreg)) pcor = as.numeric(as.factor(df[[pseudoreg]])) else pcor = 0
if (dim(t)[1] > 0){
h_st <- t %>% distinct(strat) %>% pull()
df <- df %>% mutate(strat = case_when(strat %in% c(h_st)~(99999-pcor),
T ~ strat))}
#' @importFrom vardpoor vardom
t22 = proc.time()[3]
if (verbose > 1) cat("varestim - ready to call vardom for split ", isp, "of", split, "time:", round(t22-t21,2), "secs\n")
est <- vardom(dataset = df, Y= var, H = "strat",
PSU = "ID",
w_final = weight, Dom = "himgid")
t23 = proc.time()[3]
if (verbose > 1) cat("varestim - ready to call vardom for split ", isp, "of", split, "time:", round(t23-t22,2), "secs\n")
out_var<-rbind(out_var, est$all_result)
}
}
if (verbose) cat("varestim - finished looping in totally ", round(proc.time()[3]-t1,2), "secs \n")
if (icor > 0) {
cat(icor, "of the subsets included strata with only one record. \n",
"You might want to check the strata or consider a lower value for reliabilitySplit.")
}
out_var$himgid = as.numeric(out_var$himgid)
if (dim(tt)[1] > 0) out_var = rbind(out_var[,c("himgid", "rse")], data.frame(himgid = tt$himgid, rse = 0))
out_var = out_var[order(out_var$himgid),]
}
if (!missing(nhimg) && nhimg != dim(out_var)[1]) stop("The dimension of out_var does not match the dimension of himg")
if (sum(duplicated(out_var$himgid)) > 0) stop("There are duplicated himgids in out_var")
if (verbose > 1) cat("Finished mrg_varestim")
out_var$rse
}
dominanceRule = function(ifglldat, nlarge, plim, domEstat = TRUE) {
#' @importFrom dplyr summarise
Y <- sum(ifglldat$gridvar*ifglldat$weight)
ifglldat <- ifglldat[order(ifglldat$gridvar, ifglldat$weight, decreasing = TRUE),]
if (domEstat) {
# Extrapolated aggregated value of the cell Y
# Need to loop to account for different nlarge values, even though nlarge=2 is the standard rule
dominance = FALSE
nlarge = min(dim(ifglldat)[1], nlarge)
for (nc in 1:nlarge){
#' @importFrom magrittr "%>%"
#' @importFrom dplyr slice select pull arrange ungroup
#' @importFrom rlang .data
# wmax= nc largest contributor
# wmax<-ifglldat %>% slice(1:nc) %>% select(.data$weight) %>% pull()
wmax<-ifglldat$weight[1:nc]
# This should only round the weights that are above 0.5, to avoid creation of zero weights
# This should avoid any issues due to low weights, such as
# wmax = c(0.1, 0.4, 0.7, 1.2, 1.4)
wmaxr<- ifelse(wmax > 0.5, round(wmax), wmax)
# wmaxr<-wmax %>% ifelse(.data > 0.5, round(.data), .data) #JON: Separated wmax from wmaxr, as only the second should be rounded
# xmax nc largest values of the variable
# xmax<-ifglldat %>% slice(1:nc) %>% select(.data$gridvar) %>% pull()
xmax<-ifglldat$gridvar[1:nc]
# sum of n largest contributor is <= nlarge and aggregated extrapolated value of n largest contributor is
# greater than 85% of the extrapolated aggregated value of that cell (Y)
if(sum(wmaxr)<=nlarge*1.01 && sum(wmax*xmax)>plim*Y){
# No need to continue the loop if dominance is already TRUE
# print(paste(ifglldat$ehimgid[1], Y, ifglldat$gridvar[1]))
return(dominance = TRUE)
} # JON: Not including "else dominance = FALSE" here, it is not necessary.
# Also, there might be a case where dominance = TRUE for nc = 1, but not for nc = 2
}
} else {
weight = ifglldat$weight
dat = ifglldat$gridvar
wmax = 0
xmax = 0
for (ii in 1:length(dat)) {
lw = min(weight[ii], nlarge-wmax)
xmax = xmax + dat[ii]*lw
wmax = wmax + lw
if (wmax > 0.999*nlarge) break # just to avoid potential numerical issues (FAQ 7.31) from sumWeight >= nlarge
}
if(sum(wmax) * 0.999 <=nlarge && xmax > plim*Y$total){
dominance = TRUE
} else dominance = FALSE
}
dominance
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.