multiResGrid: Create multi-resolution grids based on confidentiality or...

View source: R/multiResGrid.R

multiResGridR Documentation

Create multi-resolution grids based on confidentiality or reliability restrictions

Description

Function that creates a multi-resolution grid with larger grid cells in regions with lower resolution of data, or where data needs to be anonymized for disclosure control reasons. The function can also be used to create a grid of new variables, using an existing multi-resolution grid as template.

Usage

multiResGrid(MRGinp, ...)

## S3 method for class 'MRG'
multiResGrid(MRGinp, ...)

## S3 method for class 'sf'
multiResGrid(MRGinp, ..., ifg, vars)

## S3 method for class 'list'
multiResGrid(
  MRGinp,
  ifg,
  vars,
  weights,
  countFeatureOrTotal = "feature",
  mincount = 10,
  nlarge = 2,
  plim = 0.85,
  verbose = FALSE,
  domEstat = TRUE,
  outfile = NULL,
  checkDominance = TRUE,
  checkReliability = FALSE,
  userfun,
  strat = NULL,
  confrules = "individual",
  suppresslim = 0,
  sumsmall = FALSE,
  suppresslimSum = NULL,
  reliabilitySplit = TRUE,
  pseudoreg = NULL,
  plotIntermediate = FALSE,
  addIntermediate = FALSE,
  postProcess = TRUE,
  rounding = -1,
  remCols = TRUE,
  ...
)

Arguments

MRGinp

Either an MRGobject (from a call to createMRGobject) or a list of gridded data with different resolutions (from a call to gridData or a gridded sf-object (typically from an earlier call to multiResGrid)

...

Possible arguments to underlying functions

ifg

Either a data.frame or tibble or sf-object with the locations and the data of the survey or census data, or a list of such objects.

vars

Variable(s) of interest that should be aggregated (necessary when ifg is used for individual farm specific anonymization rules)

weights

Extrapolation factor(s) (weights) wi of unit i in the sample of units nc falling into a specific cell c. Weights are used for disclosure control measures. A weight of 1 will be used if missing. If only one weight is given, it will be used for all variables. If the length is more than one, the length has to be equal to the number of variables. If the same weight is used for several variables, it must be repeated in the weights-vector

countFeatureOrTotal

Should the frequency limit be applied on records with a positive value for a certain feature, or on all records, independent of value of feature

mincount

The minimum number of farms for a grid cell (threshold rule)

nlarge

Parameter to be used if the nlarge(st) farms should count for maximum plim percent of the total value for the variable in the grid cell (see details of gridData)

plim

See nlarge

verbose

Indicates if some extra output should be printed. Usually TRUE/FALSE, but can also have a value of 2 for multiResGrid for even more output.

domEstat

Should the dominance rule be applied as in the IFS handbook (TRUE), where the weights are rounded before finding the first nlarge contributors, or should it be the first nlarge contributors*weight, where also fractions are considered (FALSE)?

outfile

File to direct the output in case of parallel processing, see makeCluster for more details.

checkDominance

Logical - should the dominance rule be applied?

checkReliability

Logical - should the prediction variance be checked, and used for the aggregation? This considerably increases computation time

userfun

This gives the possibility to add a user defined function with additional confidentiality rules which the grid cell has to pass

strat

Column name defining the strata for stratified sampling, used if checkReliability is TRUE

confrules

Should the frequency rule (number of holdings) refer to the number of holdings with a value of the individual vars above zero ("individual") or the total number of holdings in the data set ("total")?

suppresslim

Parameter that can be used to avoid that almost empty grid cells are merged with cells with considerably higher number of observations. The value is a minimum share of the total potential new cell for a grid cell to be aggregated. See below for more details.

sumsmall

Logical; should the suppresslimSum value be applied on the sum of small grid cells within the lower resolution grid cell? Note that different combinations of suppreslim and suppreslimSum values might not give completely intuitive results.For instance, if both are equal, then a higher value can lead to more grid cells being left unaggregated for smaller grid sizes, leading to aggregation for a large grid cell

suppresslimSum

Parameter similar to suppreslim, but affecting the total of grid cells to be suppressed

reliabilitySplit

Logical or number - parameter to be used in calculation of the reliability (if checkReliability = TRUE). It can either give the number of groups, or if TRUE, it will create groups of approdcimately 50,000 records per group. If FALSE, the data set will not be split, independent on the size.

pseudoreg

A column with regions to be used to define pseudostrata if checkReliability is TRUE. This is used for the cases when one or more strata only has a single record (and the weight is different from one). This makes variance calculation impossible, so such strata are merged into a pseudostrata. If pseudoreg is given (for example a column with the country name, or NUTS2 region), the pseudostrata will be created separately for each pseudoreg region.

plotIntermediate

Logical or number - make a simple plot showing which grid cells have already passed the frequency rule. plotintermediate = TRUE, the function will wait 5 seconds after plotting before continuing, otherwise it will wait plotintermediate seconds.

addIntermediate

Logical; will add a list of all intermediate himgs and lohs (overlay of himg and the lower resolution grid) as an attribute to the object to be returned

postProcess

Logical; should the postprocessing be done as part of creation of the multiresolution grid (TRUE), or be done in a separate step afterwards (FALSE). The second option is useful when wanting to check the confidential grid cells of the final map

rounding

either logical (FALSE) or an integer indicating the number of decimal places to be used. Negative values are allowed (such as the default value rounding to the closest 10). See also the details for digits in round.

remCols

Logical; Should intermediate columns be removed? Can be set to FALSE for further analyses. Temporary columns will not be removed if their names partly match the variable names of vars

Details

The possible restrictions that will lead to aggregation of a grid cell are:

  1. Frequency rule (Aggregate to reach a minimum number of counts)

  2. Dominance rule (Aggregate because of dominance by one or more units)

  3. Reliability rule (Aggregate because the uncertainty is too high)

  4. User defined rule (Aggregate because a grid cell does not respect a user defined criteria)

This function will find the highest resolution data set that fulfills the confidentiality rules and potential reliability rules for variable(s) of interest. Starting with the second highest resolution (5 km in the default settings), the function will check if any of the 1 km sub pixels will have values not fulfilling any of the confidentiality rules (number of farms, values of the 2 largest compared to values of the entire grid cell). If all values are above the confidentiality limits, the grid cells will be kept at a 1 km resolution, otherwise only the 5 km grid cell will be kept. This will again be tested against the confidentiality rules in the next iteration, when grid cells will possibly be merged to 10 km grid cells.

The function can also be called if it is necessary to create a grid of a new variable for the same grid as an already existing variable. The confidentiality rules will then be applied to the new variables for the existing grid cells, and mask the ones that do not respect the rules. The function will not do any further merging of grid cells, for this it is necessary to grid the variables together. This feature is useful when the new data set has a similar resolution as the original data set. It will give a high number of missing values if the resolution of the new data is more sparse than the original. In the examples below, this means that it is possible to copy the grid of organic organic agricultural area to a grid of all agricultural area, whereas the opposite will not work well.

The standard threshold rule for spatial data is at least 10 units (mincount). The parameters nlarge and plim are used for determining the dominance treatment for the variable of interest, with default values of nlarge = 2 and plim = 0.85. If more than plim of the values of the grid cell (e.g. UAA, arable land, number of livestock) is explained by 1-nlarge weighted holdings, the grid cell will not pass the confidentiality rule.

The concept of reliability is explained in details in section 4.6 in the integrated farm survey handbook for 2023: https://wikis.ec.europa.eu/display/IFS/Integrated+Farm+Statistics+Manual+ In short, it is an estimate of the coefficient of variation for an estimate (a grid cell in this case), based on the number in the sample relative to the number in the population, and taking into account possible stratified sampling approaches. The number is zero if all holdings in the population in a grid cell has been sampled, and the default requirement is that the CV is less than 35

The computation can be time and memory intensive, particularly for the first iteration. The method involves creation (and inversion) of a matrix of size nr*ng, where nr is the number of records and ng is the number of grid cells. it is therefore sometimes necessary to split the data set into smaller parts, to reduce the computational challenges. The parameter reliabilitySplit is used for this. It will split the area of interest into several subsets. This will have some impact on the reliability calculations. The reliabilitySplit value might be set temporarily higher for the first iterations, as it will also depend on the number of grid cells.

Reliability cannot be calculated for records belonging to strata with only one record. The function will therefore attempt to merge these into pseudostrata, if there is more than one of these strata. The pseudoreg-parameter can be used to define the regions within which the pseudostrata are created (for example NUTS2-region). If there are still strata with only one record, these will cause a printed warning.

There are some cases where aggregation might not be desired. In the situation where a relatively large single grid cell does not respect the confidentiality rules, it is fine to aggregate it if the neighbouring grid cells are also relatively large. However, it can be seen as unfortunate if the single cell was aggregated with many smaller grid cells that could otherwise be disseminated at a high resolution. The added value of being able to present a value for a region with very few farms is perhaps lower than what is lost by having to aggregate to a lower resolution. The parameter suppresslim indicates the minimum value in a grid cell relative to the possible lower resolution grid cell before it is necessary to aggregate. If the limit is 0.05, a grid cell would only cause an aggregation to lower resolution if the value in the grid cell is more than 5% of the value in the lower resolution grid cell. Instead, it would be left as it is, and will be suppressed in the post-processing step.

There are cases when the built-in confidentiality checks are not what the user needs. That is why it is possible to submit a user defined function. This function needs to follow certain rules.

  1. The first argument must be a data.frame with name df. This is a data.frame with the individual records for a particular grid cell. It has three columns:

    1. himgid - the ID of the current grid cell. This is the grouping variable and is constant for the data.frame

    2. gridvar - a new common name for the current variable to be gridded

    3. weight - the weight of the variable to be gridded

  2. The function can include additional parameters for calculation of confidentiality (or reliability, or suitability, if the meaning of the function refers to something else). This can be new parameters to this particular function (through the ellipsis argument (...) of multiResGrid), existing parameters to multiResGrid, or potentially internal variables of multiResGrid.)

  3. The result of the function must be a logical, either the rule was passed for the records of this grid cell, or not (TRUE/FALSE)

  4. The function can potentially use internal variables of multiResGrid, however, the meaning of these will have to be understood from the code

A simple example of a userfun is given in the example section below (the one producing himg6)

Value

The function will return a multi-resolution grid with observations gridded to different grid cell sizes according to the confidentiality rules to be applied. It can also include some additional columns that indicates which of the different confidentiality rules that have been applied.

Note that the function might (if postProcess = FALSE) return values also for the confidential grid-cells.

Examples


library(sf)
if (!require(ggplot2)) print("Plotting of results will not work 
                     without installation of ggplot2")
if (!require(viridis)) print("Some of the plots will not work 
                     without installation of viridis package")
if (!require(patchwork)) print("Some of the plots will not work 
                     without installation of patchwork")

if (require(giscoR)) {
  useBorder = TRUE 
} else {
  useBorder = FALSE
  print("You need to install giscoR for plotting borders and clipping the gridded maps")
}
# These are SYNTHETIC agricultural FSS data 
data(ifs_dk) # Census data
ifs_weight = ifs_dk %>% dplyr::filter(Sample == 1) # Extract weighted subsample

# Create spatial data
ifg = fssgeo(ifs_dk, locAdj = "LL")
fsg = fssgeo(ifs_weight, locAdj = "LL")

if (useBorder) {
# Read country borders, only used for plotting
  borders = gisco_get_nuts(nuts_level = 0)
  dkb = borders[borders$CNTR_CODE == "DK",] %>% st_transform(crs = 3035)
}

ress = c(1,5,10,20,40, 80, 160)*1000
# Gridding Utilized agricultural area (UAA)
ifl = gridData(ifg, "UAA",res = ress)
# Gridding organic utilized agricultural area
ifl2 = gridData(ifg, vars = "UAAXK0000_ORG", res = ress)

# Gridding UAA and organic UAA together
ifl3 = gridData(ifg, vars = c("UAA", "UAAXK0000_ORG"), res = ress)

# Gridding the UAA from the survey - the survey weights are in the column EXT_MODULE
fsl = gridData(fsg,  vars = c("UAA"), weights = "EXT_MODULE",  res = ress)

# Create a multi-resolution grid only with farm number as confidentiality rule, then plot results
himg0 = multiResGrid(ifl, checkReliability = FALSE, suppresslim = 0)
ggplot(himg0) + geom_sf(aes(fill = count))

# Create a multi-resolution grid of UAA, also based on the dominance rule (default)
himg1 = multiResGrid(ifl, vars = "UAA", ifg = ifg)
  p1 = ggplot(himg1) + geom_sf(aes(fill = UAA))
  p1
# Create multi-resolution grid of organic UAA
himg2 = multiResGrid(ifl2, vars = "UAAXK0000_ORG", ifg = ifg)
himg21 = multiResGrid(ifl2, vars = "UAAXK0000_ORG", ifg = ifg, postProcess = FALSE)

ggplot(himg2) + geom_sf(aes(fill = UAAXK0000_ORG))

# Create joint multi-resolution grid of organic UAA and total UAA
himg3 = multiResGrid(ifl3, vars = c("UAA", "UAAXK0000_ORG"), ifg = ifg, 
                  checkReliability = FALSE, suppresslim = 0)
# Create multi-resolution grid of organic UAA, based on the UAA grid
# The large number of missing values indicates that this feature should
# mainly be used for data that have similar or higher resolution as the
# original data set.
himg33 = multiResGrid(himg1, vars = c("UAAXK0000_ORG"), ifg = ifg, 
                  checkReliability = FALSE, suppresslim = 0)
p31 = ggplot(himg3) + geom_sf(aes(fill = UAA))
p32 = ggplot(himg3) + geom_sf(aes(fill = UAAXK0000_ORG))
p33 = ggplot(himg33) + geom_sf(aes(fill = UAAXK0000_ORG))
p31 + p32 + p33

# Create multi-resolution grid of UAA, based on survey data,
# with and without applying reliability check
# This is a relatively slow functionality
# rounding is set to FALSE, to be better able to visualize the few records
# (Not recommended for data to be published)
himg4 = multiResGrid(fsl,  vars = c("UAA"), weights = "EXT_MODULE", ifg = fsg, 
                      strat = "STRA_ID_CORE", checkReliability = FALSE, rounding = FALSE)
# The parameter reliabilitySplit = 15 will divide the data set in 15 groups for the 
# reliabilityCheck.
# A lower value would be recommended, but a high value speeds up the computation for this example
himg5 = multiResGrid(fsl,  vars = c("UAA"), weights = "EXT_MODULE", ifg = fsg, 
                      strat = "STRA_ID_CORE", checkReliability = TRUE, 
                      reliabilitySplit = TRUE, rounding = FALSE, pseudoreg = "REGIONS")
                      
# Apply suppreslim to suppress insignificant grid cells
# Show intermediate maps of confidential cells (wait 5 seconds)
pint = ifelse(interactive(), 5, FALSE)
#himg11 = multiResGrid(ifl, vars = "UAA", ifg = ifg, 
#                  suppresslim = 0, plotIntermediate = pint)
himg11 = himg1
himg12 = multiResGrid(ifl, vars = "UAA", ifg = ifg, 
                 suppresslim = 0.02, plotIntermediate = pint)
himg13 = multiResGrid(ifl, vars = "UAA", ifg = ifg, 
                 suppresslim = 0.05, plotIntermediate = pint)
himg14 = multiResGrid(ifl, vars = "UAA", ifg = ifg, 
                 suppresslim = 0.1, plotIntermediate = pint)
 
 
 # This is an example of a userfun that can be used for alternative restrictions
 # for a grid cell. This particular toy example assures that there are at least
 # \code{nabove} records with a value (UAA in this case) above a certain "limit". 
 ufun = function(df, nabove, limit) {
   sum(df$gridvar > limit) < nabove
 }
 
himg6 = multiResGrid(ifl, vars = "UAA", ifg = ifg, 
                 suppresslim = 0.2, plotIntermediate = pint, userfun = ufun, nabove = 5, limit = 10)
 
 
 
if (useBorder) himg00 = st_intersection(dkb, himg0) else himg00 = himg0
p00 = ggplot() + geom_sf(data = himg00, aes(fill = count, color = count)) +
  scale_fill_viridis( name = "number of farms", trans = "log10") +
  scale_color_viridis( name = "number of farms", trans = "log10") +
  coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
  ggtitle("Number of farms for variable grid cell size, only frequency confidentiality") +
  theme_bw()
if (useBorder) p00 = p00 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) 
p00
  
if (useBorder) himg01 = st_intersection(dkb, himg1) else himg01 = himg1
p01 = ggplot() + geom_sf(data = himg01, aes(fill = count, color = count)) +
  scale_fill_viridis( name = "number of farms", trans = "log10") +
  scale_color_viridis( name = "number of farms", trans = "log10") +
  coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
  ggtitle("Number of farms for variable grid cell size, frequency and dominance confidentiality") +
  theme_bw()
if (useBorder) p01 = p01 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) 
p01
  
# Plot the density of organic agriculture, as hectares per square km
if (useBorder)himg02 = st_intersection(dkb, himg2) else himg02 = himg2
himg02$orgarea = himg02$UAAXK0000_ORG/units::set_units(st_area(himg02), "km^2")
units(himg02$orgarea) = NULL
p02 = ggplot() + geom_sf(data = himg02, aes(fill = orgarea), lwd = 0) +
  scale_fill_viridis( name = "ha / km2") +
  coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
  ggtitle("Organic UAA density")  +
  theme_bw()
if (useBorder) p02 = p02 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) 
p02

# Plot the relative abundance of organic UAA relative to total UAA
if (useBorder) himg03 = st_intersection(dkb, himg3) else himg03 = himg3
himg03$ouaashare = himg03$UAAXK0000_ORG/himg03$UAA*100
p03 = ggplot() + geom_sf(data = himg03, aes(fill = ouaashare), lwd = 0) +
  scale_fill_viridis( name = "% Organic") +
  coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
  ggtitle("Organic share")  +
  theme_bw()
if (useBorder) p03 = p03 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) 
p03
  
  
# Plot maps from survey data before and after adding the reliability constraint 
# The percentage of UAA can be above 100% due to farm area being registered at the location
# of the administration building, but the map without reliability check has too high values 
# for too many cells

if (useBorder) himg04 = st_intersection(dkb, himg4) else himg04 = himg4
himg04$area = st_area(himg04)/1e6
units(himg04$area) = NULL
himg04$uaashare = himg04$UAA/himg04$area
himg04$uaashare[himg04$uaashare > 1000] = 1000
p04 = ggplot() + geom_sf(data = himg04, aes(fill = uaashare), lwd = 0) +
  scale_fill_viridis( name = "% UAA",  trans = "log10", limits = c(1,1000)) +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
  ggtitle("UAA share (sample without reliability check)")  +
  theme_bw()
if (useBorder) p04 = p04 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) 
p04
  
if (useBorder) himg05 = st_intersection(dkb, himg5) else himg05 = himg5
himg05$area = st_area(himg05)/1e6
units(himg05$area) = NULL
himg05$uaashare = himg05$UAA/himg05$area
himg05$uaashare[himg05$uaashare > 1000] = 1000
p05 = ggplot() + geom_sf(data = himg05, aes(fill = uaashare), lwd = 0) +
  scale_fill_viridis( name = "% UAA",  trans = "log10", limits = c(1,1000)) +
  coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
  ggtitle("UAA share (sample with reliability check)")  +
  theme_bw()
if (useBorder) p05 = p05 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) 
  
if (require(patchwork)) p04 + p05 + plot_layout(guides = "collect")
  
if (useBorder) himg06 = st_intersection(dkb, himg6) else himg06 = himg6
p06 = ggplot() + geom_sf(data = himg06, aes(fill = UAA), lwd = 0) +
  scale_fill_viridis( name = "ha") +
  coord_sf(crs = 3035) +#, xlim = c(2377294, 6400000), ylim = c(1313597, 5628510)) +
  ggtitle("UAA, with additional user defined function")  +
  theme_bw()
if (useBorder) p06 = p06 + geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) 
p06
  
     
# Plot the different maps from using different suppreslim values
himgs = list(himg11, himg12, himg13, himg14)
slims = c(0, 0.02, 0.05, 0.1, 0.2)
plots = list()
uaas = c(himg11$UAA, himg12$UAA, himg13$UAA, himg14$UAA)
lims = range(uaas[uaas > 0], na.rm = TRUE)
for (ii in 1:4) {
  if (useBorder) himg = st_intersection(dkb, himgs[[ii]]) else himg = himgs[[ii]]
  plots[[ii]] = 
   ggplot() + geom_sf(data = himg, aes(fill = UAA), lwd = 0) +
    scale_fill_viridis( name = "UAA (ha)", trans = "log10", limits = lims, na.value="red") +
    ggtitle(paste("Suppresslim = ", slims[[ii]])) +
    xlab("") + ylab("") +
    theme_bw()
   if (useBorder) plots[[ii]] = plots[[ii]] + 
                     geom_sf(data = dkb, fill = NA, colour='black', lwd = 0.5)
}

if (require(patchwork)) plots[[1]]  + plots[[2]] + plots[[3]]  + plots[[4]] + 
                              plot_layout(guides = "collect")
 



#' @rdname multiResGrid

MRG documentation built on Oct. 28, 2024, 5:07 p.m.

Related to multiResGrid in MRG...