Nothing
## Copyright (c) 2016, Flowminder Foundation, Forrest R. Stevens, Dana R.
## Thomson
##
## Permission is hereby granted, free of charge, to any person obtaining
## a copy of this software and associated documentation files (the
## "Software"), to deal in the Software without restriction, including
## without limitation the rights to use, copy, modify, merge, publish,
## distribute, sublicense, and/or sell copies of the Software, and to
## permit persons to whom the Software is furnished to do so, subject to
## the following conditions:
##
## The above copyright notice and this permission notice shall be
## included in all copies or substantial portions of the Software.
##
## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
## EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
## MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
## NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
## LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
## OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
## WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
globalVariables(names = c("selected", "id", "grid_id", "stratum", "V1",
"sample_ids", "psu_id", "valid", "pop", "urban"))
#' @title GridSample sampling algorithm
#'
#' @description
#' The \code{gs_sample} algorithm creates primary sampling units (PSUs) for
#' multi-stage cluster household surveys based on gridded population data.
#' Typical complex survey design is supported with input of a raster of
#' population counts, a raster of urbanized areas, and a raster of study
#' strata. Each of these rasters need to be in an identical projection and have
#' an identical grid resolution. The algorithm first selects PSU seed cells
#' with a probability proportionate to population size according to strata,
#' rural-urban, and spatial parameters specified, then it optionally grows PSUs
#' around the seed cells until a minimum population threshold is met in each
#' PSU.
#'
#' @details
#' A number of sampling features are optional. Oversampling in urban/rural
#' areas, oversampling to be spatially representative, and stratification are
#' not required. At a minimum, the user generates a simple random sample of
#' PSUs in a study area by inputting a \code{population_raster}, defining the
#' study area boundary as one stratum with \code{strata_raster}, defining the
#' output shapefile parameters \code{output_path} and \code{sample_name}, and
#' configuring the parameters \code{cfg_hh_per_stratum},
#' \code{cfg_hh_per_urban}, \code{cfg_hh_per_rural}, and
#' \code{cfg_pop_per_psu}. See the "Stratification", "Urban/rural domains",
#' "Spatial sampling", and "PSU size and framework" sections for additional
#' information. Note that all datasets are re-projected into WGS84 before the
#' sampling process begins. A real-world example can be seen using the code
#' \code{vignette("Rwanda")}, a vignette that replicates the sample design of
#' the 2010 Rwanda DHS survey.
#'
#' @section Stratification:
#' To stratify the sample, define geographic strata boundaries with
#' \code{strata_raster}, and specify the sample size per strata with
#' \code{cfg_hh_per_stratum}. For example, if a national survey will sample
#' 10,000 households from 5 provinces, then \code{cfg_hh_per_stratum = 2000}.
#' The parameter \code{cfg_hh_per_stratum} is the minimum sample size to
#' generate representative population statistics. In some surveys, strata
#' follow urban/rural boundaries within administrative units. If this is the
#' case, then \code{strata_raster} should include the boundaries of urban and
#' rural sampling areas within each administrative area, and
#' \code{cfg_hh_per_stratum} should reflect the correct sample size per stratum
#' - for example, a national sample of 10,000 households from each urban and
#' rural areas in 5 provinces would have \code{cfg_hh_per_stratum = 1000}.
#'
#' @section Urban/rural domains:
#' If urban/rural populations are not part of the stratification scheme, then
#' they are often treated as a sub-domain. Sub-domains represent important
#' sub-populations for which representative statistics are generated from the
#' survey data, and thus each sub-domain (at the national-level) should meet
#' the minimum sample size specified for each stratum. If either the
#' urban/rural sub-domain does not include enough households to generate
#' population statistics with the desired precision, then extra PSUs are
#' oversampled in the smaller sub-domain. To implement this step with
#' \code{gs_sample}, set \code{cfg_sample_rururb = 1}. In practice, rural areas
#' are often more difficult and expensive to visit, and thus a greater number
#' of households might be sampled from rural PSUs than urban PSUs. This is why
#' the user may specify different numbers of households to be sampled from each
#' urban PSUs (\code{cfg_hh_per_urban}) and rural PSUs
#' (\code{cfg_hh_per_rural}); if the same number of households will be sampled
#' from all PSUs, then configure both of these parameters with the same value.
#' Note, the number of PSUs that will be generated in each stratum is
#' \code{cfg_hh_per_stratum} divided by some number between
#' \code{cfg_hh_per_urban} and \code{cfg_hh_per_rural}.
#'
#' @section Spatial sampling:
#' To select a sample that is both representative of the population
#' and of space, set \code{cfg_sample_spatial = 1} and specify
#' \code{cfg_sample_spatial_scale}, the spatial scale at which the sample
#' should be representative. The spatial scale should be meaningful; for
#' example, it will facilitate small area estimates with limited statistical
#' error for administrative units that are smaller than the stratification
#' units. Determining an appropriate spatial scale might take trial and error.
#' If the study area has large regions of sparse population, a typical
#' non-spatially representative sample will follow the population distribution
#' and have large areas without a PSU. In this case, the user might need to
#' increase the spatial resolution \code{cfg_sample_spatial_scale} of the
#' sample, or force the algorithm to generate more PSUs in each stratum by
#' increasing \code{cfg_hh_per_stratum} and/or reducing \code{cfg_hh_per_urban}
#' and \code{cfg_hh_per_rural}.
#'
#' @section PSU size and fieldwork:
#' Four additional parameters can be configured to deal with idiosyncrasies of
#' gridded population data and improve feasibility of fieldwork. The user can
#' set a maximum geographic size of PSU in kilometres squared,
#' \code{cfg_max_psu_size}. We recommend choosing a size that can feasibly be
#' visited by a field team on foot during one day. The user might also specify
#' which cells are included in the sample frame with
#' \code{cfg_min_pop_per_cell}. Selection of a sensible value is highly
#' dependent on the gridded population dataset being used, and the scale of the
#' input data (e.g. 200m X 200m grid cells). The cell size of the output raster
#' can be specified with \code{cfg_desired_cell_size}. Gridded population
#' datasets generated from old population figures or old covariates may be
#' inaccurate at a very local scale (e.g. 100m X 100m cells), but will
#' generally increase in accuracy as cells are aggregated (e.g. 300m X 300m
#' cells). Finally, the PSU growth portion of the algorithm can be switched off
#' by setting \code{cfg_psu_growth = FALSE} resulting in a sample of single
#' grid cells (and their centroids).
#'
#' @import raster rgdal geosphere rgeos maptools spatstat.utils
#' @importFrom utils flush.console
#' @importFrom data.table data.table setkey
#' @importFrom spatstat dirichlet
#' @importFrom methods as
#'
#' @param population_raster Raster* layer. Input gridded population dataset to
#' use as sample frame. Values should be number of people in each pixel as a
#' whole number or decimal value.
#' @param strata_raster Raster* layer. Raster that defines the stratum numeric
#' ID of each pixel. Generally created by rasterizing a shapefile of polygons
#' that define strata.
#' @param urban_raster Raster* layer. Raster of urbanized areas where a cell
#' value of 1 indicates urban cells and 0 indicates rural cells.
#' @param cfg_hh_per_stratum numeric. Target household sample size per stratum.
#' In a non-stratified sample, this is the total sample size of households.
#' In a stratified sample, this is the household sample size per stratum.
#' @param cfg_hh_per_urban numeric. Number of households expected to be
#' selected per urban PSU during survey fieldwork.
#' @param cfg_hh_per_rural numeric. Number of households expected to be
#' selected per rural PSU during survey fieldwork.
#' @param cfg_pop_per_psu numeric. Minimum population per PSU (e.g. 500
#' persons).
#' @param cfg_sample_rururb logical. A flag to oversample rural/urban areas if
#' one domain does not meet the target sample size per stratum. Default is
#' \code{FALSE}.
#' @param cfg_sample_spatial logical. A flag to oversample in space ensuring
#' that at least one PSU is selected within each "coarse grid" cell with cell
#' size defined by the user. Default is \code{FALSE}.
#' @param cfg_sample_spatial_scale If \code{cfg_sample_spatial == TRUE}, this
#' defines the length in kilometres of the side of the cell (e.g. 20 for
#' 20km X 20km) of each coarse grid cell where the algorithm will ensure at
#' least one PSU is located in each coarse grid cell.
#' @param cfg_desired_cell_size numeric. Desired length of the side of the cell
#' in 100m (e.g. 4 for 400m X 400m) for output raster of PSUs. Defaults to
#' NA, which yields an output raster at the same resolution as
#' population_raster.
#' @param cfg_max_psu_size numeric. Maximum allowed geographic size of a given
#' PSU in kilometres squared (e.g. 5 for PSUs smaller than 5km X 5km).
#' Defaults to infinity.
#' @param cfg_min_pop_per_cell numeric. Minimum population in a raster cell
#' required for it to be considered for sampling. Cells with less than this
#' value will be excluded from the sample. Defaults to 0, therefore including
#' all cells.
#' @param cfg_psu_growth logical. Determines whether to grow PSUs until either
#' there are no available cells or each PSU covers a population defined by
#' \code{cfg_pop_per_psu}.
#' @param cfg_random_number numeric. The random number seed to reproduce a
#' previous gridded population sample.
#' @param output_path character. Output path and folder name.
#' @param sample_name character. Name of output PSU shapefile.
#'
#' @return Shapefile of household survey primary sampling unit (PSU) boundaries
#'
#' @examples
#' require(raster)
#'
#' poprast <- raster(ncols = 100, nrows = 100, xmx = 10, xmn = 9, ymn = 9, ymx = 10,
#' crs = CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"),
#' vals = runif(10000, 0, 100))
#' stratarast <- raster(ncols = 100, nrows = 100, xmx = 10, xmn = 9, ymn = 9, ymx = 10,
#' crs = CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"),
#' vals = c(rep(1, times = 5000), rep(2, times = 5000)))
#' urbanrast <- poprast > 25
#'
#' example_1 <- gs_sample(
#' population_raster = poprast,
#' strata_raster = stratarast,
#' urban_raster = urbanrast,
#' cfg_hh_per_stratum = 800,
#' cfg_hh_per_urban = 20,
#' cfg_hh_per_rural = 20,
#' cfg_pop_per_psu = 500,
#' cfg_sample_rururb = TRUE,
#' cfg_sample_spatial = FALSE,
#' cfg_sample_spatial_scale = 100,
#' cfg_desired_cell_size = NA,
#' cfg_max_psu_size = 5,
#' cfg_min_pop_per_cell = 0.01,
#' output_path = tempdir(),
#' sample_name="Example"
#' )
#' plot(example_1)
#'
#' #### Example two is the identical, except PSUs aren't grown,
#' #### so the shapefile returned includes a single grid cell for each PSU.
#'
#' example_2 <- gs_sample(
#' population_raster = poprast,
#' strata_raster = stratarast,
#' urban_raster = urbanrast,
#' cfg_hh_per_stratum = 800,
#' cfg_hh_per_urban = 20,
#' cfg_hh_per_rural = 20,
#' cfg_pop_per_psu = 500,
#' cfg_sample_rururb = TRUE,
#' cfg_sample_spatial = FALSE,
#' cfg_sample_spatial_scale = 100,
#' cfg_desired_cell_size = NA,
#' cfg_max_psu_size = 5,
#' cfg_min_pop_per_cell = 0.01,
#' cfg_psu_growth = FALSE,
#' output_path = tempdir(),
#' sample_name="Example_without_growth"
#' )
#' plot(example_2)
#'
#' @export
gs_sample <- function(population_raster,
strata_raster,
urban_raster,
cfg_hh_per_stratum,
cfg_hh_per_urban,
cfg_hh_per_rural,
cfg_pop_per_psu,
cfg_sample_rururb = FALSE,
cfg_sample_spatial = FALSE,
cfg_sample_spatial_scale = NA,
cfg_desired_cell_size = NA,
cfg_max_psu_size = Inf,
cfg_min_pop_per_cell = 0,
cfg_psu_growth = TRUE,
cfg_random_number = NA,
output_path,
sample_name) {
if (!is.na(cfg_random_number)) {
if (exists('.Random.seed')) {
old.seed <- .Random.seed
on.exit({
print("Restoring random seed to its original value.")
assign('.Random.seed', old.seed, envir = .GlobalEnv)
})
} else {
on.exit({
print("Restoring random seed to its original value.")
rm('.Random.seed', envir = .GlobalEnv)
})
}
print(paste0("Setting random seed to ", cfg_random_number, "."))
set.seed(cfg_random_number)
}
## Project rasters in WGS84 if not
if (sp::proj4string(population_raster) != "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0") {
population_raster <- projectRaster(population_raster,
crs = sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
}
if (sp::proj4string(urban_raster) != "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0") {
urban_raster <- projectRaster(urban_raster,
crs = sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
}
if (sp::proj4string(strata_raster) != "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0") {
strata_raster <- projectRaster(strata_raster,
crs = sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
}
## Aggregate to meet desired resolution if lower than current
if (!is.na(cfg_desired_cell_size)) {
midcell <- c(
xFromCol(population_raster, col = floor(ncol(population_raster) / 2)),
yFromRow(population_raster, row = nrow(population_raster) / 2)
)
nextcellx <- c(
xFromCol(population_raster, col = floor(ncol(population_raster) / 2) + 1),
yFromRow(population_raster, row = nrow(population_raster) / 2)
)
# calculate cell area and convert to hectares
cellsize <- distGeo(midcell, nextcellx) / 100
if (cellsize < cfg_desired_cell_size) {
population_raster <- aggregate(population_raster,
fact = round(cfg_desired_cell_size / cellsize),
fun = sum, na.rm = T)
strata_raster <- resample(strata_raster,
population_raster, method = "ngb")
urban_raster <- resample(urban_raster,
population_raster, method = "ngb")
}
if (cellsize >= cfg_desired_cell_size) {
print("Resolution already as low or lower than desired.")
}
}
midcell <- c(
xFromCol(population_raster, col = floor(ncol(population_raster) / 2)),
yFromRow(population_raster, row = nrow(population_raster) / 2)
)
nextcellx <- c(
xFromCol(population_raster, col = floor(ncol(population_raster) / 2) + 1),
yFromRow(population_raster, row = nrow(population_raster) / 2)
)
nextcelly <- c(
xFromCol(population_raster, col = floor(ncol(population_raster) / 2)),
yFromRow(population_raster, row = nrow(population_raster) / 2 + 1)
)
# calculate cell area and convert to kilometers
cellsizex <- distGeo(midcell, nextcellx) / 1000
cellsizey <- distGeo(midcell, nextcelly) / 1000
cellkmsq <- cellsizex * cellsizey
if (!is.na(cfg_sample_spatial_scale)) {
spatialfac <- (cfg_sample_spatial_scale) / mean(c(cellsizex, cellsizey))
}
## Convert maximum PSU size (in square kilometers) into a maximum possible
## number of cells
maxcells <- cfg_max_psu_size / cellkmsq
print("Beginning gs_sample() Process:")
flush.console()
## Compile vectors of values:
urban_vec <- urban_raster[]
pop_vec <- population_raster[]
strata_vec <- strata_raster[]
## Create an index of cells, and vectors with no NAs from our data:
raster_index <- 1:ncell(strata_raster)
raster_index_valid_boo <- (
!is.na(strata_vec) &
!is.na(urban_vec) &
!is.na(pop_vec) &
pop_vec > cfg_min_pop_per_cell
)
raster_index_valid <- raster_index[ raster_index_valid_boo ]
## Create sampled_index that will be values for our output, 0 for
## unsampled, 1 for sampled:
sampled <- 0 * raster_index_valid_boo
## Compile unique lists of valid Strata IDs:
strata_ids <- unique(strata_vec[raster_index_valid])
print("Randomly Stratifying PSU Cell IDs:")
flush.console()
for (stratum_id in strata_ids) {
print(paste("Processing stratum ID", stratum_id, "..."))
flush.console()
## Create index for rasters for current strata, with valid values:
index <- raster_index[ raster_index_valid_boo & (strata_vec == stratum_id) ]
initsampledpsu <- cfg_hh_per_stratum / min(
c(cfg_hh_per_urban, cfg_hh_per_rural), na.rm = TRUE)
psu_data <- data.table(
"id" = index,
"pop" = pop_vec[index],
"urban" = urban_vec[index],
"selected" = 0
)
## Calculate number of cells to skip for serpentine sampling
psu_interval <- floor(sum(psu_data$pop) / initsampledpsu)
psu_data$pop_cum <- cumsum(psu_data$pop)
psu_data$urbanpop <- (psu_data$urban == 0) * cfg_hh_per_rural +
(psu_data$urban == 1) * cfg_hh_per_urban
psu_data$ruralpop <- (psu_data$urban == 1) * cfg_hh_per_rural +
(psu_data$urban == 0) * cfg_hh_per_urban
## Choose first cell to sample
initpop <- sample(psu_interval, 1)
psu_data$selected[which.min(abs(psu_data$pop_cum - initpop))] <- 1
## Skip based on psu_interval then select cells
nextpop <- initpop + psu_interval
while (nextpop < max(psu_data$pop_cum)) {
psu_data$selected[which.min(abs(psu_data$pop_cum - nextpop))] <- 1
nextpop <- nextpop + psu_interval
}
## Calculate maximum number of households sampled given the cells that
## have been chosen
psu_data$maxsample <- max(psu_data$urbanpop, psu_data$ruralpop)
total_stratum <- sum(psu_data[selected == 1]$maxsample)
## Remove and add cells as necessary based on how many will be sampled
## per stratum, compared to the desired cfg_hh_per_stratum
if (total_stratum > cfg_hh_per_stratum) {
popremove <- floor(total_stratum - cfg_hh_per_stratum)
psu_seeds <- psu_data[selected == 1]
psu_seeds <- psu_seeds[sample(1:dim(psu_seeds)[1]), ]
psu_seeds$popcum <- cumsum(psu_seeds$maxsample)
psu_data$selected[
is.element(
psu_data$id,
psu_seeds$id[which(psu_seeds$urbanpopcum < popremove)]
)] <- 0
}
if (total_stratum < cfg_hh_per_stratum) {
while (total_stratum < cfg_hh_per_stratum) {
newid <- sample(psu_data[selected == 0]$id, 1)
total_stratum <- total_stratum +
(urban_vec[newid] == 0) * cfg_hh_per_rural +
(urban_vec[newid] == 1) * cfg_hh_per_urban
psu_data[id == newid]$selected <- 1
}
}
## set all selected cells to 1 in the "sampled" vector
sampled[ psu_data[selected == 1]$id] <- 1
}
## Next we need to revisit where we have 1 in our sampled values, and
## confirm that we match our rural/urban ratio if
## cfg_sample_rururb == TRUE:
if (cfg_sample_rururb == TRUE) {
print("Checking urban and rural domains meet stratum HH sample size")
flush.console()
## Extract the rural and urban values that are currently in our sample:
urban_vec_sampled <- urban_vec[sampled == 1]
rural_count <- length(which(urban_vec_sampled == 0))
urban_count <- length(which(urban_vec_sampled == 1))
## Double check to see if we are missing PSUs of a certain type:
reassign <- NA
choice <- NA
num_switch <- 0
## Find out which class we need to increase, and which we need to
## decrease:
if (rural_count * cfg_hh_per_rural < cfg_hh_per_stratum) {
## We need more rural in our sample:
choice <- 1
reassign <- 0
## Number to reassign:
num_switch <- round(cfg_hh_per_stratum / cfg_hh_per_rural) -
rural_count
}
if (urban_count * cfg_hh_per_urban < cfg_hh_per_stratum) {
## We need more urban in our sample:
choice <- 0
reassign <- 1
## Number to reassign:
num_switch <- round(cfg_hh_per_stratum / cfg_hh_per_urban) -
urban_count
}
if (
rural_count * cfg_hh_per_rural < cfg_hh_per_stratum &
urban_count * cfg_hh_per_urban < cfg_hh_per_stratum
) {
print("Both are below threshold! Cannot reassign to meet desired sample size")
reassign <- NA
choice <- NA
}
if (is.na(reassign)) {
print("Not reassigning urban/rural")
}
if (!is.na(reassign)) {
reassign_index <- raster_index[
(sampled == 0) &
(urban_vec == reassign) &
raster_index_valid_boo
]
choice_index <- raster_index[
(sampled == 1) &
(urban_vec == choice) &
raster_index_valid_boo
]
if (length(reassign_index) == 0 | length(choice_index) == 0) {
print("Unable to reassign urban/rural; no available cells")
}
## Choose cells to reassign
if (length(reassign_index) > 0 & length(choice_index) > 0) {
reassign_index <- reassign_index[
is.element(strata_vec[reassign_index],
unique(strata_vec[choice_index])
)
]
reassign_ids <- sample(reassign_index, num_switch)
reassign_strata <- strata_vec[reassign_ids]
strata_numbers <- table(reassign_strata)
strata_numbers_ids <- as.numeric(names(strata_numbers))
if (length(reassign_strata) > 0) {
for (i in 1:length(strata_numbers_ids)) {
stratum_id <- strata_numbers_ids[i]
stratum_number <- strata_numbers[[i]]
if (choice == 1) {
stratum_number <- floor(
(cfg_hh_per_urban / cfg_hh_per_rural) * stratum_number
)
} else {
stratum_number <- floor(
(cfg_hh_per_rural / cfg_hh_per_urban) * stratum_number
)
}
print(paste("Reassignment for stratum",
strata_ids[i], "of", stratum_number,
"cells processed"))
choice_index <- raster_index[
(sampled == 1) &
(urban_vec == choice) &
strata_vec == stratum_id
]
## If the number of samples to convert is more than
## 90% of the total number then only convert up to
## that amount:
if (stratum_number > round(0.90 * length(choice_index))) {
stratum_number <- round(0.90 * length(choice_index))
}
choice_ids <- sample(choice_index, stratum_number)
sampled[ choice_ids ] <- 0
}
## Finally, convert our reassigned cells to 1 in the
## sampled vector:
sampled[ reassign_ids ] <- 1
}
}
}
}
if (cfg_sample_spatial == TRUE) {
print("Stratifying by Spatial Grid:")
flush.console()
## First create a raster with cell indices from our valid cell
## locations:
valid_cells <- raster_index
valid_cells[-1 * raster_index_valid] <- 0
valid_raster <- strata_raster
valid_raster <- setValues(valid_raster, valid_cells)
## Change the aggregated cell parameter from square kilometers to numbers
## of pixels
valid_raster <- aggregate(valid_raster, fac = spatialfac, fun = max)
valid_raster <- resample(valid_raster, strata_raster, method = "ngb")
valid_vec <- valid_raster[]
## Compile unique lists of valid aggregated grid IDs:
grid_ids <- unique(valid_vec[raster_index_valid])
## Compile counts of sampled points by grid ids:
raster_data <- data.table(
"grid_id" = valid_vec,
"sampled" = as.numeric(sampled),
"stratum" = strata_vec,
"raster_index" = raster_index
)
data.table::setkey(raster_data, grid_id, stratum, sampled)
grid_table <- raster_data[, sum(sampled, na.rm = TRUE), by = grid_id]
## Compile majority value of the stratum id for each cell:
strata_table <- raster_data[, gs_mode(stratum), by = grid_id]
## For each empty cell, randomly select a new cell within it to
## sample and remove one randomly from the stratum that makes up the
## majority of the cell:
for (i in 1:length(strata_table[, grid_id])) {
## Check to see if the grid is empty and the dominant stratum is not
## zero:
if (grid_table[, V1][i] == 0 & !is.na(strata_table[, V1][i])) {
print(paste("Reassignment for grid", grid_table[, grid_id][i],
"and stratum", strata_table[, V1][i]))
flush.console()
## Select unsampled raster indices for this grid cell:
cell_raster_indices <- raster_data[list(strata_table[, grid_id][i],
strata_table[, V1][i], 0)]
## Select sampled raster indices for this stratum:
stratum_raster_indices <- raster_data[
stratum == strata_table[, V1][i] &
sampled == 1
]
## Randomly choose a valid cell from this grid location to sample
## and one from the sampled locations in the majority stratum to
## unsample:
compiled_cell_indices <- cell_raster_indices[, raster_index]
compiled_cell_indices <- compiled_cell_indices[ compiled_cell_indices %in%
raster_index[ (sampled == 0) & raster_index_valid_boo ]]
if (
length(compiled_cell_indices) > 0 &
length(stratum_raster_indices[, raster_index]) > 0
) {
cell_raster_index <- sample(compiled_cell_indices, 1)
## Make sure we only sample from indices matching the rural/urban
## designation we just chose randomly:
stratum_raster_indices <- raster_data[
stratum == strata_table[, V1][i] &
sampled == 1 &
urban_vec == urban_vec[cell_raster_index]
]
stratum_raster_index <- sample(
stratum_raster_indices[, raster_index], 1
)
sampled[ cell_raster_index ] <- 1
sampled[ stratum_raster_index ] <- 0
} else {
print("No valid cells for this stratum left for assignment or reassignment! Skipping...")
flush.console()
}
}
}
}
## Convert sampled vector into a raster
sampled_image <- strata_raster
sampled_image <- setValues(sampled_image, sampled)
## Second Stage Sampling:
print("Begin Second Stage Sampling, Growing PSUs if requested:")
flush.console()
## Assign an ID to each sampled cell randomly:
sampled_ids <- sample(1:sum(sampled))
sampled_image[ sampled_image[] != 0 ] <- sampled_ids
sampled_image[ sampled_image[] == 0 ] <- NA
cellids <- which(is.na(sampled_image[]) == F)
xydat <- data.frame(
sampleid = sampled_image[cellids],
x = xFromCell(sampled_image, cell = cellids),
y = yFromCell(sampled_image, cell = cellids)
)
## Create data.table of our data for sorting and quick accesss:
raster_data <- data.table(
"sampled" = sampled,
"sample_ids" = sampled_image[],
"stratum" = strata_vec,
"raster_index" = raster_index
)
setkey(raster_data, sample_ids)
index_raster <- sampled_image
index_raster[] <- raster_index
## Create vectors of x/y coordinates from our raster object for use
## in creating a distance from focal PSU point within the loop:
xCoords <- xFromCell(index_raster, index_raster[])
yCoords <- yFromCell(index_raster, index_raster[])
## Initialize a new image based on our sampled image for storing our
## expanded PSU areas:
psu_image <- sampled_image
psu_image[ is.na(psu_image[]) ] <- 0
if (cfg_psu_growth == TRUE) {
psu_ids <- sampled_ids
psu_sampled_ids <- sampled_ids
psu_voronoi_done <- sampled_ids
## Calculate statistics for each PSU (number of households, number of
## cells, etc, and a vector indicating if the Voronoi polygons are
## filled)
psu_sampled_sum <- population_raster[][
raster_data[list(psu_sampled_ids), raster_index]]
psu_sampled_cell_count <- psu_sampled_sum * 0 + 1
psu_voronoi_done <- sampled_ids
psu_index <- sp::SpatialPointsDataFrame(
coords = xyFromCell(index_raster, raster_data[sampled == 1]$raster_index),
data = as.data.frame(raster_data[sampled == 1]),
proj4string = sp::CRS(sp::proj4string(population_raster)))
## Create Voronoi polygons around each seed PSU cell
psuvoronoi <- as(dirichlet(spatstat::as.ppp(psu_index@coords,
W = extent(population_raster)[1:4])), "SpatialPolygons")
psuvoronoi$sample_ids <- psu_index$sample_ids
psuvoronoi$stratum <- psu_index$stratum
voronoirast <- rasterize(psuvoronoi, population_raster,
field = "sample_ids")
passes <- 0
while (sum(psu_sampled_ids != 0) > 0) {
print(paste("PSUs to process:", sum(psu_sampled_ids != 0)))
flush.console()
## Randomly shuffle the remaining IDs so we don't process them in
## the same order:
shuffle_index <- sample(1:length(psu_sampled_ids))
psu_ids <- psu_ids[shuffle_index]
psu_sampled_ids <- psu_sampled_ids[shuffle_index]
psu_sampled_sum <- psu_sampled_sum[shuffle_index]
psu_sampled_cell_count <- psu_sampled_cell_count[shuffle_index]
psu_voronoi_done <- psu_voronoi_done[shuffle_index]
## Print our current active PSUs:
print("Active PSUs:")
print(psu_sampled_ids[])
print("Active PSU Population Sums:")
print(psu_sampled_sum[])
print("Active PSU Cell Count:")
print(psu_sampled_cell_count)
## Remove any PSUs that already exceed our population sum threshold:
psu_sampled_ids[psu_sampled_sum[] > cfg_pop_per_psu] <- 0
psu_sampled_ids[psu_sampled_cell_count > maxcells] <- 0
psu_voronoi_done[psu_sampled_sum[] > cfg_pop_per_psu] <- 0
psu_voronoi_done[psu_sampled_cell_count > maxcells] <- 0
if (length(which(psu_voronoi_done > 0) > 0)) {
for (sample_id in psu_voronoi_done[psu_voronoi_done > 0]) {
print(paste("Processing", sample_id, "..."))
flush.console()
## Calculate distances from the current cell for entire image:
x_index <- mean(
xFromCell(index_raster, which(psu_image[] == sample_id)))
y_index <- mean(
yFromCell(index_raster, which(psu_image[] == sample_id)))
cell_dist <- ((x_index - xCoords)^2 + (y_index - yCoords)^2 )^0.5
## Create an index that only pulls values for those cells
## within our current stratum, and for which we have not
## already selected for use in a PSU:
stratum_id <- raster_data[list(sample_id), stratum]
stratum_indices <- raster_data[stratum == stratum_id, raster_index]
indices <- stratum_indices[
psu_image[stratum_indices] == 0 &
raster_index_valid_boo[stratum_indices] == T &
voronoirast[stratum_indices] == sample_id
]
if (length(indices) == 0) {
## If there are no more valid indices left in the
## strata for assignment then we drop the ID from the
## psu_sampled_ids list:
psu_sampled_ids[psu_sampled_ids == sample_id] <- 0
print("No Voronoi-bounded cells left!")
} else {
ordered_data <- data.frame(indices,
"pop" = pop_vec[indices], "dist" = cell_dist[indices])
ordered_data <- ordered_data[order(ordered_data$dist,
-(ordered_data$pop), decreasing = F), ]
ordered_data$pop_cum <- cumsum(ordered_data$pop) +
psu_sampled_sum[psu_sampled_ids == sample_id]
selection <- ordered_data$pop_cum <= cfg_pop_per_psu
selection[which(selection == FALSE)[1]] <- TRUE
indices_to_add <- ordered_data[ selection, "indices" ]
max_index <- length(indices_to_add)
if (length(indices_to_add) == 0) {
max_index <- 1
indices_to_add[max_index] <- ordered_data[max_index, 1]
}
if (sum(is.na(indices_to_add) > 0)) {
indices_to_add <- indices_to_add[!is.na(indices_to_add)]
}
## Update our psu_image based on new sample indices:
psu_image[ indices_to_add ] <- sample_id
## Adjust the total sample_id to match the total population:
sum_population_added <- sum(pop_vec[indices_to_add])
sum_population_for_sample <- sum(
pop_vec[raster_index[psu_image[] == sample_id]])
sum_cells_added <- length(indices_to_add)
sum_cells <- sum(psu_image[] == sample_id)
psu_sampled_sum[
psu_sampled_ids == sample_id
] <- sum_population_for_sample
psu_sampled_cell_count[psu_sampled_ids == sample_id] <- sum_cells
print(paste("... cells added:", sum_cells_added))
print(paste("... population added:", sum_population_added))
print(paste("... population total:", sum_population_for_sample))
## Check to see if we have reached our maximum
## configured population or if we could not add any
## more cells, then drop the current sample_id from the
## list to process:
if (
sum_cells_added == 0 |
sum_population_for_sample > cfg_pop_per_psu
) {
psu_sampled_ids[psu_sampled_ids == sample_id] <- 0
psu_voronoi_done[psu_voronoi_done == sample_id] <- 0
}
}
}
passes <- passes + 1
}
}
}
## Last let's convert our 0's to NA and save the image off:
psu_image[ psu_image[] == 0 ] <- NA
print("Converting PSU Cells to Polygons and Summarizing Metrics:")
## Convert our PSUs to polygons:
psu_polygons <- rasterToPolygons(psu_image, na.rm = T, dissolve = T)
names(psu_polygons) <- "psu_id"
# Create data.table of our data for sorting and quick accesss of
# per-stratum information:
raster_data <- data.table(
"valid" = raster_index_valid_boo,
"sampled" = sampled,
"sample_ids" = sampled_image[],
"psu_id" = psu_image[],
"stratum" = strata_vec,
"pop" = as.numeric(population_raster[]),
"urban" = urban_vec,
"raster_index" = raster_index
)
setkey(raster_data, psu_id)
# Calculate, total, rural (urban==0) and urban (urban==1) populations
# per stratum, as well as total number of cells:
stratum_pop <- raster_data[
valid == TRUE,
list(
"stratum" = max(stratum),
"stratum_pop" = sum(pop, na.rm = TRUE),
"stratum_rural_pop" = sum((urban == 0) * pop, na.rm = TRUE),
"stratum_urban_pop" = sum((urban == 1) * pop, na.rm = TRUE),
"stratum_cells" = length(raster_index)
),
by = stratum
]
setkey(stratum_pop, stratum)
psu_pop <- raster_data[
valid == TRUE & !is.na(psu_id),
list(
"stratum" = max(stratum),
"psu_pop" = sum(pop, na.rm = TRUE),
"psu_rural_pop" = sum((urban == 0) * pop, na.rm = TRUE),
"psu_urban_pop" = sum((urban == 1) * pop, na.rm = TRUE),
"psu_cells" = length(raster_index)
),
by = psu_id
]
setkey(psu_pop, stratum)
merge_pop <- psu_pop[list(stratum_pop)]
setkey(merge_pop, psu_id)
fields <- c(
"stratum",
"psu_id",
"psu_pop",
"psu_rural_pop",
"psu_urban_pop",
"psu_cells",
"stratum_pop",
"stratum_rural_pop",
"stratum_urban_pop",
"stratum_cells"
)
merge_pop <- as.data.frame(merge_pop)[, -7]
psu_polygons@data <- merge_pop[match(psu_polygons@data[, "psu_id"],
merge_pop[, "psu_id"]), ]
psu_polygons <- merge(psu_polygons, xydat, by.x = "psu_id", by.y = "sampleid")
names(psu_polygons) <- c(
"psu_id",
"stratum",
"psu_pop",
"psu_r_pop",
"psu_u_pop",
"psu_cells",
"str_pop",
"str_r_pop",
"str_u_pop",
"str_cells",
"xCent",
"yCent"
)
writeOGR(psu_polygons, dsn = output_path, layer = sample_name,
driver = "ESRI Shapefile", check_exists = TRUE, overwrite_layer = TRUE)
return(psu_polygons)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.