irs: Select an independent random sample (IRS)
In spsurvey: Spatial Sampling Design and Analysis

View source: R/irs.R

irs	R Documentation

Select an independent random sample (IRS)

Description

Select a sample that is not spatially balanced from a point (finite), linear / linestring (infinite), or areal / polygon (infinite) sampling frame using the Independent Random Sampling (IRS) algorithm. The IRS algorithm accommodates unstratified and stratified sampling designs and allows for equal inclusion probabilities, unequal inclusion probabilities according to a categorical variable, and inclusion probabilities proportional to a positive auxiliary variable. Several additional sampling options are included, such as including legacy (historical) sites, requiring a minimum distance between sites, and selecting replacement sites.

Usage

irs(
  sframe,
  n_base,
  stratum_var = NULL,
  seltype = NULL,
  caty_var = NULL,
  caty_n = NULL,
  aux_var = NULL,
  legacy_var = NULL,
  legacy_sites = NULL,
  legacy_stratum_var = NULL,
  legacy_caty_var = NULL,
  legacy_aux_var = NULL,
  mindis = NULL,
  maxtry = 10,
  n_over = NULL,
  n_near = NULL,
  wgt_units = NULL,
  pt_density = NULL,
  DesignID = "Site",
  SiteBegin = 1,
  sep = "-",
  projcrs_check = TRUE
)

Arguments

`sframe`	A sampling frame as an `sf` object. The coordinate system for `sframe` must projected (not geographic). If m or z values are in `sframe`'s geometry, they are silently dropped (i.e., only x-coordinates and y-coordinates are preserved).
`n_base`	The base sample size required. If the sampling design is unstratified, this is a single numeric value. If the sampling design is stratified, this is a named vector or list whose names represent each stratum and whose values represent each stratum's sample size. These names must match the values of the stratification variable represented by `stratum_var`. Legacy sites are considered part of the base sample, so the value for `n_base` should be equal to the number of legacy sites plus the number of desired non-legacy sites.
`stratum_var`	A character string containing the name of the column from `sframe` that identifies stratum membership for each element in `sframe`. If stratum equals `NULL`, the sampling design is unstratified and all elements in `sframe` are eligible to be selected in the sample. The default is `NULL`.
`seltype`	A character string or vector indicating the inclusion probability type, which must be one of following: `"equal"` for equal inclusion probabilities; `"unequal"` for unequal inclusion probabilities according to a categorical variable specified by `caty_var`; and `"proportional"` for inclusion probabilities proportional to a positive auxiliary variable specified by `aux_var`. If the sampling design is unstratified, `seltype` is a single character vector. If the sampling design is stratified, `seltype` is a named vector whose names represent each stratum and whose values represent each stratum's inclusion probability type. `seltype`'s default value tries to match the intended inclusion probability type: If `caty_var` and `aux_var` are not specified, `seltype` is `"equal"`; if `caty_var` is specified, `seltype` is `"unequal"`; and if `aux_var` is specified, `seltype` is `"proportional"`.
`caty_var`	A character string containing the name of the column from `sframe` that represents the unequal probability variable.
`caty_n`	A character vector indicating the expected sample size for each level of `caty_var`, the unequal probability variable. If the sampling design is unstratified, `caty_n` is a named vector whose names represent each level of `caty_var` and whose values represent each level's expected sample size. The sum of `caty_n` must equal `n_base`. If the sampling design is stratified and the expected sample sizes are the same among strata, `caty_n` is a named vector whose names represent represent each level of `caty_var` and whose values represent each level's expected sample size – these expected sample sizes are applied to all strata. The sum of `caty_n` must equal each stratum's value in `n_base`. If the sampling design is stratified and the expected sample sizes differ among strata, `caty_n` is a list where each element is named as a stratum in `n_base`. Each stratum's list element is a named vector whose names represent each level of `caty_var` and whose values represent each level's expected sample size (within the stratum). The sum of the values in each stratum's list element must equal that stratum's value in `n_base`.
`aux_var`	A character string containing the name of the column from `sframe` that represents the proportional (to size) inclusion probability variable (auxiliary variable). This auxiliary variable must be positive, and the resulting inclusion probabilities are proportional to the values of the auxiliary variable. Larger values of the auxiliary variable result in higher inclusion probabilities.
`legacy_var`	This argument can be used instead of `legacy_sites` when `sframe` is a `POINT` or `MULTIPOINT` geometry (i.e. a finite sampling frame), When `legacy_var` is used, it is a character string containing the name of the column from `sframe` that represents whether each site is a legacy site. For legacy sites, the values of the `legacy_var` must contain character strings that act as a legacy site identifier. For non-legacy sites, the values of the `legacy_var` column must be `NA`. Using this approach, `legacy_stratum_var`, `legacy_caty_var`, and `legacy_aux_var` are not required and should not be used (because `legacy_var` represents a column in `sframe`). `spsurvey` assumes that the legacy sites were selected from a previous sampling design that incorporated randomness into site selection and that the legacy sites are elements of the current sampling frame.
`legacy_sites`	An sf object with a `POINT` or `MULTIPOINT` geometry representing the legacy sites. spsurvey assumes that the legacy sites were selected from a previous sampling design that incorporated randomness into site selection and that the legacy sites are elements of the current sampling frame. If `sframe` has a `POINT` or `MULTIPOINT` geometry, the observations in `legacy_sites` should not also be in `sframe` (i.e., duplicates are not removed). Thus, `sframe` and `legacy_sites` together compose the current sampling frame. If m or z values are in `legacy_sites`' geometry, they are silently dropped (i.e., only x-coordinates and y-coordinates are preserved).
`legacy_stratum_var`	A character string containing the name of the column from `legacy_sites` that identifies stratum membership for each element of `legacy_sites`. This argument is required when the sampling design is stratified and its levels must be contained in the levels of the `stratum_var` variable. The default value of `legacy_stratum_var` is `stratum_var`, so `legacy_stratum_var` need only be specified explicitly when the name of the stratification variable in `legacy_sites` differs from `stratum_var`.
`legacy_caty_var`	A character string containing the name of the column from `legacy_sites` that identifies the unequal probability variable for each element of `legacy_sites`. This argument is required when the sampling design uses unequal selection probabilities and its categories must be contained in the levels of the `caty_var` variable. The default value of `legacy_caty_var` is `caty_var`, so `legacy_caty_var` need only be specified explicitly when the name of the unequal probability variable in `legacy_sites` differs from `caty_var`.
`legacy_aux_var`	A character string containing the name of the column from `legacy_sites` that identifies the proportional probability variable for each element of `legacy_sites`. This argument is required when the sampling design uses proportional selection probabilities and the values of the `legacy_aux_var` variable must be positive. The default value of `legacy_aux_var` is `aux_var`, so `legacy_aux_var` need only be specified explicitly when the name of the proportional probability variable in `legacy_sites` differs from `aux_var`.
`mindis`	A numeric value indicating the desired minimum distance between sampled sites. If the sampling design is stratified and `mindis` is an numeric value, the minimum distance is applied to all strata. If the sampling design is stratified and different minimum distances are desired among strata, then `mindis` is a list whose names match the names of `n_base` and whose and values are the minimum distance for the corresponding stratum. If a minimum distance is not desired for a particular stratum, then the corresponding value in `mindis` should be `0` or `NULL` (which is equivalent to `0`). The units of `mindis` must represent the units in `sframe`. A warning is returned if the minimum distance could not be reached after `maxtry` attempts. If legacy sites are used, the minimum distance requirement (and subsequent warning if `maxtry` attempts are reached) is enforced for all base sites that are not legacy sites (i.e., the minimum distance is enforced for these sites by comparing distances against all base sites (legacy and non-legacy)).
`maxtry`	The number of maximum attempts to apply the minimum distance algorithm to obtain the desired minimum distance between sites. Each iteration takes roughly as long as the standard GRTS algorithm. Successive iterations will always contain at least as many sites satisfying the minimum distance requirement as the previous iteration. The algorithm stops when the minimum distance requirement is met or there are `maxtry` iterations. The default number of maximum iterations is `10`.
`n_over`	The number of reverse hierarchically ordered (rho) replacement sites. If the sampling design is unstratified, then `n_over` is an integer specifying the number of rho replacement sites desired. If the sampling design is stratified, then `n_over` is a vector (or list) whose names match the names of `n_base` and whose values indicate the number of rho replacement sites for each stratum. If replacement sites are not desired for a particular stratum, then the corresponding value in `n_over` should be `0` or `NULL` (which is equivalent to `0`). If the sampling design is stratified but the number of `n_over` sites is the same in each stratum, `n_over` can be a vector which is used for each stratum. If `n_over` is an unnamed, length-one vector, it's value is recycled and used for each stratum. Note that if the sampling design has unequal selection probabilities (`seltype = "unequal"`), then `n_over` sites are given the same proportion of `caty_n` values as `n_base`.
`n_near`	The number of nearest neighbor (nn) replacement sites. If the sampling design is unstratified, `n_near` is integer from `1` to `10` specifying the number of nn replacement sites to be selected for each base site. If the sampling design is stratified but the same number of nn replacement sites is desired for each stratum, `n_near` is integer from `1` to `10` specifying the number of nn replacement sites to be selected for each base site. If the sampling design is unstratified and a different number of nn replacement sites is desired for each stratum, `n_near` is a vector (or list) whose names represent strata and whose values is integer from `1` to `10` specifying the number of nn replacement sites to be selected for each base site in the stratum. If replacement sites are not desired for a particular stratum, then the corresponding value in `n_over` should be `0` or `NULL` (which is equivalent to `0`). For infinite sampling frames, the distance between a site and its nn depends on `pt_density`. The larger `pt_density`, the closer the nn neighbors.
`wgt_units`	The units used to compute the design weights. These units must be standard units as defined by the `set_units()` function in the units package. The default units match the units of the sf object.
`pt_density`	A positive integer controlling the density of the GRTS approximation for infinite sampling frames. The GRTS approximation for infinite sample frames vastly improves computational efficiency by generating many finite points and selecting a sample from the points. `pt_density` represents the density of finite points per unit to use in the approximation. More specifically, for each stratum, the number of points used in the approximation equals `pt_density * (n_base + n_over)`. A larger value of `pt_density` means a closer approximation to the infinite sampling frame but less computational efficiency. The default value of `pt_density` is `10`. Note that when used with `caty_n`, the unequal inclusion probabilities generated from this approach are also approximations.
`DesignID`	A character string indicating the naming structure for each site's identifier selected in the sample, which is matched with `SiteBegin` and included as a variable in the sf object in the function's output. Default is "Site".
`SiteBegin`	A character string indicating the first number to use to match with `DesignID` while creating each site's identifier selected in the sample. Successive sites are given successive integers. The default starting number is `1` and the number of digits is equal to number of digits in `nbase + nover`. For example, if `nbase` is 50 and `nover` is 0, then the default site identifiers are `Site-01` to `Site-50`
`sep`	A character string that acts as a separator between `DesignID` and `SiteBegin`. The default is `"-"`.
`projcrs_check`	A check for whether the coordinates are projected. If `TRUE`, an error is returned if coordinates are not projected (i.e., they are geographic or NA). If `FALSE`, the check is not performed, which means that the crs in `sframe` (and `legacy_sites` if provided) can be projected, geographic, or NA.

Details

n_base is the number of sites used to calculate the design weights, which is typically the number of sites used in an analysis. When a panel sampling design is implemented, n_base is typically the number of sites in all panels that will be sampled in the same temporal period – n_base is not the total number of sites in all panels. The sum of n_base and n_over is equal to the total number of sites to be visited for all panels plus any replacement sites that may be required.

Value

The sampling design sites and additional information about the sampling design. More specifically, it is, a list with five elements:

sites_legacy An sf object containing legacy sites. This is NULL if legacy sites were not included in the sample.
sites_base An sf object containing the base sites. This is NULL if n_base equals the number of legacy sites.
sites_over An sf object containing the reverse hierarchically ordered replacement sites. This is NULL if no reverse hierarchically ordered replacement sites were included in the sample.
sites_near An sf object containing the nearest neighbor replacement sites. This is NULL if no nearest neighbor replacement sites were included in the sample.
design A list documenting the specifications of this sampling design. This can be checked to verify your sampling design ran as intended.
- call The original function call.
- stratum_var The name of the stratification variable in sframe. This equals NULL if no stratification is used.
- stratum The unique strata. This equals "None" if the sampling design is unstratified.
- n_base The base sample size per stratum.
- seltype The selection type per stratum.
- caty_var The name of the unequal probability variable in sframe. This equals NULL if no unequal probability variable is used.
- caty_n The expected sample sizes for each level of the unequal probability grouping variable per stratum. This equals NULL when seltype is not "unequal".
- aux_var The name of the proportional probability (auxiliary) variable in sframe. This equals NULL if no proportional probability variable is used.
- legacy A logical variable indicating whether legacy sites were included in the sample.
- legacy_stratum_var The name of the stratification variable in legacy_sites. Omitted if legacy sites are not used. This equals NULL if legacy sites were used but no stratification variable is used.
- legacy_caty_var The name of the unequal probability variable in legacy_sites. Omitted if legacy sites are not used. This equals NULL if legacy sites were used but no unequal probability variable is used.
- legacy_aux_var The name of the proportional probability (auxiliary) variable in legacy_sites. Omitted if legacy sites are not used. This equals NULL if legacy sites were used but no proportional probability variable is used.
- mindis The minimum distance requirement desired. This is NULL when no minimum distance requirement was applied.
- n_over The reverse hierarchically ordered replacement site sample sizes per stratum. If seltype is unequal, this represents the expected sample sizes. This is NULL when no reverse hierarchically ordered replacement sites were selected.
- n_near The number of nearest neighbor replacement sites desired. This is NULL when no nearest neighbor replacement sites were selected.

When non-NULL, the sites_legacy, sites_base, sites_over, and sites_near objects contain the original columns in sframe and include a few additional columns. These additional columns are

siteID A site identifier (as named using the DesignID and SiteBegin arguments to grts()).
siteuse Whether the site is a legacy site (Legacy), base site (Base), reverse hierarchically ordered replacement site (Over), or nearest neighbor replacement site (Near).
replsite The replacement site ordering. replsite is None if the site is not a replacement site, Next if it is the next reverse hierarchically ordered replacement site to use, or Near_, where the word following _ indicates the ordering of sites closest to the originally sampled site.
lon_WGS84 Longitude coordinates using the WGS84 coordinate system (EPSG:4326). Only given if coordinates are projected.
lat_WGS84 Latitude coordinates using the WGS84 coordinate system (EPSG:4326). Only given if coordinates are projected.
X Longitude coordinates using the provided coordinate system. Only given if coordinates are not projected (i.e., they are geographic or NA).
Y Latitude coordinates using the provided coordinate system. Only given if coordinates are not projected (i.e., they are geographic or NA).
stratum A stratum indicator. stratum is None if the sampling design was unstratified. If the sampling design was stratified, stratum indicates the stratum.
wgt The design weight.
ip The site's original inclusion probability (the reciprocal) of (wgt).
caty An unequal probability grouping indicator. caty is None if the sampling design did not use unequal inclusion probabilities. If the sampling design did use unequal inclusion probabilities, caty indicates the unequal probability level.
aux The auxiliary proportional probability variable. This column is only returned if seltype was proportional in the original sampling design.

If any columns in sframe contain these names, those columns from sframe will be automatically prefixed with sframe_ in the sites object. When output is printed, a summary of site counts by the levels in stratum_var and caty_var is shown.

Author(s)

Tony Olsen olsen.tony@epa.gov

Examples

## Not run: 
samp <- irs(NE_Lakes, n_base = 100)
print(samp)
strata_n <- c(low = 25, high = 30)
samp_strat <- irs(NE_Lakes, n_base = strata_n, stratum_var = "ELEV_CAT")
print(samp_strat)
samp_over <- irs(NE_Lakes, n_base = 30, n_over = 5)
print(samp_over)

## End(Not run)

spsurvey documentation built on June 22, 2024, 7:36 p.m.