sampcont: Unmatched Control Sampling
In MapGAM: Mapping Smoothed Effect Estimates from Individual-Level Data

sampcont

R Documentation

Unmatched Control Sampling

Description

Take all cases and a random sample of controls from a data frame. Simple random sampling and spatially stratified random sampling are available. For spatially statified random sampling, strata can be defined by region, or by region and additional stratification variables (see Tang et al., 2023 for examples and simulation comparisons). If no specific regions are specified with stratified sampling, the function will create a regular grid for spactially stratified sampling.

Usage

sampcont(rdata, type = "stratified", casecol=1, Xcol=2, Ycol=3, regions = NULL, 
		addstrat = NULL, times = NULL, n = 1, nrow = 100, ncol = 100)

Arguments

`rdata`	a data frame with case status in the `casecol` column (by default, the 1st column), and the geocoordinates (e.g., X and Y) in the `Xcol` and `Ycol` columns (by default, the 2nd and 3rd columns). Additional columns are not used in the sampling scheme but are retained in the sampled data frame.
`casecol`	the column number in `rdata` for the case status (coded with 0 for controls, 1 for cases).
`Xcol`	the column number in `rdata` for the X geocoordinate.
`Ycol`	the column number in `rdata` for the Y geocoordinate.
`type`	`"stratified"` (default) or `"simple"`. If `"simple"` then a simple random sample of `n` controls (rows of `rdata` with `outcome=0`) is obtained. If `"stratified"` then a stratified random sample of controls is obtained, with up to `n` controls per stratum. Sampling strata are defined by the `regions` and `times` arguments. All cases (rows with outcome=1) are taken for the sample regardless of the value supplied for `type`.
`regions`	a vector of length equal to the number of rows in `rdata`, used to construct sampling strata. Only used if `type = "stratified"`. If `regions = NULL` then the function will define `regions` as a vector of specific grid cells on a regular grid with `nrow` rows and `ncol` columns. If `times = NULL` then the nonempty regions are used as the sampling strata. If `times` is a vector, then the sampling strata are all nonempty combinations of `regions` and `times`.
`addstrat`	a vector of length equal to the number of rows in `rdata`, used along with `regions` to construct sampling strata. If `addstrat = NULL` then the sampling strata are defined only by the `regions` argument. If `addstrat` is a vector, then the sampling strata are all nonempty combinations of `regions` and `addstrat`. Continuous variables should generally be binned before being passed through this argument, as there are no efficiency gains if each value in `addstrat` is unique.
`times`	included for backward compatibility; now replaced by the `addstrat` argument which serves the same purpose.
`n`	the number of controls to sample from the eligible controls in each stratum. All available controls will be taken for strata with fewer than `n` eligible controls.
`nrow`	the number of rows used to create a regular grid for sampling regions. Only used when `regions = NULL`.
`ncol`	the number of columns used to create a regular grid for sampling regions. Only used when `regions = NULL`.

Value

`rdata`	a data frame with all cases and a random sample of controls.
`w`	the inverse probability weights for the rows in `rdata`. Important to include as weights in subsequent analyses.
`ncont`	the total number of controls in the sample.
`type`	statified or simple sampling, as specified by the same argument described above.
`gridsize`	a vector with the numbers of rows and columns for the stratified sampling grid.
`grid`	the stratified sampling grid in PolySet format.

Author(s)

Scott M. Bartell and Ian W. Tang sbartell@uci.edu.

References

Tang IW, Bartell SM, Vieira VM. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.sste.2023.100584")}Unmatched Spatially Stratified Controls: A simulation study examining efficiency and precision using spatially-diverse controls and generalized additive models. Spatial and Spatio-temporal Epidemiology 2023, 45:100584.

Examples

#### load beertweets data, which has 719 cases and 9281 controls
data(beertweets)
# take a simple random sample of 1000 controls
samp1 <- sampcont(beertweets, type="simple", n=1000)

# take a stratified random sample of controls on a 80x50 grid
samp2 <- NULL

samp2 <- sampcont(beertweets, nrow=80, ncol=50)

# Compare locations for the two sampling designs (cases in red)
par(mfrow=c(2,1), mar=c(0,3,4,3))
plot(samp1$rdata$longitude, samp1$rdata$latitude, col=3-samp1$rdata$beer,
	cex=0.5, type="p", axes=FALSE, ann=FALSE)
# Show US base map if maps package is available
mapUS <- require(maps)
if (mapUS) map("state", add=TRUE)
title("Simple Random Sample, 1000 Controls")

if (!is.null(samp2)) {
	plot(samp2$rdata$longitude, samp2$rdata$latitude, 
		col=3-samp2$rdata$beer, cex=0.5, type="p", axes=FALSE, 
		ann=FALSE)
	if (mapUS) map("state", add=TRUE)
	title(paste("Spatially Stratified Sample,",samp2$ncont,"Controls"))
	}

par(mfrow=c(1,1))

## Note that weights are needed in statistical analyses
# Prevalence of cases in sample--not in source data
mean(samp1$rdata$beer)		 
# Estimated prevalence of cases in source data 
weighted.mean(samp1$rdata$beer, w=samp1$w)	
## Do beer tweet odds differ below the 36.5 degree parallel?
# Using full data
glm(beer~I(latitude<36.5), family=binomial, data=beertweets) 
# Stratified sample requires sampling weights 
if (!is.null(samp2)) glm(beer~I(latitude<36.5), family=binomial, 
	data=samp2$rdata, weights=samp2$w)

MapGAM documentation built on July 26, 2023, 5:12 p.m.