sampcont: Unmatched Control Sampling

View source: R/sampcont.R

sampcontR Documentation

Unmatched Control Sampling

Description

Take all cases and a random sample of controls from a data frame. Simple random sampling and spatially stratified random sampling are available. For spatially statified random sampling, strata can be defined by region, or by region and additional stratification variables (see Tang et al., 2023 for examples and simulation comparisons). If no specific regions are specified with stratified sampling, the function will create a regular grid for spactially stratified sampling.

Usage

sampcont(rdata, type = "stratified", casecol=1, Xcol=2, Ycol=3, regions = NULL, 
		addstrat = NULL, times = NULL, n = 1, nrow = 100, ncol = 100)

Arguments

rdata

a data frame with case status in the casecol column (by default, the 1st column), and the geocoordinates (e.g., X and Y) in the Xcol and Ycol columns (by default, the 2nd and 3rd columns). Additional columns are not used in the sampling scheme but are retained in the sampled data frame.

casecol

the column number in rdata for the case status (coded with 0 for controls, 1 for cases).

Xcol

the column number in rdata for the X geocoordinate.

Ycol

the column number in rdata for the Y geocoordinate.

type

"stratified" (default) or "simple". If "simple" then a simple random sample of n controls (rows of rdata with outcome=0) is obtained. If "stratified" then a stratified random sample of controls is obtained, with up to n controls per stratum. Sampling strata are defined by the regions and times arguments. All cases (rows with outcome=1) are taken for the sample regardless of the value supplied for type.

regions

a vector of length equal to the number of rows in rdata, used to construct sampling strata. Only used if type = "stratified". If regions = NULL then the function will define regions as a vector of specific grid cells on a regular grid with nrow rows and ncol columns. If times = NULL then the nonempty regions are used as the sampling strata. If times is a vector, then the sampling strata are all nonempty combinations of regions and times.

addstrat

a vector of length equal to the number of rows in rdata, used along with regions to construct sampling strata. If addstrat = NULL then the sampling strata are defined only by the regions argument. If addstrat is a vector, then the sampling strata are all nonempty combinations of regions and addstrat. Continuous variables should generally be binned before being passed through this argument, as there are no efficiency gains if each value in addstrat is unique.

times

included for backward compatibility; now replaced by the addstrat argument which serves the same purpose.

n

the number of controls to sample from the eligible controls in each stratum. All available controls will be taken for strata with fewer than n eligible controls.

nrow

the number of rows used to create a regular grid for sampling regions. Only used when regions = NULL.

ncol

the number of columns used to create a regular grid for sampling regions. Only used when regions = NULL.

Value

rdata

a data frame with all cases and a random sample of controls.

w

the inverse probability weights for the rows in rdata. Important to include as weights in subsequent analyses.

ncont

the total number of controls in the sample.

type

statified or simple sampling, as specified by the same argument described above.

gridsize

a vector with the numbers of rows and columns for the stratified sampling grid.

grid

the stratified sampling grid in PolySet format.

Author(s)

Scott M. Bartell and Ian W. Tang sbartell@uci.edu.

References

Tang IW, Bartell SM, Vieira VM. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.sste.2023.100584")}Unmatched Spatially Stratified Controls: A simulation study examining efficiency and precision using spatially-diverse controls and generalized additive models. Spatial and Spatio-temporal Epidemiology 2023, 45:100584.

See Also

modgam

Examples

#### load beertweets data, which has 719 cases and 9281 controls
data(beertweets)
# take a simple random sample of 1000 controls
samp1 <- sampcont(beertweets, type="simple", n=1000)

# take a stratified random sample of controls on a 80x50 grid
samp2 <- NULL

samp2 <- sampcont(beertweets, nrow=80, ncol=50)

# Compare locations for the two sampling designs (cases in red)
par(mfrow=c(2,1), mar=c(0,3,4,3))
plot(samp1$rdata$longitude, samp1$rdata$latitude, col=3-samp1$rdata$beer,
	cex=0.5, type="p", axes=FALSE, ann=FALSE)
# Show US base map if maps package is available
mapUS <- require(maps)
if (mapUS) map("state", add=TRUE)
title("Simple Random Sample, 1000 Controls")

if (!is.null(samp2)) {
	plot(samp2$rdata$longitude, samp2$rdata$latitude, 
		col=3-samp2$rdata$beer, cex=0.5, type="p", axes=FALSE, 
		ann=FALSE)
	if (mapUS) map("state", add=TRUE)
	title(paste("Spatially Stratified Sample,",samp2$ncont,"Controls"))
	}

par(mfrow=c(1,1))

## Note that weights are needed in statistical analyses
# Prevalence of cases in sample--not in source data
mean(samp1$rdata$beer)		 
# Estimated prevalence of cases in source data 
weighted.mean(samp1$rdata$beer, w=samp1$w)	
## Do beer tweet odds differ below the 36.5 degree parallel?
# Using full data
glm(beer~I(latitude<36.5), family=binomial, data=beertweets) 
# Stratified sample requires sampling weights 
if (!is.null(samp2)) glm(beer~I(latitude<36.5), family=binomial, 
	data=samp2$rdata, weights=samp2$w)

MapGAM documentation built on July 26, 2023, 5:12 p.m.