grid_sample  R Documentation 
Sample observation data on a spacetime grid to reduce spatiotemporal bias.
grid_sample(
x,
coords = c("longitude", "latitude", "day_of_year"),
is_lonlat = TRUE,
res = c(3000, 3000, 7),
jitter_grid = TRUE,
sample_size_per_cell = 1,
cell_sample_prop = 0.75,
keep_cell_id = FALSE,
grid_definition = NULL
)
grid_sample_stratified(
x,
coords = c("longitude", "latitude", "day_of_year"),
is_lonlat = TRUE,
unified_grid = FALSE,
keep_cell_id = FALSE,
by_year = TRUE,
case_control = TRUE,
obs_column = "obs",
sample_by = NULL,
min_detection_probability = 0,
maximum_ss = NULL,
jitter_columns = NULL,
jitter_sd = 0.1,
...
)
x 
data frame; observations to sample, including at least the columns defining the location in space and time. Additional columns can be included such as features that will later be used in model training. 
coords 
character; names of the spatial and temporal coordinates. By
default the spatial spatial coordinates should be 
is_lonlat 
logical; if the points are in unprojected, lonlat coordinates. In this case, the points will be projected to an equal area sinusoidal CRS prior to grid assignment. 
res 
numeric; resolution of the spatiotemporal grid in the x, y, and time dimensions. Unprojected locations are projected to an equal area coordinate system prior to sampling, and resolution should therefore be provided in units of meters. The temporal resolution should be in the native units of the time coordinate in the input data frame, typically it will be a number of days. 
jitter_grid 
logical; whether to jitter the location of the origin of the grid to introduce some randomness. 
sample_size_per_cell 
integer; number of observations to sample from each grid cell. 
cell_sample_prop 
proportion 
keep_cell_id 
logical; whether to retain a unique cell identifier,
stored in column named 
grid_definition 
list defining the spatiotemporal sampling grid as
returned by 
unified_grid 
logical; whether a single, unified spatiotemporal
sampling grid should be defined and used for all observations in 
by_year 
logical; whether the sampling should be done by year, i.e.
sampling N observations per grid cell per year, rather than across years,
i.e. N observations per grid cell regardless of year. If using sampling by
year, the input data frame 
case_control 
logical; whether to apply case control sampling whereby presence and absence are sampled independently. 
obs_column 
character; if 
sample_by 
character; additional columns in 
min_detection_probability 
proportion 
maximum_ss 
integer; the maximum sample size in the final dataset. If
the grid sampling yields more than this number of observations,

jitter_columns 
character; if detections are oversampled to achieve the
minimum detection probability, some observations will be duplicated, and it
can be desirable to slightly "jitter" the values of model training features
for these duplicated observations. This argument defines the column names
in 
jitter_sd 
numeric; strength of the jittering in units of standard
deviations, see 
... 
additional arguments defining the spatiotemporal grid; passed to

grid_sample_stratified()
performs stratified case control sampling,
independently sampling from strata defined by, for example, year and
detection/nondetection. Within each stratum, grid_sample()
is used to
sample the observations on a spatiotemporal grid. In addition, if case
control sampling is turned on, the detections are oversampled to increase the
frequecy of detections in the dataset.
The sampling grid is defined, and assignment of locations to cells occurs, in
assign_to_grid()
. Consult the help for that function for further details on
how the grid is generated and locations are assigned. Note that by providing
2element vectors to both coords
and res
the time component of the grid
can be ignored and spatialonly subsampling is performed.
A data frame of the spatiotemporally sampled data.
set.seed(1)
# generate some example observations
n_obs < 10000
checklists < data.frame(longitude = rnorm(n_obs, sd = 0.1),
latitude = rnorm(n_obs, sd = 0.1),
day_of_year = sample.int(28, n_obs, replace = TRUE),
year = NA_integer_,
obs = rpois(n_obs, lambda = 0.1),
forest_cover = runif(n_obs),
island = as.integer(runif(n_obs) > 0.95))
# add a year column, giving more data to recent years
checklists$year < sample(seq(2016, 2020), size = n_obs, replace = TRUE,
prob = seq(0.3, 0.7, length.out = 5))
# create several rare islands
checklists$island[sample.int(nrow(checklists), 9)] < 2:10
# basic spatiotemporal grid sampling
sampled < grid_sample(checklists)
# plot original data and grid sampled data
par(mar = c(0, 0, 0, 0))
plot(checklists[, c("longitude", "latitude")],
pch = 19, cex = 0.3, col = "#00000033",
axes = FALSE)
points(sampled[, c("longitude", "latitude")],
pch = 19, cex = 0.3, col = "red")
# case control sampling stratified by year and island
# return a maximum of 1000 checklists
sampled_cc < grid_sample_stratified(checklists, sample_by = "island",
maximum_ss = 1000)
# case control sampling increases the prevalence of detections
mean(checklists$obs > 0)
mean(sampled$obs > 0)
mean(sampled_cc$obs > 0)
# stratifying by island ensures all levels are retained, even rare ones
table(checklists$island)
# normal grid sampling loses rare island levels
table(sampled$island)
# stratified grid sampling retain at least one observation from each level
table(sampled_cc$island)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.