create_resamples: Create samples for fitting, calibrating, and validating...

create_resamplesR Documentation

Create samples for fitting, calibrating, and validating models

Description

The function creates (sub)samples of the data to be included for three different model blocks: train (fitting), test (calibrate, tunning), and validate. By default, samples are created through bootstrapping, i.e. with replacement. This means data observations can be repeated within a given sample block, but observations included in one block are necessarily excluded from the other blocks (e.g. observations selected for validation will be absent from train and test blocks). ' Samples can be created at random (if spat_strat = NULL, default) or with spatial stratification (spatial strata can be created with the function spat_strat(). In the latter case, train and test sets are spatially split, to allow for a more thorough cross-validation to define the penalty parameter in the penalized regressions. Also, samples might include a specific variable (with classes or groups) H0 to be used for (block cross-)validation (if colH0 is provided), but this is not a requirement.

Usage

create_resamples(
  y,
  times = 10,
  p = c(0.4, 0.2, 0.2),
  max_size_blockH0_validation = 1000,
  max_size_blockH0_train = 1000,
  max_size_blockH0_test = 1000,
  max_number_blocksH1_train = 40,
  sp_strat = NULL,
  colH0 = NULL,
  H0setup = c("LAO", "LOO")[1],
  replace = TRUE
)

Arguments

y

⁠[vector]⁠
A vector of outcomes. It can be the response variable for the data set of interest, or only the case = 1 cases for conditional logistic ( step-selection) analyses.

times

⁠[numeric(1)=10]⁠
The number of partitions or samples to be sampled.

p

⁠[numeric(3)=c(0.4,0.2,0.2)]⁠
A 3 element numeric vector with the percentage of data that goes to fitting/training (H1), testing (H2), and validation (H0). Values should be between 0 and 1 and should not sum more than 1.

max_size_blockH0_validation

⁠[numeric(1)=1000]⁠
Maximum size of the blocks H0 (e.g. population, area, year) for validation block H0. Used to limit the number of observations in the validation set, to avoid sampling too many observations of the block H0 levels with more observations, for imbalanced data sets. To find out about meaningful values for this parameter, use explore_blocks_pre() and explore_blocks().

max_size_blockH0_train

⁠[numeric(1)=1000]⁠
Maximum size of the blocks H0 (e.g. population, area, year) for training/fitting the model. Used to limit the number of observations in the train set, to avoid sampling too many observations of the block H0 levels with more observations, for imbalanced data sets. To find out about meaningful values for this parameter, use explore_blocks().

max_size_blockH0_test

⁠[numeric(1)=1000]⁠
Not implemented yet.

max_number_blocksH1_train

⁠[numeric(1)=15]⁠
Maximum number of levels or blocks H1 to be used for model fitting/training. This is only meaningful if there is spatial stratification (i.e. if sp_strat in not NULL). To find out about meaningful values for this parameter, use explore_blocks().

sp_strat

⁠[data.frame]⁠
Default is NULL. If not NULL, the data.frame resulting from spat_strat() should be provided here.

colH0

⁠[numeric,character,vector]⁠
Column number or name to define the IDs of the H0 level - the one with ecological meaning, e.g. individual, population, or study area, used for validating the predictions of the fitted model. If sp_strat is provided, colH0 is a string with the column name (or the column number) in the sp_strat table. If sp_strat = NULL, colH0 is a vector of H0 values with the same length as y. If colH0 = NULL (Default), no H0 level is defined and there is no block cross-validation in the bootstraped sets.

H0setup

Not implemented yet.

replace

⁠[logical(1)=TRUE]⁠
Whether to perform the bootstrap sampling with or without replacement (Default is TRUE).

Value

A list with lists for the sets for train, test, and validation, each of which with the indices corresponding to the observations to be kept in each resample. If colkH0 is not NULL, a vector with the blockH0 which each observation pertains to is also appended to the output. If spat_strat is provided, a list of blocks H0 and possibly a list of strata might also be provided.

Examples

# random sampling, no validation block H0
y <- runif(200)
samples <- create_resamples(y, p = c(0.4, 0.2, 0.2), times = 5)
samples

# with validation block H0
data(reindeer)
library(terra)
library(amt)

# random sampling, with validation block H0
samples <- create_resamples(1:nrow(reindeer), times = 5,
                            p = c(0.2, 0.2, 0.2),
                            max_size_blockH0_validation = 1000,
                            colH0 = reindeer$original_animal_id)
samples

# spatially stratified sampling, with validation block H0
spst <- spat_strat(reindeer, coords = c("x", "y"), colH0 = "original_animal_id",
                   all_cols = F)
samples <- create_resamples(1:nrow(reindeer), times = 5,
                            p = c(0.2, 0.2, 0.2),
                            max_number_blocksH1_train = 20,
                            sp_strat = spst,
                            colH0 = "blockH0")
samples
sum(is.na(samples$test[[1]]))
sapply(samples$train, function(x) sum(is.na(x)))
sapply(samples$test, function(x) sum(is.na(x)))

# small number of blocks or too high p[1] might incur in errors
samples <- create_resamples(1:nrow(reindeer), times = 10,
                            max_number_blocksH1_train = 3,
                            sp_strat = spst,
                            colH0 = "blockH0")


NINAnor/oneimpact documentation built on June 14, 2025, 12:27 a.m.