Species Distribution Models
In vectra: Columnar Query Engine for Larger-than-RAM Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(1)
has_biglm <- requireNamespace("biglm", quietly = TRUE)
library(vectra)

What vectra does for a distribution model

A species distribution model relates occurrence records to environmental predictors. The predictors usually arrive as raster layers (climate, terrain, land cover), and the records as point coordinates. The first step of almost every workflow is the same: sample each raster at the occurrence coordinates to build a table of predictor values, then fit a model to that table.

vectra covers two parts of this pipeline directly. It reads GeoTIFF rasters (including tiled and BigTIFF layouts) and samples them at point coordinates with tiff_extract_points(), reading only the strips that contain query points, so a sparse point set over a continental raster touches a small fraction of the file. And it consumes a query one batch at a time, so a model matrix that exceeds memory can still be fitted: chunk_feeder() streams the data through biglm::bigglm() one chunk per iteration.

Two operations sit outside vectra's scope, and the rest of this article assumes they are handled before extraction:

Vector geometry. vectra has no polygons, lines, or spatial joins. Buffer a study region, intersect with a range map, or rasterize a shapefile using sf or terra first.
Reprojection. vectra reads the coordinate reference system of a GeoTIFF with tiff_crs() but does not transform coordinates. Occurrence points must already be in the raster's CRS when they reach tiff_extract_points().

Align coordinates before extracting

tiff_extract_points() maps each coordinate to a pixel through the raster's affine geotransform. It does this in the raster's own CRS, so points recorded in a different CRS must be reprojected first. Read the raster CRS with tiff_crs(), then transform the points to match:

# Occurrences recorded in lon/lat (EPSG:4326), raster in a projected CRS.
target <- tiff_crs("predictors.tif")$epsg          # e.g. 3035 (LAEA Europe)

pts <- sf::st_as_sf(occ, coords = c("lon", "lat"), crs = 4326)
pts <- sf::st_transform(pts, target)
xy  <- sf::st_coordinates(pts)

env <- tiff_extract_points("predictors.tif", x = xy[, 1], y = xy[, 2])

When the points and the raster already share a CRS this step is unnecessary. The examples below work in lon/lat throughout.

Build a predictor table

We synthesize a small two-layer climate raster over Central Europe and write it to a GeoTIFF with an EPSG:4326 GeoKey. Layer band1 stands in for mean annual temperature and band2 for annual precipitation.

nx <- 80; ny <- 50
xmin <- 5; xmax <- 17; ymin <- 45; ymax <- 55
xres <- (xmax - xmin) / nx
yres <- (ymax - ymin) / ny

g <- expand.grid(col = 0:(nx - 1), row = 0:(ny - 1))
g$x <- xmin + (g$col + 0.5) * xres          # pixel centres
g$y <- ymax - (g$row + 0.5) * yres          # row 1 is the northern edge
g$band1 <- 14 - 0.6 * (g$y - 50) + 0.2 * (g$x - 11) + rnorm(nrow(g), 0, 0.3)
g$band2 <- 750 + 25 * (g$y - 50) - 8 * (g$x - 11) + rnorm(nrow(g), 0, 10)

tif <- tempfile(fileext = ".tif")
write_tiff(g[, c("x", "y", "band1", "band2")], tif, crs = 4326L)

tiff_crs(tif)

Now we sample the raster at a set of survey sites. tiff_extract_points() returns one row per point with the input coordinates and one column per band. Points outside the raster come back as NA.

N <- 1500
sites <- data.frame(
  x = runif(N, xmin + 0.3, xmax - 0.3),
  y = runif(N, ymin + 0.3, ymax - 0.3)
)

env <- tiff_extract_points(tif, sites)
head(env)

We standardize the two predictors once and store the prepared columns, then draw a presence/absence outcome from them. Standardizing before the data reaches the model lets every streaming batch share the same transform, which the out-of-core fit below relies on. With the generating coefficients known, each fit can be checked against them.

occ <- data.frame(
  bio1  = as.numeric(scale(env$band1)),   # standardized once, then stored
  bio12 = as.numeric(scale(env$band2))
)
eta <- -0.2 + 1.3 * occ$bio1 - 0.9 * occ$bio12
occ$presence <- rbinom(N, 1, plogis(eta))
table(occ$presence)

Fit in memory when the table fits

For most studies the extracted table has thousands to a few million rows and a handful of predictors, which fits in memory comfortably even when the source rasters are enormous. The ordinary path is to extract, then fit with glm():

fit <- glm(presence ~ bio1 + bio12, data = occ, family = binomial())
round(coef(fit), 3)

The recovered coefficients sit close to the values used to generate the data (intercept -0.2, bio1 1.3, bio12 -0.9).

Stream when the table is larger than memory

Dense background sampling, many predictors, or a model fitted across a whole continent can push the predictor table past available RAM. Here vectra streams the table through the model in bounded pieces. We write the modelling frame to a .vtr file to stand in for a table too large to hold in memory.

vtr <- tempfile(fileext = ".vtr")
write_vtr(occ, vtr)

A single pass with collect_chunked()

collect_chunked() folds a function over the query one batch at a time. It is the right tool for any summary that accumulates in bounded space across the data. Here we compute the class balance and per-class predictor means in a single streaming pass, never holding more than one batch:

acc <- collect_chunked(
  tbl(vtr),
  function(acc, chunk) {
    p <- chunk$presence == 1
    acc$n      <- acc$n      + nrow(chunk)
    acc$n_pres <- acc$n_pres + sum(p)
    acc$sum1   <- acc$sum1   + tapply(chunk$bio1, p, sum)[c("FALSE", "TRUE")]
    acc
  },
  .init = list(n = 0L, n_pres = 0L, sum1 = c("FALSE" = 0, "TRUE" = 0))
)
acc$n_pres / acc$n                      # prevalence

The same pattern accumulates the cross-products X'X and X'y behind a linear model, giving an exact least-squares fit in one pass. A generalized linear model needs more: each iteratively reweighted step reweights the rows by the current coefficient estimate, so the data must be read once per iteration.

An out-of-core GLM with chunk_feeder()

chunk_feeder() exposes the query as a resettable generator that biglm::bigglm() drives itself, re-reading the stream on every iteration. Because a vectra node is consumed as it streams, the feeder takes a factory: a function that returns a fresh node each time the stream is rewound.

src <- function() tbl(vtr) |> select(presence, bio1, bio12)

fit_stream <- biglm::bigglm(
  presence ~ bio1 + bio12,
  data = chunk_feeder(src),
  family = binomial()
)
round(coef(fit_stream), 3)

The streamed fit matches the in-memory glm() to within its convergence tolerance, while the engine never holds more than one batch of the predictor table at a time. The same call scales to a .vtr or CSV far larger than RAM: only the factory and the formula change.

bigglm() rebuilds the model matrix from each batch as it arrives. A formula term whose definition is estimated from the data, such as scale(), poly(), ns(), or bs(), is recomputed on every batch and takes a different value on each one. Prepare these terms before writing the table: standardize predictors as we did above, set explicit Boundary.knots on splines, and make sure every factor level appears in the stream. The formula then names prepared columns and each batch yields the same transform.

cat("> The out-of-core GLM example needs the **biglm** package, which is not",
    "installed, so it was not run here. Install it with",
    "`install.packages(\"biglm\")` to reproduce the streamed fit.\n")

unlink(c(tif, vtr))

Choosing a path

Extract with tiff_extract_points() in every case. After that:

If the predictor table fits in memory, collect() it and fit with glm(), mgcv::gam(), or any modelling function. This is the common case.
If a single-pass summary is all you need (class balance, sufficient statistics, an online mean), use collect_chunked().
If the table itself exceeds memory and you need a full GLM, feed it to biglm::bigglm() through chunk_feeder(). ```