knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(1) has_biglm <- requireNamespace("biglm", quietly = TRUE) library(vectra)
A species distribution model relates occurrence records to environmental predictors. The predictors usually arrive as raster layers (climate, terrain, land cover), and the records as point coordinates. The first step of almost every workflow is the same: sample each raster at the occurrence coordinates to build a table of predictor values, then fit a model to that table.
vectra covers two parts of this pipeline directly. It reads GeoTIFF rasters
(including tiled and BigTIFF layouts) and samples them at point coordinates
with tiff_extract_points(), reading only the strips that contain query
points, so a sparse point set over a continental raster touches a small
fraction of the file. And it consumes a query one batch at a time, so a model
matrix that exceeds memory can still be fitted: chunk_feeder() streams the
data through biglm::bigglm() one chunk per iteration.
Two operations sit outside vectra's scope, and the rest of this article assumes they are handled before extraction:
sf or terra first.GeoTIFF
with tiff_crs() but does not transform coordinates. Occurrence points must
already be in the raster's CRS when they reach tiff_extract_points().tiff_extract_points() maps each coordinate to a pixel through the raster's
affine geotransform. It does this in the raster's own CRS, so points recorded
in a different CRS must be reprojected first. Read the raster CRS with
tiff_crs(), then transform the points to match:
# Occurrences recorded in lon/lat (EPSG:4326), raster in a projected CRS. target <- tiff_crs("predictors.tif")$epsg # e.g. 3035 (LAEA Europe) pts <- sf::st_as_sf(occ, coords = c("lon", "lat"), crs = 4326) pts <- sf::st_transform(pts, target) xy <- sf::st_coordinates(pts) env <- tiff_extract_points("predictors.tif", x = xy[, 1], y = xy[, 2])
When the points and the raster already share a CRS this step is unnecessary. The examples below work in lon/lat throughout.
We synthesize a small two-layer climate raster over Central Europe and write it
to a GeoTIFF with an EPSG:4326 GeoKey. Layer band1 stands in for mean annual
temperature and band2 for annual precipitation.
nx <- 80; ny <- 50 xmin <- 5; xmax <- 17; ymin <- 45; ymax <- 55 xres <- (xmax - xmin) / nx yres <- (ymax - ymin) / ny g <- expand.grid(col = 0:(nx - 1), row = 0:(ny - 1)) g$x <- xmin + (g$col + 0.5) * xres # pixel centres g$y <- ymax - (g$row + 0.5) * yres # row 1 is the northern edge g$band1 <- 14 - 0.6 * (g$y - 50) + 0.2 * (g$x - 11) + rnorm(nrow(g), 0, 0.3) g$band2 <- 750 + 25 * (g$y - 50) - 8 * (g$x - 11) + rnorm(nrow(g), 0, 10) tif <- tempfile(fileext = ".tif") write_tiff(g[, c("x", "y", "band1", "band2")], tif, crs = 4326L) tiff_crs(tif)
Now we sample the raster at a set of survey sites. tiff_extract_points()
returns one row per point with the input coordinates and one column per band.
Points outside the raster come back as NA.
N <- 1500 sites <- data.frame( x = runif(N, xmin + 0.3, xmax - 0.3), y = runif(N, ymin + 0.3, ymax - 0.3) ) env <- tiff_extract_points(tif, sites) head(env)
We standardize the two predictors once and store the prepared columns, then draw a presence/absence outcome from them. Standardizing before the data reaches the model lets every streaming batch share the same transform, which the out-of-core fit below relies on. With the generating coefficients known, each fit can be checked against them.
occ <- data.frame( bio1 = as.numeric(scale(env$band1)), # standardized once, then stored bio12 = as.numeric(scale(env$band2)) ) eta <- -0.2 + 1.3 * occ$bio1 - 0.9 * occ$bio12 occ$presence <- rbinom(N, 1, plogis(eta)) table(occ$presence)
For most studies the extracted table has thousands to a few million rows and a
handful of predictors, which fits in memory comfortably even when the source
rasters are enormous. The ordinary path is to extract, then fit with glm():
fit <- glm(presence ~ bio1 + bio12, data = occ, family = binomial()) round(coef(fit), 3)
The recovered coefficients sit close to the values used to generate the data
(intercept -0.2, bio1 1.3, bio12 -0.9).
Dense background sampling, many predictors, or a model fitted across a whole
continent can push the predictor table past available RAM. Here vectra streams
the table through the model in bounded pieces. We write the modelling frame to a
.vtr file to stand in for a table too large to hold in memory.
vtr <- tempfile(fileext = ".vtr") write_vtr(occ, vtr)
collect_chunked() folds a function over the query one batch at a time. It is
the right tool for any summary that accumulates in bounded space across the
data. Here we compute the class balance and per-class predictor means in a
single streaming pass, never holding more than one batch:
acc <- collect_chunked( tbl(vtr), function(acc, chunk) { p <- chunk$presence == 1 acc$n <- acc$n + nrow(chunk) acc$n_pres <- acc$n_pres + sum(p) acc$sum1 <- acc$sum1 + tapply(chunk$bio1, p, sum)[c("FALSE", "TRUE")] acc }, .init = list(n = 0L, n_pres = 0L, sum1 = c("FALSE" = 0, "TRUE" = 0)) ) acc$n_pres / acc$n # prevalence
The same pattern accumulates the cross-products X'X and X'y behind a linear
model, giving an exact least-squares fit in one pass. A generalized linear
model needs more: each iteratively reweighted step reweights the rows by the
current coefficient estimate, so the data must be read once per iteration.
chunk_feeder() exposes the query as a resettable generator that
biglm::bigglm() drives itself, re-reading the stream on every iteration.
Because a vectra node is consumed as it streams, the feeder takes a factory: a
function that returns a fresh node each time the stream is rewound.
src <- function() tbl(vtr) |> select(presence, bio1, bio12) fit_stream <- biglm::bigglm( presence ~ bio1 + bio12, data = chunk_feeder(src), family = binomial() ) round(coef(fit_stream), 3)
The streamed fit matches the in-memory glm() to within its convergence
tolerance, while the engine never holds more than one batch of the predictor
table at a time. The same call scales to a .vtr or CSV far larger than RAM:
only the factory and the formula change.
bigglm() rebuilds the model matrix from each batch as it arrives. A formula
term whose definition is estimated from the data, such as scale(), poly(),
ns(), or bs(), is recomputed on every batch and takes a different value on
each one. Prepare these terms before writing the table: standardize predictors
as we did above, set explicit Boundary.knots on splines, and make sure every
factor level appears in the stream. The formula then names prepared columns and
each batch yields the same transform.
cat("> The out-of-core GLM example needs the **biglm** package, which is not", "installed, so it was not run here. Install it with", "`install.packages(\"biglm\")` to reproduce the streamed fit.\n")
unlink(c(tif, vtr))
Extract with tiff_extract_points() in every case. After that:
collect() it and fit with glm(),
mgcv::gam(), or any modelling function. This is the common case.collect_chunked().biglm::bigglm() through chunk_feeder().
```Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.