1. Processing MODIS data with gdalcubes

ev = Sys.getenv("USER") == "marius"
  collapse = TRUE,
  eval = ev


The gdalcubes R package aims at making the work with large collections of Earth observation (EO) imagery (e.g. from Sentinel 2) easier and faster. Typical challenges with these data such as overlapping images, different spatial resolutions of spectral bands, irregular temporal sampling, and different spatial reference systems become invisible to users by reading the data as a raster data cube and letting users define the shape of the cube (spatiotemporal extent, resolution, and spatial reference system). Working with EO imagery then becomes more interactive: going from "try my method on low resolution and get the result asap" to "apply my final method to the full resolution dataset over night" becomes straightforward.

This brief vignette illustrates basic ideas of the package. We will use satellite imagary from the Moderate Resolution Imaging Spectroradiometer (MODIS) that is small enough to process even on older machines. The imagery comes as a set of HDF4 files. We assume that you have successfully installed the gdalcubes R package. Please also make sure that your GDAL installation supports the HDF4 driver (e.g. with gdalcubes_gdalformats()).

In the following, we will follow a simple workflow by

  1. indexing image files in an image collection,
  2. creating data cubes at various spatiotemporal resolutions and extents, and
  3. applying simple operations on data cubes including band selection, application of pixel-wise functions, reduction over time, and time series filtering.

Downloading the sample data

We will use 8-daily land surface temperature from the MODIS product MOD11A2, covering western Europe (tiles v=13,14, h=03,04) from January to September 2018. The zip archive has approximately 600 megabytes. The code below downloads and unzips the data to the current working directory.

dest_dir = tempdir()
download.file("https://uni-muenster.sciebo.de/s/eP9E6OIkQbXrmsY/download", destfile=file.path(dest_dir, "MOD11A2.zip"),mode = "wb")
unzip(file.path(dest_dir, "MOD11A2.zip"), exdir = file.path(dest_dir,"MOD11A2"))
unlink(file.path(dest_dir, "MOD11A2.zip"))

Creating an image collection

As a first step, we must combine the set of files for the first MODIS product (MOD11A2) in a single image collection. The image collection is a simple index, pointing to files and storing the spatial extent, spatial reference system, aquisition date/time of images and how files relate to image bands. The package comes with a set of predefined rules (called image collection formats), how this information can be extracted from filenames for selected EO products. A list of available collection formats including a short description can be printed with:


In this case, MxD11A2 is the correct format for our datasets. Internally, collection formats are defined in relatively simple JSON files, presets for other products will be added continuously.

To create the image collection, we must pass a list of our files and the collection format to the create_image_collection() function. The code below finds all files as character vector of GDAL datasets, which then can be passed as first argument to create_image_collection(). The second argument here referes to the image collection format and the third argument provides the name of the output image collection file (which is simply an SQLite database). The collection format also knows that bands are stored as subdatasets in the HDF files.

img_col_file = tempfile(fileext=".db")
files = list.files(file.path(dest_dir,"MOD11A2"), pattern=".hdf$", full.names = TRUE)
create_image_collection(files, "MxD11A2", img_col_file)

Creating a data cube

The created image collection has references to original images on disk and knows about the datetime, spatial extent, and coordinate reference system of images. The raster_cube() function creates a data cube from an image collection. This function expects up to three arguments:

  1. an image collection object as created above,
  2. a data cube view defining the spatiotemporal resolution and extent and the spatial reference system of the target cube, and
  3. the internal size of chunks how the cube is partitioned in memory

If no data cube view is provided, a default cube that covers the whole extent of the collection at low resolution is created. The chunk size defaults to (16 x 256 x 256) pixels in time, y, and x directions. Below, we create a data cube with a default cube view and print some basic information about its dimensions and bands.

x = raster_cube(image_collection(img_col_file))


Notice that raster_cube() will not run any computations besides deriving the shape of the output cube. Instead, it will return a proxy object that will not be evaluated until data must be actually read (e.g. when calling plot). This not only applies to data cubes from image collections but also for derived cubes (see further below). In most cases, however, users want to specify the extent and resolution manually. Above, the temporal resolution of the cube was 3 months whereas below, we define a custom data cube view with temporal resolution of one month.

srs="+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs"
MOD11A2.col = image_collection(img_col_file)
v = cube_view(srs=srs, extent=MOD11A2.col, nx = 400, dt="P1M", aggregation = "mean")
MOD11A2.cube = raster_cube(MOD11A2.col, v)

Aggregation and resampling

Besides the spatiotemporal extent, the resolution and the spatial reference system, the data cube view contains the two important parameters aggregation and resampling. Resampling here refers to how images are resampled in space during the warp operation. The temporal aggregation method defines how values for the same cell from different images are combined in the target cube. For example, a data cube with monthly temporal resolution will contain values from multiple images. Currently, possible values are first, last, min, max, mean, and median. These functions are evaluated per data cube pixel.

Data cube operations

The package comes with a few operations on data cubes to (i) select bands (select_bands()), (ii) apply pixel-wise arithmetic expressions (apply_pixel()), (iii) reduce data cubes over space and time (reduce_time(), reduce_space()), apply a moving-window function over time (window_time()), (iv) apply an R function over chunks of a data cube (chunk_apply()), and (v) join bands of identically shaped data cubes (join_bands()). All of these functions produce a proxy data cube, storing the shape of the result cube but not any data. They all take a (proxy) data cube as first argument and can be chained. The following code demonstrates some of the operations and how data cubes can be plotted.

MOD11A2.bandselect = select_bands(MOD11A2.cube, c("LST_DAY","LST_NIGHT"))
MOD11A2.daynight_difference = apply_pixel(MOD11A2.bandselect, "0.02*(LST_DAY-LST_NIGHT)",names = "LST_difference")
MOD11A2.reduce = reduce_time(MOD11A2.daynight_difference, "median(LST_difference)")
plot(MOD11A2.reduce, col=heat.colors, key.pos=1)

The result is the median day night surface temperature difference for all pixels between Jan and December 2018. Notice that no data is actually read until we call plot(), i.e. all operations again return proxy objects. Replacing the plot function with write_ncdf(MOD11A2.reduce, "test.nc") would write the result as a NetCDF file to disk.

Operations on data cubes are designed such that they can be used with the pipe operator. The following code for example derives monthly differences of land surface temperature, from June to September 2018. The window_time function here applies the simple time series kernel filter $T_t - T_{t-1}$.

library(magrittr) # get the pipe
v1 = cube_view(view=v, extent=list(t0="2018-06", t1="2018-09"))
raster_cube(image_collection(img_col_file), v1)  %>%
  select_bands(c("LST_DAY")) %>%
  apply_pixel("LST_DAY * 0.02") %>% # apply scale
  window_time(kernel=c(-1,1), window=c(1,0)) %>%
  plot(col=terrain.colors, key.pos=1, zlim=c(-25,25))


gdalcubes supports multithreaded evaluation of data cube operations. Calling gdalcubes_set_threads(n) will tell all following data cube operations to use up to n threads. Since worker threads are assigned to chunks of the data cube, the actual number of threads used is lower for cubes with less than n chunks.

Future work

This vignette presents a very simple example with a small dataset. Future vignettes will focus on (i) how user-defined R functions can be applied on data cubes, (ii) how to process larger datasets (Sentinel 2) with overlapping images from different spatial reference systems, and (iii) how to run gdalcubes in the cloud.

Try the gdalcubes package in your browser

Any scripts or data that you put into this service are public.

gdalcubes documentation built on July 29, 2021, 5:07 p.m.