PROPOSAL.md

Scalable, spatiotemporal Tidy Arrays for R

Edzer Pebesma, Michael Sumner, Etienne Racine

Summary

A lot of spatiotemporal data takes the form of dense, multidimensional arrays. Examples are

Although such data can be represented in long tables, for larger datasets the array form is beneficial because it does not replicate dimension indexes, and the array form provides faster access by being implicitly indexed. R's native arrays have a number of limitations, they

This project will (i) implement a flexible and generic multidimensional array model for heterogeneous records that (ii) handles strong spatial and temporal referencing of array indexes, and (iii) scales from moderately sized in- memory data, to large on-disk data, and to massive data held on remote servers, while using a unified user interface that follows the tidy tools manifesto.

The Problem

How do we handle and analyze large amounts of spatially referenced time series data in R? How do we handle satellite imagery that don't fit on a local disc with R, or for which we need a small cluster to finish computation within acceptable time? How can we quickly and easily develop an analysis by testing it on a small portion of the spatiotemporal datasets, before deploying it on the massive data set? How can we use pipe-based workflows or dplyr-verbs on such data sets? How can we visually explore high-dimensional raster data?

Today, people use R for large spatiotemporal data, but hit limits related to usability, user interface, and scalability. The r-sig-geo mailing list documents many of these cases.

Now that the simple features for R project has largely modernized the handling and analysis vector data (points, lines, polygons) in R in a tidyverse-friendly fashion, it is time for raster data to catch up. This proposal aims at spatiotemporal raster data, as well as time series of feature data.

Existing work

Base R supports n-dimensional homogeneous arrays of basic types (double, integer, logical, logical), in-memory. Package ff supports out-of-memory data structures, held on local disc, but no spatial or temporal references on dimensions.

Spatial packages include rgdal, which lets you read and write raster data, possibly piece-by-piece, in one of 142 different file formats. Package raster allows users to work with raster maps or stacks of them, where a stack can either refer to different bands (color) or different time steps, but not both. Package raster can iterate functions over large files on disc, and takes care of the caching, using either rgdal or ncdf4. Another package for reading and writing NetCDF is RNetCDF. Packages that are more dedicated to a single data source or type include include RStoolbox, MODIS, landsat, and hsdar; each of these relies on raster or rgdal for file-based I/O.

CRAN package spacetime provides heterogeneous records, using a data.frame for attributes. It keeps indexes for each record to a spatial geometry (grid cell/point/polygon) and time instance or period; it keeps all data in memory and builds on xts for temporal, and sp for spatial reference.

With support from the R Consortium, the Distributed Computing Working Group has started to develop an API for distributed computing; an initial version is available in package ddR. It aims at generic R data structures, and works towards relieving users from worrying that data is distributed.

Relevant work outside R includes

Since there is a definite trend that downloading Earth observation data is no longer feasible, we will have to work towards solutions where the data are accessed over a web service interface. Cloud services such as AWS are starting to give access to the large remote sensing imagery archives of e.g. Landsat, MODIS and Sentinel satellites.

The Plan:

We will develop an R package and container infrastructure that

We will document the software and provide tutorials and reproducible data analysis examples using locally downloaded imagery, as well as scalable examples accessing larger (> 1 Tb) datasets using docker container images.

We will document the RESTful API that connects the R client with the web service holding (and processing) the big Earth observation data.

We will also develop and discuss a migration path for the raster package (which has 43K lines of R, C and C++ code), and its functionality, into the new infrastructure.

We will publish the resulting products in an open access form, in the R journal, but also in a journal (or on a conference) more directed to the Earth observation community.

Timeline:

Failure modes:

How Can The ISC Help:

We will use most funding to develop the R package and web service API. Total costs will be 10,000 USD, and breaks down in:

Dissemination:

We will regularly post blogs about the project on r-spatial.org, use twitter, post to r-sig-geo, stackoverflow, and communicate through github issues or gitter discussion. The project will live on GitHub, in the r-spatial organisation. We will work under a permissive open source license, probably LGPL-2.1. Pull requests will be encouraged. R consortium blogs will be provided at start and end. Publications in the R Journal and other scientific outlets are foreseen.



r-spatial/stars documentation built on Sept. 6, 2024, 9:59 p.m.