In EhrmannS/rasterTools: Obtain and process earth observation data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this document you can find an outline of a standardised design of obtain operators and other "best practices". In case you want to write a new operator that is readily compatible with rasterTools, it is recommended that you stick to the design which is suggested here. I do my best to keep this document updated, but as the development of rasterTools and other packages progresses, one or the other thing here may be outdate, so don't be too frustrated if something does not work ad-hoc, let's get in contact if you encounter major difficulties or need additional pointers.

Many of the code-chunks of this documentation are written in so-called pseudocode, which is a placeholder or variable, which would be replaced in an actual function. Pseudocode elements are written in CAPITAL LETTERS and typically have a rather self-explanatory name.

rasterTools builds largely on the framework of the checkmate package so that arguments are properly tested before they are used. This makes the source code rather verbose but helps to maintain consistency.

Build the operator

tl;dr

use the checkmate package
make use of the geometr package
take proper care of the projection, mask provides the target projection, after loading and cropping the data reproject to this projection.
most likely you need to write a small downloadMYDATASET() function, see below.
assign a nice colourtable and brief history to the resulting raster object.

First things first

This package is an attempt to standardise, unify and simplify spatial operations. Hence, the operators themselves should also be written in a standardised and modular way, which increases transparency due to easier access of the code. All obtain operators follow the same basic design principle.

Check arguments (`checkmate`)

Each obtain operator has at least the argument mask but most likely also other arguments to handle distinct subsets of the dataseries, such as different years or products. We must first make sure that the arguments follow the intended input patterns. rasterTools accepts various spatial objects as mask, so we need to assert that we deal with the correct mask.

oDATASET <- function(mask, ARG){

  maskIsGeom <- testClass(mask, classes = "geom")
  maskIsSp <- testClass(mask, classes = "Spatial")
  maskIsSf <- testClass(mask, classes = "sf")
  assert(maskIsGeom, maskIsSp, maskIsSf)
  assertXZY(ARG)

  datasetCRS <- projs$longlat

The arguments (ARG) would typically reflect characteristics that could be seen as distinctive for the subsets of this dataseries. This is, for instance, the year in case of the Corine land-cover dataset or the product, layer and period in case of a MODIS dataseries. All arguments must be properly tested before carrying out any further steps.

Use a standardised interface (`geometr`)

When a function wants to accept as many possible input classes as possible, the code may become verbose quite quickly. geometr is an attempt to standardise and unify operations across all spatial classes. It does so by providing methods to generic tasks - such as extracting an extent (getExtent()) or setting the coordinate reference system (setCRS()) - and by strictly producing the same output for the task at hand across all classes. A code-snippet that recurrs in all obtain operators puts together the mask that is needed for a specific dataseries, described in the next section.

Since we have already assured that we have a valid mask, we can simply use functions of the geometr package to extract the relevant information for this operation without having to transform spatial objects from one class to the other. This aims at making intentions clear more quickly and hopefully helps to avoid bugs.

Handle projection

We must handle the projection of the involved files, to avoid accidently building on spatial objects with the wrong coordinate reference system. Controlling the projection is crucial, as problems with the projection of one out of many files may already lead to problems that are hard to trace back.

The object defined in mask provides the target projection the user wants to work with, so we take targetCRS and targetExtent from this mask. Moreover, a the mask object with which we will extract from the dataseries (maskGeom), with the dataseries CRS and the extent valid in this CRS needs to be built.

  thisCRS <- ...

  # transform crs of the mask to the dataset crs
  targetCRS <- getCRS(x = mask)
  maskExtent <- getExtent(x = mask)
  if(targetCRS != thisCRS | is.na(targetCRS)){
    targetMask <- setCRS(x = mask, crs = thisCRS)
  } else{
    targetMask <- mask
  } 
  maskGeom <- gs_rectangle(anchor = getExtent(x = targetMask))
  maskGeom <- setCRS(x = maskGeom, crs = thisCRS)
  targetExtent <- getExtent(maskGeom)

Determine target tiles

Tiles are a set of regularly arranged neighbouring rectangles that represent different "windows" by which a large spatial dataset is subdivided. This approach is being chosen by many of the providers of gridded dataseries to make files available in small enough chunks that can easier be handled. geometr comes with the function gs_tiles() that lets you outline tiles with the specific intention that these tiles represent the tiling of the raster datasets. We first outline the overall extent of the tiles, i.e. their minimum and maximum values in x and y dimension. Then we specify the number of cells/rectangles and the projection:

  aWindow <- tibble(x = c(-180, 180),
                    y = c(-60, 80))
  datasetTiles <- gs_tiles(window = aWindow, cells = c(36, 14), crs = datasetCRS)

From the object datasetTiles we need to determine the subset of rectangles in which the data we are interested in are to be found (this is not ideally solved at the moment and might change in the future):

  tabTiles <- getTable(x = datasetTiles)
  tabMask <- getTable(x = mask)
  ids <- unique(tabTiles$id)
  xMatch <- yMatch <- NULL
  for(i in seq_along(ids)){
    temp <- tabTiles[tabTiles$id == ids[i],]
    xMatch <- c(xMatch, ifelse(any(tabMask$x < max(temp$x)) & 
                                 any(tabMask$x > min(temp$x)), TRUE, FALSE))
    yMatch <- c(yMatch, ifelse(any(tabMask$y < max(temp$y)) & 
                                 any(tabMask$y > min(temp$y)), TRUE, FALSE))
  }
  tiles <- xMatch & yMatch
  myTiles <- getSubset(tiles_gfc, tabTiles$id == ids[tiles])

Iterate through the tiles or arguments

Once we have determined the spatial subset of the data we are interested in, we have to find the other subsets, maybe according to the temporal extent or according to other properties of the dataset. When a dataset is stored in a tiled way, we have to figure out code that would let us derive the name under which the tile of our interest is stored. To assist in this, we could for instance create an id variable for the object datasetTiles, that contains the systematic names, or components thereof. Other name components could perhaps be derived from the respective arguments based on which we want to subset. The oGFC() function handles this in the following way:

  tabTiles <- getCoords(x = myTiles)

  for (i in unique(tabTiles$fid)){
    min_x <- min(tabTiles$x[tabTiles$fid == i])
    max_y <- max(tabTiles$y[tabTiles$fid == i])

    if(min_x < 0){
      easting <- paste0(sprintf('%03i', abs(min_x)), 'W')
    } else{
      easting <-  paste0(sprintf('%03i', min_x), 'E')
    }
    if(max_y < 0){
      northing <- paste0(sprintf('%02i', abs(max_y)), 'S')
    } else{
      northing <- paste0(sprintf('%02i', max_y), 'N')
    }
    gridId <- paste0(northing, '_', easting)
    fileNames <-  paste0("Hansen_GFC2015_", layerNames, "_", gridId, '.tif')

    ...

  }

Load and crop raster

wip

As some of the spatial operations may take up a large quantity of time, rasterTools is designed so that it always gives feedback about what it is doing at the moment. For that purpose the message() function is employed throughout in combination with paste0() (a slightly more efficient wrapper of paste() with sep = '' on default). The feedback should of course not overload the user but give informative feedback, so be brief and concise. Again, the oGFC() function handles this in the following way.

    message(paste0("I am handling the gfc datasets with the grid ID '", gridId, "':\n"))
    tempObject <- stack(loadData(files = fileNames, dataset = "gfc"))

Feedback for an action should always come directly before the action is carried out. This assures that no other, maybe time-costly, operation interfers with it. The function loadData() is comparable to the four core functions in that it is a wrapper that calls other functions depending on its input. In the case of the GFC dataset, the spatial data are stored in the tif format. loadData() calls the file-type specific loaders, for instance load_tif(), which contains all the code logic that is required to load the respective format. In case you want to handle a format of which such a function is not yet defined, see below.

In most cases you want to crop the overall dataset to mask, or actually the extent of mask (targetExtent). The smaller the spatial objects, the faster computation involving them can be carried out. It is no problem to overwrite a object with a cropped version of itself, since the original object usually has no other purpose other than selecting a subset of it. Overwriting these potentially large spatial files is thus a memory friendly option of which you should take advantage.

    targetExtent <- getExtent(x = targetExtent)
    message("  ... cropping to targeted study area.\n")
    tempObject <- crop(tempObject, targetExtent, snap = "out", datatype='INT1U', format='GTiff', options="COMPRESS=LZW")

wip: to handle large datasets, loading them into the environment of R is not wise, as it takes a lot of time. I am atm outlining a standardised workflow mostly based on gdalUtils to ease these problems, which I will include here as soon as it's done.

Merge multiple tiles

wip

Reproject

As a final step of operations on the spatial data you should reproject the output. If a user wants to include various different datasets in one analysis, all the datasets should have the same projection. rasterTools assumes that this is the projection of the initial mask. Hence, tempObject should be reprojected to targetCRS.

    if(getCRS(mask) != targetCRS){
      crs_name <- strsplit(targetCRS, " ")[[1]][1]
      message(paste0("  ... reprojecting to '", crs_name))
      tempObject <- setCRS(x = tempObject, crs = targetCRS)
      theExtent <- getExtent(x = theExtent)
      tempObject <- crop(tempObject, theExtent, snap = "out", datatype='INT1U', format='GTiff', options="COMPRESS=LZW")
      history <-  c(history, list(paste0("object has been reprojected to ", crs_name)))
    }

Postprocessing

wip

datasets have a typical colour pattern which would be worth to maintain. raster objects have the slot $colortable, where we can store a set of colour values that shall be used to visualise the raster.

Loading the data into R {#loaders}

loadData() is designed in a modular way to provide flexibility, in case other dataset formats should become important in the (near) future. The function is designed so that it determines the files that should be loaded, based on what is specified in its arguments. The function loadData() contains all the logic that is needed to handle a file without having to determine the actual format. All the logic that depends on the structure of files with that format or other requirements that come with the format is then outsourced to the respective load* and download* functions. These classes can, however, be very slim wrappers around the actual function that loads the format. In the end this means that virtually any file format that can be loaded into R can be handled by loadData(), given the respective wrapper has been defined.

`load*`

Each class should be able to take an argument path, the exact location of the file to load. Optionally, for instance when the file is saved in the shp format, the argument layer declares these details.

One of the simplest classes for loadData() is that for loading shape-files:

load_shp <- function(path, layer){
  rgdal::readOGR(dsn = path,
                 layer = layer,
                 verbose = FALSE)
}

The simplest 'template' of this class would hence be:

load_FORMAT <- function(path, layer){
  rgdal::readOGR(dsn = path,
                 layer = layer,
                 verbose = FALSE)
}

Building on rgdal::readOGR() makes defining new classes for loadData() a breeze. But you can of course also use other functions where this is more efficient or required because readOGR() doesn't support the format.

It is important to note, that neither the functions called by loadData() nor loadData() itself handle the coordinate reference system, this is carried out by the obtain operators. The class simply loads what is found in path with layer and makes it available to loadData(), which turns the information into the respective output format.

`download*`

wip

Be informative

rasterTools intends to be as transparent as possible and this ensues that error messages or relevant warnings are easy to understand and appear whenever a problem occurs. This is where the checkmate package comes in, which rasterTools heavily utilizes. Thus many of the recommendations with regard to error management derive from checkmates directives.

Also, since several of the functions here can be quite time-consuming, a message that informs the user about the currently processing step can be helpful. This includes a progressbar, which indicates the progress of an iterative computation. For example, the catalog() function builds an index based on all the files that are found in a certain directory. This index can be useful when a project is meant to deal with a large number of files in some directory (for instance in "./myProject/myDataset/aLargeNumberOfFiles.tif") that need to be accessed arbitrarily, i.e. not all of them at a time or only a subset according to the current conditions. Hence, it might not be required to load all the files into the global environment of R but instead it would be useful to provide a simple key, according to which the files can be loaded. Think, for example, about all the different products, tiles, temporal extents and layers of the MODIS dataset. The file-names of the respective files are a long and cumbersome combination of all these information and very timeconsuming to type in; sure, tab-completion is your friend, but with these files you still need tab through several "levels" of file specification, so to speak.

Anyway, while catalog() is an example of functions that makes rasterTools more informative, it utilizes the txtProgressBar function in the following way:

    FILES <- list.files(PATH)
    pb <- txtProgressBar(min = 0, max = length(FILES), style = 3, char=">", width=getOption("width")-14)
    for(i in seq_along(FILES)){

      # store the name and an abbreviation of each file to a data-frame, then...
      setTxtProgressBar(pb, i)

    }
    close(pb)

This informs the user about the progress of this function and should be helpful especially when many files need to be accessed and the whole procedure naturally takes a while. If your new function also goes through many files, it would be recommended that you employ the same specifications for the progress bar you may want to use.

Moreover, if computations are employed that take a certain while, for instance loading a large raster into the global environment, or reprojecting a raster, it may be useful to inform the user also about these steps. This is frequently done in the obtain() operators. Here, the message() function is used. This function has the advantage over cat, that it can be integrated with a translation framework that would provide the messages in the language rasterTools is expected to communicate with the user (not supported yet). Hence, it is recommended that you also use this function, so that your function seamlessly blends in with this framework.

Create a bibliography entry

rasterTools has the reference() function, which helps the user to put together the correct bibliography when reporting the results in a publication. If your function is based on the work that has been defined by somebody else, it should report the respective reference. R comes with the bibentry() function, which lets you define the reference(s) of your function.

bib <- bibentry(bibtype = "",
                title = "",
                author = person(""),
                year = ,
                ... 
)

Then, your function should first check whether a bibliography already exists in the options of the current session and in case this is not found, create it. If a bibliography object has already been created, it needs to be checked whether or not the recent reference is already included and if this is not the case, concatenate the reference.

if(is.null(getOption("bibliography"))){
  options(bibliography = bib)
} else{
  currentBib <- getOption("bibliography")
  if(!bib%in%currentBib){
    options(bibliography = c(currentBib, bib))
  }
}

This will ensure that the dataset your function is based upon is properly included when it is used in a computation that should be published.

EhrmannS/rasterTools documentation built on Sept. 4, 2019, 10:34 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

EhrmannS/rasterTools
Obtain and process earth observation data

In EhrmannS/rasterTools: Obtain and process earth observation data

Build the operator

tl;dr

First things first

Check arguments (`checkmate`)

Use a standardised interface (`geometr`)

Handle projection

Determine target tiles

Iterate through the tiles or arguments

Load and crop raster

Merge multiple tiles

Reproject

Postprocessing

Loading the data into R {#loaders}

`load*`

`download*`

Be informative

Create a bibliography entry

R Package Documentation

Browse R Packages

We want your feedback!

EhrmannS/rasterTools Obtain and process earth observation data

In EhrmannS/rasterTools: Obtain and process earth observation data

Build the operator

tl;dr

First things first

Check arguments (checkmate)

Use a standardised interface (geometr)

Handle projection

Determine target tiles

Iterate through the tiles or arguments

Load and crop raster

Merge multiple tiles

Reproject

Postprocessing

Loading the data into R {#loaders}

load*

download*

Be informative

Create a bibliography entry

R Package Documentation

Browse R Packages

We want your feedback!

EhrmannS/rasterTools
Obtain and process earth observation data

Check arguments (`checkmate`)

Use a standardised interface (`geometr`)

`load*`

`download*`