LAScatalog processing engine

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 2.5,
  fig.height = 2.5,
  dev.args = list(pointsize = 9)
)
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      # record the current time before each chunk
      now <<- Sys.time()
    } else {
      # calculate the time difference after a chunk
      res <- difftime(Sys.time(), now)
      # return a character string to show the time
      #if (res > 0.1)
      #paste("<br/>========================<br/>Time for this code chunk ", options$label, " to run:", round(res,2), "<br/>========================<br/>")
    }
  }
}))
knitr::opts_chunk$set(time_it = TRUE)
#rgl::setupKnitr()
options(rmarkdown.html_vignette.check_title = FALSE)

library(lidR)
data = structure(list(Max.X = c(332099.99, 333600, 335099.99, 336217.52, 
332099.99, 333599.99, 335099.99, 336368.67, 332099.99, 333599.99, 
335100, 336217.52), Min.X = c(331016.91, 332100.01, 333600.01, 
335100, 331016.91, 332100, 333600, 335100, 331016.92, 332100.01, 
333600.01, 335100.01), Max.Y = c(5529993.99, 5529993.99, 5529993.99, 
5529993.99, 5528399.99, 5528399.99, 5528399.99, 5528399.99, 5526399.98, 
5526399.96, 5526399.99, 5526399.99), Min.Y = c(5528400, 5528400, 
5528400, 5528400, 5526400, 5526400, 5526400, 5526400, 5524793.5, 
5524793.5, 5524800.38, 5524793.5), Max.Z = c(53.53, 47.59, 48.66, 
49.36, 46.13, 48.16, 50.51, 50.86, 45, 74.18, 52.56, 49.33), 
    Min.Z = c(-15.95, -7.87, -3.55, -14.96, -5.94, -11.15, -5.11, 
    -4.12, -9.63, -8.27, -35.88, -20.59), filename = c("folder/file1.las", 
    "folder/file2.las", "folder/file3.las", "folder/file4.las", 
    "folder/file5.las", "folder/file6.las", "folder/file7.las", 
    "folder/file8.las", "folder/file9.las", "folder/file10.las", 
    "folder/file11.las", "folder/file12.las")), row.names = c(NA, 
-12L), class = "data.frame")

geom <- lapply(1:nrow(data), function(i)
{
  mtx <- matrix(c(data$Min.X[i], data$Max.X[i], data$Min.Y[i], data$Max.Y[i])[c(1, 1, 2, 2, 1, 3, 4, 4, 3, 3)], ncol = 2)
  sf::st_polygon(list(mtx))
})

geom <-sf::st_sfc(geom)
sf::st_crs(geom) <- 26917
data <- sf::st_set_geometry(data, geom)

ctg       <- new("LAScatalog")
ctg@data  <- data

In lidR the LAScatalog processing engine refers to the function catalog_apply(). This function is the core of the package and drives, internally, every single other function that is capable of processing a LAScatalog, including clip_roi(), locate_trees(), rasterize_terrain(), decimate_points() and many others as well as some experimental functions in third party packages such as lidRplugins. This engine is powerful and versatile but also relatively hard to understand for new users and especially for R beginners. This vignette documents how it works by going deeper and deeper inside the engine. It is highly recommended to read the vignette named LAScatalog formal class before this one even if one may find some overlap between these two vignettes.

Overview

When processing a LAScatalog the area covered is split into chunks. A chunk can be seen as a square region of interest (ROI) and altogether the chunks cover the collection of files. The collection of files is not physically split, rather the chunks correspond to a virtual splitting of the coverage. Then the chunk are processed sequentially (one after the other) or in parallel but always independently. To process each chunk, the corresponding point cloud is extracted from the collection of files and loaded into memory. Any function can be applied on these independent point clouds and the independent outputs are stored in a list. The list contains one output per chunk and once each chunk is processed the list is collapsed into a single continuous valid object such as a raster or a spatial vector. The roles of the catalog engine are to:

When using the LAScatalog engine, users only need to think about which function they want to apply over their coverage, all the above-mentioned features being managed internally. There are many possible processing tuning tools and this is why one may feel lost by all the options to consider. To simplify, we can make two categories of tools:

  1. High-level API are lidR functions that perform a given operation either on a LAS or LAScatalog object transparently in a straightforward way. For example, pixel_metrics(), rasterize_terrain(), and locate_trees() are high-level API. Rule of thumb: if the function catalog_apply() is not directly used it is a high-level API. Processing options can be tuned with functions that start with opt_ (for option).
  2. Low-level API is the function catalog_apply() itself. This function is designed to build new high-level applications and is used internally by all the lidR functions but can also be used to build a user's own tools. Options can be tuned with the parameter .options of catalog_apply().

In the following vignette we will discuss first the high-level API then the low-level API. The variables named ctg* will refer to a LAScatalog object in subsequent code.

High-level API

Control of the chunk size

The catalog takes care of making chunks and users can define the size of the chunks. By default this size is set to 0 meaning that a chunk is a file and thus each file will be processed sequentially. The chunk size is not the most important option. It is mainly intended to be used with small configuration computers to avoid loading too many points at once. Reducing or increasing the chunk size does not modify the output but it reduces the memory used by reducing the quantity of points loaded. It is recommended to set this option to 0, but sometimes it is a good idea to set it to a small size to process particularly large files.

opt_chunk_buffer(ctg) <- 0
opt_chunk_size(ctg) <- 0 # Processing by files
plot(ctg, chunk = TRUE)

opt_chunk_size(ctg) <- 1000 # Processing chunks of 1000 x 1000
plot(ctg, chunk = TRUE)

Control of the chunk buffer

Each chunk is loaded with a buffer around it to ensure that independent clouds will not be affected by edge effects. For example, when computing a digital terrain model if a buffer is not loaded, the terrain is weakly computed at the edge of the point cloud because of the absence of context in the neighbourhood. The chunk buffer size is the most important option. A buffer that is too small can create incorrect outputs. The default is 30 m, which is likely to be appropriate for most use cases.

opt_chunk_size(ctg) <- 0
opt_chunk_buffer(ctg) <- 0 # No buffer
plot(ctg, chunk = TRUE)

opt_chunk_buffer(ctg) <- 200 # 200 m buffer
plot(ctg, chunk = TRUE)

lidR functions always check if an appropriate buffer is set. For example, it is impossible to apply rasterize_terrain() with no buffer.

opt_chunk_buffer(ctg) <- 0
rasterize_terrain(ctg, 1, tin())

Control of the chunk alignment

In some cases it might be suitable to control the alignment of the chunks to force a given pattern. This is a rare use case but the engine supports such a possibility. This option is obviously meaningless when processing by file. The chunk alignment is not the most important option and does not modify the output, but it may generate more, or fewer, chunks depending on the alignment. However, it might be very useful in the particular case of the function catalog_retile(), for example, to accurately control the new tiling pattern.

opt_chunk_size(ctg) <- 2000
opt_chunk_buffer(ctg) <- 0
plot(ctg, chunk = TRUE)

opt_chunk_size(ctg) <- 2000
opt_chunk_buffer(ctg) <- 0
opt_chunk_alignment(ctg) <- c(1000, 1000)
plot(ctg, chunk = TRUE)

clip_roi() is the single case where the control of the chunk is not respected. clip_roi() aims to extract a shape (rectangle, disc, polygon) as a single entity . The chunk pattern does not make sense here.

Filter points

When reading a single file with readLAS() the option filter allows for discarding some points based on criteria. For example -keep_first keeps only first returns and discards all non-first returns. It is important to understand that the discarded points are discarded while reading and they will never be loaded in memory. This is useful for processing only some points of interest, for example when computing metrics only on first returns above 2 m.

# Read all the points
readLAS("file.las")

# Only first returns above 2 m are loaded
readLAS("file.las", filter = "-drop_first -drop_z_below 2")

This works only when the readLAS() function is explicitly called. When using the catalog processing engine, readLAS() is called internally under the hood and users cannot explicitly call the filter argument. filter is propagated with the opt_filter() function.

opt_filter(ctg) <- "-keep_first -drop_z_below 2"

Internally, for each chunk, readLAS() will be called with filter = "-drop_first -drop_z_below 2 in every function that is processing a LAScatalog unless the documentation explicitly mentions something else. In the following examples clip_roi() is used to extract a plot but only the points classified as ground or water will be read and pixel_metrics() is used to compute metrics only on first returns above 2 meters. It does not mean other points do not exist, they are simply not read.

opt_filter(ctg) <- "-keep_class 2 9"
las <- clip_circle(ctg, 273400, 5274500, 20)

opt_filter(ctg) <- "-keep_first -drop_z_below 2"
m <- pixel_metrics(ctg, .stdmetrics, 20)
#LASfile <- system.file("extdata", "Topography.laz", package="lidR")
#ctg = readLAScatalog(LASfile)
#opt_progress(ctg) <- FALSE
#opt_filter(ctg) <- "-keep_class 2 9"
#las = clip_circle(ctg, 273500, 5274500, 40)
#m = structure(c(0.921, -0.146, 0.362, 0, 0.386, 0.482, -0.787, 0, 
#-0.06, 0.864, 0.5, 0, 0, 0, 0, 1), .Dim = c(4L, 4L))
#plot(las)
#rgl::view3d(fov = 50, userMatrix = m)

All filters are not necessarily appropriate everywhere. For example, the following is meaningless because it discards all the points that are used to compute a DTM.

opt_filter(ctg) <- "-drop_class 2 9"
rasterize_terrain(ctg, 1, tin())
#> Error: No ground points found. Impossible to compute a DTM.

Select attributes

When reading a single file with readLAS() the option select allows for discarding some attributes of the points. For example, select = "ir" keeps only the intensity and the return number of each point and discards all the other attributes, such as the scan angle, the flags or the classification. Its role is only to save processing memory by not loading data that are not needed. Similarly to opt_filter(), the function opt_select() allows propagation of the argument select to readLAS().

# Load Intensity only
opt_select(ctg) <- "i"
las <- clip_circle(ctg, 273500, 5274500, 10)

# Load Intensity + Classification + ReturnNumber + NumberOfReturns
opt_select(ctg) <- "icrn"
las <- clip_circle(ctg, 273500, 5274500, 10)

However, this option is not always respected because many lidR functions already know which optimization they should apply. In the following example the 'classification' attribute is explicitly discarded, yet the creation of a DTM works because the function overwrites the user settings for something better (in this specific case xyzc).

opt_select(ctg) <- "* -c"
dtm <- rasterize_terrain(ctg, 1, tin())

Write independent chunks on disk

By default every function returns a continuous output within a single R object stored in memory so it is immediately usable in the working environment in a straightforward way. For example, a sf/sfc_POINT for locate_trees().

LASfile <- system.file("extdata", "MixedConifer.laz", package="lidR")
ctg2 <- readLAScatalog(LASfile)
opt_progress(ctg2) <- FALSE
opt_chunk_size(ctg2) <- 100
# Find the trees
trees <- locate_trees(ctg2, lmf(3))

# The trees are immediately usable in subsequent analyses
trees
#> Simple feature collection with 17865 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension:     XYZ
#> Bounding box:  xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N

However in some case this behaviour might not be suitable especially for a large collection of files that cover a broad area. In this case the computer might not be able to handle so much data and/or will run into trouble when merging the different independent chunks into a single object. This is why the engine has the capability to write on disk the output of each independent chunk. The function opt_output_files() can be used to set a path on disk where the output of each chunk should be saved on disk. This path is templated so the engine is able to create a different file name for each chunk. The general form is the following:

opt_output_files(ctg) <- "/folder/to/save/filename_{TEMPLATE1}_{TEMPLATE2}...{TEMPLATEN}"

The templates can be XCENTER, YCENTER, XLEFT, YBOTTOM, XRIGHT, YTOP, ID or ORIGINALFILENAME. The templated string does not contain the file extension. The engine guesses the extension and it works no matter the output. For example:

# Force the results to be written on disk
opt_output_files(ctg2) <- paste0(tempdir(), "/tree_coordinate_{XLEFT}_{YBOTTOM}")
trees <- locate_trees(ctg2, lmf(3))

# The output has been modified by these options and it now gives
# the paths to the written files (here shapefiles)
trees
#> "/tmp/RtmpJQHPNz/tree_coordinate_481200_3812900.shp" "/tmp/RtmpJQHPNz/tree_coordinate_481300_3812900.shp" "/tmp/RtmpJQHPNz/tree_coordinate_481200_3813000.shp"
#> [4] "/tmp/RtmpJQHPNz/tree_coordinate_481300_3813000.shp"

There is one file per chunk and thus processing by file with the template {ORIGINALFILENAME} or its shortcut {*} may be suitable to further match the output files with the point cloud files. However, depending on the size of the file and the capacity of the computer it is not always possible (see section "Control of the chunk size").

In the previous example, several shapefiles were written on disk and there is no way to combine them into a single light R object. But sometimes it is possible to return a light object that aggregates all the written files. For rasters, for example, it is possible to build a virtual raster mosaic. The engine automatically does this when it is possible.

# Force the results to be written on disk
opt_output_files(ctg2) <- paste0(tempdir(), "/tree_coordinate_{XLEFT}_{YBOTTOM}")
chm <- rasterize_canopy(ctg2, 1, p2r())

# Many rasters have been written on disk
# but a light raster has been returned anyway
chm
#> class      : RasterLayer 
#> dimensions : 90, 90, 8100  (nrow, ncol, ncell)
#> resolution : 1, 1  (x, y)
#> extent     : 481260, 481350, 3812921, 3813011  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=12 +datum=NAD83 +units=m +no_defs 
#> source     : /tmp/RtmpZVJ2hy/rasterize_canopy.vrt 
#> names      : tree_coordinate_481260_3812921 
#> values     : 0, 32.07  (min, max)

There are two special cases. The first one is when raster files are written. They are merged into a valid RasterLayer using a virtual mosaic. The second one is when LAS or LAZ files are written on disk. They are merged into a LAScatalog.

opt_output_files(ctg2) <- "{tempdir()}/plot_{ID}"
new_ctg <- clip_circle(ctg2, x, y, 20)
new_ctg
#> class       : LAScatalog (v1.2 format 0)
#> extent      : 32.372, 163.136, 38.494, 198.636 (xmin, xmax, ymin, ymax)
#> coord. ref. : NAD83 / UTM zone 17N 
#> area        : 3895.031 m²
#> points      : 44 points
#> density     : 8 points/m²
#> num. files  : 4 

Modification of the file format

By default, rasters are written in GeoTiff files, spatial points, spatial polygons and spatial lines either in sp or sf formats are written in ESRI shapefiles, point-clouds are written in las files, and table are written in csv files. This can be modified at any time by users but it corresponds to advanced settings and thus this section is discussed in a later section about advanced usages. However, the function opt_laz_compression() is a simple shortcut to switch from las to laz when writing a point cloud.

opt_laz_compression(ctg) <- TRUE

Progress estimation

The engine provides a real time display of the progression that serves two purposes: (a) see the progression and (b) monitor troubleshooting. It is enabled by default and when a LAScatalog is processing a chart is displayed. The chunks are progressively coloured. No colour: the chunk is pending. Blue: the chunk is processing. Green: the chunk has been processed. Orange: the chunk has been processed with a warning. Red: the chunk processing failed.

opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- ""
opt_wall_to_wall(ctg) <- FALSE
opt_progress(ctg) <- TRUE
cl <- engine_chunks(ctg)
for (i in 1:5){
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3")
}

bbox <- cl[[6]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "cornflowerblue")

This can be disabled with:

opt_progress(ctg) <- FALSE

Error handling

If a chunks produced a warning this is rendered in real time with an orange colouring. However, the message(s) of the warning(s) are delayed and printed only at the end of the computation.

opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- ""
opt_wall_to_wall(ctg) <- FALSE
opt_progress(ctg) <- TRUE
cl <- engine_chunks(ctg)
for (i in 1:6){
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3")
}

bbox <- cl[[7]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "orange")

bbox <- cl[[8]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "cornflowerblue")

If a chunk produced an error this is also rendered in real time. The computation stops as usual but the error is handled in such a way that the code does not actually fail. The functions returned a partial output i.e. the output of the tiles that were computed successfully. A message is printed telling the user what the error was and where it occurred, and suggests to load this specific section of the catalog to test what was wrong with it. In the following case one tile was not classified so it failed.

#> An error occurred when processing the chunk 9. Try to load this chunk with:
#>  chunk <- readRDS("/tmp/RtmpTozKLu/chunk9.rds")
#>  las <- readLAS(chunk)
#> No ground point found.
opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- ""
opt_wall_to_wall(ctg) <- FALSE
opt_progress(ctg) <- TRUE
cl <- engine_chunks(ctg)
for (i in 1:8){
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3")
}

bbox <- cl[[7]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "orange")

bbox <- cl[[9]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "red")

The engine logged the chunk and it is easy to load this specific processing region for further investigation by copy pasting the mentioned code.

chunk <- readRDS("/tmp/RtmpTozKLu/chunk9.rds")
las <- readLAS(chunk)

It is possible to restart the computation at a given chunk to produce the missing parts. In the previous example the chunk number 9 failed so we can restart at 9

opt_restart(ctg) <- 9
opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- ""
opt_wall_to_wall(ctg) <- FALSE
opt_progress(ctg) <- TRUE
opt_restart(ctg) <- 9
cl <- engine_chunks(ctg)
for (i in 1:4){
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3")
}

The engine is also able to bypass the error. This can be activated with opt_stop_early() and the computation will run until the end in any case. In this case chunks that failed will be missing and the output will contain holes with missing data. We do not recommend the use of this option because other errors will be bypassed as well without triggering any informative message. This option exists and can be useful if used carefully. Users should always try to fix any problems.

opt_stop_early(ctg) <- FALSE
dtm <- rasterize_terrain(ctg, 1, tin())
opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- ""
opt_wall_to_wall(ctg) <- FALSE
opt_progress(ctg) <- TRUE
opt_restart(ctg) <- 1
cl <- engine_chunks(ctg)
for (i in 1:length(cl)){
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3")
}

bbox <- cl[[7]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "orange")

bbox <- cl[[9]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "red")

Empty chunks

Sometimes some chunks are empty and we discover this only when loading the region of interest. This may happen when the chunk pattern is different from the tiling pattern because some chunks may have been created but they do not actually encompass any points. This is the case when the file is only partially populated because the bounding box of the file/tile is bigger than its actual contents. This often happens on the edge of the collection. In this case the engine displays the chunks in gray. This may also happen if filter is set in such a way that a lot of points are discarded and in some chunks all the points appear to be discarded. Those chunk are displayed in gray.

opt_chunk_size(ctg) <- 400
opt_output_files(ctg) <- ""
opt_wall_to_wall(ctg) <- FALSE
opt_progress(ctg) <- TRUE
cl <- engine_chunks(ctg)
for (i in 1:50){
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3")
}

bbox <- cl[[1]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")

bbox <- cl[[2]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")

bbox <- cl[[3]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")

bbox <- cl[[14]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")

bbox <- cl[[15]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")

bbox <- cl[[16]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")

bbox <- cl[[29]]@bbox
graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")

Parallel processing

The engine takes care of processing the chunks either sequentially or in parallel. The parallelization can be done with a single machine or on multiple machines by sending the chunks to different workers. The parallelization is performed using the framework given by the package future. Understanding the basics of this package is thus required. To activate a paralellization strategy users need to register a strategy with future. For example:

library(future)
plan(multissesion, workers = 4L)

Now 4 chunks will be processed at a time and the engine monitoring displays the processing chunks in real time:

opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- ""
opt_wall_to_wall(ctg) <- FALSE
opt_progress(ctg) <- TRUE
cl <- engine_chunks(ctg)
for (i in 1:6) {
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3")
}

for (i in 7:11) {
  bbox <- cl[[i]]@bbox
  graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "cornflowerblue")
}

However, parallel processing is not magic. First of all it loads several point clouds at one time because several chunks are read at one time. Users must ensure they have enough RAM to support that. Second, there are strong overheads incurred when parallelizing tasks. Putting all cores on a task does not always make it faster!

Overlapping tiles

In lidR if some files are overlapping each other, a message is printed to alert users about potential problems.

ctg <- readLAScatalog("path/to/folder/")
#> Be careful, some tiles seem to overlap each other. lidR may return incorrect outputs with edge artifacts when processing this catalog.

Overlapping is not an issue by itself. The actual problem is duplicated points. Because lidR makes arbitrary chunks, users are likely to load the same points twice if some areas are duplicated in the original dataset. Below are a few classical cases of overlapping files:

r opt_filter(ctg) <- "-drop_withhelp"

If it is not the case one can try to retile the LAScatalog with a negative buffer to remove the buffer.

r opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "/data/unbufferd/{ORIGINALFILENAME}" opt_chunk_buffer(ctg) <- -40 new_ctg <- catalog_retile(ctg)

It is also possible to process by file without a buffer by disabling all the internal controls using opt_wall_to_wall(). Using this disables the guarantee of having a correct, valid, continuous result. This is a free-wheel mode.

r opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "/data/unbufferd/{ORIGINALFILENAME}" opt_chunk_buffer(ctg) <- 0 opt_wall_to_wall(ctg) <- FALSE new_ctg <- catalog_retile(ctg)

Partial processing

It is possible to process only a subregion of the collection by flagging some files. In this case only the flagged files will be processed but the neighbouring tiles will be used to load a buffer so the local context beyond the processed tiles is not lost. In the figure below 2 files are flagged processed and the display plots the other tiles almost white.

ctg$processed <- FALSE
ctg$processed[6:7] <- TRUE
plot(ctg)

Partial output

As mentioned in a previous section, when an error stops the computation the output is never NULL, but it contains a valid output computed from the part of the catalog that has been processed. It is a partial but valid output. Computation can be restarted from the failling chunk.

Low-level API

The low-level API refers to the use of catalog_apply(). This function drives every single other one and is intended to be used by developers to create new processing routines that do not exist in lidR. When catalog_apply() is running it respects all the processing options we have seen so far plus some developer's constraints that are not intended to be modified by users. catalog_apply() maps any R function but such functions must respect some rules.

Making a valid function for catalog_apply()

A function mapped by catalog_apply() must take a chunk as input and must read the chunk using readLAS(). So a valid function starts like this:

routine <- function(chunk){ 
   las <- readLAS(chunk)
}

The size of the chunks, the size of the buffer and the positioning of the chunks depend on the options carried by the LAScatalog and set with opt_chunk_*() functions. In addition, the select and filter arguments cannot be specified in readLAS(). They are controlled by opt_filter() and opt_select(), as seen in previous sections. The problem with this is that if any size of chunk can be defined it is possible to create chunks that encompass a tile but that do not contain any points. So sometimes readLAS(chunk) may return a point cloud with 0 points (see section 'empty chunks'). In that case subsequent code is likely to fail. As a consequence the function mapped by catalog_apply() must check for empty chunks and return NULL. When returning NULL the engine understands that the chunk was empty.

The following code does not work because catalog_apply() checks internally if the function returns NULL for empty chunks

opt_wall_to_wall(ctg) <- TRUE
opt_progress(ctg) <- FALSE
routine <- function(chunk){ 
  las <- readLAS(chunk)
}

catalog_apply(ctg, routine)

The following code is valid

routine <- function(chunk){ 
   las <- readLAS(chunk)
   if (is.empty(las)) return(NULL)
}

catalog_apply(ctg, routine)

Buffer management

When reading a chunk with readLAS() the LAS object read is a point cloud that encompass the chunk plus the buffer. This LAS object has an extra attribute named buffer that records whether the points are in the buffered region or not. 0 refers to no buffer, 1,2,3 and 4 refer, respectively, to bottom, left, top and right buffers. If plotted, this is what could be seen.

LASfile <- system.file("extdata", "Megaplot.laz", package="lidR")
test = readLAScatalog(LASfile)

opt_chunk_size(test) <- 150
opt_chunk_alignment(test) <- c(50,10)
opt_progress(ctg) <- FALSE
chunks = engine_chunks(test)
chunk = chunks[[5]]
las <- readLAS(chunk)
plot(las, color = "buffer")

The chunk is formally a LAScluster object. This is a lightweight object that roughly contains the names of the files that need to be read, the extent of the chunk and some metadata useful internally.

print(chunk)
#> class       : LAScluster 
#> features    : 1 
#> extent      : 684800, 684950, 5017810, 5017960  (xmin, xmax, ymin, ymax)
#> crs         : +proj=utm +zone=17 +datum=NAD83 +units=m +no_defs

raster::extent(), terra::ext() and sf::st_bbox() are valid functions that return the bounding box of the chunk. The bounding box of the chunk is the bounding box without the buffer.

raster::extent(chunk)
#> class      : Extent 
#> xmin       : 684800 
#> xmax       : 684950 
#> ymin       : 5017810 
#> ymax       : 5017960
sf::st_bbox(chunk)
#>    xmin    ymin    xmax    ymax 
#>  684800 5017810  684950 5017960

Being able to retrieve the true bounding box allows for removal of the buffer. Indeed, the point cloud is loaded with a buffer and all subsequent computation will provide an output with the buffer included in the results. At the end the engine will merge everything and the buffered region will appear twice or more. So the final output must be unbuffered by any means. One can find examples in the documentation of catalog_apply(), in this vignette or in the source code of lidR.

Control the options provided by the catalog

Sometime we need to control processing options carried by the LAScatalog to ensure that users do not use invalid options. For example rasterize_terrain() ensures that the buffer is not 0.

opt_chunk_buffer(ctg) <- 0
rasterize_terrain(ctg, 1, tin())

This can be reproduced with the option need_buffer = TRUE

routine <- function(chunk){ 
   las <- readLAS(chunk)
   if (is.empty(las)) return(NULL)
}

options = list(need_buffer = TRUE)
catalog_apply(ctg, routine, .options = options)

Other options are documented in the manual. They serve to make safe new high-level functions. The main idea being that when developers are programming new tools using catalog_apply() they are expected to know what they are doing. But when providing new functions to third-party users or to collaborators in a high-level way we are never safe. This is why the catalog engine provides options to control the inputs so that other users won't use your tools incorrectly.

Merge the outputs

By default catalog_apply() returns a list with one output per chunk.

LASfile <- system.file("extdata", "MixedConifer.laz", package="lidR")
ctg = readLAScatalog(LASfile)

opt_chunk_buffer(ctg) <- 10
opt_chunk_size(ctg) <- 100
opt_chunk_alignment(ctg) <- c(50,50)
opt_progress(ctg) <- FALSE
routine <- function(chunk){ 
   las <- readLAS(chunk)               # read the chunk
   if (is.empty(las)) return(NULL)     # exit if empty
   ttop <- locate_trees(las, lmf(3))     # make any computation
   ttop <- sf::st_crop(ttop, st_bbox(chunk))   # remove the buffer
   return(ttop)
}

out <- catalog_apply(ctg, routine)
class(out)
#> [1] "list"
print(out[[1]])
#> Simple feature collection with 178 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension:     XYZ
#> Bounding box:  xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N

This list must be reduced after catalog_apply(). The way to merge depends on the contents of the list. Here the list contains sf so we can rbind the list.

out <- do.call(rbind, out)
print(out)
#> Simple feature collection with 17865 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension:     XYZ
#> Bounding box:  xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N

But in practice a safe merging is not always trivial and it is annoying to do it manually. The engine supports automerging of Spatial*, sf, Raster*, stars, SpatRaster, data.frame and LAS objects.

options <- list(automerge = TRUE)
out <- catalog_apply(ctg, routine, .options = options)
print(out)
#> Simple feature collection with 17865 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension:     XYZ
#> Bounding box:  xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N

In the worst case the type is not supported and a list is returned anyway without failure.

Make a high-level function

catalog_apply() is not really intended to be used directly. It is intended to be used and hidden within more user-friendly functions referred to as high-level API. High-level API are functions intended to be used by third-party users. Let's assume we have designed a new application to identify dead trees. Let's call this function find_deadtrees(). This function takes a point cloud as input and returns a sf with the positioning of the dead trees + some attributes.

find_deadtree <- function(las, param1, param2)
{
   # make a complex computation
   return(dead_trees)
}

We now want to make this function work with a LAScatalog in the sane way as all the other lidR functions. One option is the following. First, the function checks the input, if it is a LAScatalog it makes use of catalog_apply() to apply itself. Then the function will be fed with LASclusters. We test if the input is a LAScluster and if it is we read the chunk and apply the function on the loaded point cloud. To finish, if the input is a LAS we perform the actual computation.

find_deadtrees <- function(las, param1, param2)
{
   if (is(las, "LAScatalog"))  {
      options <- list(automerge = TRUE, need_buffer = TRUE)
      dead_trees <- catalog_apply(las, find_deadtrees, param1 = param1, param2 = param2, .options = options)
      return(dead_trees)
   }
   else if (is(las, "LAScluster")) {
      bbox <- st_bbox(las)
      las <- readLAS(las)
      if (is.empty(las)) return(NULL) 
      dead_trees <- find_deadtrees(las, param1, param2)
      dead_trees <- sf::st_crop(dead_trees, bbox)
      return(dead_trees)
   }
   else if (is(las, "LAS")) {
      # make an advanced computation
      return(dead_trees)
   }
   else {
      stop("This type is not supported.")
   }
}

The function now works seamlessly both on a LAS and a LAScatalog. We created a function find_deadtrees() that can be applied on a whole collection of files in parallel or not, on multiple machines or not, that returns a valid continuous shapefile, that handles errors nicely, that can optionally write the output of each chunk, and so on...

# This will works
las <- readLAS(...)
deads <- find_deadtrees(las, 0.5, 3)

# And this too
library(future)
ctg <- readLAScatalog(...)
opt_chunk_size(ctg) <- 800
opt_chunk_buffer(ctg) <- 30
plan(multissesion, workers = 4L)
deads <- find_deadtrees(ctg, 0.5, 3)

Advanced usages of the engine

The following sections concern both high-level and low-level APIs and present some advanced use cases.

Modify the drivers

Using the processing option opt_output_files() users can tell the engine to sequentially write the outputs on a drive with custom filenames. In the default settings raster are written in GeoTiff files, vectors, either in sp or sf formats, are written in ESRI shapefiles, point clouds are written in las files and data.frame are written in csv files. This can be modified at any time. This is documented in help("lidR-LAScatalog-drivers"). If somehow the output of the function mapped by catalog_apply() is not a raster, a vector, a data.frame, the function will fail when writing the output.

routine <- function(chunk){ 
   las <- readLAS(chunk)
   if (is.empty(las)) return(NULL)
   # Do some computation
   # output is a list
   return(output)
}

opt_output_files(ctg) <- "{tempdir()}/{ID}"
catalog_apply(ctg, routine)
#> Erreur : Trying to write an object of class 'list' but this type is not supported.

It is possible to create a new driver. We could for example write the list in a .rds file. This can be done by creating a new entry named after the name of the class of the object that needs to be written. Here, a list.

ctg@output_options$drivers$list = list(
 write = base::saveRDS,
 object = "object",
 path = "file",
 extension = ".rds",
 param = list(compress = TRUE))

More details in help("lidR-LAScatalog-drivers") or in this wiki page.

Multi-machines paralellisation

This vignette does not cover the set-up required to make two or more machines able to communicate. In this section we assume that the user is able to connect a remote machine via SSH. There is a wiki page that covers some of this subject. Assuming a user is able to get access to multiple machines remotely and such machines are all able to read files from the same storage, the multi-machine parallelization is straightforward because it is driven by the future package. The only thing users need to do is to register a remote strategy

library(future)
plan(remote, workers = c("localhost", "john@132.203.41.25", "bob@132.203.41.152", "alice@132.203.41.125"))

Internally each chunk is sent to a worker. Remember, a chunk is not a subset of the point cloud. What is sent to each worker instead is a tiny internal object named LAScluster that roughly contains only the bounding box of the region of interest and the files that need to be read to load the corresponding point cloud. Consequently, the bandwidth needed to send a workload to each worker is virtually null. This is also why the function mapped by catalog_apply() should start with readLAS(). Indeed, each worker reads its own data using the paths provided by the LAScatalog.

On cheap multi-machine parallelization made using several regular computers on a local network, the machines won't necessarily have access to shared data. So a copy of the data is mandatory on each machine. If the data are accessible via the exact same path for each machine it will work smoothly. However, if the data are not available via the same paths it is possible to provide alternative directories where to find the files.

ctg@input_options$alt_dir = c("/home/Alice/data/", "/home/Bob/remote/project1/data/")

This is covered by this wiki page.



Try the lidR package in your browser

Any scripts or data that you put into this service are public.

lidR documentation built on Sept. 11, 2024, 5:21 p.m.