knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 2.5, fig.height = 2.5, dev.args = list(pointsize = 9) ) knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { # record the current time before each chunk now <<- Sys.time() } else { # calculate the time difference after a chunk res <- difftime(Sys.time(), now) # return a character string to show the time #if (res > 0.1) #paste("<br/>========================<br/>Time for this code chunk ", options$label, " to run:", round(res,2), "<br/>========================<br/>") } } })) knitr::opts_chunk$set(time_it = TRUE) #rgl::setupKnitr() options(rmarkdown.html_vignette.check_title = FALSE) library(lidR)
data = structure(list(Max.X = c(332099.99, 333600, 335099.99, 336217.52, 332099.99, 333599.99, 335099.99, 336368.67, 332099.99, 333599.99, 335100, 336217.52), Min.X = c(331016.91, 332100.01, 333600.01, 335100, 331016.91, 332100, 333600, 335100, 331016.92, 332100.01, 333600.01, 335100.01), Max.Y = c(5529993.99, 5529993.99, 5529993.99, 5529993.99, 5528399.99, 5528399.99, 5528399.99, 5528399.99, 5526399.98, 5526399.96, 5526399.99, 5526399.99), Min.Y = c(5528400, 5528400, 5528400, 5528400, 5526400, 5526400, 5526400, 5526400, 5524793.5, 5524793.5, 5524800.38, 5524793.5), Max.Z = c(53.53, 47.59, 48.66, 49.36, 46.13, 48.16, 50.51, 50.86, 45, 74.18, 52.56, 49.33), Min.Z = c(-15.95, -7.87, -3.55, -14.96, -5.94, -11.15, -5.11, -4.12, -9.63, -8.27, -35.88, -20.59), filename = c("folder/file1.las", "folder/file2.las", "folder/file3.las", "folder/file4.las", "folder/file5.las", "folder/file6.las", "folder/file7.las", "folder/file8.las", "folder/file9.las", "folder/file10.las", "folder/file11.las", "folder/file12.las")), row.names = c(NA, -12L), class = "data.frame") geom <- lapply(1:nrow(data), function(i) { mtx <- matrix(c(data$Min.X[i], data$Max.X[i], data$Min.Y[i], data$Max.Y[i])[c(1, 1, 2, 2, 1, 3, 4, 4, 3, 3)], ncol = 2) sf::st_polygon(list(mtx)) }) geom <-sf::st_sfc(geom) sf::st_crs(geom) <- 26917 data <- sf::st_set_geometry(data, geom) ctg <- new("LAScatalog") ctg@data <- data
In lidR
the LAScatalog
processing engine refers to the function catalog_apply()
. This function is the core of the package and drives, internally, every single other function that is capable of processing a LAScatalog
, including clip_roi()
, locate_trees()
, rasterize_terrain()
, decimate_points()
and many others as well as some experimental functions in third party packages such as lidRplugins. This engine is powerful and versatile but also relatively hard to understand for new users and especially for R beginners. This vignette documents how it works by going deeper and deeper inside the engine. It is highly recommended to read the vignette named LAScatalog formal class before this one even if one may find some overlap between these two vignettes.
When processing a LAScatalog
the area covered is split into chunks. A chunk can be seen as a square region of interest (ROI) and altogether the chunks cover the collection of files. The collection of files is not physically split, rather the chunks correspond to a virtual splitting of the coverage. Then the chunk are processed sequentially (one after the other) or in parallel but always independently. To process each chunk, the corresponding point cloud is extracted from the collection of files and loaded into memory. Any function can be applied on these independent point clouds and the independent outputs are stored in a list
. The list
contains one output per chunk and once each chunk is processed the list
is collapsed into a single continuous valid object such as a raster or a spatial vector. The roles of the catalog engine are to:
When using the LAScatalog
engine, users only need to think about which function they want to apply over their coverage, all the above-mentioned features being managed internally. There are many possible processing tuning tools and this is why one may feel lost by all the options to consider. To simplify, we can make two categories of tools:
lidR
functions that perform a given operation either on a LAS
or LAScatalog
object transparently in a straightforward way. For example, pixel_metrics()
, rasterize_terrain()
, and locate_trees()
are high-level API. Rule of thumb: if the function catalog_apply()
is not directly used it is a high-level API. Processing options can be tuned with functions that start with opt_
(for option).catalog_apply()
itself. This function is designed to build new high-level applications and is used internally by all the lidR
functions but can also be used to build a user's own tools. Options can be tuned with the parameter .options
of catalog_apply()
.In the following vignette we will discuss first the high-level API then the low-level API. The variables named ctg*
will refer to a LAScatalog
object in subsequent code.
The catalog takes care of making chunks and users can define the size of the chunks. By default this size is set to 0 meaning that a chunk is a file and thus each file will be processed sequentially. The chunk size is not the most important option. It is mainly intended to be used with small configuration computers to avoid loading too many points at once. Reducing or increasing the chunk size does not modify the output but it reduces the memory used by reducing the quantity of points loaded. It is recommended to set this option to 0, but sometimes it is a good idea to set it to a small size to process particularly large files.
opt_chunk_buffer(ctg) <- 0
opt_chunk_size(ctg) <- 0 # Processing by files plot(ctg, chunk = TRUE) opt_chunk_size(ctg) <- 1000 # Processing chunks of 1000 x 1000 plot(ctg, chunk = TRUE)
Each chunk is loaded with a buffer around it to ensure that independent clouds will not be affected by edge effects. For example, when computing a digital terrain model if a buffer is not loaded, the terrain is weakly computed at the edge of the point cloud because of the absence of context in the neighbourhood. The chunk buffer size is the most important option. A buffer that is too small can create incorrect outputs. The default is 30 m, which is likely to be appropriate for most use cases.
opt_chunk_size(ctg) <- 0
opt_chunk_buffer(ctg) <- 0 # No buffer plot(ctg, chunk = TRUE) opt_chunk_buffer(ctg) <- 200 # 200 m buffer plot(ctg, chunk = TRUE)
lidR
functions always check if an appropriate buffer is set. For example, it is impossible to apply rasterize_terrain()
with no buffer.
opt_chunk_buffer(ctg) <- 0 rasterize_terrain(ctg, 1, tin())
In some cases it might be suitable to control the alignment of the chunks to force a given pattern. This is a rare use case but the engine supports such a possibility. This option is obviously meaningless when processing by file. The chunk alignment is not the most important option and does not modify the output, but it may generate more, or fewer, chunks depending on the alignment. However, it might be very useful in the particular case of the function catalog_retile()
, for example, to accurately control the new tiling pattern.
opt_chunk_size(ctg) <- 2000 opt_chunk_buffer(ctg) <- 0 plot(ctg, chunk = TRUE) opt_chunk_size(ctg) <- 2000 opt_chunk_buffer(ctg) <- 0 opt_chunk_alignment(ctg) <- c(1000, 1000) plot(ctg, chunk = TRUE)
clip_roi()
is the single case where the control of the chunk is not respected. clip_roi()
aims to extract a shape (rectangle, disc, polygon) as a single entity . The chunk pattern does not make sense here.
When reading a single file with readLAS()
the option filter
allows for discarding some points based on criteria. For example -keep_first
keeps only first returns and discards all non-first returns. It is important to understand that the discarded points are discarded while reading and they will never be loaded in memory. This is useful for processing only some points of interest, for example when computing metrics only on first returns above 2 m.
# Read all the points readLAS("file.las") # Only first returns above 2 m are loaded readLAS("file.las", filter = "-drop_first -drop_z_below 2")
This works only when the readLAS()
function is explicitly called. When using the catalog processing engine, readLAS()
is called internally under the hood and users cannot explicitly call the filter
argument. filter
is propagated with the opt_filter()
function.
opt_filter(ctg) <- "-keep_first -drop_z_below 2"
Internally, for each chunk, readLAS()
will be called with filter = "-drop_first -drop_z_below 2
in every function that is processing a LAScatalog
unless the documentation explicitly mentions something else. In the following examples clip_roi()
is used to extract a plot but only the points classified as ground or water will be read and pixel_metrics()
is used to compute metrics only on first returns above 2 meters. It does not mean other points do not exist, they are simply not read.
opt_filter(ctg) <- "-keep_class 2 9" las <- clip_circle(ctg, 273400, 5274500, 20) opt_filter(ctg) <- "-keep_first -drop_z_below 2" m <- pixel_metrics(ctg, .stdmetrics, 20)
#LASfile <- system.file("extdata", "Topography.laz", package="lidR") #ctg = readLAScatalog(LASfile) #opt_progress(ctg) <- FALSE #opt_filter(ctg) <- "-keep_class 2 9" #las = clip_circle(ctg, 273500, 5274500, 40) #m = structure(c(0.921, -0.146, 0.362, 0, 0.386, 0.482, -0.787, 0, #-0.06, 0.864, 0.5, 0, 0, 0, 0, 1), .Dim = c(4L, 4L)) #plot(las) #rgl::view3d(fov = 50, userMatrix = m)
All filters are not necessarily appropriate everywhere. For example, the following is meaningless because it discards all the points that are used to compute a DTM.
opt_filter(ctg) <- "-drop_class 2 9" rasterize_terrain(ctg, 1, tin()) #> Error: No ground points found. Impossible to compute a DTM.
When reading a single file with readLAS()
the option select
allows for discarding some attributes of the points. For example, select = "ir"
keeps only the intensity and the return number of each point and discards all the other attributes, such as the scan angle, the flags or the classification. Its role is only to save processing memory by not loading data that are not needed. Similarly to opt_filter()
, the function opt_select()
allows propagation of the argument select
to readLAS()
.
# Load Intensity only opt_select(ctg) <- "i" las <- clip_circle(ctg, 273500, 5274500, 10) # Load Intensity + Classification + ReturnNumber + NumberOfReturns opt_select(ctg) <- "icrn" las <- clip_circle(ctg, 273500, 5274500, 10)
However, this option is not always respected because many lidR
functions already know which optimization they should apply. In the following example the 'classification' attribute is explicitly discarded, yet the creation of a DTM works because the function overwrites the user settings for something better (in this specific case xyzc
).
opt_select(ctg) <- "* -c" dtm <- rasterize_terrain(ctg, 1, tin())
By default every function returns a continuous output within a single R object stored in memory so it is immediately usable in the working environment in a straightforward way. For example, a sf/sfc_POINT
for locate_trees()
.
LASfile <- system.file("extdata", "MixedConifer.laz", package="lidR") ctg2 <- readLAScatalog(LASfile) opt_progress(ctg2) <- FALSE opt_chunk_size(ctg2) <- 100
# Find the trees trees <- locate_trees(ctg2, lmf(3)) # The trees are immediately usable in subsequent analyses trees #> Simple feature collection with 17865 features and 2 fields #> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity #> Geometry type: POINT #> Dimension: XYZ #> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011 #> Projected CRS: NAD83 / UTM zone 12N
However in some case this behaviour might not be suitable especially for a large collection of files that cover a broad area. In this case the computer might not be able to handle so much data and/or will run into trouble when merging the different independent chunks into a single object. This is why the engine has the capability to write on disk the output of each independent chunk. The function opt_output_files()
can be used to set a path on disk where the output of each chunk should be saved on disk. This path is templated so the engine is able to create a different file name for each chunk. The general form is the following:
opt_output_files(ctg) <- "/folder/to/save/filename_{TEMPLATE1}_{TEMPLATE2}...{TEMPLATEN}"
The templates can be XCENTER
, YCENTER
, XLEFT
, YBOTTOM
, XRIGHT
, YTOP
, ID
or ORIGINALFILENAME
. The templated string does not contain the file extension. The engine guesses the extension and it works no matter the output. For example:
# Force the results to be written on disk opt_output_files(ctg2) <- paste0(tempdir(), "/tree_coordinate_{XLEFT}_{YBOTTOM}") trees <- locate_trees(ctg2, lmf(3)) # The output has been modified by these options and it now gives # the paths to the written files (here shapefiles) trees #> "/tmp/RtmpJQHPNz/tree_coordinate_481200_3812900.shp" "/tmp/RtmpJQHPNz/tree_coordinate_481300_3812900.shp" "/tmp/RtmpJQHPNz/tree_coordinate_481200_3813000.shp" #> [4] "/tmp/RtmpJQHPNz/tree_coordinate_481300_3813000.shp"
There is one file per chunk and thus processing by file with the template {ORIGINALFILENAME}
or its shortcut {*}
may be suitable to further match the output files with the point cloud files. However, depending on the size of the file and the capacity of the computer it is not always possible (see section "Control of the chunk size").
In the previous example, several shapefiles were written on disk and there is no way to combine them into a single light R object. But sometimes it is possible to return a light object that aggregates all the written files. For rasters, for example, it is possible to build a virtual raster mosaic. The engine automatically does this when it is possible.
# Force the results to be written on disk opt_output_files(ctg2) <- paste0(tempdir(), "/tree_coordinate_{XLEFT}_{YBOTTOM}") chm <- rasterize_canopy(ctg2, 1, p2r()) # Many rasters have been written on disk # but a light raster has been returned anyway chm #> class : RasterLayer #> dimensions : 90, 90, 8100 (nrow, ncol, ncell) #> resolution : 1, 1 (x, y) #> extent : 481260, 481350, 3812921, 3813011 (xmin, xmax, ymin, ymax) #> crs : +proj=utm +zone=12 +datum=NAD83 +units=m +no_defs #> source : /tmp/RtmpZVJ2hy/rasterize_canopy.vrt #> names : tree_coordinate_481260_3812921 #> values : 0, 32.07 (min, max)
There are two special cases. The first one is when raster files are written. They are merged into a valid RasterLayer
using a virtual mosaic. The second one is when LAS or LAZ files are written on disk. They are merged into a LAScatalog
.
opt_output_files(ctg2) <- "{tempdir()}/plot_{ID}" new_ctg <- clip_circle(ctg2, x, y, 20) new_ctg #> class : LAScatalog (v1.2 format 0) #> extent : 32.372, 163.136, 38.494, 198.636 (xmin, xmax, ymin, ymax) #> coord. ref. : NAD83 / UTM zone 17N #> area : 3895.031 m² #> points : 44 points #> density : 8 points/m² #> num. files : 4
By default, rasters are written in GeoTiff files, spatial points, spatial polygons and spatial lines either in sp
or sf
formats are written in ESRI shapefiles, point-clouds are written in las files, and table are written in csv files. This can be modified at any time by users but it corresponds to advanced settings and thus this section is discussed in a later section about advanced usages. However, the function opt_laz_compression()
is a simple shortcut to switch from las to laz when writing a point cloud.
opt_laz_compression(ctg) <- TRUE
The engine provides a real time display of the progression that serves two purposes: (a) see the progression and (b) monitor troubleshooting. It is enabled by default and when a LAScatalog
is processing a chart is displayed. The chunks are progressively coloured. No colour: the chunk is pending. Blue: the chunk is processing. Green: the chunk has been processed. Orange: the chunk has been processed with a warning. Red: the chunk processing failed.
opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "" opt_wall_to_wall(ctg) <- FALSE opt_progress(ctg) <- TRUE cl <- engine_chunks(ctg) for (i in 1:5){ bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3") } bbox <- cl[[6]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "cornflowerblue")
This can be disabled with:
opt_progress(ctg) <- FALSE
If a chunks produced a warning this is rendered in real time with an orange colouring. However, the message(s) of the warning(s) are delayed and printed only at the end of the computation.
opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "" opt_wall_to_wall(ctg) <- FALSE opt_progress(ctg) <- TRUE cl <- engine_chunks(ctg) for (i in 1:6){ bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3") } bbox <- cl[[7]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "orange") bbox <- cl[[8]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "cornflowerblue")
If a chunk produced an error this is also rendered in real time. The computation stops as usual but the error is handled in such a way that the code does not actually fail. The functions returned a partial output i.e. the output of the tiles that were computed successfully. A message is printed telling the user what the error was and where it occurred, and suggests to load this specific section of the catalog to test what was wrong with it. In the following case one tile was not classified so it failed.
#> An error occurred when processing the chunk 9. Try to load this chunk with: #> chunk <- readRDS("/tmp/RtmpTozKLu/chunk9.rds") #> las <- readLAS(chunk) #> No ground point found.
opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "" opt_wall_to_wall(ctg) <- FALSE opt_progress(ctg) <- TRUE cl <- engine_chunks(ctg) for (i in 1:8){ bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3") } bbox <- cl[[7]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "orange") bbox <- cl[[9]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "red")
The engine logged the chunk and it is easy to load this specific processing region for further investigation by copy pasting the mentioned code.
chunk <- readRDS("/tmp/RtmpTozKLu/chunk9.rds") las <- readLAS(chunk)
It is possible to restart the computation at a given chunk to produce the missing parts. In the previous example the chunk number 9 failed so we can restart at 9
opt_restart(ctg) <- 9
opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "" opt_wall_to_wall(ctg) <- FALSE opt_progress(ctg) <- TRUE opt_restart(ctg) <- 9 cl <- engine_chunks(ctg) for (i in 1:4){ bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3") }
The engine is also able to bypass the error. This can be activated with opt_stop_early()
and the computation will run until the end in any case. In this case chunks that failed will be missing and the output will contain holes with missing data. We do not recommend the use of this option because other errors will be bypassed as well without triggering any informative message. This option exists and can be useful if used carefully. Users should always try to fix any problems.
opt_stop_early(ctg) <- FALSE dtm <- rasterize_terrain(ctg, 1, tin())
opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "" opt_wall_to_wall(ctg) <- FALSE opt_progress(ctg) <- TRUE opt_restart(ctg) <- 1 cl <- engine_chunks(ctg) for (i in 1:length(cl)){ bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3") } bbox <- cl[[7]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "orange") bbox <- cl[[9]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "red")
Sometimes some chunks are empty and we discover this only when loading the region of interest. This may happen when the chunk pattern is different from the tiling pattern because some chunks may have been created but they do not actually encompass any points. This is the case when the file is only partially populated because the bounding box of the file/tile is bigger than its actual contents. This often happens on the edge of the collection. In this case the engine displays the chunks in gray. This may also happen if filter
is set in such a way that a lot of points are discarded and in some chunks all the points appear to be discarded. Those chunk are displayed in gray.
opt_chunk_size(ctg) <- 400 opt_output_files(ctg) <- "" opt_wall_to_wall(ctg) <- FALSE opt_progress(ctg) <- TRUE cl <- engine_chunks(ctg) for (i in 1:50){ bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3") } bbox <- cl[[1]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray") bbox <- cl[[2]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray") bbox <- cl[[3]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray") bbox <- cl[[14]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray") bbox <- cl[[15]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray") bbox <- cl[[16]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray") bbox <- cl[[29]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "gray")
The engine takes care of processing the chunks either sequentially or in parallel. The parallelization can be done with a single machine or on multiple machines by sending the chunks to different workers. The parallelization is performed using the framework given by the package future
. Understanding the basics of this package is thus required. To activate a paralellization strategy users need to register a strategy with future
. For example:
library(future) plan(multissesion, workers = 4L)
Now 4 chunks will be processed at a time and the engine monitoring displays the processing chunks in real time:
opt_chunk_size(ctg) <- 0 opt_output_files(ctg) <- "" opt_wall_to_wall(ctg) <- FALSE opt_progress(ctg) <- TRUE cl <- engine_chunks(ctg) for (i in 1:6) { bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "green3") } for (i in 7:11) { bbox <- cl[[i]]@bbox graphics::rect(bbox[1], bbox[2], bbox[3], bbox[4], border = "black", col = "cornflowerblue") }
However, parallel processing is not magic. First of all it loads several point clouds at one time because several chunks are read at one time. Users must ensure they have enough RAM to support that. Second, there are strong overheads incurred when parallelizing tasks. Putting all cores on a task does not always make it faster!
In lidR
if some files are overlapping each other, a message is printed to alert users about potential problems.
ctg <- readLAScatalog("path/to/folder/") #> Be careful, some tiles seem to overlap each other. lidR may return incorrect outputs with edge artifacts when processing this catalog.
Overlapping is not an issue by itself. The actual problem is duplicated points. Because lidR
makes arbitrary chunks, users are likely to load the same points twice if some areas are duplicated in the original dataset. Below are a few classical cases of overlapping files:
r
opt_filter(ctg) <- "-drop_withhelp"
If it is not the case one can try to retile the LAScatalog
with a negative buffer to remove the buffer.
r
opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- "/data/unbufferd/{ORIGINALFILENAME}"
opt_chunk_buffer(ctg) <- -40
new_ctg <- catalog_retile(ctg)
It is also possible to process by file without a buffer by disabling all the internal controls using opt_wall_to_wall()
. Using this disables the guarantee of having a correct, valid, continuous result. This is a free-wheel mode.
r
opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- "/data/unbufferd/{ORIGINALFILENAME}"
opt_chunk_buffer(ctg) <- 0
opt_wall_to_wall(ctg) <- FALSE
new_ctg <- catalog_retile(ctg)
LAScatalog
, the light blue being transparent, the overlapping regions should appear darker. You must remove these files.It is possible to process only a subregion of the collection by flagging some files. In this case only the flagged files will be processed but the neighbouring tiles will be used to load a buffer so the local context beyond the processed tiles is not lost. In the figure below 2 files are flagged processed
and the display plots the other tiles almost white.
ctg$processed <- FALSE ctg$processed[6:7] <- TRUE plot(ctg)
As mentioned in a previous section, when an error stops the computation the output is never NULL
, but it contains a valid output computed from the part of the catalog that has been processed. It is a partial but valid output. Computation can be restarted from the failling chunk.
The low-level API refers to the use of catalog_apply()
. This function drives every single other one and is intended to be used by developers to create new processing routines that do not exist in lidR
. When catalog_apply()
is running it respects all the processing options we have seen so far plus some developer's constraints that are not intended to be modified by users. catalog_apply()
maps any R function but such functions must respect some rules.
A function mapped by catalog_apply()
must take a chunk as input and must read the chunk using readLAS()
. So a valid function starts like this:
routine <- function(chunk){ las <- readLAS(chunk) }
The size of the chunks, the size of the buffer and the positioning of the chunks depend on the options carried by the LAScatalog
and set with opt_chunk_*()
functions. In addition, the select
and filter
arguments cannot be specified in readLAS()
. They are controlled by opt_filter()
and opt_select()
, as seen in previous sections. The problem with this is that if any size of chunk can be defined it is possible to create chunks that encompass a tile but that do not contain any points. So sometimes readLAS(chunk)
may return a point cloud with 0 points (see section 'empty chunks'). In that case subsequent code is likely to fail. As a consequence the function mapped by catalog_apply()
must check for empty chunks and return NULL
. When returning NULL
the engine understands that the chunk was empty.
The following code does not work because catalog_apply()
checks internally if the function returns NULL
for empty chunks
opt_wall_to_wall(ctg) <- TRUE opt_progress(ctg) <- FALSE
routine <- function(chunk){ las <- readLAS(chunk) } catalog_apply(ctg, routine)
The following code is valid
routine <- function(chunk){ las <- readLAS(chunk) if (is.empty(las)) return(NULL) } catalog_apply(ctg, routine)
When reading a chunk with readLAS()
the LAS
object read is a point cloud that encompass the chunk plus the buffer. This LAS
object has an extra attribute named buffer
that records whether the points are in the buffered region or not. 0 refers to no buffer, 1,2,3 and 4 refer, respectively, to bottom, left, top and right buffers. If plotted, this is what could be seen.
LASfile <- system.file("extdata", "Megaplot.laz", package="lidR") test = readLAScatalog(LASfile) opt_chunk_size(test) <- 150 opt_chunk_alignment(test) <- c(50,10) opt_progress(ctg) <- FALSE chunks = engine_chunks(test) chunk = chunks[[5]]
las <- readLAS(chunk) plot(las, color = "buffer")
The chunk is formally a LAScluster
object. This is a lightweight object that roughly contains the names of the files that need to be read, the extent of the chunk and some metadata useful internally.
print(chunk) #> class : LAScluster #> features : 1 #> extent : 684800, 684950, 5017810, 5017960 (xmin, xmax, ymin, ymax) #> crs : +proj=utm +zone=17 +datum=NAD83 +units=m +no_defs
raster::extent()
, terra::ext()
and sf::st_bbox()
are valid functions that return the bounding box of the chunk. The bounding box of the chunk is the bounding box without the buffer.
raster::extent(chunk) #> class : Extent #> xmin : 684800 #> xmax : 684950 #> ymin : 5017810 #> ymax : 5017960 sf::st_bbox(chunk) #> xmin ymin xmax ymax #> 684800 5017810 684950 5017960
Being able to retrieve the true bounding box allows for removal of the buffer. Indeed, the point cloud is loaded with a buffer and all subsequent computation will provide an output with the buffer included in the results. At the end the engine will merge everything and the buffered region will appear twice or more. So the final output must be unbuffered by any means. One can find examples in the documentation of catalog_apply()
, in this vignette or in the source code of lidR
.
Sometime we need to control processing options carried by the LAScatalog
to ensure that users do not use invalid options. For example rasterize_terrain()
ensures that the buffer is not 0.
opt_chunk_buffer(ctg) <- 0 rasterize_terrain(ctg, 1, tin())
This can be reproduced with the option need_buffer = TRUE
routine <- function(chunk){ las <- readLAS(chunk) if (is.empty(las)) return(NULL) } options = list(need_buffer = TRUE) catalog_apply(ctg, routine, .options = options)
Other options are documented in the manual. They serve to make safe new high-level functions. The main idea being that when developers are programming new tools using catalog_apply()
they are expected to know what they are doing. But when providing new functions to third-party users or to collaborators in a high-level way we are never safe. This is why the catalog engine provides options to control the inputs so that other users won't use your tools incorrectly.
By default catalog_apply()
returns a list
with one output per chunk.
LASfile <- system.file("extdata", "MixedConifer.laz", package="lidR") ctg = readLAScatalog(LASfile) opt_chunk_buffer(ctg) <- 10 opt_chunk_size(ctg) <- 100 opt_chunk_alignment(ctg) <- c(50,50) opt_progress(ctg) <- FALSE
routine <- function(chunk){ las <- readLAS(chunk) # read the chunk if (is.empty(las)) return(NULL) # exit if empty ttop <- locate_trees(las, lmf(3)) # make any computation ttop <- sf::st_crop(ttop, st_bbox(chunk)) # remove the buffer return(ttop) } out <- catalog_apply(ctg, routine) class(out) #> [1] "list" print(out[[1]]) #> Simple feature collection with 178 features and 2 fields #> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity #> Geometry type: POINT #> Dimension: XYZ #> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011 #> Projected CRS: NAD83 / UTM zone 12N
This list
must be reduced after catalog_apply()
. The way to merge depends on the contents of the list
. Here the list contains sf
so we can rbind
the list.
out <- do.call(rbind, out) print(out) #> Simple feature collection with 17865 features and 2 fields #> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity #> Geometry type: POINT #> Dimension: XYZ #> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011 #> Projected CRS: NAD83 / UTM zone 12N
But in practice a safe merging is not always trivial and it is annoying to do it manually. The engine supports automerging of Spatial*
, sf
, Raster*
, stars
, SpatRaster
, data.frame
and LAS
objects.
options <- list(automerge = TRUE) out <- catalog_apply(ctg, routine, .options = options) print(out) #> Simple feature collection with 17865 features and 2 fields #> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity #> Geometry type: POINT #> Dimension: XYZ #> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011 #> Projected CRS: NAD83 / UTM zone 12N
In the worst case the type is not supported and a list is returned anyway without failure.
catalog_apply()
is not really intended to be used directly. It is intended to be used and hidden within more user-friendly functions referred to as high-level API. High-level API are functions intended to be used by third-party users. Let's assume we have designed a new application to identify dead trees. Let's call this function find_deadtrees()
. This function takes a point cloud as input and returns a sf
with the positioning of the dead trees + some attributes.
find_deadtree <- function(las, param1, param2) { # make a complex computation return(dead_trees) }
We now want to make this function work with a LAScatalog
in the sane way as all the other lidR
functions. One option is the following. First, the function checks the input, if it is a LAScatalog
it makes use of catalog_apply()
to apply itself. Then the function will be fed with LAScluster
s. We test if the input is a LAScluster
and if it is we read the chunk and apply the function on the loaded point cloud. To finish, if the input is a LAS
we perform the actual computation.
find_deadtrees <- function(las, param1, param2) { if (is(las, "LAScatalog")) { options <- list(automerge = TRUE, need_buffer = TRUE) dead_trees <- catalog_apply(las, find_deadtrees, param1 = param1, param2 = param2, .options = options) return(dead_trees) } else if (is(las, "LAScluster")) { bbox <- st_bbox(las) las <- readLAS(las) if (is.empty(las)) return(NULL) dead_trees <- find_deadtrees(las, param1, param2) dead_trees <- sf::st_crop(dead_trees, bbox) return(dead_trees) } else if (is(las, "LAS")) { # make an advanced computation return(dead_trees) } else { stop("This type is not supported.") } }
The function now works seamlessly both on a LAS
and a LAScatalog
. We created a function find_deadtrees()
that can be applied on a whole collection of files in parallel or not, on multiple machines or not, that returns a valid continuous shapefile, that handles errors nicely, that can optionally write the output of each chunk, and so on...
# This will works las <- readLAS(...) deads <- find_deadtrees(las, 0.5, 3) # And this too library(future) ctg <- readLAScatalog(...) opt_chunk_size(ctg) <- 800 opt_chunk_buffer(ctg) <- 30 plan(multissesion, workers = 4L) deads <- find_deadtrees(ctg, 0.5, 3)
The following sections concern both high-level and low-level APIs and present some advanced use cases.
Using the processing option opt_output_files()
users can tell the engine to sequentially write the outputs on a drive with custom filenames. In the default settings raster are written in GeoTiff files, vectors, either in sp
or sf
formats, are written in ESRI shapefiles, point clouds are written in las files and data.frame
are written in csv files. This can be modified at any time. This is documented in help("lidR-LAScatalog-drivers")
. If somehow the output of the function mapped by catalog_apply()
is not a raster, a vector, a data.frame
, the function will fail when writing the output.
routine <- function(chunk){ las <- readLAS(chunk) if (is.empty(las)) return(NULL) # Do some computation # output is a list return(output) } opt_output_files(ctg) <- "{tempdir()}/{ID}" catalog_apply(ctg, routine) #> Erreur : Trying to write an object of class 'list' but this type is not supported.
It is possible to create a new driver. We could for example write the list in a .rds
file. This can be done by creating a new entry named after the name of the class of the object that needs to be written. Here, a list
.
ctg@output_options$drivers$list = list( write = base::saveRDS, object = "object", path = "file", extension = ".rds", param = list(compress = TRUE))
More details in help("lidR-LAScatalog-drivers")
or in this wiki page.
This vignette does not cover the set-up required to make two or more machines able to communicate. In this section we assume that the user is able to connect a remote machine via SSH. There is a wiki page that covers some of this subject. Assuming a user is able to get access to multiple machines remotely and such machines are all able to read files from the same storage, the multi-machine parallelization is straightforward because it is driven by the future
package. The only thing users need to do is to register a remote strategy
library(future) plan(remote, workers = c("localhost", "john@132.203.41.25", "bob@132.203.41.152", "alice@132.203.41.125"))
Internally each chunk is sent to a worker. Remember, a chunk is not a subset of the point cloud. What is sent to each worker instead is a tiny internal object named LAScluster
that roughly contains only the bounding box of the region of interest and the files that need to be read to load the corresponding point cloud. Consequently, the bandwidth needed to send a workload to each worker is virtually null. This is also why the function mapped by catalog_apply()
should start with readLAS()
. Indeed, each worker reads its own data using the paths provided by the LAScatalog
.
On cheap multi-machine parallelization made using several regular computers on a local network, the machines won't necessarily have access to shared data. So a copy of the data is mandatory on each machine. If the data are accessible via the exact same path for each machine it will work smoothly. However, if the data are not available via the same paths it is possible to provide alternative directories where to find the files.
ctg@input_options$alt_dir = c("/home/Alice/data/", "/home/Bob/remote/project1/data/")
This is covered by this wiki page.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.