LAScatalog processing engine

In lidR the LAScatalog processing engine refers to the function catalog_apply(). This function is the core of the package and drives, internally, every single other function that is capable of processing a LAScatalog including clip_roi(), find_trees(), grid_terrain(), decimate_points() and many others as well as some experimental functions in third party packages such as lidRplugins. This engine is powerful and versatile but also relatively hard to understand for new users and especially for R beginners. This vignettes documents how it works going deeper and deeper inside the engine. It is highly recommended to read the vignette named LAScatalog formal class before to enter this one even if one may find some overlap between these two vignettes.

Overview

When processing a LAScatalog the area covered is split into chunks. A chunk can be seen as a square region of interest (ROI) and all together the chunks cover the collection of file. The collection of files is not physically split, the chunks correspond to a virtual splitting of the coverage. Then the chunk are processed sequentially (one after the other) or in parallel but always independently. To process each chunk, the corresponding point-cloud is extracted from the collection of files and loaded into memory. Any function can be applied on these independent point-clouds and the independent outputs are stored in a list. The list contains one output per chunk and once each chunk is processed the list is collapsed into a single continuous valid object such as a RasterLayer or a SpatialPolygonsDataFrame. The roles of the catalog engine are to:

When using the LAScatalog engine, users only need to think about what function they want to apply over their coverage, all the above mentioned features being managed internally. There are many possible processing tuning and this is why one may feel lost by all the options to consider. To simplify we can make two categories of tools:

  1. High level API are lidR functions that perform a given operation either on a LAS or LAScatalog object transparently in a straightforward way. For examples grid_metrics(), grid_terrain(), find_trees() are high level API. Rule of thumbs, if the function catalog_apply() is not directly used it is high level API. Processing options can be tuned with functions that start by opt_ (for option).
  2. Low level API is the function catalog_apply() itself. This function is designed to build new high level applications and is used internally by all the lidR functions but can also be used by users to build their own tools. Options can be tuned with the parameter .options of catalog_apply().

In the following vignette we will discuss first of the hight level API then of the low level API. The variables named ctg* will refer to a LAScatalog object in subsequent codes.

High level API

Control of the chunk size

The catalog takes care of making chunks and user can define the size of the chunks. By default this size is set to 0 meaning that a chunk is a file and thus each files will be processed sequentially. The chunk size is not the most important option. It is mainly intended to be used with small configuration computers to do not load too many points at once. Reducing or increasing the chunk size does not modify the output but it reduces the memory used by reducing the quantity of points loaded. It is recommended to set this options to 0 but sometime it is a good idea to set a small size to process particularly big files.

Control of the chunk buffer

Each chunk is loaded with a buffer around it to ensure that independent clouds will not be affected by edge effects. For example when computing a digital terrain model if a buffer is not loaded, the terrain is weakly computed at the edge of the point cloud because of the absence of context in the neighbourhood. The chunk buffer size is the most important option. A too small buffer can create incorrect outputs. The default is 30 m and is likely to be appropriated for most use cases.

lidR functions always checks if an appropriate buffer is set. For example it is impossible to apply grid_terrain() with no buffer.

Control of the chunk alignment

In some cases it might be suitable to control the alignment of the chunks to force a given pattern. This is a rare use case but the engine supports such possibility. This option is obviously meaningless when processing by file. The chunk alignment is not the most important option and does not modify the output but it may generate more or less chunks depending on the alignment. However it might be very useful is the particular case of the function catalog_retile() for example to control accurately the new tiling pattern.

clip_roi() is the single case where the control of the chunk is not respected. clip_roi() aims to extract a shape (rectangle, disc, polygon) as a single entity . The chunk pattern does not make sense here.

Filter points

When reading a single file with readLAS() the option filter allows for discarding some points based on criteria. For example -keep_first keeps only first returns and discard all non first returns. It is important to understand that the discarded points are discarded while reading and they will never be loaded in memory. This is useful to process only some points of interest for example when computing metrics only on first returns above 2 m.

This works only when the readLAS() function is called explicitly. When using the catalog processing engine, readLAS() is called internally under the hood and users cannot explicitly call the filter argument. filter is propagated with the opt_filter() function.

Internally, for each chunk, readLAS() will be called with filter = "-drop_first -drop_z_below 2 in every functions that is processing a LAScatalog unless the documentation explicitly mention something else. In the following examples clip_roi() is used to extract a plot but only the points classified as ground or water will be read and grid_metrics() is used to compute metrics only on first returns above 2 meters. It does not mean other points do not exist, they are simply not read.

All filters are not necessarily appropriated everywhere. For example the following is meaningless because it discards all the point that are used to compute a DTM.

Select attributes

When reading a single file with readLAS() the option select allows for discarding some attributes of the points. For example select = "ir" keeps only the intensity and the return number of each point and discards all the other attributes such as the scan angle, the flags or the classification. Its role is only to save processing memory by do not loading data that are not actually needed. Similarly to opt_filter() the function opt_select() allows to propagate the argument select to readLAS().

However this option is not always respected because many lidR functions already know which optimization they should apply. In the following example the ‘classification’ attribute is explicitly discarded but yet the creation of a DTM do works because the function overwrites the user settings for something better (in this specific case xyzc).

Write independent chunks on disk

By default every function returns a continuous output within a single R object stored in memory so it is immediately usable in the working environment in a straightforward way. For example a SpatialPointsDataFrame for find_trees().

However in some case this behaviour might not be suitable especially for big collection of files that cover a broad area. In this case the computer will maybe not be able to handle so much data and/or will run into trouble when merging the different independent chunks into a single object. This is why the engine has the capability to write on disk the output of each independent chunks. The function opt_output_files() can be used to set a path on disk where the output of each chunk should be saved on disk. This path is templated so the engine is able to create a different file name for each chunk. The general form is the following:

The templates can be XCENTER, YCENTER, XLEFT, YBOTTOM, XRIGHT, YTOP, ID or ORIGINALFILENAME. The templated string does not contain the file extension. The engine guesses the extension and it works no matter the output. For example:

The is one file per chunk and thus processing by file with the template {ORIGINALFILENAME} or its shortcut {*} may be suitable to further match the output files with the point cloud files. However depending on the size of the file and the capacity of the computer it is not always possible (see section “Control of the chunk size”).

In the previous example, several shapefiles were written on disk and there is no way to combine them into a single light R object. But sometime it is possible to return a light object that aggregates all the written files. For rasters for example it is possible to build a virtual raster mosaic. The engine automatically do that when it is possible.

There are two special cases. The first one is when raster files are written. They are merged into a valid RasterLayer using a virtual mosaic. The second one is when LAS or LAZ files are written on disk. They are merged into a LAScatalog.

Modification of the file format

By default, rasters are written in GeoTiff files, spatial points, spatial polygons and spatial lines either in sp or sf formats are written in ESRI shapefiles, point-clouds are written in las files and table are written in csv files. This can be modified at any time by users but it corresponds to advanced settings and thus this section is deferred in a later section about advanced usages. However the function opt_laz_compression() is a simple shortcut to switch from las to laz when writing a point cloud.

Progress estimation

The engine provides a real time display of the progression that serves two purposes: (a) see the progression and (b) monitor troubleshooting. It is enabled by default and when a LAScatalog is processing a chart is displayed. The chunks are progressively coloured. No colour, the chunk is pending. Blue, the chunk is processing. Green, the chunk has been processed. Orange, the chunk has been processed with a warning. Red, the chunk processing failed.

This can be disabled with:

Error handling

If a chunks produced a warning this is rendered in real time with an orange colouring. However the message(s) of the warning(s) are delayed and printed only at the end of the computation.

If a chunks produced an error this is rendered in real time as well. The computation stops as usual but the error is handle in such a way that the code did not actually failed. The functions returned a partial output i.e. the output of the tiles that were computed successfully. A message is printed telling user where and what was the error and suggest to load this specific section of the catalog to test what was wrong with it. In the following case one tile was not classified so it failed.

The engine logged the chunk and it is easy to load this specific processing region for further investigation by copy pasting the mentioned code.

The engine is able to bypass the error. This can be activated with opt_stop_early() and the computation will run till the end in any case. In this case chunks that failed will be missing and the output will contain holes with missing data. We do not recommend the use of such option because other errors will be bypassed as well without triggering any informative message. This option exists and can be useful if used carefully. Users should always try to fix the problem.

Empty chunks

Sometime some chunks are empty and we discover that only when loading the region of interest. This may happens when the chunk pattern is different from the tiling pattern because some chunks may have been created but they do not actually encompass any point. This is the case when the file is only partially populated because the bounding box of the file/tile is bigger than its actual content. This often happens on the edge of the collection. In this case the engine displays the chunks in gray. This may also happens if filter is set in such a way that a lot of points are discarded and in some chunks all the points appear to be discarded. Those chunk are displayed in gray.

Parallel processing

The engine takes care of processing the chunks one after the other sequentially or in parallel. The parallelisation can be done with a single machine or on multiple machines by sending the chunks to different workers. The parallelization is performed using the framework given by the package future. Understanding the basic of this package is thus required. To activate a paralellization strategy users need to register a strategy with future. For example:

Now 4 chunks will be processed at a time and the engine monitoring displays in real time the processing chunks:

However parallel processing is not magic. First of all it loads several point clouds at one time because several chunks are read at one time. Users must ensure they have enough RAM to support that. Second, there are strong overheads at parallelizing tasks. Putting all cores on a task does not always make it faster!

Overlapping tiles

In lidR if some files are overlapping each other, a message is printing to alert about potential troubles.

Overlapping is not an issue by itself. The actual problem is duplicated points. Because lidR makes arbitrary chunks, users are likely to load twice the same points is some areas are duplicated in the original dataset. Below few classical cases of overlapping files:

Partial processing

It is possible to process only a subregion of the collection by flagging some files. In this case only the flagged files will be process but the neighbouring tiles are used to load a buffer so the local context is not lost. Below only 4 files are flagged and the display plots the other tiles almost white.

Partial output

As mentioned in a previous section when an error stops the computation the output is never NULL it contains a valid output computed from the part of the catalog that have been processed. It is a partial but valid output. Hopefully next versions of the package will allow to restart a computation that failed from the failing points

Low level API

The low level API refers to the use of catalog_apply(). This function drives every single other one and is intended to be used by developers to create new processing routines that are not existing in lidR. When catalog_apply() is running it respects all the processing option we have seen so far plus some developer’s constraints that are not intended to be modified by users. catalog_apply() maps any R function but such function must respect some rules.

Making a valid function for catalog_apply()

A function mapped by catalog_apply() must take a chunk as input and must read the chunk using readLAS(). So a valid function starts like that:

The size of the chunks, the size of the buffer and the positioning of the chunks depend on the option carried by the LAScatalog and set with opt_chunk_*() functions. In addition the select and filter argument cannot be specified in readLAS(). They are controlled by opt_filter() and opt_select() as seen in previous sections. The problem with that is that if any size of chunk can be defined it is possible to create chunks that encompass a tile but that do not contain any points. So sometime readLAS(chunk) may return a point cloud with 0 point (see section ‘empty chunks’). In that case subsequent code is likely to fail. As a consequence the function mapped by catalog_apply() must check for empty chunks and return NULL. When returning NULL the engine understand that the chunk was empty.

The following code does not work because catalog_apply() checks internally if the function returns NULL for empty chunks

The following code is valid

Buffer management

When reading a chunk with readLAS() the LAS object read is a point cloud that encompass the chunk plus the buffer. This LAS object has an extra attribute named buffer that records whether the points are in the buffered region or not. 0 refers to no buffer, 1,2,3 and 4 refers respectively to bottom, left, top and right buffers. If plotted this is what could be seen.

The chunk is formally a LAScluster object. This is a lightweight object that roughly contains the names of the files that need to be read, the extent of the chunk and some metadata useful internally.

It inherits from sp::Spatial so raster::extent() and sp::bbox() are valid function that return the bounding box of the chunk. The bounding box of the chunk is the bounding box without the buffer.

Being able to retrieve the true bounding box allow for removing the buffer. Indeed, the point cloud is loaded with a buffer and all subsequent computation will provide an output with the buffer included in the results. At the end the engine will merge everything and buffered region will appear twice or more. So the final output must be unbuffered by any means. One can find examples in the documentation of catalog_apply(), in this vignette or in the source code of lidR.

Control the option provided by the catalog

Sometime we need to control processing options carried by the LAScatalog to ensures that users did not put invalid options. For example grid_terrain() control that the buffer is not 0.

This can be reproduced with the option need_buffer = TRUE

Other options are documented in the manual. They serve to make safe new high level functions. The main idea being that when developers are programming new tools using catalog_apply() they are expecting to know what they are doing. But when providing new functions to thirds party users or to collaborators in a high level way we are never safe. This is why the catalog engine provides options to control the inputs so that the users won’t use your tools badly.

Merge the outputs

By default catalog_apply() returns a list with one output per chunk.

This list must be reduced after catalog_apply(). The way to merge depends on the content of the list. Here the list contains SpatialPointsDataFrame so we can rbind the list.

But in practice a safe merging is not always trivial and it is annoying to do it manually. The engine supports automerging of Spatial*, sf, Raster*, data.frame and LAS objects

In the worst case the type is not supported an a list is returned anyway without failure.

Make an high level function

catalog_apply() is not really intended to be used directly. It is intended to be used and hidden within more user-friendly functions referred as high level API. Hight level API are functions intended to be used by third party users. Let assume we have designed a new application to identify dead tree. Let call this function find_deadtrees(). This function takes a point cloud as input and return a SpatialPointsDataFrame with the positioning of the dead trees + some attributes.

We now want to make this function working with a LAScatalog in the sane fashion than all the lidR functions. One option is the following. First the function checks the input, if it is a LAScatalog it makes use of catalog_apply() to apply itself. Then the function will be fed with LASclusters. We test if the input is a LAScluster and if it is we read the chunk and apply the function on the loaded point cloud. To finish, if the input is a LAS we perform the actual computation.

The function now works seamlessly either on a LAS and a LAScatalog. We created a function find_deadtrees() that can be applied on a whole collection of files in parallel or not, on multiple machines or not, that returns a valid continuous shapefile, that handles errors nicely, that can optionally write the output of each chunk and so on…

In practice there is, in our opinion, a more elegant way to achieve the same task. This way is presented in this vignette an relies on S3 dispatch.

Advanced usages of the engine

The following sections concern both high level and low level APIs and present some advanced use cases.

Modify the drivers

Using the processing option opt_output_files() users can tell the engine to sequentially write the outputs on drive with custom filenames. In the default settings Rasters* are written in GeoTiff files, Spatial*, either in sp or sf formats are written in ESRI shapefiles, point-clouds are written in las files and data.frame are written in csv files. This can be modified at any time. This is documented in help("lidR-LAScatalog-drivers"). If somehow the output of the function mapped by catalog_apply() is not a Raster* a Spatial*, a data.frame or a sf the function will fail at writing the output.

It is possible to create a new driver. We could for example write the list in a .rds file. This can be done by creating a new entry named after the name of the class of the object that need to be written. Here a list.

More details in help("lidR-LAScatalog-drivers") or in this wiki page.

Multi-machines paralellisation

This vignette does not cover the set-up required to make two or more machines able to communicate. In this section we assume that the user is able to connect a remote machine via SSH. There is a wiki page that covers a little this subject. Assuming user is able to get access to multiple machines remotely and such machines are all able to read files from the same storage the multiple machines parallelisation is straightforward because driven by the future package. The only one thing to do it to register a remote strategy

Internally each chunk is sent to a worker. Remember, a chunk is not a subset of the points cloud. What is sent to each worker instead is a tiny internal object named LAScluster that roughly contains only the bounding box of the region of interest and the files that need to be read to load the according points cloud. Consequently the bandwidth needed to send a workload to each worker is virtually null. This is also why the function mapped by catalog_apply() should start with readLAS(). Indeed each worker reads its own data using the paths provided by the LAScatalog.

On cheap multi-machine parallelization made using several regular computers on a local network, the machines won’t necessarily have access to shared data. So a copy of the data is mandatory on each machine. If the data are accessible all under the exact same path under each machine it will work smoothly. However if the data are not available under the same paths it is possible to provide alternative directories where to find the files.

This is covered by this wiki page.