Imputing forest attributes with FOSTER"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=12, 
  fig.height=8,
  out.width="100%")
library(foster)
library(ggplot2)
library(raster)
library(knitr)

Introduction

The goal of this vignette is to illustrate how FOSTER can be used to impute ALS-derived forest variables (response variables Y) to a larger area covered by multispectral satellite imagery and topographic data (predictor variables X). We can usually describe an imputation problem by defining two sets of observations: the reference and the target observations. At reference observations, both Y and X variables are defined while only X variables are available at targets. Ultimately, targets are the area where we want to impute response variables.

FOSTER has been designed around the following workflow:

knitr::include_graphics("https://raw.githubusercontent.com/mqueinnec/storage/main/FOSTER/FOSTER_workflow.png")

Using FOSTER

Load packages

In order to use the functions of FOSTER directly we need to attach the package using library(). Otherwise the functions need to be called explicitly from foster namespace using foster::function_name(). The raster package is regularly used throughout the workflow to read data using the functions raster, stack or brick (see below). It is therefore recommended attach the raster package as well.

library(foster)
library(raster)

Main functions implemented in FOSTER

Data preparation

Stratified random sampling:

Calculate spectral indices and time series-based metrics

These functions supports both raster and point features (ESRI Shapefiles) to calculate wall-to-wall predictor variables or only at sampled observations

Train a k-NN model and assess its accuracy

Impute response variables at targets

Data types

Two types of data are encountered when using FOSTER: raster data for wall-to-wall variables and vectors for variables extracted at sample points (point features).

Reading data from disk

Raster data

If the raster contains a single layer it can be read with raster() to create a RasterLayer object . However, if it is a raster with multiple layers, stack() (RasterStack) or brick() (RasterBrick) should be used. RasterBrick are usually more efficient to process but they can only point to a single file while RasterStack can be assembled of layers from different file sources. RasterStack objects can also be created from multiple RasterLayer objects. The functions raster, stack and brick take the filename (full path to file or relative path from your current working directory) of the raster as an argument. Please refer to the documentation of these functions to learn more about the various options and file types supported by the raster package.

#Read single layer raster
raster_layer <- raster("path/to/raster_layer.tif")
#Read multi-layer raster
raster_stack <- stack("path/to/raster_stack.tif")

Vector data

In order to read a shapefile from a file on disk we use readOGR from rgdal package, providing the directory where the file is located in dsn and the name of the layer without the .shp extension. Please refer to the documentation of rgdal for supported file type and other options. The data will be stored in a SpatialPointsDataFrame object.

dem_samples <- rgdal::readOGR(dsn = "path/to/directory", 
                              layer = "layer_name")

Write data to disk

Raster data

We can choose to process data in memory or write the output to disk. When dealing with small raster object, it is safe to process and save everything in memory. However, it is advised to write data to disk when processing large datasets. Writing raster data to disk can be done with the function raster::writeRaster taking as arguments at least the name of the raster object and its full output filename (including path and optionally extension type). Whenever calling a function of FOSTER returning a raster object, the filename can be provided directly in the function call and writeRaster will be automatically used to save the output to disk. The functions also usually supports ... arguments where additional parameters controlling writeRaster can be provided (e.g. overwrite to overwrite existing files, format to specify the output file type if not given in filename, bylayer to write each layer of the raster object individually).

#Example to write output of calcIndices to disk using different options
ind <- calcIndices(x, indices = c("NDVI","TCG","TCW"),red = 3,nir=4,filename = "full/path/to/filename.tif")
ind <- calcIndices(x, indices = c("NDVI","TCG","TCW"),red = 3,nir=4,filename = "full/path/to/filename", format="GTiff", overwrite=TRUE, byLayer=TRUE)

Whenever filename is kept to its default value "" (empty character), the output raster will be processed and saved in memory if possible. If the file is too large to be processed in memory and no filename is provided, the raster package will automatically write the output to a temporary folder. The location of the temporary folder and the maximum size that can be processed in memory can be found (among other options) using rasterOptions(). It is possible to change the global options of the raster package by using rasterOptions(optionName) <- optionValue. It is for example recommended to change the default temp directory to easily access it and clear it if necessary.

rasterOptions(tmpdir) <- "path/to/tempdir"

Vector data

The function getSample, getSampleValues, calcIndices and temporalMetrics can return SpatialPointsDataFrame objects. These objects are usually relatively small and can be easily processed in memory. However, as for raster data, it is possible to provide the name out the output under the filename argument (full path to output, file extension not necessary). Only ESRI Shapefile objects are saved by FOSTER hence any other file extension provided in filename would be overwritten by .shp

Optimize computing performance

The functions calcIndices, temporalMetrics and predictTrgs support parallel processing. To enable parallel processing you need to set par = TRUE and the number of parallel threads threads. When parallel processing is performed, the input data is divided in chunks and each cluster processes a chunk at a time. For calcIndices and temporalMetrics you can specify the number of chunks each cluster will process with the argument m (the raster will be divided into m x threads blocks).

For predictTrgs, controlling how data is processed is slightly different. Memory issues can occur when processing too much data at the same time because large matrices need to be stored in memory. The argument nrows specifies the number of rows that will be processed at a time (or per cluster when using parallel processing). By default nrows = 200 which may not be the optimum value depending on the size of the dataset and available memory on the computer. In general, increasing nrows will speed up computing but also increase risks of running into memory issues. In order to choose the best value of nrows, it is suggested to make some test runs and monitor the memory usage of the computer to see if should be increased or decreased.

Description of the example

We illustrate how FOSTER can be used by imputing two ALS-derived variables: the 95th percentile of first returns height (elev_p95) and canopy cover above mean height (cover). elev_p95 and cover have been calculated on a 20 m x 20 m grid from an ALS point cloud. For this example we assume that only three 500 m x 4 km ALS stripes are available. The goal is to impute these two ALS metrics on a 4km x 4 km area only partially covered by the ALS stripes.

We will use the following predictor variables:

The imputation in FOSTER is based on a Random Forest k-NN model (measure of nearness based on the Random Forest proximity matrix). The imputation is based on the yaImpute package.

Input data

ALS-derived forest attributes maps

We load elev_p95 and cover from .tif files and stack them in a RasterStack object Y_vars using stack() from raster package.

elev_p95 <- raster(system.file("extdata/vignette/ALS_metrics/ALS_metrics_p95.tif",package="foster"))
cover <- raster(system.file("extdata/vignette/ALS_metrics/ALS_metrics_cov_mean.tif",package="foster"))
Y_vars <- stack(elev_p95,cover)

#Set layers names
names(Y_vars) <- c("p95","cover")
Y_vars
plot(Y_vars)

Multispectral data - Annual Landsat composites

Multispectral data is derived from a 3 year time-series (2006 - 2008) of Landsat surface reflectance composite images (30 m x 30 m resolution). We load the data from 2006 as an example:

spectral_2006 <- stack(system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2006.tif",package="foster"))
names(spectral_2006) <- c("blue","green","red","nir","swir1","swir2")

spectral_2006
plot(spectral_2006, col = grey.colors(255))

Topographic data

Elevation (DEM) and terrain slope (DEM_slope) data are derived from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) global Digital Elevation Model (GDEM, v.2).Both DEM and DEM_slope were resampled to a 30 m spatial resolution and aligned with the multispectral data grid.

dem <- raster(system.file("extdata/vignette/topo/DEM.tif",package="foster"))
dem_slope <- raster(system.file("extdata/vignette/topo/DEM_slope.tif",package="foster"))

plot(dem)
plot(dem_slope)

Mask of forested areas

A mask of forested areas was derived from a landcover dataset of 30 m spatial resolution and aligned with multispectral data grid. Cells have the value 1 if forested and NA otherwise.

mask_forest <- raster(system.file("extdata/vignette/landcover/VLCE_forest_2008.tif",package="foster"))
plot(mask_forest)

Data preparation

In this example, the ALS metrics (20 x 20 m) and predictor variables (30 x 30 m) have different spatial resolutions.

In order to integrate the ALS metrics and predictor variables, the first step consists in resampling Y_vars in order to match the spatial resolution, CRS and origin of the Landsat data

Y_vars_resampled <- matchResolution(x = Y_vars,
                                    ref = spectral_2006,
                                    method='bilinear')
Y_vars_resampled

The response variables have now a spatial resolution of 30 m x 30 m and are aligned on the Landsat images grid.

Smoothing response and predictor variables can help reducing noise and potential spatial errors between Landsat and ALS data and improve estimation accuracy. In this example, we smooth data by assigning to each cell the mean of its 3x3 neighbors. The function focalMultiBand requires a weight matrix that is used to define the size of the window and weights to apply to neighboring cells. Here we use a 3x3 weight matrix with weights set to 1. Next, we provide the function that is applied to the neighboring cells values (here mean). Using na.rm = TRUE allows mean to be calculated even if NA values occur in the neighborhood. Using pad=TRUE with padValues = NA creates additional rows and columns of NA values around the borders of Y_vars_resampled in order to keep the original raster extent. Finally, using keepNA = TRUE is useful to assign back NA values to cells that had a NA value in Y_vars_resampled but got assigned a non-NA value when applying the spatial filter. Multispectral data will be smoothed later on, after calculating spectral indices.

filt <- matrix(1,nrow=3,ncol=3)
Y_vars_smooth <- focalMultiBand(Y_vars_resampled,
                                w=filt,
                                fun=mean,
                                pad=TRUE,
                                padValue=NA, 
                                na.rm=TRUE, 
                                keepNA = TRUE)
plot(Y_vars_smooth)

From now, on we will focus only on cells that are classified as forests by the land cover map. In order to mask ALS metrics cells that are not forested we need to make sure that the ALS metrics and land cover maps have the same extent. This is taken care of internally by the function matchExtent

Y_vars_mask <- matchExtent(Y_vars_smooth,
                           mask_forest,
                           mask=TRUE,
                           maskValue = NA)
plot(Y_vars_mask)

# We do the same with the DEM and slope

dem_mask <- matchExtent(dem, 
                        mask_forest, 
                        mask = TRUE)

dem_slope_mask <- matchExtent(dem_slope, 
                        mask_forest, 
                        mask = TRUE)

Temporal summary metrics of spectral indices time series

Spectral indices time series

The function calcIndices is used to calculate spectral indices from multispectral data. The spectral indices calculation is based on the RStoolbox package. The list of supported spectral indices and their formula is described in the documentation of the spectralIndices function of the RStoolbox package (available here. The layer index corresponding to the blue, green, red, nir, swir1, swir3 bands need to be provided depending on the indices calculated.

Tasseled Cap brightness (TCB), greenness (TCG) and wetness (TCW) can also be calculated. Calculation of tasseled cap indices is based on the function tasseledCap from the RSToolbox package. The name of the satellite that acquired the data has to be provided under sat and the bands have to be ordered in a specific order as explained in the documentation of RStoolbox::tasseledCap (bands 1, 2, 3, 4, 5, 7 for Landsat5TM).

For this example, we calculate the Tasseled Cap Brightness (TCB), Greenness (TCG), Wetness (TCW) and NDVI. NDVI requires red and nir bands only.

indices_list <- c("NDVI","TCB","TCW","TCG")
# Example for one year
VI_2006 <- calcIndices(spectral_2006,
                   indices = indices_list, 
                   sat="Landsat5TM", 
                   red=3, 
                   nir=4)
plot(VI_2006[["TCG"]])

To process multiple years at once, all RasterStack objects (one per year) can be placed in a list.

# List all filenames in time-series order: 2006 to 2008
spectral_ts_files <- list(system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2006.tif",package = "foster"), 
                    system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2007.tif",package = "foster"), 
                    system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2008.tif",package = "foster"))

# Open all Landsat images and place them in a list
spectral_ts <- lapply(spectral_ts_files, stack)
spectral_ts
VI_ts <- calcIndices(spectral_ts,
                     indices = indices_list, 
                     sat="Landsat5TM", 
                     red=3, 
                     nir=4)

# The output is a list with VI for each year
VI_ts

We can visualize the vegetation indices of the first year of the time series

plot(VI_ts[[1]])

Once indices have been calculated we can smooth them with a 3 x 3 mean filter, as it was done with the resampled ALS metrics maps.

VI_ts_smooth <- focalMultiBand(VI_ts,
                             w=filt,
                             fun=mean,
                             na.rm=TRUE,
                             pad = TRUE,
                             keepNA = TRUE)

We can visualize the smoothed vegetation indices of the first year of the time series

plot(VI_ts_smooth[[1]])

Spectral indices temporal summary metrics

We will now calculate temporal summary metrics of the vegetation indices time series.

The function temporalMetrics requires the name of a function that returns summary metrics of a numeric vector (e.g mean, standard deviation). The default function used by temporalMetrics returns the median, IQR and Theil-Sen slope. However, the user can also define another function that returns metrics specific to its needs. For illustration, we create the function funSummary that returns the same metrics as the default function. Ideally, gap-free composite multispectral images should be used but when creating your own function you can handle the cases where NA values might occur in the time series (by setting the na.rm argument to TRUE or FALSE).

funSummary <- function(x){

  c(
    median = median(x,na.rm=TRUE),
    IQR = IQR(x,na.rm=TRUE),
    slope = theilSen(x)
  )
}

We can calculate the temporal summary of the 3 year time series of smoothed vegetation indices created above (VI_ts_smooth)

VI_ts_metrics <- temporalMetrics(VI_ts,
                               metrics="funSummary")

The output is a RasterStack where each band is a temporal summary metric of a vegetation index. In this example, there are 12 bands (4 vegetation indices, 3 temporal summary metrics).

VI_ts_metrics

plot(VI_ts_metrics[["NDVI_median"]])

Stratified random sampling

Sample points need to be extracted from the ALS metrics maps in order to train and test the imputation model.To avoid selecting cells on forested edges or close to ALS extent boundaries we use edges to assign NA values to all cell located in a 3x3 neighborhood of any cells having a NA value.

Y_vars_edges <- edges(Y_vars_mask,
                      w=3)
plot(Y_vars_edges)

In FOSTER, stratification is performed using k-means algorithm which classifies the data in k-means clusters based on the proximity between observations. Random sampling is then performed in each strata, with the number of sampled points in each strata proportional to the occurrence of those strata across the classified raster. For this example, we want to extract nSamples = 230 sample points from nClasses = 5 strata having a minimum distance between each others of at least mindist = 75 meters. Due to the small study area of this example we have to select a relatively low number of samples and set a low mindist requirement. However, it is advised to increase the number of sample points and use a larger mindist to reduce spatial autocorrelation within the samples. We use norm = TRUE to normalize variables prior to k-means clustering. Since we use xy = TRUE the output is a SpatialPointsDataFrame with x and y coordinates added as fields.

The k-means clustering is performed internally with the RStoolbox package.

nSamples = 230
nClasses = 5
mindist = 75

set.seed(1234) #For example reproducibility
sample_strata <- getSample(Y_vars_edges, 
                     layers = c("p95","cover"), 
                     n = nSamples, 
                     strata = nClasses,
                     mindist = mindist, 
                     norm = TRUE,
                     xy = TRUE)

The returned sample_strata object is a list containing the sampled points (SpatialPointsDataFrame), the map of k-means strata (RasterLayer) and the k-means model.

# Sampled points
sampleLoc <- sample_strata$sample

# Map of strata
strata_map <- sample_strata$clusterMap

# k-NN model
kmeans_model <- sample_strata$model

We can plot the sampled points on top of the map of strata:

plot(strata_map)
plot(sampleLoc,add=TRUE)

Train a k-NN model

Once all predictor variables response variables have been pre-processed and calculated and reference observations sampled, we have all the data needed to train a k-NN model using trainNN.

We start by putting all the response variables and predictor variables in a RasterStack

# Predictor variables
X_vars <- stack(VI_ts_metrics, 
                dem_mask, 
                dem_slope_mask)

# Response variables

Y_vars <- Y_vars_mask

Then, we extract the values of the predictor and response variables at the sampled reference observations

# Extract values at sample
X_vars_sample <- getSampleValues(X_vars, sampleLoc)
Y_vars_sample <- getSampleValues(Y_vars, sampleLoc)

X_vars_sample
Y_vars_sample

In order to assess the accuracy of the kNN imputation, we will split the sampled points into training and validation sets.The function partition can be used for that purpose. Three methods are implemented: "random holdout" where a percentage of the data randomly selected is left out for testing, "group holdout where the data is first grouped by quantiles and a percentage of the data within each group is left out for testing and "kfold" where k folds containing the same percentage of data left out for testing are created (cross-validation).

In this example, we perform a k-fold cross validation with 5 folds. We also split the observations based on the k-means strata obtained from stratified random sampling.

#Create data partition
set.seed(1234) #for example reproducibility
train_idx <- partition(sampleLoc$cluster,
                       type="kfold", 
                       kfold = 5,
                       returnTrain = TRUE)

train_idx

We can then train a random forest (RF) k-NN model specifying the method randomForest and optionally setting up the number of trees and the number of parameters mtry evaluated at each nodes of the RF trees. We consider only the nearest neighbor by setting up k = 1. If we supply samples that should go to training, trainNN returns a data.frame with observed and predicted variables of observations that are not in training.

set.seed(1234) #for example reproducibility
kNN <- trainNN(x = X_vars_sample,
              y=Y_vars_sample,
              inTrain = train_idx, 
              k = 1, 
              method = "randomForest",
              ntree = 200)

The output of traiNN is a list with an element model, which is the trained kNN model, and an element preds which is a data frame with predictions and observations of the sets used for validation.

kNN_model <- kNN$model

kNN_preds <- kNN$preds
head(kNN_preds, 10)

Accuracy assessment

We can use the function accuracy to compute a few accuracy summary metrics

accuracy(obs = kNN_preds$obs, 
         preds = kNN_preds$preds, 
         vars = kNN_preds$variable, 
         folds = kNN_preds$Fold)

The function scatter can be used to create a scatterplot (ggplot2 objects) of the predicted against observed values of the testing set

plots_scatter <- scatter(obs = kNN_preds$obs, 
         preds = kNN_preds$preds, 
         vars = kNN_preds$variable)

plots_scatter$cov

plots_scatter$p95

We can also get the most important variables of the RF-based kNN model and plot them in a boxplot or in a heatmap

imp <- varImp(kNN_model,scaled=FALSE,plot=TRUE,plotType="boxplot")
imp$plot
imp <- varImp(kNN$model,scaled=TRUE,plot=TRUE,plotType="grid")
imp$plot

Impute response variables at targets

Once the k-NN model has been trained and all predictor variables have been calculated at targets, we can impute response variables at targets using predictTrgs. This function requires the trained model and a raster where each layer is one of the model predictor variable. Name of layers should exactly match the name of the predictor variables used during training. It returns one raster layer per response variable and optionally the nearest neighbor ID and distance of each target cell (identified by the rownames of Y_vars_sample observations used for training the kNN model).

Y_imputed <- predictTrgs(model=kNN_model,
                         x = X_vars, 
                         nnID = TRUE, 
                         nnDist = TRUE)
Y_imputed

A raster object with 4 layers is returned: the first layer corresponds to the imputed p95, the second layer to cover and the third and fourth ones to the ID and NN distance of each reference observation the response variables have been imputed from.

plot(Y_imputed$p95)
plot(Y_imputed$cover)
plot(Y_imputed$nnID1,col=rainbow(length(unique(Y_imputed$nnID1))))
plot(Y_imputed$nnDist1)


Try the foster package in your browser

Any scripts or data that you put into this service are public.

foster documentation built on March 30, 2021, 5:11 p.m.