knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=12, fig.height=8, out.width="100%") library(foster) library(ggplot2) library(raster) library(knitr)
The goal of this vignette is to illustrate how FOSTER can be used to impute ALS-derived forest variables (response variables Y) to a larger area covered by multispectral satellite imagery and topographic data (predictor variables X). We can usually describe an imputation problem by defining two sets of observations: the reference and the target observations. At reference observations, both Y and X variables are defined while only X variables are available at targets. Ultimately, targets are the area where we want to impute response variables.
FOSTER has been designed around the following workflow:
knitr::include_graphics("https://raw.githubusercontent.com/mqueinnec/storage/main/FOSTER/FOSTER_workflow.png")
In order to use the functions of FOSTER directly we need to attach the package using library()
. Otherwise the functions need to be called explicitly from foster namespace using foster::function_name()
. The raster
package is regularly used throughout the workflow to read data using the functions raster
, stack
or brick
(see below). It is therefore recommended attach the raster
package as well.
library(foster) library(raster)
matchExtent
: match the extent of a raster from a reference. Cells of the reference having a specific value can be masked in the output raster object. matchResolution
: successively project and resample a raster coordinate system and spatial resolution to the ones of a reference raster. The input layer keeps its original extent instead of inheriting from the reference. focalMultiband
: apply a spatial filter (function) in the neighborhood of each cell. edges
: assign NA values to cell located in the neighborhood of other cells having NA values. This can be used for example to avoid sampling cells located close to borders. tile
: split a raster into smaller tiles. Can be used to reduce the size of data to be processed at a time and avoid memory issues. getSample
: perform a k-means clustering of a raster and randomly sample cells within each cluster proportionally to the presence of those clusters across the entire rastergetSampleValues
: extract the values of a raster at sample points These functions supports both raster and point features (ESRI Shapefiles) to calculate wall-to-wall predictor variables or only at sampled observations
calcIndices
: calculate a set of spectral indices from multispectral datatemporalMetrics
: summarize variables time series in a few metrics (e.g. mean, median, slope, IQR) partition
: split samples into training and testing setstrainNN
: train a k-NN model from the training sets and use the trained model to impute the Y variables on the testing set. Also returns the model accuracy by comparing observed and imputed response variables from the testing setaccuracy
: compute accuracy metrics from observed and predicted variablesscatter
: create a scatterplot between observed and predicted variablesvarImp
: return the importance of each predictor variable if random forest is used to calculate the nearest neighborspredictTrgs
: impute response variables from a trained k-NN model and predictor variables at targets.Two types of data are encountered when using FOSTER: raster data for wall-to-wall variables and vectors for variables extracted at sample points (point features).
If the raster contains a single layer it can be read with raster()
to create a RasterLayer
object . However, if it is a raster with multiple layers, stack()
(RasterStack
) or brick()
(RasterBrick
) should be used. RasterBrick
are usually more efficient to process but they can only point to a single file while RasterStack
can be assembled of layers from different file sources. RasterStack
objects can also be created from multiple RasterLayer
objects.
The functions raster
, stack
and brick
take the filename (full path to file or relative path from your current working directory) of the raster as an argument. Please refer to the documentation of these functions to learn more about the various options and file types supported by the raster package.
#Read single layer raster raster_layer <- raster("path/to/raster_layer.tif") #Read multi-layer raster raster_stack <- stack("path/to/raster_stack.tif")
In order to read a shapefile from a file on disk we use readOGR
from rgdal
package, providing the directory where the file is located in dsn
and the name of the layer without the .shp
extension. Please refer to the documentation of rgdal for supported file type and other options. The data will be stored in a SpatialPointsDataFrame
object.
dem_samples <- rgdal::readOGR(dsn = "path/to/directory", layer = "layer_name")
We can choose to process data in memory or write the output to disk. When dealing with small raster object, it is safe to process and save everything in memory. However, it is advised to write data to disk when processing large datasets. Writing raster data to disk can be done with the function raster::writeRaster
taking as arguments at least the name of the raster object and its full output filename (including path and optionally extension type). Whenever calling a function of FOSTER returning a raster object, the filename
can be provided directly in the function call and writeRaster
will be automatically used to save the output to disk. The functions also usually supports ...
arguments where additional parameters controlling writeRaster
can be provided (e.g. overwrite
to overwrite existing files, format
to specify the output file type if not given in filename, bylayer
to write each layer of the raster object individually).
#Example to write output of calcIndices to disk using different options ind <- calcIndices(x, indices = c("NDVI","TCG","TCW"),red = 3,nir=4,filename = "full/path/to/filename.tif") ind <- calcIndices(x, indices = c("NDVI","TCG","TCW"),red = 3,nir=4,filename = "full/path/to/filename", format="GTiff", overwrite=TRUE, byLayer=TRUE)
Whenever filename
is kept to its default value ""
(empty character), the output raster will be processed and saved in memory if possible. If the file is too large to be processed in memory and no filename is provided, the raster
package will automatically write the output to a temporary folder. The location of the temporary folder and the maximum size that can be processed in memory can be found (among other options) using rasterOptions()
. It is possible to change the global options of the raster
package by using rasterOptions(optionName) <- optionValue
. It is for example recommended to change the default temp directory to easily access it and clear it if necessary.
rasterOptions(tmpdir) <- "path/to/tempdir"
The function getSample
, getSampleValues
, calcIndices
and temporalMetrics
can return SpatialPointsDataFrame
objects. These objects are usually relatively small and can be easily processed in memory. However, as for raster data, it is possible to provide the name out the output under the filename
argument (full path to output, file extension not necessary). Only ESRI Shapefile objects are saved by FOSTER hence any other file extension provided in filename
would be overwritten by .shp
The functions calcIndices
, temporalMetrics
and predictTrgs
support parallel processing. To enable parallel processing you need to set par = TRUE
and the number of parallel threads threads
. When parallel processing is performed, the input data is divided in chunks and each cluster processes a chunk at a time. For calcIndices
and temporalMetrics
you can specify the number of chunks each cluster will process with the argument m
(the raster will be divided into m x threads
blocks).
For predictTrgs
, controlling how data is processed is slightly different. Memory issues can occur when processing too much data at the same time because large matrices need to be stored in memory. The argument nrows
specifies the number of rows that will be processed at a time (or per cluster when using parallel processing). By default nrows = 200
which may not be the optimum value depending on the size of the dataset and available memory on the computer. In general, increasing nrows
will speed up computing but also increase risks of running into memory issues. In order to choose the best value of nrows
, it is suggested to make some test runs and monitor the memory usage of the computer to see if should be increased or decreased.
We illustrate how FOSTER can be used by imputing two ALS-derived variables: the 95th percentile of first returns height (elev_p95
) and canopy cover above mean height (cover
). elev_p95
and cover
have been calculated on a 20 m x 20 m grid from an ALS point cloud. For this example we assume that only three 500 m x 4 km ALS stripes are available. The goal is to impute these two ALS metrics on a 4km x 4 km area only partially covered by the ALS stripes.
We will use the following predictor variables:
The imputation in FOSTER is based on a Random Forest k-NN model (measure of nearness based on the Random Forest proximity matrix). The imputation is based on the yaImpute
package.
We load elev_p95
and cover
from .tif
files and stack them in a RasterStack object Y_vars
using stack()
from raster
package.
elev_p95 <- raster(system.file("extdata/vignette/ALS_metrics/ALS_metrics_p95.tif",package="foster")) cover <- raster(system.file("extdata/vignette/ALS_metrics/ALS_metrics_cov_mean.tif",package="foster")) Y_vars <- stack(elev_p95,cover) #Set layers names names(Y_vars) <- c("p95","cover") Y_vars plot(Y_vars)
Multispectral data is derived from a 3 year time-series (2006 - 2008) of Landsat surface reflectance composite images (30 m x 30 m resolution). We load the data from 2006 as an example:
spectral_2006 <- stack(system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2006.tif",package="foster")) names(spectral_2006) <- c("blue","green","red","nir","swir1","swir2") spectral_2006 plot(spectral_2006, col = grey.colors(255))
Elevation (DEM) and terrain slope (DEM_slope) data are derived from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) global Digital Elevation Model (GDEM, v.2).Both DEM and DEM_slope were resampled to a 30 m spatial resolution and aligned with the multispectral data grid.
dem <- raster(system.file("extdata/vignette/topo/DEM.tif",package="foster")) dem_slope <- raster(system.file("extdata/vignette/topo/DEM_slope.tif",package="foster")) plot(dem) plot(dem_slope)
A mask of forested areas was derived from a landcover dataset of 30 m spatial resolution and aligned with multispectral data grid. Cells have the value 1
if forested and NA
otherwise.
mask_forest <- raster(system.file("extdata/vignette/landcover/VLCE_forest_2008.tif",package="foster")) plot(mask_forest)
In this example, the ALS metrics (20 x 20 m) and predictor variables (30 x 30 m) have different spatial resolutions.
In order to integrate the ALS metrics and predictor variables, the first step consists in resampling Y_vars
in order to match the spatial resolution, CRS and origin of the Landsat data
Y_vars_resampled <- matchResolution(x = Y_vars, ref = spectral_2006, method='bilinear') Y_vars_resampled
The response variables have now a spatial resolution of 30 m x 30 m and are aligned on the Landsat images grid.
Smoothing response and predictor variables can help reducing noise and potential spatial errors between Landsat and ALS data and improve estimation accuracy. In this example, we smooth data by assigning to each cell the mean of its 3x3 neighbors. The function focalMultiBand
requires a weight matrix that is used to define the size of the window and weights to apply to neighboring cells. Here we use a 3x3 weight matrix with weights set to 1. Next, we provide the function that is applied to the neighboring cells values (here mean
). Using na.rm = TRUE
allows mean to be calculated even if NA values occur in the neighborhood. Using pad=TRUE
with padValues = NA
creates additional rows and columns of NA values around the borders of Y_vars_resampled
in order to keep the original raster extent. Finally, using keepNA = TRUE
is useful to assign back NA values to cells that had a NA value in Y_vars_resampled
but got assigned a non-NA value when applying the spatial filter. Multispectral data will be smoothed later on, after calculating spectral indices.
filt <- matrix(1,nrow=3,ncol=3) Y_vars_smooth <- focalMultiBand(Y_vars_resampled, w=filt, fun=mean, pad=TRUE, padValue=NA, na.rm=TRUE, keepNA = TRUE) plot(Y_vars_smooth)
From now, on we will focus only on cells that are classified as forests by the land cover map. In order to mask ALS metrics cells that are not forested we need to make sure that the ALS metrics and land cover maps have the same extent. This is taken care of internally by the function matchExtent
Y_vars_mask <- matchExtent(Y_vars_smooth, mask_forest, mask=TRUE, maskValue = NA) plot(Y_vars_mask) # We do the same with the DEM and slope dem_mask <- matchExtent(dem, mask_forest, mask = TRUE) dem_slope_mask <- matchExtent(dem_slope, mask_forest, mask = TRUE)
The function calcIndices
is used to calculate spectral indices from multispectral data. The spectral indices calculation is based on the RStoolbox
package. The list of supported spectral indices and their formula is described in the documentation of the spectralIndices
function of the RStoolbox
package (available here. The layer index corresponding to the blue
, green
, red
, nir
, swir1
, swir3
bands need to be provided depending on the indices calculated.
Tasseled Cap brightness (TCB), greenness (TCG) and wetness (TCW) can also be calculated. Calculation of tasseled cap indices is based on the function tasseledCap
from the RSToolbox
package. The name of the satellite that acquired the data has to be provided under sat
and the bands have to be ordered in a specific order as explained in the documentation of RStoolbox::tasseledCap
(bands 1, 2, 3, 4, 5, 7 for Landsat5TM).
For this example, we calculate the Tasseled Cap Brightness (TCB), Greenness (TCG), Wetness (TCW) and NDVI. NDVI requires red
and nir
bands only.
indices_list <- c("NDVI","TCB","TCW","TCG") # Example for one year VI_2006 <- calcIndices(spectral_2006, indices = indices_list, sat="Landsat5TM", red=3, nir=4) plot(VI_2006[["TCG"]])
To process multiple years at once, all RasterStack objects (one per year) can be placed in a list.
# List all filenames in time-series order: 2006 to 2008 spectral_ts_files <- list(system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2006.tif",package = "foster"), system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2007.tif",package = "foster"), system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2008.tif",package = "foster")) # Open all Landsat images and place them in a list spectral_ts <- lapply(spectral_ts_files, stack) spectral_ts
VI_ts <- calcIndices(spectral_ts, indices = indices_list, sat="Landsat5TM", red=3, nir=4) # The output is a list with VI for each year VI_ts
We can visualize the vegetation indices of the first year of the time series
plot(VI_ts[[1]])
Once indices have been calculated we can smooth them with a 3 x 3 mean filter, as it was done with the resampled ALS metrics maps.
VI_ts_smooth <- focalMultiBand(VI_ts, w=filt, fun=mean, na.rm=TRUE, pad = TRUE, keepNA = TRUE)
We can visualize the smoothed vegetation indices of the first year of the time series
plot(VI_ts_smooth[[1]])
We will now calculate temporal summary metrics of the vegetation indices time series.
The function temporalMetrics
requires the name of a function that returns summary metrics of a numeric vector (e.g mean, standard deviation). The default function used by temporalMetrics
returns the median, IQR and Theil-Sen slope. However, the user can also define another function that returns metrics specific to its needs. For illustration, we create the function funSummary
that returns the same metrics as the default function. Ideally, gap-free composite multispectral images should be used but when creating your own function you can handle the cases where NA values might occur in the time series (by setting the na.rm
argument to TRUE
or FALSE
).
funSummary <- function(x){ c( median = median(x,na.rm=TRUE), IQR = IQR(x,na.rm=TRUE), slope = theilSen(x) ) }
We can calculate the temporal summary of the 3 year time series of smoothed vegetation indices created above (VI_ts_smooth
)
VI_ts_metrics <- temporalMetrics(VI_ts, metrics="funSummary")
The output is a RasterStack where each band is a temporal summary metric of a vegetation index. In this example, there are 12 bands (4 vegetation indices, 3 temporal summary metrics).
VI_ts_metrics plot(VI_ts_metrics[["NDVI_median"]])
Sample points need to be extracted from the ALS metrics maps in order to train and test the imputation model.To avoid selecting cells on forested edges or close to ALS extent boundaries we use edges
to assign NA
values to all cell located in a 3x3 neighborhood of any cells having a NA value.
Y_vars_edges <- edges(Y_vars_mask, w=3) plot(Y_vars_edges)
In FOSTER, stratification is performed using k-means algorithm which classifies the data in k-means clusters based on the proximity between observations. Random sampling is then performed in each strata, with the number of sampled points in each strata proportional to the occurrence of those strata across the classified raster. For this example, we want to extract nSamples = 230
sample points from nClasses = 5
strata having a minimum distance between each others of at least mindist = 75
meters. Due to the small study area of this example we have to select a relatively low number of samples and set a low mindist
requirement. However, it is advised to increase the number of sample points and use a larger mindist
to reduce spatial autocorrelation within the samples. We use norm = TRUE
to normalize variables prior to k-means clustering. Since we use xy = TRUE
the output is a SpatialPointsDataFrame with x and y coordinates added as fields.
The k-means clustering is performed internally with the RStoolbox
package.
nSamples = 230 nClasses = 5 mindist = 75 set.seed(1234) #For example reproducibility sample_strata <- getSample(Y_vars_edges, layers = c("p95","cover"), n = nSamples, strata = nClasses, mindist = mindist, norm = TRUE, xy = TRUE)
The returned sample_strata
object is a list containing the sampled points (SpatialPointsDataFrame
), the map of k-means strata (RasterLayer
) and the k-means model.
# Sampled points sampleLoc <- sample_strata$sample # Map of strata strata_map <- sample_strata$clusterMap # k-NN model kmeans_model <- sample_strata$model
We can plot the sampled points on top of the map of strata:
plot(strata_map) plot(sampleLoc,add=TRUE)
Once all predictor variables response variables have been pre-processed and calculated and reference observations sampled, we have all the data needed to train a k-NN model using trainNN
.
We start by putting all the response variables and predictor variables in a RasterStack
# Predictor variables X_vars <- stack(VI_ts_metrics, dem_mask, dem_slope_mask) # Response variables Y_vars <- Y_vars_mask
Then, we extract the values of the predictor and response variables at the sampled reference observations
# Extract values at sample X_vars_sample <- getSampleValues(X_vars, sampleLoc) Y_vars_sample <- getSampleValues(Y_vars, sampleLoc) X_vars_sample Y_vars_sample
In order to assess the accuracy of the kNN imputation, we will split the sampled points into training and validation sets.The function partition
can be used for that purpose. Three methods are implemented: "random holdout"
where a percentage of the data randomly selected is left out for testing, "group holdout
where the data is first grouped by quantiles and a percentage of the data within each group is left out for testing and "kfold"
where k folds containing the same percentage of data left out for testing are created (cross-validation).
In this example, we perform a k-fold cross validation with 5 folds. We also split the observations based on the k-means strata obtained from stratified random sampling.
#Create data partition set.seed(1234) #for example reproducibility train_idx <- partition(sampleLoc$cluster, type="kfold", kfold = 5, returnTrain = TRUE) train_idx
We can then train a random forest (RF) k-NN model specifying the method randomForest
and optionally setting up the number of trees and the number of parameters mtry
evaluated at each nodes of the RF trees. We consider only the nearest neighbor by setting up k = 1
. If we supply samples that should go to training, trainNN
returns a data.frame with observed and predicted variables of observations that are not in training.
set.seed(1234) #for example reproducibility kNN <- trainNN(x = X_vars_sample, y=Y_vars_sample, inTrain = train_idx, k = 1, method = "randomForest", ntree = 200)
The output of traiNN
is a list with an element model
, which is the trained kNN model, and an element preds
which is a data frame with predictions and observations of the sets used for validation.
kNN_model <- kNN$model kNN_preds <- kNN$preds head(kNN_preds, 10)
We can use the function accuracy
to compute a few accuracy summary metrics
accuracy(obs = kNN_preds$obs, preds = kNN_preds$preds, vars = kNN_preds$variable, folds = kNN_preds$Fold)
The function scatter
can be used to create a scatterplot (ggplot2
objects) of the predicted against observed values of the testing set
plots_scatter <- scatter(obs = kNN_preds$obs, preds = kNN_preds$preds, vars = kNN_preds$variable) plots_scatter$cov plots_scatter$p95
We can also get the most important variables of the RF-based kNN model and plot them in a boxplot or in a heatmap
imp <- varImp(kNN_model,scaled=FALSE,plot=TRUE,plotType="boxplot") imp$plot imp <- varImp(kNN$model,scaled=TRUE,plot=TRUE,plotType="grid") imp$plot
Once the k-NN model has been trained and all predictor variables have been calculated at targets, we can impute response variables at targets using predictTrgs
. This function requires the trained model and a raster where each layer is one of the model predictor variable. Name of layers should exactly match the name of the predictor variables used during training. It returns one raster layer per response variable and optionally the nearest neighbor ID and distance of each target cell (identified by the rownames of Y_vars_sample
observations used for training the kNN model).
Y_imputed <- predictTrgs(model=kNN_model, x = X_vars, nnID = TRUE, nnDist = TRUE) Y_imputed
A raster object with 4 layers is returned: the first layer corresponds to the imputed p95
, the second layer to cover
and the third and fourth ones to the ID and NN distance of each reference observation the response variables have been imputed from.
plot(Y_imputed$p95) plot(Y_imputed$cover) plot(Y_imputed$nnID1,col=rainbow(length(unique(Y_imputed$nnID1)))) plot(Y_imputed$nnDist1)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.