knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Introduction

This vignette describes a workflow that covers site selection, matching, and interpolation for a case where the kpoints function is used to select sites. This workflow is appropriate when the goals are 1) to identify sites for simulation (subset cells) from a larger study area (target cells) and 2) to interpolate simulation results from the subset cells to create high-resolution, continuous maps of the study area. Here, we illustrate the workflow using an example for the state of Wyoming. The code chunks in this document are meant to be run sequentially. Data relevant to this example are contained within the rMultivariateMatching package. The site-selection and interpolation methods are described in:

Renne, R. R., Schlaepfer, D. R., Palmquist, K. A., Lauenroth, W. K., & Bradford, J. B. In preparation. Estimating complex ecological variables at high spatial resolution using cost-effective multivariate matching algorithms.

Examples of the site selection and interpolation methods in print

In the site selection example in Renne et al., our goal was to project the impacts of climate change, wildfire, and livestock grazing on big sagebrush (Artemisia tridentata) plant communities in the western United States using STEPWAT2, an individual-based, gap dynamics plant simulation model (Palmquist et al., 2018a; Palmquist et al., 2018b). We chose a set of six climate variables that capture the major drivers of plant community structure in drylands as matching variables. We used the site selection interpolation methods described in Renne et al. to generate the maps of simulation output for a recent publication investigating the projected impacts of climate change on big sagebrush plant communities:

Palmquist, K. A., Schlaepfer, D. R., Renne, R. R., Torbit, S. C., Doherty, K. E., Remington, T. E., Watson, G., Bradford, J. B., & Lauenroth, W. K. (2021). Divergent climate change effects on widespread dryland plant communities driven by climatic and ecohydrological gradients. Glob. Change Bio., 27(20), 5169-5185. https://doi.org/10.1111/gcb.15776

Interpolated datasets from the study are available from the US Geological Survey Sciencebase:

Renne, R. R., Palmquist, K. A., Schlaepfer, D. R., Lauenroth, W. K., and Bradford, J. B. (2021). High-resolution maps of big sagebrush plant community biomass using multivariate matching algorithms: U.S. Geological Survey data release. https://doi.org/10.5066/P9MNKWS4.

Overview and definitions

In this vignette, we assume that we have already determined appropriate matching variables and matching criteria and have decided on the optimal number of points (k) needed to represent the study area. To see a worked example of how to use the rMultivariateMatching package to determine matching criteria and k, please refer to the Selecting points with rMultivariateMatching vignette.

In the example presented here, we start by selecting subset cells from the target cells using kpoints. Note that this workflow is distinct from a workflow where the simulated sites (subset cells) are not selected using kpoints. A workflow for the latter scenario can be found in the "Matching and interpolation without kpoints" vignette.

The functions in the rMultivariateMatching package measure similarity between sites as the Euclidean distance of standardized (i.e., weighted) matching variables. Each variable is standardized by dividing it by the maximum amount of difference that would still allow two sites to be considered analogous for that variable. These maximum differences are called the matching criteria for each variable. For example, if two sites could be reasonably considered analogous if their mean annual precipitations (MAP) were within 50 mm of each other, the matching criteria for MAP could be set to 50 mm. Distances are calculated within the rMultivariateMatching function using the distances package (Savje 2019).

Definitions

The figure below illustrates an overview of the site selection and interpolation methods and key terminology. First, a set of matching variables are selected and standardized using user-defined matching criteria. Next, the kpoints algorithm is used to determine an optimal set of k subset cells for use in field sampling or simulation. All target cells are matched to one of the subset cells (matches are denoted by color in the central panel). Finally, point results (e.g., simulation output) for the subset cells are interpolated across the study area using multivariate matching to produce continuous, high-resolution maps of output, such as plant functional type biomass or ecohydrological variables.

# Load the rMultivariateMatching package
library(rMultivariateMatching)

Prepare inputs and select subset cells using kpoints

The first step (which you will have most likely already completed when determining matching criteria and the optimal number of subset cells), is to use the makeInputdata function to convert a rasterStack (see the raster package (Hijmans 2021) for details) of matching (and other) variables into the correct format to be used in subsequent functions.

# Load targetcells data for Target Cells
data(targetcells)
# Create data frame of potential matching variables for Target Cells
allvars <- makeInputdata(targetcells)

Next, we run the kpoints function to find an optimal set of k subset cells from the target cells. The kpoints function initializes its search for an optimal solution for k using a random selection of target cells and iteratively searches for a solution.

Importantly, the solution is likely to vary depending on the initial random selection of cells, and similar to the kmeans clustering algorithm (MacQueen 1967), kpoints can get "stuck" in a locally optimal solution that falls short of the globally optimal solution. In our case, this means that running the kpoints function with a single random selection of k cells (n_starts = 1) is likely to result in a selection of k points that does not represent as much of the study area as is possible for that number of points. Thus, to increase the probability that kpoints will find a solution that represents as much of the study areas as possible for that value of k, we need to try multiple initiations of the algorithm with different random selections of target cells. This can be achieved by setting n_starts argument to a value > 1. The default value is 10 and values <10 will throw a warning (but still run). A reasonable number of starts for testing matching criteria and determining the optimal value of k would be 10. When you are ready to find the final selection of subset cells using kpoints, you should set n_starts to a relatively large number. n_starts = 100 should be adequate, and you may choose a higher number if you wish.

Note: this function may take quite a while to run. Running kpoints with n_starts = 100 for this example could take 2-3 hours.

Explanation of kpoints arguments

Please also refer to the kpoints documentation.

# Restrict data to matching variables of interest
matchingvars <- allvars[,c("cellnumbers","x","y","bioclim_01","bioclim_04",
                           "bioclim_09","bioclim_12","bioclim_15","bioclim_18")]

# Create vector of matching criteria
criteria <- c(0.7,42,3.3,66,5.4,18.4)

# Find final solution for k = 200
results1 <- kpoints(matchingvars,criteria = criteria,klist = 200,
                    n_starts = 100,min_area = 50,iter = 50,
                    raster_template = targetcells[[1]], verify_stop = TRUE,
                    savebest = FALSE)

# Calculate coverage of best solution
results1$solution_areas/results1$totalarea
k_200   
0.9082578

Explanation of kpoints outputs

The output from the kpoints function includes:

Perform matching

Use the multivarmatch function to perform multivariate matching. This function will match every target cell to one of the subset cells identified using the kpoints function in the last step.

Explanation of multivarmatch arguments

Please also refer to the multivarmatch documentation.

Note: the function below does not specify several additional arguments that are left as default values. These include:

# Get Subset cells from solution of the kpoints algorithm
subsetcells <- results1$solutions[[1]]

# Match all Target cells to Subset cells using multivariate matching
quals <- multivarmatch(matchingvars, subsetcells, criteria = criteria,
                        matchingvars_id = "cellnumbers", addpoints = FALSE,
                        raster_template = targetcells[[1]],
                        subset_in_target = TRUE)

Explanation of multivarmatch outputs

The output from the multivarmatch function is a data frame of target cells with coordinates ('x','y'), cellnumber of each target cell ('target_cell'), unique id of matched the subset cell ('subset_cell'), and matching quality ('matching_quality'). The function will save a raster of matching quality if saveraster is TRUE and plot a map of matching quality if plotraster is TRUE.

Evaluate matching

Now that we have our final selection of k subset cells, we should evaluate the success of matching. In Renne et al., we suggest four methods of evaluating matching.

  1. Calculate matching quality (weighted Euclidean distance between target and matched subset cells)

  2. Calculate the standard deviation of differences between target and matched subset cells for a set of variables relevant to the project

  3. Calculate geographic distances between matched cells

  4. Leave-one-out cross-validation

1. Calculate matching quality

In addition to identifying the matching subset cell for each target cell, the multivarmatch function also calculates the weighted Euclidean distance between matched cells, which can be interpreted as a continuous variable measuring matching quality. Distances less than or equal to one indicate high-quality matching (matching criteria have been met for all variables). Distances greater than one indicate that the difference between these target cells and their matched subset cell exceeds the matching criterion for one or more variables.

Matching quality can be used to exclude target cells for which there is insufficient matching, i.e., no simulated site is analogous, and to determine the spatial extent of the represented area. In Renne et al., we used a cut-off of 1.5 as a threshold between high/moderate quality and low quality matching, and excluded sites with matching quality >1.5 from the final interpolated datasets.

2. Calculate standard deviation of differences between target and matched subset cells

Matching can also be evaluated by using the evaluateMatching function to calculate the standard deviation of differences between target and subset cells for a set of variables relevant to the project. This provides an estimate of how much variability there is among the target cells matched to each subset cell and is particularly informative if the variables were not used as part of the matching process.

If the rownames of allvars and matches (see below for details) do not match, the evaluateMatching function will fail. Importantly, the function will remove cases with missing values from the allvars and matches data frames and if missing values occur for different cases, the function will fail because the rownames of the two data frames will no longer match.

Here we use 19 bioclim variables (six of which were used as matching variables) to evaluate matching in the current example.

Explanation of evaluateMatching arguments

Please also refer to the evaluateMatching documentation.

# Run evaluateMatching
sddiffs <- evaluateMatching(allvars = allvars, matches = quals,
                            quality_name = "matching_quality",
                            matchingvars_id = "cellnumbers",
                            subset_in_target = TRUE, 
                            matching_distance = 1.5,
                            plot_diffs = TRUE, 
                            secondarymatch = FALSE)

Explanation of evaluateMatching outputs

A data frame of the standard deviation of differences between target and matched subset cells for all variables supplied in allvars data frame. The first row corresponds to the standard deviation of differences between target and subset cells for all cells and the second row corresponds to the standard deviation of differences between target and subset cells for only those target cells with matching quality $\le$ matching_distance. Units are the same as the units for each variable in allvars.

3. Calculate geographic distances between matched cells

Importantly, the quality of matching will also depend on the extent to which spatio-temporal patterns are maintained between matched cells. Our methods do not explicitly incorporate geographic proximity. Thus, target cells may be matched to geographically distant subset cells and simulation output from subset cells that are geographically distant may be assigned to adjacent target cells.

Although these cells may be well matched using the matching variables, they may exhibit daily patterns (e.g., precipitation, soil moisture) that are much less similar than would be expected for adjacent cells. To estimate the extent to which spatio-temporal patterns may have been maintained during matching, two measures of distance can be calculated using the evaluateGeoDist function:

  1. The distance between target cells and their matched subset cells

  2. The average distance between the subset cell that is matched to a given target cell and the subset cells that are matched to the eight adjacent neighbors of that target cell.

Explanation of evaluateGeoDist arguments

Please also refer to the evaluateGeoDist documentation.

# Look at geographic distances
geodist <- evaluateGeoDist(matches = quals, subsetcells = subsetcells,
                           subset_in_target = TRUE, 
                           quality_name = "matching_quality",
                           exclude_poor_matches = TRUE, matching_distance = 1.5,
                           longlat = TRUE, raster_template = targetcells[[1]],
                           map_distances = TRUE, map_neighbor_distances = TRUE,
                           which_distance = "both",saverasters = FALSE)

Distance (km) between target and matched subset cells.

Average distance (km) between the subset cell that is matched to a given target cell and the subset cells that are matched to the eight adjacent neighbors of that target cell.

Explanation of evaluateGeoDist outputs

A data frame with the distance between target and matched subset cells ('target_to_subset_distance') and the average distance between the subset cell matched to each target cell and the subset cells matched to the eight adjacent target cells ('avgdistance_to_neighbors'). The first column and the rownames correspond to the unique identifiers for the target cells, and columns 2 and 3 correspond to the 'x' and 'y' coordinates of the target cells.

4. Leave-one-out cross-validation

Once simulations are complete for the subset cells, interpolation errors can be estimated using leave-one-out cross validation (LOOCV) with the loocv function. In LOOCV, each subset cell is matched to its nearest neighbor from among the remaining subset cells (using Euclidean distance of the weighted matching variables). Then, an estimate of the interpolation error can be calculated using the following equation:

$$CVerror=\sqrt{\frac{1}k∑_{i=1}^{k}(y_i-\hat y_i)^2}$$

Where $k$ is the number of subset cells, $y$ is the value of the simulated output variable, and $\hat y$ is the value of the matched output variable.

For our example here, we do not have simulation output corresponding to the subset cells we selected above using the kpoints function, so we will create a mock set of output results to illustrate the loocv function here.

# Create a mock dataset of output results
output_results <- allvars[rownames(subsetcells),c("cellnumbers","bioclim_02",
                                            "bioclim_03","bioclim_16",
                                            "bioclim_17")]

Explanation of loocv arguments

Please also refer to the loocv documentation.

# Create dataset of matchingvars for subsetcells
subset_matchingvars <- matchingvars[rownames(subsetcells),-1]

# Run leave-one-out cross validation of mock output results
loocv_results <- loocv(matchingvars = subset_matchingvars, 
                       output_results = output_results,
                       criteria1 = criteria,
                       secondarymatch = FALSE, n_neighbors = 2,
                       subset_in_target = TRUE)

Explanation of loocv outputs

The output from the loocv function is a data frame of cross-validated matching, with coordinates ('x','y') and cellnumber of each subset cell ('target_cell'), unique id of matched non-self subset cell ('subset_cell') and matching quality ('matching_quality'). Subsequent columns give the difference between simulated output values and the matched output values for each cell.

The loocv function output can be used to visualize the distribution of the matching errors.

# Visualize distribution of matching errors
par(mfrow = c(2,2), mar = c(2,2,2,1), mgp = c(1,0.2,0), tcl = 0.2)
for (i in 6:9){
  hist(loocv_results[,i], breaks = 30,
       main = paste0("Matching errors: ", 
                     gsub("_diff","",names(loocv_results)[i])),
       xlab = "Simulated - Matched")
  abline(h = 0)
  legend("topright", 
         legend = c(paste0("cv-error = ", 
                           round(sqrt(mean(loocv_results[,i]^2)),2)),
                  paste0("mean = ", round(mean(loocv_results[,i]),2))),
         cex = 0.7, bty= "n", y.intersp = 0.7, x.intersp = 0.7)

  box()
}

Calculate matching errors

Finally, matching errors for each output variable can be estimated from the output of loocv using the cverrors function.

Explanation of cverrors arguments

# Estimate matching errors for each output variable
est_errors <- cverrors(loocv_output = loocv_results, 
                       first_output_column = 6)
bioclim_02 bioclim_03 bioclim_16 bioclim_17 
 0.7747073  1.2122000 16.9545430 10.3164451

Interpolate output variables

Finally, the iterpolatePoints function can be used to interpolate results from simulated subset cells to all target cells across the study area by assigning simulation results from the subset cells to their matched target cells.

For our example here, we do not have simulation output corresponding to the subset cells we selected above using the kpoints function, so we will again use our mock set of output results to illustrate the interpolatePoints when subset_in_target is TRUE.

Explanation of interpolatePoints arguments

Please also refer to the interpolatePoints documentation.

# Interpolate simulation output to rasters
interpolatePoints(matches = quals, output_results = output_results,
                  exclude_poor_matches = TRUE,
                  subset_cell_names = "subset_cell",
                  quality_name = "matching_quality",
                  matching_distance = 1.5, raster_template = targetcells[[1]],
                  plotraster = TRUE, overwrite = FALSE)

Interpolated output of bioclim_02.

Interpolated output of bioclim_03.

Interpolated output of bioclim_16.

Interpolated output of bioclim_17.

References

Hijmans, R. J. (2021). raster: Geographic Data Analysis and Modeling. R package version 3.4-13. https://CRAN.R-project.org/package=raster

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Le Cam, L. M., & Neyman, J. (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. (Vol. 1, No. 14, pp. 281-297). University of California Press.

Palmquist, K. A., Bradford, J. B., Martyn, T. E., Schlaepfer, D. R., & Lauenroth, W. K. (2018a). STEPWAT2: an individual-based model for exploring the impact of climate and disturbance on dryland plant communities. Ecosphere, 9(8). https://doi.org/10.1002/ecs2.2394

Palmquist, K. A, Schlaepfer, D. R., Martyn, T. E., Bradford, J. B., & Lauenroth, W. K. (2018b). DrylandEcology/STEPWAT2: STEPWAT2 Model Description (Palmquist, et al., 2018 Ecosphere). Zenodo. https://doi.org/10.5281/zenodo.1306924

Savje, F. (2019). distances: Tools for Distance Metrics. R package version 0.1.8. https://CRAN.R-project.org/package=distances



DrylandEcology/rMultivariateMatching documentation built on Dec. 17, 2021, 5:30 p.m.