In DrylandEcology/rMultivariateMatching: Multivariate matching for site selection and interpolation using multi-dimensional datasets

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Introduction

This vignette describes a workflow for selecting an optimal subset of sites (k) from within a study area using the kpoints function, such that the k sites maximize the area represented by those sites (subset cells). The method was designed to select optimal sites for simulation, for the purpose of applying the interpolatePoints function to interpolate simulation output from the subset cells to create high-resolution maps of output variables across the study area (target cells). The site selection method is meant to balance the cost of adding more sites (in terms of computational or other resources) with the benefit of representing a larger proportion of the study area. Importantly, the site selection method could also be used to select an optimal set of sites for field sampling. Here, we illustrate the workflow for site selection using an example for the state of Wyoming. The code chunks in this document are meant to be run sequentially. Data relevant to this example are contained within the rMultivariateMatching package. The site-selection method is described in:

Renne, R. R., Schlaepfer, D. R., Palmquist, K. A., Lauenroth, W. K., & Bradford, J. B. In preparation. Estimating complex ecological variables at high spatial resolution using cost-effective multivariate matching algorithms.

Examples of the site selection and interpolation methods in print

In the site selection example in Renne et al., our goal was to project the impacts of climate change, wildfire, and livestock grazing on big sagebrush (Artemisia tridentata) plant communities in the western United States using STEPWAT2, an individual-based, gap dynamics plant simulation model (Palmquist et al., 2018a; Palmquist et al., 2018b). We chose a set of six climate variables that capture the major drivers of plant community structure in drylands as matching variables. We used the site selection and interpolation methods described in Renne et al. to generate the maps of simulation output for a recent publication investigating the projected impacts of climate change on big sagebrush plant communities:

Palmquist, K. A., Schlaepfer, D. R., Renne, R. R., Torbit, S. C., Doherty, K. E., Remington, T. E., Watson, G., Bradford, J. B., & Lauenroth, W. K. (2021). Divergent climate change effects on widespread dryland plant communities driven by climatic and ecohydrological gradients. Glob. Change Bio., 27(20), 5169-5185. https://doi.org/10.1111/gcb.15776

Interpolated datasets from the study are available from the US Geological Survey Sciencebase:

Renne, R. R., Palmquist, K. A., Schlaepfer, D. R., Lauenroth, W. K., and Bradford, J. B. (2021). High-resolution maps of big sagebrush plant community biomass using multivariate matching algorithms: U.S. Geological Survey data release. https://doi.org/10.5066/P9MNKWS4.

Overview and definitions

The functions in the rMultivariateMatching package measure similarity between sites as the Euclidean distance of standardized (i.e., weighted) matching variables. Each variable is standardized by dividing it by the maximum amount of difference that would still allow two sites to be considered analogous for that variable. These maximum differences are called the matching criteria. For example, if two sites could be reasonably considered analogous if their mean annual precipitations (MAP) were within 50 mm of each other, the matching criteria for MAP could be set to 50 mm. Distances are calculated within the rMultivariateMatching function using the distances package (Savje 2019).

Definitions

subset cells A subset of sites (selected using kpoints or determined from some other method) for simulation. Sites are defined geographically as the centroid of the raster cell that they represent.
target cells All cells across a study area.
matching variables A set of variables that are relevant to the ecosystem and simulation model and are used to determine similarity between sites. Matching variables must be available as rasters covering the entire study area.
matching criteria Maximum allowable differences between target and matched subset cells for each matching variable such that sites can be considered analogous.
kpoints Site selection algorithm that uses multivariate clustering based on the unsupervised classification scheme, kmeans (MacQueen 1967), to find groups of analogous sites across a study area.

The figure (Fig. 1 from Renne et al.) below illustrates an overview of the site selection and interpolation methods and key terminology. First, a set of matching variables are selected and standardized using user-defined matching criteria. Next, the kpoints algorithm is used to determine an optimal set of k subset cells for use in field sampling or simulation. All target cells are matched to one of the subset cells (matches are denoted by color in the central panel). Finally, point results (e.g., simulation output) for the subset cells are interpolated across the study area using multivariate matching to produce continuous, high-resolution maps of output, such as plant functional type biomass or ecohydrological variables.

# Load the rMultivariateMatching package
library(rMultivariateMatching)

Overview of the kpoints function

In the kpoints function, k sites (subset cells) are randomly selected from all cells in the study area (target cells). Then, each target cell is assigned to a group according to its nearest neighbor among the subset cells using Euclidean distance of the weighted (standardized) matching variables. As a result, all target cells, whether part of the random selection of k cells or not, are assigned to a group. Next, the centroids (multivariate mean) of the groups are calculated and the target cell that is the nearest neighbor to the centroid of each group is selected as one of an updated set of subset cells. In the next iteration, each target cell is assigned to a group according to its nearest neighbor among the new selection of k subset cells and the process continues until a stopping criterion is met: either 1) a predefined number of iterations are completed (iter parameter in the kpoints function) or the change in area represented between iterations is at or below a designated threshold (min_area parameter in the kpoints function) for five consecutive iterations.

Importantly, because the proportion of the area represented by the k subset cells may increase or decrease between consecutive iterations, the last iteration before stopping may not represent the best solution. The kpoints function saves the subset cells and areal coverage for each iteration and once a stopping criterion is met, the subset cells for the iteration that represented the largest proportion of the study area are saved as the best solution.

Stopping criteria are designed to stop the algorithm when an optimal solution has been found (little to no gain in represented area for additional iterations) and to prevent the algorithm from running indefinitely, i.e., it will stop after the designated number of iterations (iter), whether or not an optimal solution has been found. Because the algorithm may stop prior to finding an optimal solution, suitable values for iter and min_area should be determined to ensure that the change in area represented by each iteration levels out before stopping (set verify_stop to TRUE). See Verify stopping criteria below.

In addition to the stopping criteria, the n_starts argument in the kpoints function can have a strong influence on the quality of the results. The solution to the kpoints function is likely to vary depending on the initial random selection of cells. Similar to the kmeans clustering algorithm (MacQueen 1967), kpoints can get "stuck" in a locally optimal solution that falls short of the globally optimal solution. In our case, this means that running the kpoints function with a single random selection of k cells (n_starts = 1) is likely to result in a selection of k points that does not represent as much of the study area as is possible for that number of points. Thus, to increase the probability that kpoints will find a solution that represents as much of the study areas as possible for that value of k, we need to try multiple initiations of the algorithm with different initial random selections of target cells. This can be achieved by setting n_starts argument to a value > 1. The default value is 10 and values <10 will throw a warning (but still run). A reasonable number of starts for testing matching criteria and determining the optimal value of k would be 10. When you are ready to find the final selection of subset cells using kpoints, you should set n_starts to a relatively large number. n_starts = 100 should be adequate, and you may choose a higher number if you wish.

Importantly, the kpoints function can take some time to run, particularly if n_starts is large, when solving over a large range of possible values of k, and when working with a large study area or with many matching variables. Because of possibly long run times, kpoints generates text output at the start of each iteration that reports the time, i of iter, k, and n of n_starts, followed by text output at the end of each iteration that reports i of iter, n of n_starts, k, and coverage (both in $km^{2}$ and as a percentage of total area).

Identify variables for matching, selecting matching criteria, and determining the optimal number of points

The first step is to identify the matching variables. This will depend on the data that are available and the ecosystem and processes in question. Matching variables must be provided as a stack of rasters (rasterStack, see the raster package (Hijmans 2021) for details).

Once you have identified your matching variables, you will need to convert the rasterStack to a data frame that can be used in subsequent functions in the rMultivariateMatching package. Importantly, it can be useful to include other variables that are not for matching but are still relevant to the ecosystem and processes in question so that you can use these to evaluate matching (using the evaluateMatching function) later on.

Prepare input data

The makeInputdata function takes the rasterStack of matching (and other) variables and transforms it into the correct format to be used in subsequent functions.

# Load targetcells data for Target Cells
data(targetcells)

# Create data frame of potential matching variables for Target Cells
allvars <- makeInputdata(targetcells)

Choose matching criteria to standardize matching variables

The choose_criteria function runs through a list of vectors of matching criteria and calculates the proportion of areal coverage of each set. You can compare the areal coverage across the sets to determine an acceptable balance between how strict the matching criteria are and the proportion of the study area that can be represented by a given number of k points. Importantly, you will need to choose a value for k (number of subset cells) to run this function, which may need to be somewhat arbitrary if you have not yet run kpoints over a range of possible values of k to determine the optimal number of points.

Typically, determining matching criteria and k will involve some experimentation by running kpoints over a range of possible values for k with different sets of matching criteria. In the example below, we have set k to 200.

# Restrict data to matching variables of interest
matchingvars <- allvars[,c("cellnumbers","x","y","bioclim_01","bioclim_04",
                           "bioclim_09","bioclim_12","bioclim_15","bioclim_18")]

# Create list of matching criteria to choose:
# Look at 2.5%, 5%, & 10% of range and standard deviation for each variable
range2.5pct <- apply(matchingvars[,4:ncol(matchingvars)],2,
                     function(x){(max(x)-min(x))*0.025})
range5pct <- apply(matchingvars[,4:ncol(matchingvars)],2,
                     function(x){(max(x)-min(x))*0.05})
range10pct <- apply(matchingvars[,4:ncol(matchingvars)],2,
                     function(x){(max(x)-min(x))*0.1})
stddev <- apply(matchingvars[,4:ncol(matchingvars)],2,sd)

# Create criteria_list
criteria_list <- list(range2.5pct, range5pct, range10pct, stddev)

Table displaying `criteria_list`

| bioclim_01 | bioclim_04 | bioclim_09 | bioclim_12 | bioclim_15 | bioclim_18 | |:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| | 0.36 | 10.62 | 0.82 | 32.97 | 1.36 | 4.61 | | 0.73 | 21.23 | 1.64 | 65.93 | 2.72 | 9.21 | | 1.45 | 42.47 | 3.29 | 131.86 | 5.44 | 18.42 | | 2.49 | 60.37 | 7.41 | 170.74 | 13.53 | 34.27 |

Explanation of `choose_criteria` arguments

Please also refer to the choose_criteria documentation.

matchingvars A data frame representing all target cells from which a subset of k points will be selected. Ideally, this data frame should be created using the makeInputdata function.
criteria_list A list of vectors of matching criteria to be tested.
n_starts The number of random selections of target cells to use for initializing different runs of the kpoints algorithm (as discussed above). Default is 10.
k The number of subset cells to select using the kpoints function.
verify_stop This is a boolean that indicates whether or not a figure displaying the proportion of the study area represented for each iteration of each n of n_starts should be displayed. This argument is used to verify that the stopping criteria are appropriate. See above for details.
raster_template This is one of the rasters used to define the target cells and it is used as a template to determine the area represented by each iteration and by the solution for each of the n_starts. This argument is fundamental to the kpoints function because kpoints is designed to optimize based on represented area.
subset_in_target A boolean indicating if the subset cells are taken from within the target cells. When the kpoints function is used to select the subset cells, this will be TRUE (as is the case in this example).
plot_coverage A boolean indicating if the proportion of the study area represented by each vector of matching criteria in the criteria_list should be displayed in a bar plot. (Set to FALSE in this example).

Note: the function below does not specify additional arguments that are passed to the kpoints function (meaning default values will be used):

min_area Default is 50.
iter Default is 50.

# Compare coverage with various criteria
results2 <- choose_criteria(matchingvars, criteria_list = criteria_list, 
                            n_starts = 10, k = 200, verify_stop = FALSE,
                            raster_template = targetcells[[1]], 
                            subset_in_target = TRUE,
                            plot_coverage = FALSE)

Explanation of `choose_criteria` outputs

The output from the choose_criteria function is a named list where the first item ('totalarea') reports the total area in $km^2$ of the target cells and the second item ('solution_areas') reports the area represented (Euclidean distance of weighted matching variables is $\le 1$ between target and matched subset cells) for each set of matching criteria.

In the above example, plot_coverage is set to FALSE. If TRUE, the choose_criteia function will generate a barplot that shows the proportion of the study area that is represented by each set of matching criteria in criteria_list. You can also set plot_coverage to FALSE and use the criteriaplot function to visualize the results of the choose_criteria function, as shown below.

# Plot coverage
criteriaplot(results2, criteria_list)

kpoints

The kpoints function uses a data frame created using makeInputdata to determine clusters of analogous sites (target cells) and select one site to represent each cluster (subset cells). The number of clusters is designated as k. The optimal value of k can be determined by running the kpoints function for a series of possible k values and evaluating the change in the proportion of the study area that is represented as k increases. The proportion of the study area that is represented by the k points will increase with k, and k can be chosen by identifying where along the range of k the benefit of increasing represented area is outweighed by the cost associated with adding additional sites.

Determine k: the optimal number of points

To illustrate the process of choosing a value for k below, we test a range of k values from 25 to 100 in steps of 25. First, we run the kpoints function for this range of values saved as a vector called klist. Then, we plot the proportion of the study area covered by each k value as a function of k using the plotcoverage function.

Importantly, you will need to select a set of matching criteria to use the kpoints function to choose a value for k, which may need to be somewhat arbitrary if you have not yet run the choose_criteria function to determine the best set of criteria for your project.

Typically, determining matching criteria and k will involve some experimentation by running kpoints over a list of possible values for k with different sets of matching criteria. In the example below, we have set the matching criteria as either 5 or 10% of the range of each of the matching variables. Importantly, before we run kpoints to select our final set of subset cells, we run kpoints for a single value for k and set verify_stop to TRUE (without saving the output to an object) so that we can verify that the stopping criteria are appropriate. More information on verifying stopping criteria can be found below.

# Create vector of matching criteria
criteria <- c(0.7,42,3.3,66,5.4,18.4)
# Verify stopping criteria
kpoints(matchingvars, criteria = criteria, klist = 200,
                    n_starts = 10, min_area = 50, iter = 50,
                    raster_template = targetcells[[1]], verify_stop = TRUE)

Explanation of `kpoints` arguments

Please also refer to the kpoints documentation.

matchingvars A data frame representing all target cells from which a subset of k points will be selected. Ideally, this data frame should be created using the makeInputdata function.
criteria A vector of matching criteria that correspond to each matching variable in the matchingvars data frame.
n_starts The number of random selections of target cells to use for initializing different runs of the kpoints algorithm (as discussed above).
min_area and iter Stopping criteria. Detailed explanation of these arguments provided above.
klist A vector of a range of values of subset cells to select using the kpoints function.
verify_stop This is a boolean that indicates whether or not a figure displaying the proportion of the study area represented for each iteration of each n of n_starts should be displayed. This argument is used to verify that the stopping criteria are appropriate. See above for details.
raster_template This is one of the rasters used to define the target cells and it is used as a template to determine the area represented by each iteration and by the solution for each of the n_starts. This argument is fundamental to the kpoints function because kpoints is designed to optimize based on represented area.
subset_in_target A boolean indicating if the subset cells are taken from within the target cells. When the kpoints function is used to select the subset cells, this will be TRUE (as is the case in this example).

# Create sequence of values for k
klist = seq(25,100, by = 25)
# Run kpoints algorithm for klist
results3 <- kpoints(matchingvars, criteria = criteria, klist = klist,
                    n_starts = 10, min_area = 50, iter = 50,
                    raster_template = targetcells[[1]], verify_stop = FALSE)

Explanation of `kpoints` outputs

The output from the kpoints function includes:

solutions A data frame of the final solution of k subset cells. Rownames correspond to the cellnumbers, and the 'x' and 'y' coordinates of the cells are included. (This is the data frame that is saved to file if the savebest argument is set to TRUE).
solution_areas The area (in $km^2$) represented by the subset cells, where "represented" means the weighted Euclidean distance between target and matched *subset cells is $\le$ 1.
totalarea The area of the entire study area (in $km^2$).
klist The number of subset cells (k) selected by the kpoints function.
iter This records the value used in the kpoints function for finding the solution.
n_starts This records the value used in the kpoints function for finding the solution.
criteria This records the values used in the kpoints function for finding the solution.
min_area This records the value used in the kpoints function for finding the solution.

plotcoverage

The plotcoverage function takes output from the kpoints function that has been run over an ordered vector of possible values for k and plots the proportion of the study area that was represented by each value of k. This plot will be useful for determining an optimal value for k as described at the first part of this section.

# Find optimal number of points (k)
plotcoverage(results3)

Verify stopping criteria

When the kpoints function is called (directly or indirectly, as in the choose_criteria function), it is important to verify that stopping criteria iter and min_area are appropriate. See the Overview of the kpoints function for a more detailed description of these criteria.

You can verify stopping criteria by running the kpoints function with the verify_stop argument set to TRUE. This will produce a plot that allows you to visually verify that stopping criteria are appropriate (see next paragraph).

The stopping criteria can be considered appropriate if none of the n initiations (n_starts, represented by different lines in the plot) of the kpoints algorithm reaches iter (maximum iterations) and the proportion of the study area represented levels out before stopping (without running through excessive iterations for relatively little gain in represented area). The lines in the plot below level off and then stop before reaching iter, indicating that the kpoints algorithm is able to find an optimal solution without running indefinitely. If the lines fail to level off, try increasing iter. If the lines level off many iterations before iter, try increasing min_area.

# Create vector of matching criteria
criteria <- c(0.7,42,3.3,66,5.4,18.4)

# Verify stopping criteria for 200 points
results1 <- kpoints(matchingvars, criteria = criteria, klist = 200,
                    n_starts = 10, min_area = 50, iter = 50,
                    raster_template = targetcells[[1]], verify_stop = TRUE,
                    savebest = FALSE)

Finding final selection of subset cells using `kpoints`

Once you have determined the matching variables, matching criteria, and an optimal and feasible number of points (k), you should run the kpoints function again. This time, you should set n_starts to a relatively large number to have a better chance of finding a solution that will represent as large a proportion of the study area as possible (see the Overview of the kpoints function section above for details). Setting n_starts to 100 should be adequate, and you may choose a higher number if you wish.

Note: this function may take quite a while to run. Running kpoints with n_starts = 100 for this example could take 2-3 hours.

# Find final solution for k = 200
results1 <- kpoints(matchingvars, criteria = criteria, klist = 200,
                    n_starts = 100, min_area = 50, iter = 50,
                    raster_template = targetcells[[1]], verify_stop = FALSE,
                    savebest = FALSE)

References

Hijmans, R. J. (2021). raster: Geographic Data Analysis and Modeling. R package version 3.4-13. https://CRAN.R-project.org/package=raster

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Le Cam, L. M., & Neyman, J. (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. (Vol. 1, No. 14, pp. 281-297). University of California Press.

Palmquist, K. A., Bradford, J. B., Martyn, T. E., Schlaepfer, D. R., & Lauenroth, W. K. (2018a). STEPWAT2: an individual-based model for exploring the impact of climate and disturbance on dryland plant communities. Ecosphere, 9(8). https://doi.org/10.1002/ecs2.2394

Palmquist, K. A, Schlaepfer, D. R., Martyn, T. E., Bradford, J. B., & Lauenroth, W. K. (2018b). DrylandEcology/STEPWAT2: STEPWAT2 Model Description (Palmquist, et al., 2018 Ecosphere). Zenodo. https://doi.org/10.5281/zenodo.1306924

Savje, F. (2019). distances: Tools for Distance Metrics. R package version 0.1.8. https://CRAN.R-project.org/package=distances

DrylandEcology/rMultivariateMatching documentation built on Dec. 17, 2021, 5:30 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DrylandEcology/rMultivariateMatching
Multivariate matching for site selection and interpolation using multi-dimensional datasets

In DrylandEcology/rMultivariateMatching: Multivariate matching for site selection and interpolation using multi-dimensional datasets

Introduction

Examples of the site selection and interpolation methods in print

Overview and definitions

Definitions

Overview of the kpoints function

Identify variables for matching, selecting matching criteria, and determining the optimal number of points

Prepare input data

Choose matching criteria to standardize matching variables

Table displaying `criteria_list`

Explanation of `choose_criteria` arguments

Explanation of `choose_criteria` outputs

kpoints

Determine k: the optimal number of points

Explanation of `kpoints` arguments

Explanation of `kpoints` outputs

plotcoverage

Verify stopping criteria

Finding final selection of subset cells using `kpoints`

References

R Package Documentation

Browse R Packages

We want your feedback!

DrylandEcology/rMultivariateMatching Multivariate matching for site selection and interpolation using multi-dimensional datasets

In DrylandEcology/rMultivariateMatching: Multivariate matching for site selection and interpolation using multi-dimensional datasets

Introduction

Examples of the site selection and interpolation methods in print

Overview and definitions

Definitions

Overview of the kpoints function

Identify variables for matching, selecting matching criteria, and determining the optimal number of points

Prepare input data

Choose matching criteria to standardize matching variables

Table displaying criteria_list

Explanation of choose_criteria arguments

Explanation of choose_criteria outputs

kpoints

Determine k: the optimal number of points

Explanation of kpoints arguments

Explanation of kpoints outputs

plotcoverage

Verify stopping criteria

Finding final selection of subset cells using kpoints

References

R Package Documentation

Browse R Packages

We want your feedback!

DrylandEcology/rMultivariateMatching
Multivariate matching for site selection and interpolation using multi-dimensional datasets

Table displaying `criteria_list`

Explanation of `choose_criteria` arguments

Explanation of `choose_criteria` outputs

Explanation of `kpoints` arguments

Explanation of `kpoints` outputs

Finding final selection of subset cells using `kpoints`