knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Introduction

This vignette describes a workflow that covers matching and interpolation for a case where subset cells were not determined using the kpoints function. This workflow is appropriate when the goals is to interpolate simulation results from subset cells to create high-resolution, continuous maps of the study area. Here, we illustrate the workflow using an example for the state of Wyoming, where subset cells were generated using a regular 10 x 10 km grid (see Bradford et al. (2019) for details). The code chunks in this document are meant to be run sequentially. Data relevant to this example are contained within the rMultivariateMatching package. The site-selection and interpolation methods are described in:

Renne, R. R., Schlaepfer, D. R., Palmquist, K. A., Lauenroth, W. K., & Bradford, J. B. In preparation. Estimating complex ecological variables at high spatial resolution using cost-effective multivariate matching algorithms.

Examples of the interpolation methods in print

In the interpolation example in Renne et al., our goal was to generate high-resolution, 30-arcsecond maps of SOILWAT2 simulation output with low spatial resolution (10 x 10 km). SOILWAT2 is a process-based ecohydrological simulation model (Schlaepfer et al., 2012; Schlaepfer & Andrews, 2018; Schlaepfer & Murphy, 2018). The two-step matching method described in Renne et al. was devised to interpolate simulation output for a series of studies investigating the potential impacts of climate change on ecohydological variables in western North American drylands. Interpolated datasets from these studies will soon be available from the US Geological Survey Sciencebase:

Citations to follow when available.

Overview and definitions

In this vignette, we assume that we have already determined appropriate matching variables and matching criteria and run simulations for the subset cells. Note that this workflow is distinct from a workflow where the simulated sites (subset cells) are selected using kpoints. A workflow for that scenario can be found in the Matching and interpolation with kpoints vignette. Furthermore, although the example used here is derived from the interpolation example in Renne et al. that used a two-step matching process to fit the experimental design described in Bradford et al., (2019), we limit matching in this example to a single step. A workflow that illustrates two-step matching and interpolation can be found in the Two step matching and interpolation without kpoints vignette.

The functions in the rMultivariateMatching package measure similarity between sites as the Euclidean distance of standardized (i.e., weighted) matching variables. Each variable is standardized by dividing it by the maximum amount of difference that would still allow two sites to be considered analogous for that variable. These maximum differences are called the matching criteria. For example, if two sites could be reasonably considered analogous if their mean annual precipitations (MAP) were within 50 mm of each other, the matching criteria for MAP could be set to 50 mm. Distances are calculated within the rMultivariateMatching function using the distances package (Savje 2019).

Definitions

The figure below illustrates an overview of the site selection and interpolation methods and key terminology. In this vignette, we focus on interpolation. First, a set of matching variables are selected and standardized using user-defined matching criteria. Next, a set of subset cells are defined (not using the kpoints function) and all target cells are matched to one of the subset cells (matches are denoted by color in the central panel). Finally, point results (e.g., simulation output) for the subset cells are interpolated across the study area using multivariate matching to produce continuous, high-resolution maps of output, such as plant functional type biomass or ecohydrological variables.

# Load the rMultivariateMatching package
library(rMultivariateMatching)

Prepare inputs

The first step is to use the makeInputdata function to convert a rasterStack (see the raster package (Hijmans 2021) for details) of matching (and other) variables into the correct format to be used in subsequent functions.

# Load targetcells data for Target Cells
data(targetcells)
# Create data frame of potential matching variables for Target Cells
allvars <- makeInputdata(targetcells)

Determine matching criteria

The next step is to determine matching criteria, that is, the maximum allowable difference for each matching variable between target and matched subset cells such that the cells can be considered environmentally analogous. You may have some idea of what matching criteria should be based on your knowledge of the system and processes in question, but it can be useful to evaluate how the proportion of the study area that is represented by the subset cells changes with different matching criteria using the choose_criteria function.

# Restrict data to matching variables of interest
matchingvars <- allvars[,c("cellnumbers","x","y","bioclim_01","bioclim_04",
                           "bioclim_09","bioclim_12","bioclim_15","bioclim_18")]

# Create list of matching criteria to choose:
# Look at 2.5%, 5%, & 10% of range and standard deviation for each variable
range2.5pct <- apply(matchingvars[,4:ncol(matchingvars)],2,
                     function(x){(max(x)-min(x))*0.025})
range5pct <- apply(matchingvars[,4:ncol(matchingvars)],2,
                   function(x){(max(x)-min(x))*0.05})
range10pct <- apply(matchingvars[,4:ncol(matchingvars)],2,
                    function(x){(max(x)-min(x))*0.1})
stddev <- apply(matchingvars[,4:ncol(matchingvars)],2,sd)

# Create criteria_list
criteria_list <- list(range2.5pct, range5pct, range10pct, stddev)

Table displaying criteria_list

| bioclim_01 | bioclim_04 | bioclim_09 | bioclim_12 | bioclim_15 | bioclim_18 | |:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| | 0.36 | 10.62 | 0.82 | 32.97 | 1.36 | 4.61 | | 0.73 | 21.23 | 1.64 | 65.93 | 2.72 | 9.21 | | 1.45 | 42.47 | 3.29 | 131.86 | 5.44 | 18.42 | | 2.49 | 60.37 | 7.41 | 170.74 | 13.53 | 34.27 |

The subsetcells dataset that is included in the rMultivariateMatching package is a subset of sites that were simulated in Bradford et al. (2019). The experimental design of that study selected sites using a 10 x 10 km grid across western North American drylands. For each site, they simulated ecohydrological conditions for five different soil types. Thus, we devised a two-step matching method (see the Two step matching and interpolation without kpoints vignette) to fit this experimental design. However, for the purposes of the current example in this vignette, we wanted to demonstrate matching and interpolation based on simple, one-step matching. The following code will simplify the subsetcells dataset to include only sites that were simulated with a silt loam soil with 30% sand and 18% clay (this corresponds to "site_ids" that end in ".2").

# Bring in Subset cell data
data(subsetcells)

# Remove duplicates (representing cells with same climate but different soils--
# we want to match on climate only)
subsetcells <- subsetcells[seq(2,10630, by = 5),]

# Pull out matching variables only, with site_id that identifies unique climate
subsetcells <- subsetcells[,c("site_id","X_WGS84","Y_WGS84","bioclim_01",
                              "bioclim_04","bioclim_09","bioclim_12",
                              "bioclim_15","bioclim_18")]

# Ensure that site_id will be values unique to subsetcells
subsetcells$site_id <- paste0("00",subsetcells$site_id)

Explanation of choose_criteria arguments

Please also refer to the choose_criteria documentation.

# Run choose_criteria function to evaluate different matching criteria
coverage <- choose_criteria(matchingvars = matchingvars, 
                            criteria_list = criteria_list,
                            plot_coverage = TRUE,
                            raster_template = targetcells[[1]],
                            subset_in_target = FALSE,
                            subsetcells = subsetcells,
                            matchingvars_id = "cellnumbers",
                            subsetcells_id = "site_id",
                            matching_distance = 1)

Explanation of choose_criteria outputs

The output from the choose_criteria function is a named list where the first item ('totalarea') reports the total area in $km^2$ of the target cells and the second item ('solution_areas') reports the area represented (i.e., Euclidean distance of weighted matching variables is $\le 1$ between target and matched subset cells) for each set of matching criteria.

Perform matching

Use the multivarmatch function to perform multivariate matching. This function will match every target cell to one of the subset cells.

Explanation of multivarmatch arguments

Please also refer to the multivarmatch documentation.

Note: the function below does not specify several additional arguments that are left as default values. These include:

# Create vector of matching criteria
criteria <- c(0.7,42,3.3,66,5.4,18.4)

# Find matches and calculate matching quality
quals <- multivarmatch(matchingvars, subsetcells, criteria = criteria,
                       matchingvars_id = "cellnumbers",
                       subsetcells_id = "site_id",
                       raster_template = targetcells[[1]],
                       subset_in_target = FALSE, addpoints = FALSE)

Explanation of multivarmatch outputs

The output from the multivarmatch function is a data frame of target cells with coordinates ('x','y'), cellnumber of each target cell ('target_cell'), unique id of the matched subset cell ('subset_cell'), and matching quality ('matching_quality'). The function will save a raster of matching quality if saveraster is TRUE and plot a map of matching quality if plotraster is TRUE.

Evaluate matching

Now that we have completed matching, we should evaluate the success of matching. In Renne et al., we suggest four methods of evaluating matching.

  1. Calculate matching quality (weighted Euclidean distance between target and matched subset cells)

  2. Calculate the standard deviation of differences between target and matched subset cells for a set of variables relevant to the project

  3. Calculate geographic distances between matched cells

  4. Leave-one-out cross-validation

1. Calculate matching quality

In addition to identifying the matching subset cell for each target cell, the multivarmatch function also calculates the weighted Euclidean distance between matched cells, which can be interpreted as a continuous variable measuring matching quality. Distances less than or equal to one indicate high-quality matching (matching criteria have been met for all variables). Distances greater than one indicate that the difference between these target cells and their matched subset cell exceeds the matching criterion for one or more variables.

Matching quality can be used to exclude target cells for which there is insufficient matching, i.e., no simulated site is analogous, and to determine the spatial extent of the represented area. In Renne et al., we used a cut-off of 1.5 as a threshold between high/moderate quality and low quality matching, and excluded sites with matching quality >1.5 from the final interpolated datasets.

2. Calculate standard deviation of differences between target and matched subset cells

Matching can also be evaluated by using the evaluateMatching function to calculate the standard deviation of differences between target and subset cells for a set of variables relevant to the project. This provides an estimate of how much variability there is among the target cells matched to each subset cell and is particularly informative if the variables were not used as part of the matching process.

If the rownames of allvars and matches (see below for details) do not match, the evaluateMatching function will fail. Importantly, the function will remove cases with missing values from the allvars and matches data frames and if missing values occur for different cases, the function will fail because the rownames of the two data frames will no longer match.

Here we use 19 bioclim variables (six of which were used as matching variables) and two soil variables to evaluate matching in the current example.

Explanation of evaluateMatching arguments

Please also refer to the evaluateMatching documentation.

# Bring the full subsetcells data in again to get all variables
data(subsetcells)

# Remove duplicates (representing cells with same climate but different soils--
# we want to match on climate only)
subsetcells <- subsetcells[seq(2,10630, by = 5),]

# Get all variables for Subset cells now:
subsetcells <- subsetcells[,c("site_id","X_WGS84","Y_WGS84",
                              names(subsetcells)[8:28])]

# Convert sand and clay fraction to percentage in subsetcells
subsetcells$sand <- subsetcells$sand*100
subsetcells$clay <- subsetcells$clay*100 

# Run evaluateMatching
sddiffs <- evaluateMatching(allvars = allvars, 
                            subsetcells = subsetcells,
                            matches = quals,
                            quality_name = "matching_quality",
                            matchingvars_id = "cellnumbers",
                            subsetcells_id = "site_id",
                            subset_in_target = FALSE, matching_distance = 1.5,
                            plot_diffs = TRUE, secondarymatch = FALSE)

Explanation of evaluateMatching outputs

A data frame of the standard deviation of differences between target and matched subset cells for all variables supplied in allvars data frame. The first row corresponds to the standard deviation of differences between target and subset cells for all cells and the second row corresponds to the standard deviation of differences between target and subset cells for only those target cells with matching quality $\le$ matching_distance. Units are the same as the units for each variable in allvars.

3. Calculate geographic distances between matched cells

Importantly, the quality of matching will also depend on the extent to which spatio-temporal patterns are maintained between matched cells. Our methods do not explicitly incorporate geographic proximity. Thus, target cells may be matched to geographically distant subset cells and simulation output from subset cells that are geographically distant may be assigned to adjacent target cells.

Although these cells may be well matched using the matching variables, they may exhibit daily patterns (e.g., precipitation, soil moisture) that are much less similar than would be expected for adjacent cells. To estimate the extent to which spatio-temporal patterns may have been maintained during matching, two measures of distance can be calculated using the evaluateGeoDist function:

  1. The distance between target cells and their matched subset cells

  2. The average distance between the subset cell that is matched to a given target cell and the subset cells that are matched to the eight adjacent neighbors of that target cell.

Explanation of evaluateGeoDist arguments

Please also refer to the evaluateGeoDist documentation.

# Look at geographic distances
geodist <- evaluateGeoDist(matches = quals, subsetcells = subsetcells,
                          subsetcells_id = 'site_id', 
                          subset_in_target = FALSE, 
                          quality_name = "matching_quality",
                          exclude_poor_matches = TRUE, matching_distance = 1.5,
                          longlat = TRUE, raster_template = targetcells[[1]],
                          map_distances = TRUE, map_neighbor_distances = TRUE,
                          which_distance = "both",saverasters = FALSE)

Distance (km) between target and matched subset cells.

Average distance (km) between the subset cell that is matched to a given target cell and the subset cells that are matched to the eight adjacent neighbors of that target cell.

Explanation of evaluateGeoDist outputs

A data frame with the distance between target and matched subset cells ('target_to_subset_distance') and the average distance between the subset cell matched to each target cell and the subset cells matched to the eight adjacent target cells ('avgdistance_to_neighbors'). The first column and the rownames correspond to the unique identifiers for the target cells, and columns 2 and 3 correspond to the 'x' and 'y' coordinates of the target cells.

4. Leave-one-out cross-validation

Once simulations are complete for the subset cells, interpolation errors can be estimated using leave-one-out cross validation (LOOCV) with the loocv function. In LOOCV, each subset cell is matched to its nearest neighbor from among the remaining subset cells (using Euclidean distance of the weighted matching variables). Then, an estimate of the interpolation error can be calculated using the following equation:

$$CVerror=\sqrt{\frac{1}k∑_{i=1}^{k}(y_i-\hat y_i)^2}$$

Where $k$ is the number of subset cells, $y$ is the value of the simulated output variable, and $\hat y$ is the value of the matched output variable.

For our example here, we have simulation output generated by simulations used in Bradford et al. (2019).

# Bring in subsetcells dataset again to get output variables
data(subsetcells)

# Remove duplicates (representing cells with same climate but different soils--
# we want to match on climate only)
subsetcells <- subsetcells[seq(2,10630, by = 5),]

# Create a mock dataset of output results
output_results <- subsetcells[,c("site_id","Dryprop","CwetWinter","CdrySummer",
                                 "Cwet8","Dryall","Dryany")]
rownames(output_results) <- output_results$site_id

Explanation of loocv arguments

Please also refer to the loocv documentation.

# Bring in Subset cell data
data(subsetcells)

# Remove duplicates (representing cells with same climate but different soils--
# we want to match on climate only)
subsetcells <- subsetcells[seq(2,10630, by = 5),]

# Pull out matching variables only, with site_id that identifies unique climate
subsetcells_matchingvars <- subsetcells[,c("site_id","X_WGS84","Y_WGS84",
                                           "bioclim_01","bioclim_04","bioclim_09",
                                           "bioclim_12","bioclim_15","bioclim_18")]
names(subsetcells_matchingvars)[2:3] <- c("x","y")

# Use site_id as rownames and remove
rownames(subsetcells_matchingvars) <- subsetcells_matchingvars$site_id
subsetcells_matchingvars <- subsetcells_matchingvars[,-1]

# Run leave-one-out cross validation of the output result
loocv_results <- loocv(matchingvars = subsetcells_matchingvars,
                       output_results = output_results,
                       criteria1 = criteria, 
                       secondarymatch = FALSE, n_neighbors = 2)

Explanation of loocv outputs

The output from the loocv function is a data frame of cross-validated matching, with coordinates ('x','y') and cellnumber of each subset cell ('target_cell'), unique id of matched non-self subset cell ('subset_cell') and matching quality ('matching_quality'). Subsequent columns give the difference between simulated output values and the matched output values for each cell.

The loocv function output can be used to visualize the distribution of the matching errors.

# Visualize distribution of matching errors
par(mfrow = c(3,2), mar = c(2,2,2,1), mgp = c(1,0.2,0), tcl = 0.2)
for (i in 6:11){
  hist(loocv_results[,i], breaks = 30,
       main = paste0("Matching errors: ", 
                     gsub("_diff","",names(loocv_results)[i])),
       xlab = "Simulated - Matched")
  abline(h = 0)
  legend("topright", 
         legend = c(paste0("cv-error = ", 
                           round(sqrt(mean(loocv_results[,i]^2)),2)),
                  paste0("mean = ", round(mean(loocv_results[,i]),2))),
         cex = 0.9, bty= "n", y.intersp = 0.7, x.intersp = 0.7)

  box()
}

Calculate matching errors

Finally, matching errors for each output variable can be estimated from the output of loocv using the cverrors function.

Explanation of cverrors arguments

# Estimate matching errors for each output variable
est_errors <- cverrors(loocv_output = loocv_results, 
                       first_output_column = 6)
   Dryprop  CwetWinter  CdrySummer       Cwet8      Dryall      Dryany 
0.06451686 14.26300399  8.56043378  6.01712207 32.49556186 32.21232009

Interpolate output variables

Finally, the iterpolatePoints function can be used to interpolate results from simulated subset cells to all target cells across the study area by assigning simulation results from the subset cells to their matched target cells.

For our example here, we will again use our set of output results from Bradford et al. (2019).

Explanation of interpolatePoints arguments

Please also refer to the interpolatePoints documentation.

# Interpolate simulation output to rasters
interpolatePoints(matches = quals, output_results = output_results,
                  exclude_poor_matches = TRUE,
                  subset_cell_names = "subset_cell",
                  quality_name = "matching_quality",
                  matching_distance = 1.5, raster_template = targetcells[[1]],
                  plotraster = TRUE, overwrite = FALSE)

Interpolated output of DryPROP.

Interpolated output of CwetWinter.

Interpolated output of CdrySummer.

Interpolated output of Cwet8.

Interpolated output of Dryall.

Interpolated output of Dryany.

References

Bradford, J. B., Schlaepfer, D. R., Lauenroth, W. K., Palmquist, K. A., Chambers, J. C., Maestas, J. D., & Campbell, S. B. (2019). Climate-driven shifts in soil temperature and moisture regimes suggest opportunities to enhance assessments of dryland resilience and resistance. Front. Ecol. Evol. 7:358. https://doi.org/10.3389/fevo.2019.00358

Hijmans, R. J. (2021). raster: Geographic Data Analysis and Modeling. R package version 3.4-13. https://CRAN.R-project.org/package=raster

Schlaepfer, D. R., Lauenroth, W. K., & Bradford, J. B. (2012). Ecohydrological niche of sagebrush ecosystems. Ecohydrology 5: 453-466. https://doi.org/10.1002/eco.238

Schlaepfer, D. R., & Andrews, C. A. (2018). rSFSW2: Simulation Framework for SOILWAT2. R package version 3.0.0.

Schlaepfer, D. R., & Murphy, R. (2018). rSOILWAT2: An Ecohydrological Ecosystem-Scale Water Balance Simulation Model. R package version 2.3.2.

Savje, F. (2019). distances: Tools for Distance Metrics. R package version 0.1.8. https://CRAN.R-project.org/package=distances



DrylandEcology/rMultivariateMatching documentation built on Dec. 17, 2021, 5:30 p.m.