loocv: Leave-one-out cross-validation

Description Usage Arguments Details Value Author(s) Examples

View source: R/loocv.R

Description

Use leave-one-out cross-validation of simulated sites to evaluate matching errors. This function takes each Subset cell and finds its nearest neighbor from among the remaining Subset cells using weighted (standardized) Euclidean distance of the matching variables, then calculates the differences between the simulated value of output variables for each Subset cell and the simulated value of output variables from its nearest neighbor.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
loocv(
  matchingvars,
  secondaryvars,
  output_results = NULL,
  criteria1 = 1,
  criteria2 = 1,
  secondarymatch = TRUE,
  secondaryvars_id = "cellnumbers",
  reference_treatment = "1",
  n_neighbors = 2,
  other_treatments = NULL,
  ...
)

Arguments

matchingvars

a data frame that includes all matching variables for the Subset cells. Rownames should correspond to the unique identifiers for each Subset cell. The first two columns correspond to 'x' and 'y' coordinates of the Subset cells (if none exist, use "NA"). The rest of the columns correspond to the matching variables.

secondaryvars

a data frame that includes the secondary matching variables for the Subset cells. The first column should correspond to the unique identifier for each Subset cell, and the next two columns should correspond to the 'x' and 'y' coordinates of the Subset cells (if non exist, use "NA"). The rest of the columns correspond to the secondary matching variables. Only needed if secondarymatch is TRUE.

output_results

data frame. Simulation output results for all simulated sites (Subset cells). The first column and the rownames should correspond to the unique identifiers for the Subsetcells. If secondarymatch is TRUE, the unique identifiers should be a combination of the unique identifiers used for 'matchingvars' and 'secondaryvars', in the form 'matchingvars_id.treatment'. 'treatment' is a number that designates the treatments, where '1' corresponds to the unique site-specific treatment (i.e., site-specific soils) and subsequent numbers designate other treatments that were applied to all Subset cells (these will correspond to the rownames in other_treatments, see below).

criteria1

single value or vector of length equal to the number of matching variables, where values correspond to the matching criterion for each matching variable in 'matchingvars'. If a single value, this will be used as matching criteria for all variables. Default value is 1, corresponding to using raw data for matching.

criteria2

single value or vector of length equal to the number of secondary variables, where values correspond to the matching criterion for each secondary variable secondaryvars. If a single value, this will be used as matching criteria for all variables. Default value is 1, corresponding to using raw data for matching. Only needed if secondarymatch is TRUE.

secondarymatch

boolean. Indicates whether the function should run secondary matching on the Subset cells. Defaults to TRUE

secondaryvars_id

character. Provides the column name for the unique identifiers in the 'secondaryvars' data frame (should be the first column). Defaults to "cellnumbers". Only needed if secondarymatch is TRUE.

reference_treatment

character. Designates a number to represent the reference treatment. Default value is '1'. Only needed if secondarymatch is TRUE.

n_neighbors

numeric. The number of nearest neighbors to search for among the Subset cells. To achieve leave-one-out cross-validation, this number must be set to 2. The nearest neighbor of each Subset cells is itself, so the second nearest neighbor will correspond the closest non-self neighbor. Default value is 2.

other_treatments

a data frame that gives the secondary variables for the set treatments. The rownames should correspond to unique identifiers for each treatment (e.g., 2-total number of treatments if the reference_treatment is '1'). Only needed if secondarymatch is TRUE.

...

additional parameters to be passed to multivarmatch and/or secondaryMatching.

Details

This function can be used for matching achieved with the multivarmatch function or matching that uses two-step matching, first with multivarmatch, followed by secondaryMatching. In case of the latter, loocv assumes a very specific experimental design. For each Subsetcell cell, there are five different soil types. There is one site-specific soil type that is unique to each Subset cell and four set soil soil types that were simulated for all Subset cells. In this case, the first step of matching uses only climate variables and the second step of matching identifies the best soil type from among the 5 available for the Subset cell with the best match based on the climate variables.

Value

Data frame of Target cells with coordinates ('x','y'), cellnumber of Target cell ('target_cell'), unique id of matched Subset cell ('subset_cell') and matching quality ('matching_quality'). (If secondarymatch is TRUE, it will also include the unique id of the Subset cell matched with secondary matching criteria ('subset_cell_secondary'), and matching quality of this secondary match ('matching_quality_secondary')). Additional columns correspond to the difference between the simulated values of output variables for each Subset cell and the simulated values of output variables from its nearest neighbor. These values can be squared and averaged to get the average squared cross-validated matching error for each output variable.

Author(s)

Rachel R. Renne

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
########################
# First, an example where secondarymatch = FALSE
# Load targetcells data for Target Cells
data(targetcells)

# Create data frame of potential matching variables for Target Cells
allvars <- makeInputdata(targetcells)

# Restrict data to matching variables of interest
matchingvars <- allvars[,c("cellnumbers","x","y","bioclim_01","bioclim_04",
                        "bioclim_09","bioclim_12","bioclim_15","bioclim_18")]

# Create vector of matching criteria
criteria <- c(0.7,42,3.3,66,5.4,18.4)

# Create raster template
raster_template = targetcells[[1]]

# Find solution for k = 200
# Note: n_starts should be >= 10, it is 1 here to reduce run time.
results1 <- kpoints(matchingvars,criteria = criteria,klist = 200,
                    n_starts = 1,min_area = 50,iter = 50,
                    raster_template = raster_template)

# Get points from solution to kpoints algorithm
subsetcells <- results1$solutions[[1]]

# Create a mock dataset of output results
output_results <- allvars[rownames(subsetcells),c("cellnumbers","bioclim_02",
                                                  "bioclim_03","bioclim_16",
                                                  "bioclim_17")]

# Create dataset of matchingvars for subsetcells
subset_matchingvars <- allvars[rownames(subsetcells),-1]
subset_matchingvars <- subset_matchingvars[,c("x","y","bioclim_01","bioclim_04",
                                              "bioclim_09","bioclim_12",
                                              "bioclim_15","bioclim_18" )]

# Run leave-one-out cross validation of mock output results
loocv_results <- loocv(matchingvars = subset_matchingvars,
                       output_results = output_results,
                       criteria1 = criteria,
                       secondarymatch = FALSE, n_neighbors = 2)


########################
# Next, an example where secondarymatch = TRUE

# Remove previous subsetcells
rm(subsetcells)

# Get subsetcells
data(subsetcells)

# Pull out only matching variables and remove duplicates
matchingvars <- subsetcells[,c("site_id","X_WGS84","Y_WGS84","bioclim_01",
                            "bioclim_04","bioclim_09","bioclim_12",
                            "bioclim_15","bioclim_18")]

# Fix names
names(matchingvars) <- c("cellnumbers","x","y",names(matchingvars)[4:9])

# Remove duplicates (we will first match on climate only)
matchingvars <- matchingvars[!duplicated(matchingvars$cellnumbers),]
rownames(matchingvars) <- matchingvars$cellnumbers

# Remove cellnumbers column
matchingvars <- matchingvars[,-1]

# Pull out secondary vars and keep both identifiers
secondaryvars <- subsetcells[,c("site_id","X_WGS84","Y_WGS84","sand","clay")]

# Fix names
names(secondaryvars) <- c("cellnumbers","x","y",names(secondaryvars)[4:5])

# Convert sand and clay to percentage from fraction
secondaryvars$sand <- secondaryvars$sand*100
secondaryvars$clay <- secondaryvars$clay*100

# Remove duplicates
secondaryvars <- secondaryvars[!duplicated(secondaryvars$cellnumbers),]

# Set rownames as cellnumbers
rownames(secondaryvars) <- secondaryvars$cellnumbers

# Bring in "other" treatments
data(setsoiltypes)
other_treatments = setsoiltypes

# Create raster template
raster_template = targetcells[[1]]

# Set original criteria (from first-step matching)
criteria1 = c(0.7,42,3.3,66,5.4,18.4)

# Calculate criteria for secondary matching
criteria2 = c((max(subsetcells$sand,na.rm = TRUE)-
             min(subsetcells$sand,na.rm = TRUE))/10*100,
             (max(subsetcells$clay,na.rm = TRUE)-
             min(subsetcells$clay,na.rm = TRUE))/10*100)

# Bring in simulation output results of interest
output_results = subsetcells[,c("site_ids","Dryprop","CwetWinter",
                             "CdrySummer","Cwet8","Dryall","Dryany")]
rownames(output_results) <- output_results$site_ids

# Run leave-one-out cross validation of output results
loocv_results <- loocv(matchingvars = matchingvars,
                       secondaryvars = secondaryvars,
                       output_results = output_results,
                       criteria1 = criteria1, criteria2 = criteria2,
                       secondarymatch = TRUE,
                       secondaryvars_id = "cellnumbers",
                       reference_treatment = "1", n_neighbors = 2,
                       other_treatments = other_treatments)

DrylandEcology/rMultivariateMatching documentation built on Dec. 17, 2021, 5:30 p.m.