sperrorest: Perform spatial error estimation and variable importance...

Description Usage Arguments Details Value Note References Examples

View source: R/sperrorest.R

Description

sperrorest is a flexible interface for multiple types of parallelized spatial and non-spatial cross-validation and bootstrap error estimation and parallelized permutation-based assessment of spatial variable importance.

Usage

1
2
3
4
5
6
7
8
9
sperrorest(formula, data, coords = c("x", "y"), model_fun,
  model_args = list(), pred_fun = NULL, pred_args = list(),
  smp_fun = partition_cv, smp_args = list(), train_fun = NULL,
  train_param = NULL, test_fun = NULL, test_param = NULL,
  err_fun = err_default, imp_variables = NULL, imp_permutations = 1000,
  importance = !is.null(imp_variables), distance = FALSE,
  par_args = list(par_mode = "foreach", par_units = NULL, par_option = NULL),
  do_gc = 1, progress = "all", out_progress = "", benchmark = FALSE,
  ...)

Arguments

formula

A formula specifying the variables used by the model. Only simple formulas without interactions or nonlinear terms should be used, e.g. y~x1+x2+x3 but not y~x1*x2+log(x3). Formulas involving interaction and nonlinear terms may possibly work for error estimation but not for variable importance assessment, but should be used with caution.

data

a data.frame with predictor and response variables. Training and test samples will be drawn from this data set by train_fun and test_fun, respectively.

coords

vector of length 2 defining the variables in data that contain the x and y coordinates of sample locations.

model_fun

Function that fits a predictive model, such as glm or rpart. The function must accept at least two arguments, the first one being a formula and the second a data.frame with the learning sample.

model_args

Arguments to be passed to model_fun (in addition to the formula and data argument, which are provided by sperrorest)

pred_fun

Prediction function for a fitted model object created by model. Must accept at least two arguments: the fitted object and a data.frame newdata with data on which to predict the outcome.

pred_args

(optional) Arguments to pred_fun (in addition to the fitted model object and the newdata argument, which are provided by sperrorest).

smp_fun

A function for sampling training and test sets from data. E.g. partition_kmeans for spatial cross-validation using spatial k-means clustering.

smp_args

(optional) Arguments to be passed to smp_fun.

train_fun

(optional) A function for resampling or subsampling the training sample in order to achieve, e.g., uniform sample sizes on all training sets, or maintaining a certain ratio of positives and negatives in training sets. E.g. resample_uniform or resample_strat_uniform.

train_param

(optional) Arguments to be passed to resample_fun.

test_fun

(optional) Like train_fun but for the test set.

test_param

(optional) Arguments to be passed to test_fun.

err_fun

A function that calculates selected error measures from the known responses in data and the model predictions delivered by pred_fun. E.g. err_default (the default).

imp_variables

(optional; used if importance = TRUE). Variables for which permutation-based variable importance assessment is performed. If importance = TRUE and imp_variables == NULL, all variables in formula will be used.

imp_permutations

(optional; used if importance = TRUE). Number of permutations used for variable importance assessment.

importance

logical (default: FALSE): perform permutation-based variable importance assessment?

distance

logical (default: FALSE): if TRUE, calculate mean nearest-neighbour distances from test samples to training samples using add.distance.represampling.

par_args

list of parallelization parameters:

  • par_mode: the parallelization mode. See details.

  • par_units: the number of parallel processing units.

  • par_option: optional future settings for par_mode = "future" or par_mode = "foreach".

do_gc

numeric (default: 1): defines frequency of memory garbage collection by calling gc; if < 1, no garbage collection; if >= 1, run a gc after each repetition; if >= 2, after each fold.

progress

character (default: all): Whether to show progress information (if possible). Default shows repetition, fold and (if enabled) variable importance progress for par_mode = "foreach" or par_mode = "sequential". Set to "rep" for repetition information only or FALSE for no progress information.

out_progress

only used if par_mode = foreach: Write progress output to a file instead of console output. The default ('') results in console output for Unix-systems and file output ('sperrorest.progress.txt') in the current working directory for Windows systems. No console output is possible on Windows systems.

benchmark

(optional) logical (default: FALSE): if TRUE, perform benchmarking and return sperrorestbenchmark object.

...

Further options passed to makeCluster for par_mode = "foreach".

Details

By default sperrorest runs in parallel on all cores using foreach with the future backend. If this is not desired, specify par_units in par_args or set par_mode = "sequential".

Available parallelization modes include par_mode = "apply" (calls pbmclapply on Unix, parApply on Windows) and future (future_lapply). For the latter and par_mode = "foreach", par_option (default to multiprocess and cluster, respectively) can be specified. See plan for further details.

Value

A list (object of class sperrorest) with (up to) six components:

error_rep

a sperrorestreperror object containing predictive performances at the repetition level

error_fold

a sperroresterror object containing predictive performances at the fold level

represampling

a represampling() object

importance

a sperrorestimportance object containing permutation-based variable importances at the fold level

benchmark

a sperrorestbenchmark object containing information on the system the code is running on, starting and finishing times, number of available CPU cores, parallelization mode, number of parallel units, and runtime performance

package_version

a sperrorestpackageversion object containing information about the sperrorest package version

Note

Custom predict functions passed to pred_fun, which consist of multiple custom defined child functions, must be defined in one function.

References

Brenning, A. 2012. Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: the R package 'sperrorest'. 2012 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 23-27 July 2012, p. 5372-5375.

Brenning, A. 2005. Spatial prediction models for landslide hazards: review, comparison and evaluation. Natural Hazards and Earth System Sciences, 5(6): 853-862.

Brenning, A., S. Long & P. Fieguth. Forthcoming. Detecting rock glacier flow structures using Gabor filters and IKONOS imagery. Submitted to Remote Sensing of Environment.

Russ, G. & A. Brenning. 2010a. Data mining in precision agriculture: Management of spatial information. In 13th International Conference on Information Processing and Management of Uncertainty, IPMU 2010; Dortmund; 28 June - 2 July 2010. Lecture Notes in Computer Science, 6178 LNAI: 350-359.

Russ, G. & A. Brenning. 2010b. Spatial variable importance assessment for yield prediction in Precision Agriculture. In Advances in Intelligent Data Analysis IX, Proceedings, 9th International Symposium, IDA 2010, Tucson, AZ, USA, 19-21 May 2010. Lecture Notes in Computer Science, 6065 LNCS: 184-195.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
## Not run: 

##------------------------------------------------------------
## Classification tree example using non-spatial partitioning
## setup and default parallel mode ("foreach")
##------------------------------------------------------------

data(ecuador) # Muenchow et al. (2012), see ?ecuador
fo <- slides ~ dem + slope + hcurv + vcurv + log.carea + cslope

library(rpart)
mypred_part <- function(object, newdata) predict(object, newdata)[, 2]
ctrl <- rpart.control(cp = 0.005) # show the effects of overfitting
fit <- rpart(fo, data = ecuador, control = ctrl)

### Non-spatial 5-repeated 10-fold cross-validation:
mypred_part <- function(object, newdata) predict(object, newdata)[, 2]
par_nsp_res <- sperrorest(data = ecuador, formula = fo,
                          model_fun = rpart,
                          model_args = list(control = ctrl),
                          pred_fun = mypred_part,
                          progress = TRUE,
                          smp_fun = partition_cv,
                          smp_args = list(repetition = 1:5, nfold = 10))
summary(par_nsp_res$error_rep)
summary(par_nsp_res$error_fold)
summary(par_nsp_res$represampling)
# plot(par_nsp_res$represampling, ecuador)

### Spatial 5-repeated 10-fold spatial cross-validation:
par_sp_res <- sperrorest(data = ecuador, formula = fo,
                         model_fun = rpart,
                         model_args = list(control = ctrl),
                         pred_fun = mypred_part,
                         progress = TRUE,
                         smp_fun = partition_kmeans,
                         smp_args = list(repetition = 1:5, nfold = 10))
summary(par_sp_res$error_rep)
summary(par_sp_res$error_fold)
summary(par_sp_res$represampling)
# plot(par_sp_res$represampling, ecuador)

smry <- data.frame(
    nonspat_training = unlist(summary(par_nsp_res$error_rep,
                                      level = 1)$train_auroc),
    nonspat_test     = unlist(summary(par_nsp_res$error_rep,
                                      level = 1)$test_auroc),
    spatial_training = unlist(summary(par_sp_res$error_rep,
                                      level = 1)$train_auroc),
    spatial_test     = unlist(summary(par_sp_res$error_rep,
                                     level = 1)$test_auroc))
boxplot(smry, col = c('red','red','red','green'),
    main = 'Training vs. test, nonspatial vs. spatial',
    ylab = 'Area under the ROC curve')

##------------------------------------------------------------
## Logistic regression example (glm) using partition_kmeans
## and computation of permutation based variable importance
##------------------------------------------------------------

data(ecuador)
fo <- slides ~ dem + slope + hcurv + vcurv + log.carea + cslope

out <- sperrorest(data = ecuador, formula = fo,
                  model_fun = glm,
                  model_args = list(family = "binomial"),
                  pred_fun = predict,
                  pred_args = list(type = "response"),
                  smp_fun = partition_kmeans,
                  smp_args = list(repetition = 1:2, nfold = 4),
                  par_args = list(par_mode = "future"),
                  importance = TRUE, imp_permutations = 10)
summary(out$error_rep)
summary(out$importance)

## End(Not run)

sperrorest documentation built on April 1, 2018, 12:27 p.m.