optimCLHS | R Documentation |
Optimize a sample configuration for spatial trend identification and estimation using the method proposed by Minasny and McBratney (2006), known as the conditioned Latin hypercube sampling. An utility function U is defined so that the sample reproduces the marginal distribution and correlation matrix of the numeric covariates, and the class proportions of the factor covariates (CLHS). The utility function is obtained aggregating three objective functions: O1, O2, and O3.
optimCLHS(
points,
candi,
covars,
use.coords = FALSE,
clhs.version = c("paper", "fortran", "update"),
schedule,
plotit = FALSE,
track = FALSE,
boundary,
progress = "txt",
verbose = FALSE,
weights
)
objCLHS(
points,
candi,
covars,
use.coords = FALSE,
clhs.version = c("paper", "fortran", "update"),
weights
)
points |
Integer value, integer vector, data frame (or matrix), or list. The number of sampling points (sample size) or the starting sample configuration. Four options are available:
Most users will want to set an integer value simply specifying the required sample size. Using an integer vector or data frame (or matrix) will generally be helpful to users willing to evaluate starting sample configurations, test strategies to speed up the optimization, and fine-tune or thin an existing sample configuration. Users interested in augmenting a possibly existing real-world sample configuration or fine-tuning only a subset of the existing sampling points will want to use a list. |
candi |
Data frame (or matrix). The Cartesian x- and y-coordinates (in this order) of the
cell centres of a spatially exhaustive, rectangular grid covering the entire spatial sampling
domain. The spatial sampling domain can be contiguous or composed of disjoint areas and contain
holes and islands. |
covars |
Data frame or matrix with the spatially exhaustive covariates in the columns. |
use.coords |
(Optional) Logical value. Should the projected spatial x- and y-coordinates
be used as spatially exhaustive covariates? Defaults to |
clhs.version |
(Optional) Character value setting the CLHS version that should be used. Available
options are: |
schedule |
List with named sub-arguments setting the control parameters of the annealing
schedule. See |
plotit |
(Optional) Logical for plotting the evolution of the optimization. Plot updates
occur at each ten (10) spatial jitters. Defaults to
|
track |
(Optional) Logical value. Should the evolution of the energy state be recorded and
returned along with the result? If |
boundary |
(Optional) An object of class SpatialPolygons (see sp::SpatialPolygons()) with
the outer and inner limits of the spatial sampling domain (see |
progress |
(Optional) Type of progress bar that should be used, with options |
verbose |
(Optional) Logical for printing messages about the progress of the optimization.
Defaults to |
weights |
List with named sub-arguments. The weights assigned to each one of the objective functions that form the multi-objective combinatorial optimization problem. They must be named after the respective objective function to which they apply. The weights must be equal to or larger than 0 and sum to 1. |
There are multiple mechanism to generate a new sample configuration out of an existing one. The main step consists of randomly perturbing the coordinates of a single sample, a process known as ‘jittering’. These mechanisms can be classified based on how the set of candidate locations for the samples is defined. For example, one could use an infinite set of candidate locations, that is, any location in the spatial domain can be selected as a new sample location after a sample is jittered. All that is needed is a polygon indicating the boundary of the spatial domain. This method is more computationally demanding because every time an existing sample is jittered, it is necessary to check if the new sample location falls in spatial domain.
Another approach consists of using a finite set of candidate locations for the samples. A finite set of candidate locations is created by discretising the spatial domain, that is, creating a fine (regular) grid of points that serve as candidate locations for the jittered sample. This is a less computationally demanding jittering method because, by definition, the new sample location will always fall in the spatial domain.
Using a finite set of candidate locations has two important inconveniences. First, not all locations in the spatial domain can be selected as the new location for a jittered sample. Second, when a sample is jittered, it may be that the new location already is occupied by another sample. If this happens, another location has to be iteratively sought for, say, as many times as the size of the sample configuration. In general, the larger the size of the sample configuration, the more likely it is that the new location already is occupied by another sample. If a solution is not found in a reasonable time, the the sample selected to be jittered is kept in its original location. Such a procedure clearly is suboptimal.
spsann uses a more elegant method which is based on using a finite set of candidate locations
coupled with a form of two-stage random sampling as implemented in spcosa::spsample()
.
Because the candidate locations are placed on a finite regular grid, they can be taken as the
centre nodes of a finite set of grid cells (or pixels of a raster image). In the first stage, one
of the “grid cells” is selected with replacement, i.e. independently of already being
occupied by another sample. The new location for the sample chosen to be jittered is selected
within that “grid cell” by simple random sampling. This method guarantees that virtually
any location in the spatial domain can be selected. It also discards the need to check if the new
location already is occupied by another sample, speeding up the computations when compared to the
first two approaches.
Reproducing the marginal distribution of the numeric covariates depends upon the definition of marginal
sampling strata. Equal-area marginal sampling strata are defined using the sample quantiles estimated
with quantile
using a continuous function (type = 7
), that is, a function that
interpolates between existing covariate values to estimate the sample quantiles. This is the procedure
implemented in the original method of Minasny and McBratney (2006), which creates breakpoints that do not
occur in the population of existing covariate values. Depending on the level of discretization of the
covariate values, that is, how many significant digits they have, this can create repeated breakpoints,
resulting in empty marginal sampling strata. The number of empty marginal sampling strata will ultimately
depend on the frequency distribution of the covariate and on the number of sampling points. The effect of
these features on the spatial modelling outcome still is poorly understood.
The correlation between two numeric covariates is measured using the sample Pearson's r, a descriptive statistic that ranges from -1 to +1. This statistic is also known as the sample linear correlation coefficient. The effect of ignoring the correlation among factor covariates and between factor and numeric covariates on the spatial modelling outcome still is poorly understood.
A method of solving a multi-objective combinatorial optimization problem (MOCOP) is to aggregate the objective functions into a single utility function U. In the spsann package, as in the original implementation of the CLHS by Minasny and McBratney (2006), the aggregation is performed using the weighted sum method, which uses weights to incorporate the a priori preferences of the user about the relative importance of each objective function. When the user has no preference, the objective functions receive equal weights.
The weighted sum method is affected by the relative magnitude of the different objective function values.
The objective functions implemented in optimCLHS
have different units and orders of magnitude. The
consequence is that the objective function with the largest values, generally O1, may have a numerical
dominance during the optimization. In other words, the weights may not express the true preferences of the
user, resulting that the meaning of the utility function becomes unclear because the optimization will
likely favour the objective function which is numerically dominant.
An efficient solution to avoid numerical dominance is to scale the objective functions so that they are
constrained to the same approximate range of values, at least in the end of the optimization. In the
original implementation of the CLHS by Minasny and McBratney (2006), clhs.version = "paper"
, optimCLHS
uses the naive aggregation method, which ignores that the three objective functions have different units
and orders of magnitude. In a 2015 Fortran implementation of the CLHS, clhs.version = "fortran"
, scaling
factors were included to make the values of the three objective function more comparable. The effect of
ignoring the need to scale the objective functions, or using arbitrary scaling factors, on the spatial
modelling outcome still is poorly understood. Thus, an updated version of O1, O2, and O3 has
been implemented in the spsann package. The need formulation aim at making the values returned by the
objective functions more comparable among themselves without having to resort to arbitrary scaling factors.
The effect of using these new formulations have not been tested yet.
optimCLHS
returns an object of class OptimizedSampleConfiguration
: the optimized sample configuration
with details about the optimization.
objCLHS
returns a numeric value: the energy state of the sample configuration – the objective function
value.
spsann always computes the distance between two locations (points) as the Euclidean distance between them. This computation requires the optimization to operate in the two-dimensional Euclidean space, i.e. the coordinates of the sample, candidate and evaluation locations must be Cartesian coordinates, generally in metres or kilometres. spsann has no mechanism to check if the coordinates are Cartesian: you are the sole responsible for making sure that this requirement is attained.
The (only?) difference of optimCLHS
to the original Fortran implementation of Minasny and McBratney
(2006), and to the clhs
function implemented in the former
clhs package by Pierre Roudier, is
the annealing schedule.
Alessandro Samuel-Rosa alessandrosamuelrosa@gmail.com
Minasny, B.; McBratney, A. B. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, v. 32, p. 1378-1388, 2006.
Minasny, B.; McBratney, A. B. Conditioned Latin Hypercube Sampling for calibrating soil sensor data to soil properties. Chapter 9. Viscarra Rossel, R. A.; McBratney, A. B.; Minasny, B. (Eds.) Proximal Soil Sensing. Amsterdam: Springer, p. 111-119, 2010.
Roudier, P.; Beaudette, D.; Hewitt, A. A conditioned Latin hypercube sampling algorithm incorporating operational constraints. 5th Global Workshop on Digital Soil Mapping. Sydney, p. 227-231, 2012.
optimACDC
#####################################################################
# NOTE: The settings below are unlikely to meet your needs. #
#####################################################################
data(meuse.grid, package = "sp")
candi <- meuse.grid[1:1000, 1:2]
covars <- meuse.grid[1:1000, 5]
schedule <- scheduleSPSANN(
chains = 1, initial.temperature = 20, x.max = 1540, y.max = 2060,
x.min = 0, y.min = 0, cellsize = 40)
set.seed(2001)
res <- optimCLHS(
points = 10, candi = candi, covars = covars, use.coords = TRUE,
clhs.version = "fortran", weights = list(O1 = 0.5, O3 = 0.5), schedule = schedule)
objSPSANN(res) - objCLHS(
points = res, candi = candi, covars = covars, use.coords = TRUE,
clhs.version = "fortran", weights = list(O1 = 0.5, O3 = 0.5))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.