mbl | R Documentation |
This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy learning approach for predicting a given response variable from a set of predictor variables. For each observation in a prediction set, a specific local regression is carried out based on a subset of similar observations (nearest neighbors) selected from a reference set. The local model is then used to predict the response value of the target (prediction) observation. Therefore this function does not yield a global regression model.
mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)),
diss_method = "pca", diss_usage = "predictors", gh = TRUE,
pc_selection = list(method = "opc", value = min(dim(Xr), 40)),
control = mbl_control(), group = NULL, center = TRUE, scale = FALSE,
verbose = TRUE, documentation = character(), seed = NULL, ...)
Xr |
a matrix of predictor variables of the reference data (observations in rows and variables in columns). |
Yr |
a numeric matrix of one column containing the values of the response variable corresponding to the reference data. |
Xu |
a matrix of predictor variables of the data to be predicted (observations in rows and variables in columns). |
Yu |
an optional matrix of one column containing the values of the
response variable corresponding to the data to be predicted. Default is
|
k |
a vector of integers specifying the sequence of k-nearest
neighbors to be tested. Either |
k_diss |
a numeric vector specifying the sequence of dissimilarity
thresholds to be tested for the selection of the nearest neighbors found in
|
k_range |
an integer vector of length 2 which specifies the minimum
(first value) and the maximum (second value) number of neighbors to be
retained when the |
spike |
an integer vector (with positive and/or negative values) indicating
the indices of observations in |
method |
an object of class |
diss_method |
a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation. Options are:
Alternatively, a matrix of dissimilarities can also be passed to this
argument. This matrix is supposed to be a user-defined matrix
representing the dissimilarities between observations in |
diss_usage |
a character string specifying how the dissimilarity
information shall be used. The possible options are: |
gh |
a logical indicating if the global Mahalanobis distance (in the pls
score space) between each observation and the pls mean (centre) must be
computed. This metric is known as the GH distance in the literature. Note
that this computation is based on the number of pls components determined by
using the |
pc_selection |
a list of length 2 used for the computation of GH (if
The list
|
control |
a list created with the |
group |
an optional factor (or character vector vector
that can be coerced to |
center |
a logical if the predictor variables must be centred at each
local segment (before regression). In addition, if |
scale |
a logical indicating if the predictor variables must be scaled
to unit variance at each local segment (before regression). In addition, if
|
verbose |
a logical indicating whether or not to print a progress bar
for each observation to be predicted. Default is |
documentation |
an optional character string that can be used to
describe anything related to the |
seed |
an integer value containing the random number generator (RNG)
state for random number generation. This argument can be used for
reproducibility purposes (for random sampling) in the cross-validation
results. Default is |
... |
further arguments to be passed to the |
The argument spike
can be used to indicate what reference observations
in Xr
must be kept in the neighborhood of every single Xu
observation. If a vector of length \mjeqnmm is passed to this argument,
this means that the \mjeqnmm original neighbors with the largest
dissimilarities to the target observations will be forced out of the
neighborhood. Spiking might be useful in cases where
some reference observations are known to be somehow related to the ones in
Xu
and therefore might be relevant for fitting the local models. See
Guerrero et al. (2010) for an example on the benefits of spiking.
The mbl
function uses the dissimilarity
function to
compute the dissimilarities between Xr
and Xu
. The dissimilarity
method to be used is specified in the diss_method
argument.
Arguments to dissimilarity
as well as further arguments to the
functions used inside dissimilarity
(i.e. ortho_diss
cor_diss
f_diss
sid
) can be passed to those functions by using ...
.
The diss_usage
argument is used to specify whether the dissimilarity
information must be used within the local regressions and, if so, how.
When diss_usage = "predictors"
the local (square symmetric)
dissimilarity matrix corresponding the selected neighborhood is used as
source of additional predictors (i.e the columns of this local matrix are
treated as predictor variables). In some cases this results in an improvement
of the prediction performance (Ramirez-Lopez et al., 2013a).
If diss_usage = "weights"
, the neighbors of the query point
(\mjeqnxu_jxu_j) are weighted according to their dissimilarity to
\mjeqnxu_jxu_j before carrying out each local regression. The following
tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for
computing the final weights based on the measured dissimilarities:
W_j = (1 - v^3)^3W_j = (1 - v^3)^3
where if \mjeqnxr_i \in xr_i in neighbors of \mjeqnxu_jxu_j:
\mjdeqnv_j(xu_j) = d(xr_i, xu_j)v_j(xu_j) = d(xr_i, xu_j)
otherwise:
\mjdeqnv_j(xu_j) = 0v_j(xu_j) = 0
In the above formulas \mjeqnd(xr_i, xu_j)d(xr_i, xu_j) represents the
dissimilarity between the query point and each object in \mjeqnXrXr.
When diss_usage = "none"
is chosen the dissimilarity information is
not used.
The global Mahalanobis distance (a.k.a GH) is computed based on the scores
of a pls projection. A pls projection model is built with for {Yr}, {Xr}
and this model is used to obtain the pls scores of the Xu
observations. The Mahalanobis distance between each Xu
observation in
(the pls space) and the centre of Xr
is then computed. The number of
pls components is optimized based on the parameters passed to the
pc_selection
argument. In addition, the mbl
function also
reports the GH distance for the observations in Xr
.
Some aspects of the mbl process, such as the type of internal validation,
parameter tuning, what extra objects to return, permission for parallel
execution, prediction limits, etc, can be specified by using the
mbl_control
function.
By using the group
argument one can specify groups of observations
that have something in common (e.g. observations with very similar origin).
The purpose of group
is to avoid biased cross-validation results due
to pseudo-replication. This argument allows to select calibration points
that are independent from the validation ones. In this regard, when
validation_type = "local_cv"
(used in mbl_control
function), then the p
argument refers to the percentage of groups of
observations (rather than single observations) to be retained in each
sampling iteration at each local segment.
a list
of class mbl
with the following components
(sorted either by k
or k_diss
):
call
: the call to mbl.
cntrl_param
: the list with the control parameters passed to
control.
Xu_neighbors
: a list containing two elements: a matrix of
Xr
indices corresponding to the neighbors of Xu
and a matrix
of dissimilarities between each Xu
observation and its corresponding
neighbor in Xr
.
dissimilarities
: a list with the method used to obtain the
dissimilarity matrices and the dissimilarity matrix corresponding to
\mjeqnD(Xr, Xu)D(Xr, Xu). This object is returned only if the
return_dissimilarity
argument in the control
list was set
to TRUE
.
n_predictions
: the total number of observations predicted.
gh
: if gh = TRUE
, a list containing the global
Mahalanobis distance values for the observations in Xr
and Xu
as well as the results of the global pls projection object used to obtain
the GH values.
validation_results
: a list of validation results for
"local cross validation" (returned if the validation_type
in
control
list was set to "local_cv"
),
"nearest neighbor validation" (returned if the validation_type
in control
list was set to "NNv"
) and
"Yu prediction statistics" (returned if Yu
was supplied).“
results
: a list of data tables containing the results of the
predictions for each either k
or k_diss
. Each data table
contains the following columns:
o_index
: The index of the predicted observation.
k_diss
: This column is only output if the k_diss
argument is used. It indicates the corresponding dissimilarity threshold
for selecting the neighbors.
k_original
: This column is only output if the k_diss
argument is used. It indicates the number of neighbors that were originally
found when the given dissimilarity threshold is used.
k
: This column indicates the final number of neighbors
used.
npls
: This column is only output if the pls
regression method was used. It indicates the final number of pls
components used.
min_pls
: This column is only output if wapls
regression method was used. It indicates the final number of minimum pls
components used. If no optimization was set, it retrieves the original
minimum pls components passed to the method
argument.
max_pls
: This column is only output if the wapls
regression method was used. It indicates the final number of maximum pls
components used. If no optimization was set, it retrieves the original
maximum pls components passed to the method
argument.
yu_obs
: The input values given in Yu
(the response
variable corresponding to the data to be predicted). If Yu = NULL
,
then NA
s are retrieved.
pred
: The predicted Yu values.
yr_min_obs
: The minimum reference value (of the response
variable) in the neighborhood.
yr_max_obs
: The maximum reference value (of the response
variable) in the neighborhood.
index_nearest_in_Xr
: The index of the nearest neighbor found
in Xr
.
index_farthest_in_Xr
: The index of the farthest neighbor
found in Xr
.
y_nearest
: The reference value (Yr
) corresponding to
the nearest neighbor found in Xr
.
y_nearest_pred
: This column is only output if the
validation method in the object passed to control
was set to
"NNv"
. It represents the predicted value of the nearest neighbor
observation found in Xr
. This prediction come from model fitted
with the remaining observations in the neighborhood of the target
observation in Xu
.
loc_rmse_cv
: This column is only output if the validation
method in the object passed to control
was set to
'local_cv'
. It represents the RMSE of the cross-validation
computed for the neighborhood of the target observation in Xu
.
loc_st_rmse_cv
: This column is only output if the
validation method in the object passed to control
was set to
'local_cv'
. It represents the standardized RMSE of the
cross-validation computed for the neighborhood of the target observation
in Xu
.
dist_nearest
: The distance to the nearest neighbor.
dist_farthest
: The distance to the farthest neighbor.
loc_n_components
: This column is only output if the
dissimilarity method used is one of "pca"
, "pca.nipals"
or
"pls"
and in addition the dissimilarities are requested to be
computed locally by passing .local = TRUE
to the mbl
function.
See .local
argument in the ortho_diss
function.
seed
: a value mirroring the one passed to seed.
documentation
: a character string mirroring the one provided
in the documentation
argument.
When the k_diss
argument is used, the printed results show a table
with a column named 'p_bounded
. It represents the percentage of
observations for which the neighbors selected by the given dissimilarity
threshold were outside the boundaries specified in the k_range
argument.
Leonardo Ramirez-Lopez and Antoine Stevens
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610.
Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking of NIR regional models using observations from target sites: Effect of model size on prediction accuracy. Geoderma, 158(1-2), 66-77.
Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62, 664-673.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
mbl_control
, f_diss
,
cor_diss
, sid
, ortho_diss
,
search_neighbors
, local_fit
library(prospectr)
data(NIRsoil)
# Proprocess the data using detrend plus first derivative with Savitzky and
# Golay smoothing filter
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc,
wav = as.numeric(colnames(NIRsoil$spc))
),
m = 1,
p = 1,
w = 7
)
NIRsoil$spc_pr <- sg_det
# split into training and testing sets
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
# Example 1
# A mbl implemented in Ramirez-Lopez et al. (2013,
# the spectrum-based learner)
# Example 1.1
# An exmaple where Yu is supposed to be unknown, but the Xu
# (spectral variables) are known
my_control <- mbl_control(validation_type = "NNv")
## The neighborhood sizes to test
ks <- seq(40, 140, by = 20)
sbl <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
k = ks,
method = local_fit_gpr(),
control = my_control,
scale = TRUE
)
sbl
plot(sbl)
get_predictions(sbl)
# Example 1.2
# If Yu is actually known...
sbl_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_gpr(),
control = my_control
)
sbl_2
plot(sbl_2)
# Example 2
# the LOCAL algorithm (Shenk et al., 1997)
local_algorithm <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "cor",
diss_usage = "none",
control = my_control
)
local_algorithm
plot(local_algorithm)
# Example 3
# A variation of the LOCAL algorithm (using the optimized pc
# dissmilarity matrix) and dissimilarity matrix as source of
# additional preditors
local_algorithm_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "pca",
diss_usage = "predictors",
control = my_control
)
local_algorithm_2
plot(local_algorithm_2)
# Example 4
# Running the mbl function in parallel with example 2
n_cores <- 2
if (parallel::detectCores() < 2) {
n_cores <- 1
}
# Alternatively:
# n_cores <- parallel::detectCores() - 1
# if (n_cores == 0) {
# n_cores <- 1
# }
library(doParallel)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
# Alernatively:
# library(doSNOW)
# clust <- makeCluster(n_cores, type = "SOCK")
# registerDoSNOW(clust)
# getDoParWorkers()
local_algorithm_par <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "cor",
diss_usage = "none",
control = my_control
)
local_algorithm_par
registerDoSEQ()
try(stopCluster(clust))
# Example 5
# Using local pls distances
with_local_diss <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "pls",
diss_usage = "predictors",
control = my_control,
.local = TRUE,
pre_k = 150,
)
with_local_diss
plot(with_local_diss)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.