search_neighbors: A function for searching in a given reference set the...

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/search_neighbors.R

Description

\loadmathjax

This function searches in a reference set the neighbors of the observations provided in another set.

Usage

1
2
3
4
5
6
7
8
search_neighbors(Xr, Xu, diss_method = c("pca", "pca.nipals", "pls", "mpls", 
                                         "cor", "euclid", "cosine", "sid"),
                 Yr = NULL, k, k_diss, k_range, spike = NULL,
                 pc_selection = list("var", 0.01),
                 return_projection = FALSE, return_dissimilarity = FALSE,
                 ws = NULL,
                 center = TRUE, scale = FALSE,
                 documentation = character(), ...)

Arguments

Xr

a matrix of reference (spectral) observations where the neighbors of the observations in Xu are to be searched.

Xu

a matrix of (spectral) observations for which its neighbors are to be searched in Xu.

diss_method

a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation.

  • "pca": Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr and Xu. PC projection is done using the singlar value decomposition (SVD) algorithm. See ortho_diss function.

  • "pca.nipals" Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr and Xu. PC projection is done using the non-linear iterative partial least squares (niapls) algorithm. See ortho_diss function.

  • "pls" Mahalanobis distance computed on the matrix of scores of a partial least squares projection of Xr and Xu. In this case, Yr is always required. See ortho_diss function.

  • "mpls": Mahalanobis distance computed on the matrix of scores of a modified partial least squares projection (Shenk and Westerhaus, 1991; Westerhaus, 2014) of Xr (and Xu if provided). In this case, Yr is always required. See ortho_diss function.

  • "cor" correlation coefficient between observations. See cor_diss function.

  • "euclid" Euclidean distance between observations. See f_diss function.

  • "cosine" Cosine distance between observations. See f_diss function.

  • "sid" spectral information divergence between observations. See sid function.

Yr

a numeric matrix of 'n' observations used as side information of Xr for the ortho_diss methods (i.e. pca, pca.nipals or pls). It is required when:

  • diss_method = "pls"

  • diss_method = "pca" with "opc" used as the method in the pc_selection argument. See [ortho_diss()].

k

an integer value indicating the k-nearest neighbors of each observation in Xu that must be selected from Xr.

k_diss

an integer value indicating a dissimilarity treshold. For each observation in Xu, its nearest neighbors in Xr are selected as those for which their dissimilarity to Xu is below this k_diss threshold. This treshold depends on the corresponding dissimilarity metric specified in diss_method. Either k or k_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbors to be retained when the k_diss is given.

spike

a vector of integers indicating what observations in Xr (and Yr) must be 'forced' to always be part of all the neighborhoods.

pc_selection

a list of length 2 to be passed onto the ortho_diss methods. It is required if the method selected in diss_method is any of "pca", "pca.nipals" or "pls". This argument is used for optimizing the number of components (principal components or pls factors) to be retained. This list must contain two elements in the following order: method (a character indicating the method for selecting the number of components) and value (a numerical value that complements the selected method). The methods available are:

  • "opc": optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components (of set of observations) is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. In this case value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined) indicating the maximum number of principal components to be tested. See the ortho_projection function for more details.

  • "cumvar": selection of the principal components based on a given cumulative amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain.

  • "var": selection of the principal components based on a given amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained.

  • "manual": for manually specifying a fix number of principal components. In this case, value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined) indicating the minimum amount of variance that a component should explain in order to be retained.

The default is list(method = "var", value = 0.01).

Optionally, the pc_selection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used, the default "value" is set to 0.01.

return_projection

a logical indicating if the projection(s) must be returned. Projections are used if the ortho_diss methods are called (i.e. method = "pca", method = "pca.nipals" or method = "pls").

return_dissimilarity

a logical indicating if the dissimilarity matrix used for neighbor search must be returned.

ws

an odd integer value which specifies the window size, when diss_method = cor (cor_diss method) for moving correlation dissimilarity. If ws = NULL (default), then the window size will be equal to the number of variables (columns), i.e. instead moving correlation, the normal correlation will be used. See cor_diss function.

center

a logical indicating if the Xr and Xu matrices must be centered. If Xu is provided the data is centered around the mean of the pooled Xr and Xu matrices (\mjeqnXr \cup XuXr U Xu). For dissimilarity computations based on diss_method = pls, the data is always centered.

scale

a logical indicating if the Xr and Xu matrices must be scaled. If Xu is provided the data is scaled based on the standard deviation of the the pooled Xr and Xu matrices (\mjeqnXr \cup XuXr U Xu). If center = TRUE, scaling is applied after centering.

documentation

an optional character string that can be used to describe anything related to the mbl call (e.g. description of the input data). Default: character(). NOTE: his is an experimental argument.

...

further arguments to be passed to the dissimilarity fucntion. See details.

Details

This function may be specially useful when the reference set (Xr) is very large. In some cases the number of observations in the reference set can be reduced by removing irrelevant observations (i.e. observations that are not neighbors of a particular target set). For example, this fucntion can be used to reduce the size of the reference set before before running the mbl function.

This function uses the dissimilarity fucntion to compute the dissimilarities between Xr and Xu. Arguments to dissimilarity as well as further arguments to the functions used inside dissimilarity (i.e. ortho_diss cor_diss f_diss sid) can be passed to those functions as additional arguments (i.e. ...).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

dissimilarity ortho_diss cor_diss f_diss sid mbl

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
library(prospectr)

data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu)]

Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]

# Identify the neighbor observations using the correlation dissimilarity and
# default parameters
# (In this example all the observations in Xr belong at least to the
# first 100 neighbors of one observation in Xu)
ex1 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "cor",
  k = 40
)

# Identify the neighbor observations using principal component (PC)
# and partial least squares (PLS) dissimilarities, and using the "opc"
# approach for selecting the number of components
ex2 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pca",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE
)

# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex2$unique_neighbors]

ex3 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pls",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE
)
# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex3$unique_neighbors]

# Identify the neighbor observations using local PC dissimialrities
# Here, 150 neighbors are used to compute a local dissimilarity matrix
# and then this matrix is used to select 50 neighbors
ex4 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pls",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE,
  .local = TRUE,
  pre_k = 150
)

resemble documentation built on Jan. 18, 2022, 1:08 a.m.