sim_eval: A function for evaluating dissimilarity matrices (sim_eval)

View source: R/sim_eval.R

sim_evalR Documentation

A function for evaluating dissimilarity matrices (sim_eval)

Description

\loadmathjax Stable lifecycle

This function searches for the most similar observation (closest neighbor) of each observation in a given dataset based on a dissimilarity (e.g. distance matrix). The observations are compared against their corresponding closest observations in terms of their side information provided. The root mean square of differences and the correlation coefficient are used for continuous variables and for discrete variables the kappa index is used.

Usage

sim_eval(d, side_info)

Arguments

d

a symmetric matrix of dissimilarity scores between observations of a given dataset. Alternatively, a vector of with the dissimilarity scores of the lower triangle (without the diagonal values) can be used (see details).

side_info

a matrix containing the side information corresponding to the observations in the dataset from which the dissimilarity matrix was computed. It can be either a numeric matrix with one or multiple columns/variables or a matrix with one character variable (discrete variable). If it is numeric, the root mean square of differences is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. If it is a character variable, then the kappa index is used. See details.

Details

For the evaluation of dissimilarity matrices this function uses side information (information about one variable which is available for a group of observations, Ramirez-Lopez et al., 2013). It is assumed that there is a (direct or indirect) correlation between this side informative variable and the variables from which the dissimilarity was computed. If side_info is numeric, the root mean square of differences (RMSD) is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. It is computed as follows:

\mjdeqn

j(i) = NN(xr_i, Xr^{-i})j(i) = NN(xr_i, Xr^{-i}) \mjdeqnRMSD = \sqrt\frac1m \sum_i=1^n (y_i - y_j(i))^2RMSD = \sqrt1/n sum_i=1^m (y_i - y_j(i))^2

where \mjeqnNN(xr_i, Xr^-i)NN(xr_i, Xr^-i) represents a function to obtain the index of the nearest neighbor observation found in \mjeqnXrXr (excluding the \mjeqniith observation) for \mjeqnxr_ixr_i, \mjeqny_iy_i is the value of the side variable of the \mjeqniith observation, \mjeqny_j(i)y_j(i) is the value of the side variable of the nearest neighbor of the \mjeqniith observation and \mjeqnmm is the total number of observations.

If side_info is a factor the kappa index (\mjeqn\kappakappa) is used instead the RMSD. It is computed as follows:

\mjdeqn\kappa

= \fracp_o-p_e1-p_ekappa = p_o-p_e/1-p_e

where both \mjeqnp_op_o and \mjeqnp_ep_e are two different agreement indices between the the side information of the observations and the side information of their corresponding nearest observations (i.e. most similar observations). While \mjeqnp_op_o is the relative agreement \mjeqnp_ep_e is the the agreement expected by chance.

This functions accepts vectors to be passed to argument d, in this case, the vector must represent the lower triangle of a dissimilarity matrix (e.g. as returned by the stats::dist() function of stats).

Value

sim_eval returns a list with the following components:

  • "eval either the RMSD (and the correlation coefficient) or the kappa index

  • first_nn a matrix containing the original side informative variable in the first half of the columns, and the side informative values of the corresponding nearest neighbors in the second half of the columns.

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples


library(prospectr)
data(NIRsoil)

sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)

# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg

Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

# Example 1
# Compute a principal components distance
pca_d <- ortho_diss(Xr, pc_selection = list("manual", 8))$dissimilarity

# Example 1.1
# Evaluate the distance matrix on the baisis of the
# side information (Yr) associated with Xr
se <- sim_eval(pca_d, side_info = as.matrix(Yr))

# The final evaluation results
se$eval

# The final values of the side information (Yr) and the values of
# the side information corresponding to the first nearest neighbors
# found by using the distance matrix
se$first_nn

# Example 1.2
# Evaluate the distance matrix on the basis of two side
# information (Yr and Yr2)
# variables associated with Xr
Yr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]
se_2 <- sim_eval(d = pca_d, side_info = cbind(Yr, Yr_2))

# The final evaluation results
se_2$eval

# The final values of the side information variables and the values
# of the side information variables corresponding to the first
# nearest neighbors found by using the distance matrix
se_2$first_nn

# Example 2
# Evaluate the distances produced by retaining different number of
# principal components (this is the same principle used in the
# optimized principal components approach ("opc"))

# first project the data
pca_2 <- ortho_projection(Xr, pc_selection = list("manual", 30))

results <- matrix(NA, pca_2$n_components, 3)
colnames(results) <- c("pcs", "rmsd", "r")
results[, 1] <- 1:pca_2$n_components
for (i in 1:pca_2$n_components) {
  ith_d <- f_diss(pca_2$scores[, 1:i, drop = FALSE], scale = TRUE)
  ith_eval <- sim_eval(ith_d, side_info = as.matrix(Yr))
  results[i, 2:3] <- as.vector(ith_eval$eval)
}
plot(results)

# Example 3
# Example 3.1
# Evaluate a dissimilarity matrix computed using the correlation
# method
cd <- cor_diss(Xr)
eval_corr_diss <- sim_eval(cd, side_info = as.matrix(Yr))
eval_corr_diss$eval


l-ramirez-lopez/resemble documentation built on April 20, 2023, 10:44 p.m.