sampks: Kennard-Stone sampling

View source: R/sampks.R

sampksR Documentation

Kennard-Stone sampling

Description

The function divides the data X in two sets, "train" vs "test", using the Kennard-Stone (KS) algorithm (Kennard & Stone, 1969).

The two sets returned by the KS algorithm are not generated by the same probability distribution. One set has higher dispersion than the other. For being consistent with the literature, output train of sampks contains the set with the higher dispersion. (The train/test notions can be inverted depending on the objectives and usages).

Usage


sampks(X, k, diss = c("euclidean", "mahalanobis", "correlation"))

Arguments

X

A n x p matrix or data frame in which row observations are selected.

k

An integer defining the number of training observations to select.

diss

The type of dissimilarity used for selecting the observations in the algorithm. Possible values are "euclidean" (default; Euclidean distance), "mahalanobis" (Mahalanobis distance), or "correlation". Correlation dissimilarities are calculated by sqrt(.5 * (1 - rho)).

Value

A list of vectors of the indexes (i.e. row numbers in X) of the selected observations.

References

Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.

Examples


set.seed(seed = 1)
n <- 10 ; p <- 3
X <- matrix(rnorm(n * p, mean = 10), ncol = p, byrow = TRUE)
set.seed(seed = NULL)

sampks(X, k = 7)  
sampks(X, k = 7, diss = "mahalanobis")  

###################################

data(datcass)

X <- datcass$Xr

fm <- pca_eigenk(X, ncomp = 10)

z <- sampks(fm$T, k = 140, diss = "mahalanobis")
z

plotxy(fm$T, zeroes = TRUE, pch = 16) 
points(fm$T[z$test, 1:2], col = "red", pch = 16, cex = 1.3)


mlesnoff/rnirs documentation built on April 24, 2023, 4:17 a.m.