sampcla: Within-class sampling
In rchemo: Dimension Reduction, Regression and Discrimination for Chemometrics

sampcla

R Documentation

Within-class sampling

Description

The function divides a datset in two sets, "train" vs "test", using a stratified sampling on defined classes.

If argument y = NULL (default), the sampling is random within each class. If not, the sampling is systematic (regular grid) within each class over the quantitative variable y.

Usage


sampcla(x, y = NULL, m)

Arguments

`x`	A vector (length `m`) defining the class membership of the observations.
`y`	A vector (length `m`) defining the quantitative variable for the systematic sampling. If `NULL` (default), the sampling is random within each class.
`m`	Either an integer defining the equal number of test observation(s) to select per class, or a vector of integers defining the numbers to select for each class. In the last case, vector `m` must have a length equal to the number of classes present in `x`, and be ordered in the same way as the ordered class membership.

Value

`train`	Indexes (i.e. position in `x`) of the selected observations, for the training set.
`test`	Indexes (i.e. position in `x`) of the selected observations, for the test set.
`lev`	classes
`ni`	number of observations in each class

Note

The second example is a representative stratified sampling from an unsupervised clustering.

References

Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.

Examples


## EXAMPLE 1

x <- sample(c(1, 3, 4), size = 20, replace = TRUE)
table(x)

sampcla(x, m = 2)
s <- sampcla(x, m = 2)$test
x[s]

sampcla(x, m = c(1, 2, 1))
s <- sampcla(x, m = c(1, 2, 1))$test
x[s]

y <- rnorm(length(x))
sampcla(x, y, m = 2)
s <- sampcla(x, y, m = 2)$test
x[s]

## EXAMPLE 2

data(cassav)
X <- cassav$Xtrain
y <- cassav$ytrain
N <- nrow(X)

fm <- pcaeigenk(X, nlv = 10)
z <- stats::kmeans(x = fm$T, centers = 3, nstart = 25, iter.max = 50)
x <- z$cluster
z <- table(x)
z
p <- c(z) / N
p

psamp <- .20
m <- round(psamp * N * p)
m

random_sampling <- sampcla(x, m = m)
s <- random_sampling$test
table(x[s])

Systematic_sampling_for_y <- sampcla(x, y, m = m)
s <- Systematic_sampling_for_y$test
table(x[s])

rchemo documentation built on Sept. 11, 2024, 8:05 p.m.