sampcla | R Documentation |
The function divides a datset in two sets, "train" vs "test", using a stratified sampling on defined classes.
If argument y = NULL
(default), the sampling is random within each class. If not, the sampling is systematic (regular grid) within each class over the quantitative variable y
.
sampcla(x, y = NULL, m)
x |
A vector (length |
y |
A vector (length |
m |
Either an integer defining the equal number of test observation(s) to select per class, or a vector of integers defining the numbers to select for each class. In the last case, vector |
Indexes (i.e. position in x
) of the selected observations.
Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.
x <- sample(c(1, 3, 4), size = 20, replace = TRUE)
#x <- sample(c("B", "3", "a"), size = 20, replace = TRUE)
#x <- as.factor(sample(c("B", "3", "a"), size = 20, replace = TRUE))
table(x)
sampcla(x, m = 2)
s <- sampcla(x, m = 2)$test
x[s]
sampcla(x, m = c(1, 2, 1))
s <- sampcla(x, m = c(1, 2, 1))$test
x[s]
y <- rnorm(length(x))
sampcla(x, y, m = 2)
s <- sampcla(x, y, m = 2)$test
x[s]
########## Representative stratified sampling from an unsupervised clustering
data(cassav)
X <- cassav$Xtrain
y <- cassav$ytrain
N <- nrow(X)
fm <- pcaeigenk(X, nlv = 10)
z <- stats::kmeans(x = fm$T, centers = 3, nstart = 25, iter.max = 50)
x <- z$cluster
z <- table(x)
z
p <- c(z) / N
p
psamp <- .20
m <- round(psamp * N * p)
m
## Random
res <- sampcla(x, m = m)
s <- res$test
table(x[s])
## Systematic for y
res <- sampcla(x, y, m = m)
s <- res$test
table(x[s])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.