d.spls.calval: Splits data into calibration and validation sets using the...
In dual.spls: Dual Sparse Partial Least Squares Regression

d.spls.calval

R Documentation

Splits data into calibration and validation sets using the splitting method CalValXy that takes into account X and y

Description

The function d.spls.calval divides the data X into a calibration and a validation. It uses a variation on the Kennard and Stone strategy CalValXy by dividing observations into groups (see details for more explanations).

Usage

d.spls.calval(X,pcal=NULL,Datatype=NULL,y=NULL,ncells=10,Listecal=NULL,
center=TRUE,method="euclidean",pc=0.9)

Arguments

`X`	a numeric matrix of predictors values.
`pcal`	a positive integer between 0 and 100. `pcal` is the percentage of calibration samples to be selected. Default value is NULL, meaning as long as `Listecal` is specified, `pcal` is not necessary.
`Datatype`	A vector of index specifying each observation belonging to which group index. Default value is `NULL`, meaning the function will use the internal function `type` to compute the vector for `ncells`. If `NULL`, parameter `y` should be specified. (see details for more explanation)
`y`	a numeric vector of responses. Default value is `NULL`, meaning as long as `Datatype` is specified, `y` is not necessary.
`ncells`	a positive integer. `ncells` is the number of groups dividing the observations. If `Datatype` is not specified, the function divides the observations into `ncells` groups. Default value is `10`.
`Listecal`	a numeric vector specifying how many observations from each group should be selected as calibration. Default value is `NULL`, meaning the function will consider a percentage of `pcal` from each group to be in the calibration set. If `NULL`, parameter `pcal` should be specified.
`center`	logical value indicating wether the matrix `X` should be centered. Default set to TRUE.
`method`	the method and norm used for the distance computation. It is by default equal to "euclidean" which means original `X` is used with euclidean norm. "svd-euclidean" means euclidean distance is used after a SVD transformation with `pc` components. "pca-euclidean" means euclidean distance on PCA scores with `pc` components. For
`pc`	a positive real value indicating the number of component to consider when applying the SVD transformation or the PCA. If `pc` `< 1`, the number of components kept corresponds to the number of components explaining at least (`pc` `< 1`) percent of the total variance.

Details

The algorithm allows to select samples using the classical Kennard and Stone on each group of observations one by one. It starts by selecting the point that is the furthest away from the centroid. This point is assigned as the calibration set and is removed from the list of candidates. Then, it identifies to which group belongs this first observation and considers the group g that comes after. It computes the distance \delta_{P_{i,g}} between the remaining points P_{i,g} belonging to the group the group g and the calibration point assigned. The point with the largest \delta_{P_{i,g}} is selected and removed from the set then the procedure moves on to the group that comes after.

When there is more than one calibration sample, the procedure computes the distance between each P_{i,g} from the concerned group and each P_{i,cal} from the calibration set. The minimal distance for each P_{i,g} is noted distmin(P_{i,g}). The selected final candidate verifies the following equation:

P_{selected}=\{ P_{i,g} | max(distmin(P_{i,g}))\}

Once each of the vector Listecal elements are null; the procedure is done.

The algorithm for only one group corresponds to the classical Kennard and Stone algorithm.

If Datatype is not specified, the function devides the observations into ncells groups. First, the observations are sorted according to the values of y. Second, the observations is divided into equal ncells according to the cumulative empirical probabilities. Finally, each observation with a value of y belonging to a sub interval is assigned the number of the corresponding cell.

Value

A list of the following attributes

`indcal`	a numeric vector giving the row indices of the input data selected for calibration.
`indval`	a numeric vector giving the row indices of the remaining observations.

Author(s)

Louna Alsouki François Wahl

References

Kennard, Ronald W, and Larry A Stone. 1969. “Computer Aided Design of Experiments.” Technometrics 11 (1): 137–48.

Examples

### load dual.spls library
library(dual.spls)
### parameters
n <- 100
p <- 50
nondes <- 20
sigmaondes <- 0.5
data=d.spls.simulate(n=n,p=p,nondes=nondes,sigmaondes=sigmaondes)

X <- data$X
y <- data$y

###calibration parameters for split1
pcal <- 70
ncells <- 3

split1 <- d.spls.calval(X=X,pcal=pcal,y=y,ncells=ncells)

###plotting split1
plot(X[split1$indcal,1],X[split1$indcal,2],xlab="Variable 1",
ylab="Variable 2",pch=19,col="red",main="Calibration and validation split1")
points(X[split1$indval,1],X[split1$indval,2],pch=19,col="green")
legend("topright", legend = c("Calibration points", "Validation points"),
cex = 0.8, col = c("red","green"), pch = c(19,19))

###calibration parameters for split2
ncells <- 3
dimtype=floor(n/3)
# type of observations
Datatype <- c(rep(1,dimtype),rep(2,dimtype),rep(3,(n-dimtype*2)))
# how many observations of each type are to be selected in the calibration set
L1=floor(0.7*length(which(Datatype==1)))
L2=floor(0.8*length(which(Datatype==2)))
L3=floor(0.6*length(which(Datatype==3)))
Listecal <- c(L1,L2,L3)

split2 <- d.spls.calval(X=X,y=y,Datatype=Datatype,Listecal=Listecal)

###plotting split2
plot(X[split2$indcal,1],X[split2$indcal,2],xlab="Variable 1",
ylab="Variable 2",pch=19,col="red",main="Calibration and validation split2")
points(X[split2$indval,1],X[split2$indval,2],pch=19,col="green")
legend("topright", legend = c("Calibration points", "Validation points"),
cex = 0.8, col = c("red","green"), pch = c(19,19))

dual.spls documentation built on April 19, 2023, 1:07 a.m.