subdivideDataset: Selects a subset of a multivariate (e.g., spectral) dataset.

Description Usage Arguments Value Author(s) References Examples

View source: R/subdivideDataset.R

Description

This function accepts spectra in a spectra.list or spectra.matrix object and selects a subset of that dataset. Importantly, the function can be set to select either a calibration or validation subset. These are fundamentally different. When you select a calibration dataset the intention is to choose a representative subset of all spectral data on which to perform wet lab analysis. However, when selecting a subset of samples (which already have wet lab analysis) in order to validate a model, it is important that both the validation (test set) and and calibration (training set) are representative–otherwise, the calibration model will be fit to well sampled spectral space but validated on outlying points. The calibration selection uses the Kennard-Stone algorithm whereas the validation selection uses the Duplex algorithm, which is a modification the original author's proposed. Finally, this function can also perform calibration or validation selection in one of five distinct methods (see the method parameter for details).

Usage

1
2
subdivideDataset(spectra, component = NULL, type = "validation",
  p = 0.2, method = "KS", seed.set = NULL, output = "logical")

Arguments

spectra

An object of class spectra.list or spectra.matrix containing the spectra to write.

component

Method "SPXY" and "MDKS" incorporate Y-value data in subset selection. If using one of these two methods, a vector of Y data should be provided here.

type

One of "calibration" or "validation" depending on the type of subset required.

p

The proportion of the dataset to select as the "calibration" or "validation" group.

method

The desired method. Selected from:
"KS" - Standard, Kennard-Stone selection. When type = validation the performs Duplex selection.
"PCAKS" - Selection is performed on the principal components from a PCA of the spectra.
"SPXY" - Selection occurs on both X (spectra) and Y (component) data with equal weighting.
"MDKS" - Mahalanobis distance is used instead of euclidean distances. Selection occurs on both X (spectra) and Y (component) data with equal weighting.
"random" - Simple random selection, regardless of multivariate distribution.

seed.set

A single numeric value. If method is "random" then you can set the seed so that the same selection is produced each time.

output

One of "logical" or "names." If "logical" then the function will return a logical vector where TRUE values are the selected samples. If "names" then the names of the selected spectra are returned.

Value

A vector. Depending on output, either a logical of list of names indicating selected spectra.

Author(s)

Daniel M Griffith

References

Kennard, R. W. and Stone, L. A. (1969) Computer aided design of experiments. Technometrics, 11, 137-148.

Galvao, R., Araujo, M., Jose, G., Pontes, M., Silva, E. & Saldanha, T. (2005). A method for calibration and validation subset partitioning. Talanta, 67, 736<e2><80><93>740.

Saptoro, Agus; Tad<c3><a9>, Moses O.; and Vuthaluru, Hari (2012) "A Modified Kennard-Stone Algorithm for Optimal Division of Data for Developing Artificial Neural Network Models," Chemical Product and Process Modeling: Vol. 7: Iss. 1, Article 13. DOI: 10.1515/1934-2659.1645

Snee, R.D., 1977. Validation of regression models: methods and examples. Technometrics, 19, 415-428.

Examples

1
2
3
4
5
6
## Not run: 
data(shootout)
val_set <- subdivideDataset(spectra = shootout_scans, type = "validation", method = "KS")
table(val_set)

## End(Not run)

griffithdan/plantspec documentation built on May 17, 2019, 8:37 a.m.