preprocess: preprocess for microarray data
In plsgenomics: PLS Analyses for Genomics

Description Usage Arguments Details Value Author(s) References Examples

The function preprocess performs a preprocessing of microarray data.

1 2	preprocess(Xtrain, Xtest=NULL,Threshold=c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE)

`Xtrain`	a (ntrain x p) data matrix of predictors. `Xtrain` must be a matrix. Each row corresponds to an observation and each column to a predictor variable.
`Xtest`	a (ntest x p) matrix containing the predictors for the test data set. `Xtest` may also be a vector of length p (corresponding to only one test observation).
`Threshold`	a vector of length 2 containing the values (threshmin,threshmax) for thresholding data in preprocess. Data is thresholded to value threshmin and ceiled to value threshmax. If `Threshold` is NULL then no thresholding is done. By default, if the value given for `Threshold` is not valid, no thresholding is done.
`Filtering`	a vector of length 2 containing the values (FiltMin,FiltMax) for filtering genes in preprocess. Genes with max/min$<= FiltMin$ and (max-min)$<= FiltMax$ are excluded. If `Filtering` is NULL then no thresholding is done. By default, if the value given for `Filtering` is not valid, no filtering is done.
`log10.scale`	a logical value equal to TRUE if a log10-transformation has to be done.
`row.stand`	a logical value equal to TRUE if a standardisation in row has to be done.

The pre-processing steps recommended by Dudoit et al. (2002) are performed. The default values are those adapted for Colon data.

A list with the following components:

`pXtrain`	the (ntrain x p') matrix containing the preprocessed train data.
`pXtest`	the (ntest x p') matrix containing the preprocessed test data.

Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/) and Julie Peyre (http://www-lmc.imag.fr/lmc-sms/Julie.Peyre/).

Dudoit, S. and Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97, 77–87.

# load plsgenomics library
library(plsgenomics)

# load Colon data
data(Colon)
IndexLearn <- c(sample(which(Colon$Y==2),27),sample(which(Colon$Y==1),14))

Xtrain <- Colon$X[IndexLearn,]
Ytrain <- Colon$Y[IndexLearn]
Xtest <- Colon$X[-IndexLearn,]

# preprocess data
resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500),
				log10.scale=TRUE,row.stand=TRUE)

# how many genes after preprocess ?
dim(resP$pXtrain)[2]

For any news related to the 'plsgenomics' package (update, corrected bugs), please check http://thoth.inrialpes.fr/people/gdurif/
C++ based sparse PLS routines will soon be available on the CRAN in the new 'fastPLS' package.
[1] 1157