predict: Predict Method for (mint).(block).(s)pls(da) methods

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Predicted values based on PLS models. New responses and variates are predicted using a fitted model and a new matrix of observations.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
## S3 method for class 'mixo_pls'
predict(
  object,
  newdata,
  study.test,
  dist = c("all", "max.dist", "centroids.dist", "mahalanobis.dist"),
  multilevel = NULL,
  ...
)

## S3 method for class 'mixo_spls'
predict(
  object,
  newdata,
  study.test,
  dist = c("all", "max.dist", "centroids.dist", "mahalanobis.dist"),
  multilevel = NULL,
  ...
)

## S3 method for class 'mint.splsda'
predict(
  object,
  newdata,
  study.test,
  dist = c("all", "max.dist", "centroids.dist", "mahalanobis.dist"),
  multilevel = NULL,
  ...
)

## S3 method for class 'block.pls'
predict(
  object,
  newdata,
  study.test,
  dist = c("all", "max.dist", "centroids.dist", "mahalanobis.dist"),
  multilevel = NULL,
  ...
)

## S3 method for class 'block.spls'
predict(
  object,
  newdata,
  study.test,
  dist = c("all", "max.dist", "centroids.dist", "mahalanobis.dist"),
  multilevel = NULL,
  ...
)

Arguments

object

object of class inheriting from "(mint).(block).(s)pls(da)".

newdata

data matrix in which to look for for explanatory variables to be used for prediction. Please note that this method does not perform multilevel decomposition or log ratio transformations, which need to be processed beforehand.

study.test

For MINT objects, grouping factor indicating which samples of newdata are from the same study. Overlap with object$study are allowed.

dist

distance to be applied for discriminant methods to predict the class of new data, should be a subset of "centroids.dist", "mahalanobis.dist" or "max.dist" (see Details). Defaults to "all".

multilevel

Design matrix for multilevel analysis (for repeated measurements). A numeric matrix or data frame. For a one level factor decomposition, the input is a vector indicating the repeated measures on each individual, i.e. the individuals ID. For a two level decomposition with splsda models, the two factors are included in Y. Finally for a two level decomposition with spls models, 2nd AND 3rd columns in design indicate those factors (see example in ?splsda and ?spls).

...

not used currently.

Details

predict produces predicted values, obtained by evaluating the PLS-derived methods, returned by (mint).(block).(s)pls(da) in the frame newdata. Variates for newdata are also returned. Please note that this method performs multilevel decomposition and/or log ratio transformations if needed (multilevel is an input parameter while logratio is extracted from object).

Different prediction distances are proposed for discriminant analysis. The reason is that our supervised models work with a dummy indicator matrix of Y to indicate the class membership of each sample. The prediction of a new observation results in either a predicted dummy variable (output object$predict), or a predicted variate (output object$variates). Therefore, an appropriate distance needs to be applied to those predicted values to assign the predicted class. We propose distances such as ‘maximum distance’ for the predicted dummy variables, ‘Mahalanobis distance’ and ‘Centroids distance’ for the predicted variates.

"max.dist" is the simplest method to predict the class of a test sample. For each new individual, the class with the largest predicted dummy variable is the predicted class. This distance performs well in single data set analysis with multiclass problems (PLS-DA).

"centroids.dist" allocates to the new observation the class that mimimises the distance between the predicted score and the centroids of the classes calculated on the latent components or variates of the trained model.

"mahalanobis.dist" allocates the new sample the class defined as the centroid distance, but using the Mahalanobis metric in the calculation of the distance.

In practice we found that the centroid-based distances ("centroids.dist" and "mahalanobis.dist"), and specifically the Mahalanobis distance led to more accurate predictions than the maximum distance for complex classification problems and N-integration problems (block.splsda). The centroid distances consider the prediction in dimensional space spanned by the predicted variates, while the maximum distance considers a single point estimate using the predicted scores on the last dimension of the model. The user can assess the different distances, and choose the prediction distance that leads to the best performance of the model, as highlighted from the tune and perf outputs

More (mathematical) details about the prediction distances are available in the supplemental of the mixOmics article (Rohart et al 2017).

For a visualisation of those prediction distances, see background.predict that overlays the prediction area in plotIndiv for a sPLS-DA object.

Allocates the individual x to the class of Y minimizing dist(\code{x-variate}, G_l), where G_l, l = 1,...,L are the centroids of the classes calculated on the X-variates of the model. "mahalanobis.dist" allocates the individual x to the class of Y as in "centroids.dist" but by using the Mahalanobis metric in the calculation of the distance.

For MINT objects, the study.test argument is required and provides the grouping factor of newdata.

For multi block analysis (thus block objects), newdata is a list of matrices whose names are a subset of names(object$X) and missing blocks are allowed. Several predictions are returned, either for each block or for all blocks. For non discriminant analysis, the predicted values (predict) are returned for each block and these values are combined by average (AveragedPredict) or weighted average (WeightedPredict), using the weights of the blocks that are calculated as the correlation between a block's components and the outcome's components.

For discriminant analysis, the predicted class is returned for each block (class) and each distance (dist) and these predictions are combined by majority vote (MajorityVote) or weighted majority vote (WeightedVote), using the weights of the blocks that are calculated as the correlation between a block's components and the outcome's components. NA means that there is no consensus among the block. For PLS-DA and sPLS-DA objects, the prediction area can be visualised in plotIndiv via the background.predict function.

Value

predict produces a list with the following components:

predict

predicted response values. The dimensions correspond to the observations, the response variables and the model dimension, respectively. For a supervised model, it corresponds to the predicted dummy variables.

variates

matrix of predicted variates.

B.hat

matrix of regression coefficients (without the intercept).

AveragedPredict

if more than one block, returns the average predicted values over the blocks (using the predict output)

WeightedPredict

if more than one block, returns the weighted average of the predicted values over the blocks (using the predict and weights outputs)

class

predicted class of newdata for each 1,...,ncomp components.

MajorityVote

if more than one block, returns the majority class over the blocks. NA for a sample means that there is no consensus on the predicted class for this particular sample over the blocks.

WeightedVote

if more than one block, returns the weighted majority class over the blocks. NA for a sample means that there is no consensus on the predicted class for this particular sample over the blocks.

weights

Returns the weights of each block used for the weighted predictions, for each nrepeat and each fold

centroids

matrix of coordinates for centroids.

dist

type of distance requested.

vote

majority vote result for multi block analysis (see details above).

Author(s)

Florian Rohart, Sébastien Déjean, Ignacio González, Kim-Anh Lê Cao, Al J Abadi

References

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.

See Also

pls, spls, plsda, splsda, mint.pls, mint.spls, mint.plsda, mint.splsda, block.pls, block.spls, block.plsda, block.splsda, mint.block.pls, mint.block.spls, mint.block.plsda, mint.block.splsda and visualisation with background.predict and http://www.mixOmics.org for more details.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
data(linnerud)
X <- linnerud$exercise
Y <- linnerud$physiological
linn.pls <- pls(X, Y, ncomp = 2, mode = "classic")

indiv1 <- c(200, 40, 60)
indiv2 <- c(190, 45, 45)
newdata <- rbind(indiv1, indiv2)
colnames(newdata) <- colnames(X)
newdata

pred <- predict(linn.pls, newdata)

plotIndiv(linn.pls, comp = 1:2, rep.space = "X-variate",style="graphics",ind.names=FALSE)
points(pred$variates[, 1], pred$variates[, 2], pch = 19, cex = 1.2)
text(pred$variates[, 1], pred$variates[, 2],
c("new ind.1", "new ind.2"), pos = 3)

## First example with plsda
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- as.factor(liver.toxicity$treatment[, 4])


## if training is perfomed on 4/5th of the original data
samp <- sample(1:5, nrow(X), replace = TRUE)
test <- which(samp == 1)   # testing on the first fold
train <- setdiff(1:nrow(X), test)

plsda.train <- plsda(X[train, ], Y[train], ncomp = 2)
test.predict <- predict(plsda.train, X[test, ], dist = "max.dist")
Prediction <- test.predict$class$max.dist[, 2]
cbind(Y = as.character(Y[test]), Prediction)

## Not run: 
## Second example with splsda
splsda.train <- splsda(X[train, ], Y[train], ncomp = 2, keepX = c(30, 30))
test.predict <- predict(splsda.train, X[test, ], dist = "max.dist")
Prediction <- test.predict$class$max.dist[, 2]
cbind(Y = as.character(Y[test]), Prediction)


## example with block.splsda=diablo=sgccda and a missing block
data(nutrimouse)
# need to unmap Y for an unsupervised analysis, where Y is included as a data block in data
Y.mat = unmap(nutrimouse$diet)
data = list(gene = nutrimouse$gene, lipid = nutrimouse$lipid, Y = Y.mat)
# with this design, all blocks are connected
design = matrix(c(0,1,1,1,0,1,1,1,0), ncol = 3, nrow = 3,
byrow = TRUE, dimnames = list(names(data), names(data)))

# train on 75% of the data
ind.train=NULL
for(i in 1:nlevels(nutrimouse$diet))
ind.train=c(ind.train,which(nutrimouse$diet==levels(nutrimouse$diet)[i])[1:6])

#training set
gene.train=nutrimouse$gene[ind.train,]
lipid.train=nutrimouse$lipid[ind.train,]
Y.mat.train=Y.mat[ind.train,]
Y.train=nutrimouse$diet[ind.train]
data.train=list(gene=gene.train,lipid=lipid.train,Y=Y.mat.train)

#test set
gene.test=nutrimouse$gene[-ind.train,]
lipid.test=nutrimouse$lipid[-ind.train,]
Y.mat.test=Y.mat[-ind.train,]
Y.test=nutrimouse$diet[-ind.train]
data.test=list(gene=gene.test,lipid=lipid.test)

# example with block.splsda=diablo=sgccda and a missing block
res.train = block.splsda(X=list(gene=gene.train,lipid=lipid.train),Y=Y.train,
ncomp=3,keepX=list(gene=c(10,10,10),lipid=c(5,5,5)))
test.predict = predict(res.train, newdata=data.test[2], method = "max.dist")


## example with mint.splsda
data(stemcells)

#training set
ind.test = which(stemcells$study == "3")
gene.train = stemcells$gene[-ind.test,]
Y.train = stemcells$celltype[-ind.test]
study.train = factor(stemcells$study[-ind.test])

#test set
gene.test = stemcells$gene[ind.test,]
Y.test = stemcells$celltype[ind.test]
study.test = factor(stemcells$study[ind.test])

res = mint.splsda(X = gene.train, Y = Y.train, ncomp = 3, keepX = c(10, 5, 15),
study = study.train)

pred = predict(res, newdata = gene.test, study.test = study.test)

data.frame(Truth = Y.test, prediction = pred$class$max.dist)


## End(Not run)

Example output

Loading required package: MASS
Loading required package: lattice
Loading required package: ggplot2

Loaded mixOmics 6.14.0
Thank you for using mixOmics!
Tutorials: http://mixomics.org
Bookdown vignette: https://mixomicsteam.github.io/Bookdown
Questions, issues: Follow the prompts at http://mixomics.org/contact-us
Cite us:  citation('mixOmics')

       Chins Situps Jumps
indiv1   200     40    60
indiv2   190     45    45
      Y    Prediction
ID204 "18" "48"      
ID220 "24" "6"       
ID312 "48" "48"      
ID314 "6"  "6"       
ID317 "18" "48"      
ID319 "24" "6"       
ID324 "48" "48"      
ID411 "48" "18"      
ID423 "48" "48"      
ID505 "18" "18"      
ID522 "48" "18"      
      Y    Prediction
ID204 "18" "18"      
ID220 "24" "6"       
ID312 "48" "18"      
ID314 "6"  "6"       
ID317 "18" "18"      
ID319 "24" "24"      
ID324 "48" "18"      
ID411 "48" "24"      
ID423 "48" "18"      
ID505 "18" "18"      
ID522 "48" "18"      
Warning message:
In predict.block.spls(res.train, newdata = data.test[2], method = "max.dist") :
  Some blocks are missing in 'newdata'; the prediction is based on the following blocks only: lipid
              Truth prediction.comp1 prediction.comp2 prediction.comp3
sample39       hESC       Fibroblast             hESC             hESC
sample40       hESC       Fibroblast       Fibroblast       Fibroblast
sample41       hESC       Fibroblast       Fibroblast       Fibroblast
sample42 Fibroblast       Fibroblast       Fibroblast       Fibroblast
sample43 Fibroblast       Fibroblast       Fibroblast       Fibroblast
sample44 Fibroblast       Fibroblast       Fibroblast       Fibroblast
sample45       hiPS       Fibroblast       Fibroblast       Fibroblast
sample46       hiPS       Fibroblast       Fibroblast       Fibroblast
sample47       hiPS       Fibroblast       Fibroblast       Fibroblast
sample48       hiPS             hiPS             hESC             hiPS
sample49       hiPS             hiPS             hESC             hESC
sample50       hiPS             hiPS             hESC             hESC
sample51       hESC             hiPS             hiPS             hiPS
sample52       hESC             hiPS             hESC             hiPS
sample53       hESC             hiPS             hiPS             hiPS
sample54       hESC             hiPS             hiPS             hiPS
sample55       hESC             hiPS             hiPS             hiPS
sample56       hiPS             hiPS             hESC             hESC
sample57       hiPS             hiPS             hiPS             hESC
sample58       hiPS             hiPS             hiPS             hiPS
sample59       hiPS             hiPS             hiPS             hiPS

mixOmics documentation built on April 15, 2021, 6:01 p.m.