Description Usage Arguments Details Value Author(s) References See Also Examples

Integration of multiple data sets measured on the same samples or observations to classify a discrete outcome to classify a discrete outcome and select features from each data set, ie. N-integration with sparse Discriminant Analysis. The method is partly based on Generalised Canonical Correlation Analysis.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |

`X` |
A list of data sets (called 'blocks') measured on the same samples. Data in the list should be arranged in matrices, samples x variables, with samples order matching in all data sets. |

`Y` |
A factor or a class vector indicating the discrete outcome of each sample. |

`indY` |
To be supplied if Y is missing, indicates the position of the factor / class vector outcome in the list |

`ncomp` |
the number of components to include in the model. Default to 2. Applies to all blocks. |

`keepX` |
A list of same length as X. Each entry is the number of variables to select in each of the blocks of X for each component. By default all variables are kept in the model. |

`design` |
numeric matrix of size (number of blocks in X) x (number of blocks in X) with values between 0 and 1. Each value indicates the strenght of the relationship to be modelled between two blocks; a value of 0 indicates no relationship, 1 is the maximum value. If |

`scheme` |
Either "horst", "factorial" or "centroid". Default = |

`mode` |
character string. What type of algorithm to use, (partially) matching
one of |

`scale` |
boleean. If scale = TRUE, each block is standardized
to zero means and unit variances. Default = |

`init` |
Mode of initialization use in the algorithm, either by Singular Value Decompostion of the product of each block of X with Y ("svd") or each block independently ("svd.single"). Default = |

`tol` |
Convergence stopping value. |

`max.iter` |
integer, the maximum number of iterations. |

`near.zero.var` |
boolean, see the internal |

`all.outputs` |
boolean. Computation can be faster when some specific (and non-essential) outputs are not calculated. Default = |

`block.splsda`

function fits a horizontal integration PLS-DA model with a specified number of components per block).
A factor indicating the discrete outcome needs to be provided, either by `Y`

or by its position `indY`

in the list of blocks `X`

.

`X`

can contain missing values. Missing values are handled by being disregarded during the cross product computations in the algorithm `block.pls`

without having to delete rows with missing data. Alternatively, missing data can be imputed prior using the `nipals`

function.

The type of algorithm to use is specified with the `mode`

argument. Four PLS
algorithms are available: PLS regression `("regression")`

, PLS canonical analysis
`("canonical")`

, redundancy analysis `("invariant")`

and the classical PLS
algorithm `("classic")`

(see References and `?pls`

for more details).

Note that our method is partly based on sparse Generalised Canonical Correlation Analysis and differs from the MB-PLS approaches proposed by Kowalski et al., 1989, J Chemom 3(1), Westerhuis et al., 1998, J Chemom, 12(5) and sparse variants Li et al., 2012, Bioinformatics 28(19); Karaman et al (2014), Metabolomics, 11(2); Kawaguchi et al., 2017, Biostatistics.

Variable selection is performed on each component for each block of `X`

if specified, via input parameter `keepX`

.

`block.splsda`

returns an object of class `"block.splsda", "block.spls"`

, a list
that contains the following components:

`X` |
the centered and standardized original predictor matrix. |

`indY` |
the position of the outcome Y in the output list X. |

`ncomp` |
the number of components included in the model for each block. |

`mode` |
the algorithm used to fit the model. |

`keepX` |
Number of variables used to build each component of each block |

`variates` |
list containing the variates of each block of X. |

`loadings` |
list containing the estimated loadings for the variates. |

`names` |
list containing the names to be used for individuals and variables. |

`nzv` |
list containing the zero- or near-zero predictors information. |

`iter` |
Number of iterations of the algorthm for each component |

`weights` |
Correlation between the variate of each block and the variate of the outcome. Used to weight predictions. |

`explained_variance` |
Percentage of explained variance for each component and each block |

Florian Rohart, Benoit Gautier, Kim-Anh Lê Cao

On multiple integration with sPLS-DA and 4 data blocks:

Singh A., Gautier B., Shannon C., Vacher M., Rohart F., Tebbutt S. and Lê Cao K.A. (2016). DIABLO: multi omics integration for biomarker discovery. BioRxiv available here: http://biorxiv.org/content/early/2016/08/03/067611

On data integration:

Tenenhaus A., Philippe C., Guillemot V, Lê Cao K.A., Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. *Biostatistics*. kxu001

Gunther O., Shin H., Ng R. T. , McMaster W. R., McManus B. M. , Keown P. A. , Tebbutt S.J. , Lê Cao K-A. , (2014) Novel multivariate methods for integration of genomics and proteomics data: Applications in a kidney transplant rejection study, OMICS: A journal of integrative biology, 18(11), 682-95.

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

`plotIndiv`

, `plotArrow`

, `plotLoadings`

, `plotVar`

,
`predict`

, `perf`

, `selectVar`

, `block.plsda`

, `block.spls`

and http://www.mixOmics.org/mixDIABLO for more details and examples.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | ```
# block.splsda
# -------------
data("breast.TCGA")
# this is the X data as a list of mRNA, miRNA and proteins
data = list(mrna = breast.TCGA$data.train$mrna, mirna = breast.TCGA$data.train$mirna,
protein = breast.TCGA$data.train$protein)
# set up a full design where every block is connected
design = matrix(1, ncol = length(data), nrow = length(data),
dimnames = list(names(data), names(data)))
diag(design) = 0
design
# set number of component per data set
ncomp = c(2)
# set number of variables to select, per component and per data set (this is set arbitrarily)
list.keepX = list(mrna = rep(20, 2), mirna = rep(10,2), protein = rep(10, 2))
TCGA.block.splsda = block.splsda(X = data, Y = breast.TCGA$data.train$subtype,
ncomp = ncomp, keepX = list.keepX, design = design)
TCGA.block.splsda
plotIndiv(TCGA.block.splsda, ind.names = FALSE)
# illustrates coefficient weights in each block
plotLoadings(TCGA.block.splsda, ncomp = 1, contrib = 'max')
plotVar(TCGA.block.splsda, style = 'graphics', legend = TRUE)
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.