In ZhengguoGu/RegularizedSCA: Regularized Simultaneous Component Based Data Integration

Introduction

This R package is a collection of functions for performing multi-block data integration. Specifically, the functions in this package deal with concatenated data blocks that share the same observation units (i.e., rows; such as, subjects). For example, genomic data, denoted $\mathbf{X}{I \times J}$, and behavioral data, $\mathbf{X}{I \times K}$, with respect to the same persons can be concatenated $\mathbf{X}C=[\mathbf{X}{I \times J}\; \mathbf{X}_{I \times K}]$, and thus be jointly analyzed.

Notice that the common and distinctive processes (also referred to as common and distinctive components) are with respect to the estimated component loading matrix $\mathbf{P}$. As for the technical details, we refer to:

Van Deun, K., Wilderjans, T.F., Van Den Berg, R.A., Antoniadis, A., & Van Mechelen, I. (2011). A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics. 12(1), 448.
Gu, Z., & Van Deun, K. (2016). A variable selection method for simultaneous component based data integration. \emph{Chemometrics and Intelligent Laboratory Systems}, 158, 187-199.

Data pre-processing

Raw data must be standardized (i.e., pre-processed) before analysis, and thus we provide a function mySTD(). The following paper provides a nice overview of how and why raw data should be pre-processed:

Van Deun, K., Smilde, A.K., van der Werf, M.J., Kiers, H.A.L, & Mechelen, I.V. (2009). A structured overview of simultaneous component based data integration. BMC Bioinformatics, 10:246.

Identify common and distinctive components

Situation 1: No prior information on common and distinctive components

When no prior information is available, users may first try the function VAF() with various numbers of components. This function provides an overview of proportions of variance accounted for (VAF) for each component in each block. The idea is that we let VAF() do the analysis with an arbitrarily large number of components, say R*. Then the results of VAF() will show that only a smaller nuber is needed (i.e., R << R*), and thus we have found R. summary() is available for summarizing the results of VAF(). For reference of VAF(), see:

Schouteden, M., Van Deun, K., Wilderjans, T. F., & Van Mechelen, I. (2014). Performing DISCO-SCA to search for distinctive and common information in linked data. Behavior research methods, 46(2), 576-587.
Schouteden, M., Van Deun, K., Pattyn, S., & Van Mechelen, I. (2013). SCA with rotation to distinguish common and distinctive information in linked data. Behavior research methods, 45(3), 822-833.

DISCOsca() tries all possible combinations of common and distinctive patterns in $\mathbf{P}$ matrix. Note that this algorithm utilizes a specific rule to determine common and distinctive processes: To put it simply, for each component across the entire data block, the algorithm calculates the distance among the sum of squares of the loadings per block (weighted by total variance of the block) to determine whether it is a common/distinctive component. summary() is available for summarizing the results of DISCOsca(). For technical details, see

Schouteden, M., Van Deun, K., Wilderjans, T. F., & Van Mechelen, I. (2014). Performing DISCO-SCA to search for distinctive and common information in linked data. Behavior research methods, 46(2), 576-587.

pca_gca() identifies common and distinctive components in a two-stage procedure: First perform principal component analysis on each data block, and then perform a canonical correlation analysis on the component loadings across all the data blocks. In case of more than 2 data blocks, users may need to repeatedly apply the function to see, for example, whether some of the blocks (but not all) share common components. For technical details, see

Tenenhaus, A., & Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis. Psychometrika, 76(2), 257-284.
Smilde, A.K., Mage, I., Naes, T., Hankemeier, T., Lips, M.A., Kiers, H.A., Acar, E., & Bro, R. (2016). Common and distinct components in data fusion. arXiv preprint arXiv:1607.02328.

sparseSCA() is an algorithm for SCA models with a Lasso penalty and a Group Lasso penalty. The algorithm can identify common and distinctive components as long as the proper tuning parameters for Lasso and Group Lasso are chosen. Users may use cv_sparseSCA() to find suitable values for the Lasso and Group Lasso tuning parameters and use plot() to check the cross-validation plot. maxLGlasso() helps to identify the max values for the Lasso and Group Lasso tuning parameters (that is, the smallest value for Lasso and Group Lasso penalties that generate $\mathbf{P}=\mathbf{0}$. Also noted that, sparseSCA() incorporates a multi-start procedure to deal with the local minima problem. summary() function is available for summarizing the results of cv_sparseSCA() and sparseSCA(). For technical details, see

Friedman, J., Hastie, T., & Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

Finally, the shrinkage of the non-zero loadings of the estimated $\mathbf{P}$ matrix can be undone by undoShrinkage(), and its results can be summarized by summary().

Situation 2: Prior information on common and distinctive components is available

This situation can happen, for example, when previous research has already provided some information on the common/distinctive processes. In this case, researchers may want to specify a particular structure in the estimated $\mathbf{P}$ matrix; in particular, some elements in $\mathbf{P}$ are fixed to 0's.

To this end, users may want to use the function structuredSCA() or cv_structuredSCA(). structuredSCA() allows for flexibly estimating $\mathbf{P}$ given a pre-defined structure in $\mathbf{P}$. The algorithm incorprates a Lasso penalty to achieve sparseness (and thus the Group Lasso is dropped). To identify a suitable range of Lasso tuning parameters, users are advised to try cv_structuredSCA(). summary() is available for the summarizing the results of cv_structuredSCA() and structuredSCA(). Further plot() is available for showing the cross-validation plot for cv_structuredSCA(). For technical details, see

Gu, Z., & Van Deun, K. (2016). A variable selection method for simultaneous component based data integration. \emph{Chemometrics and Intelligent Laboratory Systems}, 158, 187-199.

ZhengguoGu/RegularizedSCA documentation built on July 4, 2019, 2:46 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ZhengguoGu/RegularizedSCA
Regularized Simultaneous Component Based Data Integration

In ZhengguoGu/RegularizedSCA: Regularized Simultaneous Component Based Data Integration

Introduction

Data pre-processing

Identify common and distinctive components

Situation 1: No prior information on common and distinctive components

Situation 2: Prior information on common and distinctive components is available

R Package Documentation

Browse R Packages

We want your feedback!

ZhengguoGu/RegularizedSCA Regularized Simultaneous Component Based Data Integration

In ZhengguoGu/RegularizedSCA: Regularized Simultaneous Component Based Data Integration

Introduction

Data pre-processing

Identify common and distinctive components

Situation 1: No prior information on common and distinctive components

Situation 2: Prior information on common and distinctive components is available

R Package Documentation

Browse R Packages

We want your feedback!

ZhengguoGu/RegularizedSCA
Regularized Simultaneous Component Based Data Integration