Introduction

This R package is a collection of functions for performing multi-block data integration. Specifically, the functions in this package deal with concatenated data blocks that share the same observation units (i.e., rows; such as, subjects). For example, genomic data, denoted $\mathbf{X}{I \times J}$, and behavioral data, $\mathbf{X}{I \times K}$, with respect to the same persons can be concatenated $\mathbf{X}C=[\mathbf{X}{I \times J}\; \mathbf{X}_{I \times K}]$, and thus be jointly analyzed.

Notice that the common and distinctive processes (also referred to as common and distinctive components) are with respect to the estimated component loading matrix $\mathbf{P}$. As for the technical details, we refer to:

Data pre-processing

Raw data must be standardized (i.e., pre-processed) before analysis, and thus we provide a function mySTD(). The following paper provides a nice overview of how and why raw data should be pre-processed:

Identify common and distinctive components

Situation 1: No prior information on common and distinctive components

When no prior information is available, users may first try the function VAF() with various numbers of components. This function provides an overview of proportions of variance accounted for (VAF) for each component in each block. The idea is that we let VAF() do the analysis with an arbitrarily large number of components, say R*. Then the results of VAF() will show that only a smaller nuber is needed (i.e., R << R*), and thus we have found R. summary() is available for summarizing the results of VAF(). For reference of VAF(), see:

DISCOsca() tries all possible combinations of common and distinctive patterns in $\mathbf{P}$ matrix. Note that this algorithm utilizes a specific rule to determine common and distinctive processes: To put it simply, for each component across the entire data block, the algorithm calculates the distance among the sum of squares of the loadings per block (weighted by total variance of the block) to determine whether it is a common/distinctive component. summary() is available for summarizing the results of DISCOsca(). For technical details, see

pca_gca() identifies common and distinctive components in a two-stage procedure: First perform principal component analysis on each data block, and then perform a canonical correlation analysis on the component loadings across all the data blocks. In case of more than 2 data blocks, users may need to repeatedly apply the function to see, for example, whether some of the blocks (but not all) share common components. For technical details, see

sparseSCA() is an algorithm for SCA models with a Lasso penalty and a Group Lasso penalty. The algorithm can identify common and distinctive components as long as the proper tuning parameters for Lasso and Group Lasso are chosen. Users may use cv_sparseSCA() to find suitable values for the Lasso and Group Lasso tuning parameters and use plot() to check the cross-validation plot. maxLGlasso() helps to identify the max values for the Lasso and Group Lasso tuning parameters (that is, the smallest value for Lasso and Group Lasso penalties that generate $\mathbf{P}=\mathbf{0}$. Also noted that, sparseSCA() incorporates a multi-start procedure to deal with the local minima problem. summary() function is available for summarizing the results of cv_sparseSCA() and sparseSCA(). For technical details, see

Finally, the shrinkage of the non-zero loadings of the estimated $\mathbf{P}$ matrix can be undone by undoShrinkage(), and its results can be summarized by summary().

Situation 2: Prior information on common and distinctive components is available

This situation can happen, for example, when previous research has already provided some information on the common/distinctive processes. In this case, researchers may want to specify a particular structure in the estimated $\mathbf{P}$ matrix; in particular, some elements in $\mathbf{P}$ are fixed to 0's.

To this end, users may want to use the function structuredSCA() or cv_structuredSCA(). structuredSCA() allows for flexibly estimating $\mathbf{P}$ given a pre-defined structure in $\mathbf{P}$. The algorithm incorprates a Lasso penalty to achieve sparseness (and thus the Group Lasso is dropped). To identify a suitable range of Lasso tuning parameters, users are advised to try cv_structuredSCA(). summary() is available for the summarizing the results of cv_structuredSCA() and structuredSCA(). Further plot() is available for showing the cross-validation plot for cv_structuredSCA(). For technical details, see



ZhengguoGu/RegularizedSCA documentation built on July 4, 2019, 2:46 p.m.