FusionLearn-package: Fusion Learning

FusionLearn-packageR Documentation

Fusion Learning

Description

FusionLearn package implements a new learning algorithm to integrate information from different experimental platforms. The algorithm applies the grouped penalization method in the pseudolikelihood setting.

Details

In the context of fusion learning, there are k different data sets from k different experimental platforms. The data from each platform can be modeled by a different generalized linear model. Assume the same set of predictors \{M_1,M_2,...,M_j,...,M_p \} are measured across k different experimental platforms.

Platforms Formula M_1 M_2 M_j M_p
1 y_1: g_1(μ_1) \sim x_{11}β_{11}+ x_{12}β_{12}+ x_{1j}β_{1j}+ x_{1p}β_{1p}
2 y_2: g_2(μ_2) \sim x_{21}β_{21}+ x_{22}β_{22}+ x_{2j}β_{2j}+ x_{2p}β_{2p}
...
k y_k: g_k(μ_k) \sim x_{k1}β_{k1}+ x_{k2}β_{k2}+ x_{kj}β_{kj}+ x_{kp}β_{kp}

Here x_{kj} represents the observation of the predictor M_j on the kth platform, and β^{(j)} denotes the vector of regression coefficients for the predictor M_j.

Platforms \bold{M_j} \bold{β^{(j)}}
1 x_{1j} β_{1j}
2 x_{2j} β_{2j}
... ...
k x_{kj} β_{kj}

Consider the following examples.

Example 1. Suppose k different types of experiments are conducted to study the genetic mechanism of a disease. The predictors in this research are different facets of individual genes, such as mRNA expression, protein expression, RNAseq expression and so on. The goal is to select the genes which affect the disease, while the genes are assessed in a number of ways through different measurement processes across k experimental platforms.

Example 2. The predictive models for three different financial indices are simultaneously built from a panel of stock index predictors. In this case, the predictor values across different models are the same, but the regression coefficients are different.

In the conventional approach, the model for each of the k platforms is analyzed separately. FusionLearn algorithm selects significant predictors through learning from multiple models. The overall objective is to minimize the function:

Q(β)=l_I(β)- n ∑_{j=1}^{p} Ω_{λ_n} ||β^{(j)}||,

with p being the numbers of predictors, Ω_{λ_n} being the penalty functions, and ||β^{(j)}|| = (∑_{i=1}^{k}β_{ij}^2)^{1/2} denoting the L_2-norm of the coefficients of the predictor M_j.

The user can specify the penalty function Ω_{λ_n} and the penalty values λ_n. This package also contains functions to provide the pseudolikelihood Bayesian information criterion:

pseu-BIC(s) = -2l_I(\hat{β}_I;Y) + d_s^{*} γ_n

with -2l_I(\hat{β}_I; Y) denoting the pseudo loglikelihood, d_s^{*} measuring the model complexity and γ_n being the penalty on the model complexity.

The basic function fusionbase deals with continuous responses. The function fusionbinary is applied to binary responses, and the function fusionmixed is applied to a mix of continuous and binary responses.

Note

Here we provide two examples to illustrate the data structures. Assume X_I and X_{II} represent two sets of the predictors from 2 experimental platforms.

Example 1. If the observations from X_I and X_{II} are independent, the number of observations can be different. The order of the predictors \{M_1, M_2, M_3, M_4\} in X_I matches with the predictors in X_{II}. If X_{II} does not include the predictor M_3, then the M_3 in X_{II} needs to be filled with NA.

M_1 M_2 M_3 M_4 M_1 M_2 M_3 M_4
X_I = 0.1 0.3 0.5 20 X_{II} = 100 8 NA 100
0.3 0.1 0.5 7 30 1 NA 2
0.1 0.9 1 0 43 19 NA -3
-0.3 1.2 2 40

Example 2. If the observations from X_I and X_{II} are correlated, the number of observations must be the same. The ith row in X_I is correlatd with the ith row in X_{II}. The predictors of X_I and X_{II} should be matched in order. The predictors which are not measured need to be filled with NA.

M_1 M_2 M_3 M_4 M_1 M_2 M_3 M_4
X_I = 0.1 0.3 0.5 20 X_{II} = 0.3 0.8 NA 100
0.3 0.1 0.5 70 0.2 1 NA 20
-0.1 0.9 1 0 0.43 1.9 NA -30
-0.3 1.2 2 40 -0.4 -2 NA 40

In functions fusionbase.fit, fusionbinary.fit, and fusionmixed.fit, the option depen is used to specify whether observations from different platforms are correlated or independent.

Author(s)

Xin Gao, Yuan Zhong and Raymond J Carroll

Maintainer: Yuan Zhong <aqua.zhong@gmail.com>

References

Gao, X and Carroll, R. J. (2017) Data integration with high dimensionality. Biometrika, 104, 2, pp. 251-272


FusionLearn documentation built on April 25, 2022, 1:05 a.m.