r Sys.Date()
@Atsushi Kawaguchi
The msma
package provides functions for a matrix decomposition method incorporating sparse and supervised modeling for a multiblock multivariable data analysis.
Install package (as necessary)
if(!require("msma")) install.packages("msma")
Load package
library(msma)
#source("../package/Ver3_nest/src.r") #predir = unlist(strsplit(getwd(), "ADNI"))[1] #source(paste(predir, "ADNI/Multimodal/program/package/Ver2/src.r", sep="")); library(mvtnorm)
Simulated multiblock data (list data) by using the function simdata
.
Sample size is 50. The correlation coeficient is 0.8.
The numbers of columns for response and predictor can be specified by the argument Yps
and Xps
, respectively.
The length of vecor represents the number of blocks.
That is, response has three blocks with the numbers of columns being 3, 4, and 5 and predictor has one block with the number of columns being 3.
dataset0 = simdata(n = 50, rho = 0.8, Yps = c(3, 4, 5), Xps = 3, seed=1) X0 = dataset0$X; Y0 = dataset0$Y
The data generated here is applied to the msma
function.
The argument comp
can specify the number of components.
The arguments lambdaX
and lambdaY
can specify the regularization parameters for X and Y, respectively.
First, we set comp
=1, which will perform an analysis with 1 component.
fit01 = msma(X0, Y0, comp=1, lambdaX=0.05, lambdaY=1:3) fit01
The plot
function is available.
In default setting, the block weights are displayed as a barplot.
plot(fit01)
Next, we set comp
=2, which will perform an analysis with 2 components.
fit02 = msma(X0, Y0, comp=2, lambdaX=0.03, lambdaY=0.01*(1:3)) fit02
Two matrics are prepared by specifying arguments Yps
and Xps
.
dataset1 = simdata(n = 50, rho = 0.8, Yps = 5, Xps = 5, seed=1) X1 = dataset1$X[[1]]; Y1 = dataset1$Y
If input is a matrix, a principal component analysis is implemented.
(fit111 = msma(X1, comp=5))
The weight (loading) vectors can be obtained as follows.
fit111$wbX
The bar plots of weight vectors are provided by the function plot
.
The component number is specified by the argument axes
.
The plot type is selected by the argument plottype
.
Furthermore, since this function uses the barplot
function originally built into R, its arguments are also available.
In the following example, on the horizontal axis,
the magnification of the variable names is set to 0.7 by setting cex.names
=0.7, and
the variable names are oriented as las
=2.
par(mfrow=c(1,2)) plot(fit111, axes = 1, plottype="bar", cex.names=0.7, las=2) plot(fit111, axes = 2, plottype="bar", cex.names=0.7, las=2)
The score vectors for first six subjects.
lapply(fit111$sbX, head)
The scatter plots for the score vectors specified by the argument v
.
The argument axes
is specified by the two length vector represents which components are displayed.
par(mfrow=c(1,2)) plot(fit111, v="score", axes = 1:2, plottype="scatter") plot(fit111, v="score", axes = 2:3, plottype="scatter")
When the argument v
was specified as "cpev", the cummulative eigenvalues are plotted.
par(mfrow=c(1,2)) plot(fit111, v="cpev", ylim=c(0.7, 1))
There is the R function prcomp to implement PCA.
(fit1112 = prcomp(X1, scale=TRUE)) summary(fit1112)
This Rotation is almost the same as the output of msma
,
but it can be made closer by setting the argument ceps
as follows.
fit1113 = msma(X1, comp=5, ceps=0.0000001) fit1113$wbX
Plotting the scores with the signs turned over, we see that similar scores are calculated.
par(mfrow=c(1,2)) biplot(fit1112) plot(-fit1113$sbX[[1]][,1:2],xlab="Component 1",ylab="Component 2")
The ggfortify
package is also available for the PCA plot.
If lambdaX
(>0) is specified, a sparse principal component analysis is implemented.
(fit112 = msma(X1, comp=5, lambdaX=0.1)) par(mfrow=c(1,2)) plot(fit112, axes = 1, plottype="bar", las=2) plot(fit112, axes = 2, plottype="bar", las=2)
The outcome Z is generated.
set.seed(1); Z = rbinom(50, 1, 0.5)
If the outcome Z is specified, a supervised sparse principal component analysis is implemented.
(fit113 = msma(X1, Z=Z, comp=5, lambdaX=0.02))
par(mfrow=c(1,2)) plot(fit113, axes = 1, plottype="bar", las=2) plot(fit113, axes = 2, plottype="bar", las=2)
If the another input Y1 is specified, a partial least squres is implemented.
(fit121 = msma(X1, Y1, comp=2))
The component number is specified by the argument axes
.
When the argument XY
was specified as "XY", the scatter plots for Y score against X score are plotted.
par(mfrow=c(1,2)) plot(fit121, axes = 1, XY="XY") plot(fit121, axes = 2, XY="XY")
If lambdaX
and lambdaY
are specified, a sparse PLS is implemented.
(fit122 = msma(X1, Y1, comp=2, lambdaX=0.5, lambdaY=0.5)) par(mfrow=c(1,2)) plot(fit122, axes = 1, XY="XY") plot(fit122, axes = 2, XY="XY")
If the outcome Z is specified, a supervised sparse PLS is implemented.
(fit123 = msma(X1, Y1, Z, comp=2, lambdaX=0.5, lambdaY=0.5)) par(mfrow=c(1,2)) plot(fit123, axes = 1, XY="XY") plot(fit123, axes = 2, XY="XY")
Multiblock data is a list of data matrix.
dataset2 = simdata(n = 50, rho = 0.8, Yps = c(2, 3), Xps = c(3, 4), seed=1) X2 = dataset2$X; Y2 = dataset2$Y
The input class is list.
class(X2)
The list length is 2 for 2 blocks.
length(X2)
list of data matrix structure.
lapply(X2, dim)
The function msma
is applied to this list X2 as follows.
(fit211 = msma(X2, comp=1))
The bar plots for the block and super weights (loadings) specified the argument block
.
par(mfrow=c(1,2)) plot(fit211, axes = 1, plottype="bar", block="block", las=2) plot(fit211, axes = 1, plottype="bar", block="super")
If lambdaX
with the length of 2 (same as the length of blocks) are specified, a multiblock sparse PCA is implemented.
(fit212 = msma(X2, comp=1, lambdaX=c(0.5, 0.5)))
The bar plots for the block and super weights (loadings).
par(mfrow=c(1,2)) plot(fit212, axes = 1, plottype="bar", block="block", las=2) plot(fit212, axes = 1, plottype="bar", block="super")
If the outcome Z is specified, a supervised analysis is implemented.
(fit213 = msma(X2, Z=Z, comp=1, lambdaX=c(0.5, 0.5)))
par(mfrow=c(1,2)) plot(fit213, axes = 1, plottype="bar", block="block", las=2) plot(fit213, axes = 1, plottype="bar", block="super")
A vector of length 2 can be given to the comp
argument to perform the nested component analysis,
which is a method to consider multiple components even in the super component.
The first element of the vector corresponds to the number of block components and
the second element corresponds to the number of (nested) super components.
(fit214 = msma(X2, comp=c(2,3)))
In this example, there are 2 block components and 3 super components.
fit214$wbX
For the block weights, the number of blocks is 2 since there are two data matrices as shown as follows, and the number of rows is 3 and 4, the number of variables in each.
The number of components is 2 for the first element of the vector specified by the comp argument, which is the number of columns in each matrix.
par(mfrow=c(1,2)) plot(fit214, axes = 1, axes2 = 1, plottype="bar", block="block", las=2) plot(fit214, axes = 2, axes2 = 1, plottype="bar", block="block", las=2)
fit214$wsX
par(mfrow=c(2,3)) for(j in 1:2) for(i in 1:3) plot(fit214, axes = j, axes2 = i, plottype="bar", block="super")
If the another input (list) Y2 is specified, the partial least squared is implemented.
(fit221 = msma(X2, Y2, comp=1))
par(mfrow=c(1,2)) plot(fit221, axes = 1, plottype="bar", block="block", XY="X", las=2) plot(fit221, axes = 1, plottype="bar", block="super", XY="X")
par(mfrow=c(1,2)) plot(fit221, axes = 1, plottype="bar", block="block", XY="Y", las=2) plot(fit221, axes = 1, plottype="bar", block="super", XY="Y")
The regularized parameters lambdaX
and lambdaY
are specified vectors with same length with the length of lists X2 and Y2, respectively.
(fit222 = msma(X2, Y2, comp=1, lambdaX=c(0.5, 0.5), lambdaY=c(0.5, 0.5)))
par(mfrow=c(1,2)) plot(fit222, axes = 1, plottype="bar", block="block", XY="X", las=2) plot(fit222, axes = 1, plottype="bar", block="super", XY="X")
par(mfrow=c(1,2)) plot(fit222, axes = 1, plottype="bar", block="block", XY="Y", las=2) plot(fit222, axes = 1, plottype="bar", block="super", XY="Y")
(fit223 = msma(X2, Y2, Z, comp=1, lambdaX=c(0.5, 0.5), lambdaY=c(0.5, 0.5)))
par(mfrow=c(1,2)) plot(fit223, axes = 1, plottype="bar", block="block", XY="X", las=2) plot(fit223, axes = 1, plottype="bar", block="super", XY="X")
par(mfrow=c(1,2)) plot(fit223, axes = 1, plottype="bar", block="block", XY="Y", las=2) plot(fit223, axes = 1, plottype="bar", block="super", XY="Y")
number of components search
(ncomp11 = ncompsearch(X1, comps = c(1, 5, 10*(1:2)), nfold=5)) plot(ncomp11)
(ncomp12 = ncompsearch(X1, comps = 20, criterion="BIC")) plot(ncomp12)
ncomp21 = ncompsearch(X2, Y2, comps = c(1, 5, 10*(1:2)), nfold=5) plot(ncomp21)
The multi block structure has
dataset3 = simdata(n = 50, rho = 0.8, Yps = rep(4, 5), Xps = rep(4, 5), seed=1) X3 = dataset3$X; Y3 = dataset3$Y
ncomp31 = ncompsearch(X3, comps = 20, criterion="BIC") plot(ncomp31)
(ncomp32 = ncompsearch(X3, comps = list(20, 20), criterion="BIC"))
par(mfrow=c(1,2)) plot(ncomp32,1) plot(ncomp32,2)
ncomp41 = ncompsearch(X3, Y3, comps = c(1, 5, 10*(1:2)), criterion="BIC") plot(ncomp41)
The number of components and regularized parameters can be selected by the function optparasearch
.
The following options are available.
criteria = c("BIC", "CV") search.methods = c("regparaonly", "regpara1st", "ncomp1st", "simultaneous")
regparaonly
method searches for the regularized parameters with a fixed number of components.(opt11 = optparasearch(X1, search.method = "regparaonly", criterion="BIC", comp=ncomp11$optncomp)) (fit311 = msma(X1, comp=opt11$optncomp, lambdaX=opt11$optlambdaX))
ncomp1st
method identifies the number of components with a regularized parameter of 0, then searches for the regularized parameters with the selected number of components. (opt12 = optparasearch(X1, search.method = "ncomp1st", criterion="BIC")) (fit312 = msma(X1, comp=opt12$optncomp, lambdaX=opt12$optlambdaX))
regpara1st
identifies the regularized parameters by fixing the number of components, then searching for the number of components with the selected regularized parameters. (opt13 = optparasearch(X1, search.method = "regpara1st", criterion="BIC")) (fit313 = msma(X1, comp=opt13$optncomp, lambdaX=opt13$optlambdaX))
simultaneous
method identifies the number of components by searching the regularized parameters in each component. (opt14 = optparasearch(X1, search.method = "simultaneous", criterion="BIC")) (fit314 = msma(X1, comp=opt14$optncomp, lambdaX=opt14$optlambdaX))
The argument maxpct4ncomp
=0.5 means that 0.5$\lambda$ is used as the regularized
parameter when the number of components is searched and where $\lambda$ is the maximum of
the regularized parameters among the possible candidates.
(opt132 = optparasearch(X1, search.method = "ncomp1st", criterion="BIC", maxpct4ncomp=0.5)) (fit3132 = msma(X1, comp=opt132$optncomp, lambdaX=opt132$optlambdaX))
The result with the argument regpara1st
depends on the number of components and
the default value is 10. The number of components is set as follows.
(opt133 = optparasearch(X1, search.method = "regpara1st", criterion="BIC", comp=5)) (fit3133 = msma(X1, comp=opt133$optncomp, lambdaX=opt133$optlambdaX))
For PLS, two parameters $\lambda_X$ and $\lambda_Y$ are used in arguments lambdaX
and lambdaY
to control sparseness for data X and Y, respectively.
(opt21 = optparasearch(X2, Y2, search.method = "regparaonly", criterion="BIC")) (fit321 = msma(X2, Y2, comp=opt21$optncomp, lambdaX=opt21$optlambdaX, lambdaY=opt21$optlambdaY))
(opt31 = optparasearch(X3, search.method = "regparaonly", criterion="BIC")) (fit331 = msma(X3, comp=opt31$optncomp, lambdaX=opt31$optlambdaX, lambdaXsup=opt31$optlambdaXsup))
(opt32 = optparasearch(X3, search.method = "regparaonly", criterion="BIC", whichselect="X")) (fit332 = msma(X3, comp=opt32$optncomp, lambdaX=opt32$optlambdaX, lambdaXsup=opt32$optlambdaXsup))
(opt33 = optparasearch(X3, search.method = "regparaonly", criterion="BIC", whichselect="Xsup")) (fit333 = msma(X3, comp=opt33$optncomp, lambdaX=opt33$optlambdaX, lambdaXsup=opt33$optlambdaXsup))
ncomp1st
(opt341 = optparasearch(X3, search.method = "ncomp1st", criterion="BIC", comp=c(8, 8))) (fit341 = msma(X3, comp=opt341$optncomp, lambdaX=opt341$optlambdaX, lambdaXsup=opt341$optlambdaXsup))
regparaonly
(opt342 = optparasearch(X3, search.method = "regparaonly", criterion="BIC", comp=c(4, 5))) (fit342 = msma(X3, comp=opt342$optncomp, lambdaX=opt342$optlambdaX, lambdaXsup=opt342$optlambdaXsup))
regpara1st
(opt344 = optparasearch(X3, search.method = "regpara1st", criterion="BIC", comp=c(8, 8))) (fit344 = msma(X3, comp=opt344$optncomp, lambdaX=opt344$optlambdaX, lambdaXsup=opt344$optlambdaXsup))
This is computationally expensive and takes much longer to execute.
(opt345 = optparasearch(X3, search.method = "simultaneous", criterion="BIC", comp=c(8, 8))) (fit345 = msma(X3, comp=opt345$optncomp, lambdaX=opt345$optlambdaX))
This is computationally expensive and takes much longer to execute due to the large number of blocks.
(opt41 = optparasearch(X3, Y3, search.method = "regparaonly", criterion="BIC")) (fit341 = msma(X3, Y3, comp=opt41$optncomp, lambdaX=opt41$optlambdaX, lambdaY=opt41$optlambdaY, lambdaXsup=opt41$optlambdaXsup, lambdaYsup=opt41$optlambdaYsup))
In this example, it works by narrowing down the parameters as follows.
(opt42 = optparasearch(X3, Y3, search.method = "regparaonly", criterion="BIC", whichselect=c("Xsup","Ysup"))) (fit342 = msma(X3, Y3, comp=opt42$optncomp, lambdaX=opt42$optlambdaX, lambdaY=opt42$optlambdaY, lambdaXsup=opt42$optlambdaXsup, lambdaYsup=opt42$optlambdaYsup))
Another example dataset is generated.
dataset4 = simdata(n = 50, rho = 0.8, Yps = rep(4, 2), Xps = rep(4, 3), seed=1) X4 = dataset4$X; Y4 = dataset4$Y
With this number of blocks, the calculation can be performed in a relatively short time.
(opt43 = optparasearch(X4, Y4, search.method = "regparaonly", criterion="BIC")) (fit343 = msma(X4, Y4, comp=opt43$optncomp, lambdaX=opt43$optlambdaX, lambdaY=opt43$optlambdaY, lambdaXsup=opt43$optlambdaXsup, lambdaYsup=opt43$optlambdaYsup))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.