Multiblock data analysis with the RGCCA package

We consider J data matrices X1 ,..., XJ. Each n × pj data matrix Xj = [ xj1, ..., xjpj ] is called a block and represents a set of pj variables observed on n individuals. The number and the nature of the variables may differ from one block to another, but the individuals must be the same across blocks. We assume that all variables are centered. The objective of RGCCA is to find, for each block, a weighted composite of variables (called block component) yj = Xj . aj, j = 1 ,..., J (where aj is a column-vector with pj elements) summarizing the relevant information between and within the blocks. The block components are obtained such that (i) block components explain well their own block and/or (ii) block components that are assumed to be connected are highly correlated. In addition, RGCCA integrates a variable selection procedure, called SGCCA, allowing the identification of the most relevant features.

# Load functions
source("../R/parsing.R")
source("../R/select.type.R")
source("../R/plot.R")
source("../R/network.R")

loadLibraries(c("RGCCA", "ggplot2", "optparse", "scales", "igraph"))

# Load the data
data("Russett")

#Creates the blocks
agriculture = Russett[, 1:3]
industry = Russett[, 4:5]
politic = Russett[, 6:11]

# Creates optional files
response = factor( apply(Russett[, 9:11], 1, which.max),
                   labels = colnames(Russett)[9:11] )
connection = matrix(c(0, 0, 0, 1,
                      0, 0, 0, 1,
                      0, 0, 0, 1,
                      1, 1, 1, 0),
                    4, 4)

# Save the Russett files
files = c("agriculture", "industry", "politic", "response", "connection")

# Creates the files (in .tsv format for example)
if(!file.exists("data"))
  dir.create("data")

# Without row and column names for connection files
sapply(1:length(files), function (x) {

  bool = ( files[x] != "connection")

  write.table(
    x = get(files[x]),
    file = paste("data/", files[x], ".tsv", sep=""),
    row.names = bool,
    col.names = bool,
    sep = "\t")
})

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "images/"
)

Load the inputs

Load the blocks

The blocks are loaded with the function setBlocks. The first argument of this function (superblock) required a bolean giving the presence (TRUE) / absence (FALSE) of a superblock. The second one corresponds to a character giving the list of the file path separated by a comma (argument file). By default, the name of the blocks corresponds to those of the files (names argument) and could be set. By default, the tabulation is used as a column separator (sep argument) and the first row is considered as a header (header parameter).

# A boolean giving the presence (TRUE) / absence (FALSE) of a superblock
blocks = setBlocks(file = "data/agriculture.tsv,data/industry.tsv,data/politic.tsv")
blocks[["Superblock"]] = Reduce(cbind, blocks)

Load the groups of response and the connection between blocks

The connection between the blocks will be used by the RGCCA and must be set by setConnection function. A group of samples will be used to color them in the samples plot and must be set by setResponse function. For both functions, the blocks parameter, set at the previous step, is required. The other parameters are optional. The user could import a file containing either (file parameter) : (i) a symmetric matrix with 1 giving a connection between two blocs, or 0 otherwise; (ii) a univariate vector (qualitative or quantitative) or a disjunctive table for the response. By default, the column separator is the tabulation and could be set (sep argument). For the setResponse, a header could be specified (header parameter).

# Optional parameters
RESPONSE = "data/response.tsv"
CONNECTION = "data/connection.tsv"
SUPERBLOCK =  TRUE
# Uncomment the parameters below to try without default settings
# RESPONSE <- CONNECTION <- NULL

response = setResponse(blocks = blocks, 
                       file = RESPONSE)
connection = setConnection(blocks = blocks, 
                           file = CONNECTION)

View the inputs

for (x in 1:length(files)) {

  tab = get(files[x])
  if (x > 3) colnames(tab) = rep(NULL, NCOL(tab))

  pander::pandoc.table(head(tab),
                       caption = files[x])
}

Run S/RGCCA

SGCCA is run from the RGCCA package by using two components for a biplot visualization. The S/RGCCA function doesn't names the blocks in their outputs. This step is required to generate biplots.

# Use two components in RGCCA to plot in a bidimensionnal space
NB_COMP = c(2, 2, 3, 3)

sgcca.res = rgcca.analyze(blocks = blocks,
                  connection = connection,
                  ncomp = NB_COMP)

# Renames the elements of the list according to the block names
names(sgcca.res$a) = names(blocks)

Vizualise the analysis

Both the samples and the variables could be visualized by using biplots functions (respectively plotSamplesSpace and plotVariablesSpace). Histograms are used to visualized in decreasing order the variables with the higher weights and the blocks with the higher Average Variance Explained (AVE).

These functions take the results of a sgcca or a rgcca (rgcca parameter) and the components to visualize : either comp_x and comp_y for biplots or comp for histograms. By default, comp_x = comp = 1 and comp_y = 2. The presence or the absence of a superblock among the analysis could be specified for plotVariablesSpace and plotFingerprint to color the variables according to their blocks. By default, the last block is plotted, corresponding to the superblock if selected (i_block parameter). plotVariablesSpace, which is a corcircle plot, required the blocks for the correlation with the selected component. plotVariablesSpace could use the response variable to color the samples by groups. By default, the first 100th higher weights are used for the plotFingerprint and could be set by using the n_mark argument.

COMP1 = 2
COMP2 = 3
NB_MARK = 100

With the superblock, by default on the first and the second components

Samples plot

plotSamplesSpace(rgcca = sgcca.res, 
                 resp = response)

Corcircle plot

plotVariablesSpace(rgcca = sgcca.res,
                   block = blocks,
                   superblock = SUPERBLOCK)

Fingerprint plot

plotFingerprint(rgcca = sgcca.res,
                superblock = SUPERBLOCK,
                n_mark = NB_MARK,
                type = "weight")

Best explained blocks

plotAVE(rgcca = sgcca.res)

With the politic block, on the 2nd and the 3rd components

Samples plot

plotSamplesSpace(rgcca = sgcca.res, 
                 resp = response, 
                 comp_x = COMP1, 
                 comp_y = COMP2, 
                 i_block = 3)

Corcircle plot

plotVariablesSpace(rgcca = sgcca.res,
                   block = blocks,
                   comp_x = COMP1,
                   comp_y = COMP2,
                   i_block = 3)

Fingerprint plot

plotFingerprint(rgcca = sgcca.res,
                comp = COMP1,
                n_mark = NB_MARK, 
                i_block = 3,
                type = "weight")
# Remove the temp/ folder
unlink("data", recursive = TRUE)


ecamenen/rgcca_galaxy documentation built on Dec. 16, 2019, 12:40 a.m.