A. Data handling
In multiblock: Multiblock Data Fusion in Statistics and Machine Learning

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=6, 
  fig.height=4
)
# Legge denne i YAML på toppen for å skrive ut til tex
#output: 
#  pdf_document: 
#    keep_tex: true
# Original:
#  rmarkdown::html_vignette:
#    toc: true

# Start the multiblock R package
library(multiblock)

Read from file

Data are stored in many different file formats. The following three examples cover two types of CSV-files and generic flat files.

# Find directory extdata from the multiblock package
mbdir <- system.file('extdata/', package = "multiblock")

# Comma separated values, row names in first column
meta_data <- read.csv(paste0(mbdir, "/meta_data.csv"), row.names = 1)
# If working directory matches file location:
# meta_data <- read.csv('meta_data.csv', row.names = 1)
meta_data

# Semi-colon separated values (locales where the decimal point is comma),
# no row names
proteins <- read.csv2(paste0(mbdir, "/proteins.csv"))
proteins

# Blank space separated data without labels
genes <- read.table(paste0(mbdir, "/genes.dat"))
genes

Data pre-processing

Before analysis, various types of pre-processing may be needed. Centring and standardising/scaling may be considered the most basic. In R, these operations are performed column-wise by default, leading to autoscaling. If these operations are performed on the rows, we perform the standard normal variate (SNV) instead.

# Column-centring
genes_centred <- scale(genes, scale=FALSE)
colMeans(genes_centred) # Check mean values

# Autoscaling
genes_scaled <- scale(genes)
apply(genes_scaled, 2, sd) # Check standard deviations

# SNV (transpose, autoscale, re-transpose)
genes_snv <- t(scale(t(genes)))
apply(genes_snv, 1, sd) # Check standard deviations

Re-coding categorical data

Most analysis methods require continuous input data. The meta_data data.frame contains a character vector (a factor in older R versions) of categories. This package has a function dummycode for converting categorical data to various dummy formats.

# Default is sum coding
dummycode(meta_data$colour)

# Treatment coding
dummycode(meta_data$colour, "contr.treatment")

# Full dummy-coding (rank deficient)
dummycode(meta_data$colour, drop = FALSE)

# Replace categorical with dummy-coded, use I() to index by common name
meta_data2 <- meta_data
meta_data2$colour <- I(dummycode(meta_data$colour, drop = FALSE))
meta_data2
meta_data2$colour

Data structures for multiblock analysis

Create list of blocks

A simple list of blocks can be created using the list() function. Naming of the blocks can be done directly or after creation.

# Direct approach
blocks1 <- list(meta = meta_data2, proteins = proteins, genes = genes)

# Two-step approach
blocks2 <- list(meta_data2, proteins, genes)
names(blocks2) <- c('meta', 'proteins', 'genes')

# Same result
identical(blocks1, blocks2)

# Access by name or number
blocks1[['meta']]
blocks2[[1]]

Create data.frame of blocks

A data.frame is a convenient storage format for data in R and can handle many types of variables, e.g. numeric, logical, character, factor or matrices. The latter is useful for analyses of data with shared sample mode.

# Construct block data.frame from list
df1 <- block.data.frame(blocks1)

# Construct block data.frame from data.frame:
# First merge blocks into data.frame
my_data <- cbind(meta_data2, proteins, genes)
# Then construct block data.frame using named 
# list of indexes
df2 <- block.data.frame(my_data, block_inds = 
        list(meta = 1:2, proteins = 3:5, genes = 6:8))

# Same result
identical(df1,df2)

# Access by name or number
df1[[2]]
df2[['proteins']]
df1[c(1,3)]
df1[-2]
df2[c('proteins','genes')]

# Use with formula interface (see other vignettes)
# sopls(meta ~ proteins + genes, data = df1)

# Use with single list interface (see other vignettes)
# mfa(df1[c(1,3)], ncomp = 3)