knitr::opts_chunk$set(echo = TRUE, cache = TRUE)

Introduction

This manual describes how to perform gcca to integrate multiple tables, using the mgcca function. The mgcca function is designed for small tables, for omics data to work with omics dataset we have implemented the mgcca_bd function, which allows to work with large number of variables. We will first show the calculations with mgcca and later with mgcca_bd.

First, we have to install mgcca and BigDataStatMeth packages, for further information about installation, please see vignette for mgcca package.

# rm(list=ls())
#devtools::install_github("isglobal-brge/mgcca")
library(mgcca)

Load other necessary libraries.

# utility library
library(R.utils)

# libraries needed by TCGA data
library(curatedTCGAData)
library(TCGAutils)

# Libraries needed by mgcca package
library(MultiAssayExperiment)
library(RSpectra)
library(impute)

# Benchmark
library(microbenchmark)

Download TCGA data

Data used to perform analysis was data referred to assays RNASeq (Normalized) and Methilation from Adrenocortical carcinoma. We can check the available cancer codes and assays in TCGA repositories. Let's choose ACC cancer, and two assays: RNA-seq and Methylation.

# Available cancer codes and assays
curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
# MultiAssayExperiment created with two assays: RNA-seq and Methylation
multiassayexperiment <- curatedTCGAData("ACC",c("RNASeq2GeneNorm","Methylation" ), FALSE)

Now let's check how this MultiAssayExperiment looks like.

class(multiassayexperiment)
multiassayexperiment

# Methylation data
assay(multiassayexperiment[[1]])[1:5,1:5]
dim(assay(multiassayexperiment[[1]]))

# RNA-seq data
assay(multiassayexperiment[[2]])[1:5,1:5]
dim(assay(multiassayexperiment[[2]]))

# We can see there are some missing values (NA and/or 0)
table(is.na(assay(multiassayexperiment[[1]])))
table(is.na(assay(multiassayexperiment[[2]])))

Data imputation

As observed above, we can impute the missing values using K-Nearest Neighbors method (KNN) from impute.knn package implemented in our package with the function impute. Only NA's will be imputed by default. However, 0 values can (if desired) be imputed aswell (impute.zero = TRUE). Take into account that samples must be columns and features must be rows (this is expected by the impute.knn function by default).

# Arguments:
# subset: MultiAssayExperiment
# method: knn for K-Nearest Neighbors or 
# k: Number of neighbors to be used in the imputation
# rowmax: maximum percent missing data allowed in any row (default 50%). For any rows with more than rowmax% missing are imputed using the overall mean per sample.
# colmax: maximum percent missing data allowed in any column (default 80%). If any column has more than colmax% missing data, the program halts and reports an error.
# remove.col: To remove columns with more than colmax% missing data.
# impute.zero: To impute 0 values.

multiassayexperiment.imputed <- impute(multiassayexperiment, method = "knn", 
                                       k = 10, rowmax = 0.5, colmax = 0.8, 
                                       remove.col = TRUE, impute.zero = TRUE)

Convert MultiAssayExperiment into a list of assays

We need to obtain a list of tables/matrices, one for each assay, to be able to run mgcca. The resulting matrices will have the rows and columns transposed, i.e. samples in rows and features in columns (because this is expected by the mgcca function).

tables_list <- getTables(multiassayexperiment.imputed)

# We assess the class, length, data and dimensions of the list of tables.
class(tables_list)
length(tables_list)
tables_list[[1]][1:5,1:5]
tables_list[[2]][1:5,1:5]
dim(tables_list[[1]])
dim(tables_list[[2]])

Data preprocessing

Sample names

First, samples names must at least overlap between tables (there can be missing individuals, but not all of them). In TCGA data, samples names are barcodes that contain information about the type of data, batches, etc. Thus, samples names are different from among different assays. Let's obtain the real sample ID's from the same barcodes.

# The sample ID is the first 12 characters from the sample name
for (assay in 1:length(tables_list)) {
  rownames(tables_list[[assay]]) <- substr(rownames(tables_list[[assay]]),1,12)
}

#tums <- TCGAsampleSelect(colnames(accmae), "01")

Tables containing character numbers

In our case, methylation data seems to be of character, instead of numeric, and although the imputation can be done with that type of data this will give an error in mgcca, because it expects numeric matrices. Thus, let's change this matrix to numeric values.

# Check if it is character
tables_list[[1]][1,1]
class(tables_list[[1]][1,1])

# Use our in-house function
tables_list <- matrix.chr2num(tables_list, index = 1)

# Confirm it's numeric now
tables_list[[1]][1,1]
class(tables_list[[1]][1,1])

MGCCA

Now we can run mgcca! However, due to speed limitations, let's make a subset of data.

class(tables_list[[1]])
table_list.subset <- lapply(tables_list, function(x) x[,1:100])
lapply(table_list.subset, function(x) x[1:5,1:5])

mgcca for small data

Run mgcca, select the grouping variable for the plot, and finally plot the result.

# mgcca
mgcca.result <- mgcca(table_list.subset, method="penalized", lambda=c(0.75,1))

# grouping variable from the original MultiAssayExperiment
# In this case, as example, the vital status of the patients with ACC cancer.
group <- factor(multiassayexperiment$vital_status)

#plot
plotInds(mgcca.result, group=group, pos.leg = "topright")

mgcca_bd for Omics data

Run mgcca_bd, select the grouping variable for the plot, and finally plot the result.

# mgcca
mgcca.result.bd <- mgcca(table_list.subset, method="penalized", lambda=c(0.75,1))

# grouping variable from the original MultiAssayExperiment
group <- factor(multiassayexperiment$vital_status)

#plot
plotInds(mgcca.result.bd, group=group, pos.leg = "topright")

We can observe that plot obtained with mgcca and mgcca_bd are equal

Mgcca for Genomic Ranges

You can also make a mgcca for a specific genomic coordinate that you are interested in. To do so, we will first subset the features from each assay (RNA-seq and methylation) that belong to that particular genomic coordinate. AS an example, we will use the coordinates for two genes: IGF2 --> a prototypical tumor-progenitor gene, which is the most commonly over-expressed gene in human adrenocortical carcinoma. MEN1 --> A tumor suppressor gene located at 11q13, also related to this cancer. TP53 --> The most common tumor supressor gene in all cancers. We will use both gene coordinates + 2kb up and downstream (in order to obtain CpGs next to it).

NOTE 1: we need at least 2 genes, because (I think) imputation cannot be done if there is only 1 row (gene). NOTE 2: The first thing to do is to find the genes names (and cpgs) belonging to that particular genomic coordinates, but in this case we already know their names. That means that genomic coordinates can be given without knowing the actual genes, because if you know the actual genes you just create a list with its names (see below).

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)

### 1) Save the genomic coordinates that we are interested in
# MEN1 --> chr11: 64570986-64578766 +-2kb --> 64568986-64580766

chr.coords <- GRanges("11:64550986-64578766") 

### 2) Annotate each assay independently to obtain the genomic coordinates for all features, in case it is not already done

### 2.1) Methylation annotation
# We already have the genomic coordinates for each CpGs in the rowData dataframe
rowData(multiassayexperiment[[1]])

### 2.2) RNA-seq annotation
# It will split the assay in two: ranged RangedSummarizedExperiment containing the features with existing genomic coordinates, and unranged SummarizedExperiment, not containing them
mae.annotated <- TCGAutils::symbolsToRanges(multiassayexperiment)

### 3) Remove the non-used assays (unranged SummarizedExperiment)

mae.annotated[[3]] <- NULL

### 4) Obtain a subset of features (CpGs and genes) that fall within the interested genomic coordinates

mae.subset <- genCoords(multiassayexperiment = mae.annotated , gen.coords = chr.coords,
          meth.index = 1, col.name = "Genomic_Coordinate")

# mae.subset <- genCoords(multiassayexperiment = mae.annotated , gen.coords = chr.coords,
#           type.omics = c("methylation","rnaseq"), col.name = "Genomic_Coordinate")

Now we can make the mgcca with the MultiAssayExperiment subset by genomic coordinates

### 1) Imputation

mae.subset.imputed <- impute(mae.subset, method = "knn", k = 10, 
                                       rowmax = 0.5, colmax = 0.8, remove.col = TRUE, impute.zero = TRUE)

### 2) List of Matrices

tables_list.subset <- getTables(mae.subset.imputed)

### 3) Reassign sample names

for (assay in 1:length(tables_list.subset)) {
  rownames(tables_list.subset[[assay]]) <- substr(rownames(tables_list.subset[[assay]]),1,12)
}

### 4) Convert Character matrix (methylation) to Numeric matrix

tables_list.subset <- matrix.chr2num(tables_list.subset, index = 1)

Mgcca for Genomic Ranges in small matrices

### 5) mgcca

# mgcca
mgcca.result <- mgcca(tables_list.subset, method="penalized", lambda=c(0.75,1))

# grouping variable from the original MultiAssayExperiment
# In this case, as example, the vital status of the patients with ACC cancer.
group <- factor(multiassayexperiment$vital_status)
# miRNA cluster
#group <- factor(multiassayexperiment$miRNA.cluster)

#plot
plotInds(mgcca.result, group=group, pos.leg = "bottomleft")

Mgcca for Genomic Ranges and Omics data

### 5) mgcca

# mgcca
mgcca.result_bd <- mgcca_bd(tables_list.subset, method="penalized", lambda=c(0.75,1))

# grouping variable from the original MultiAssayExperiment
# In this case, as example, the vital status of the patients with ACC cancer.
group <- factor(multiassayexperiment$vital_status)
# miRNA cluster
#group <- factor(multiassayexperiment$miRNA.cluster)

#plot
plotInds(mgcca.result_bd, group=group, pos.leg = "bottomleft")

Differences between mgcca for small matrix and omics data

The main difference between mgcca and mgcca_bd is that mgcca_bd allows you to work with more variables and the execution time is lower in mgcca_bd than mgcca, here, we show an execution time example

table_list.subset.100 <- lapply(tables_list, function(x) x[,1:100])
table_list.subset.500 <- lapply(tables_list, function(x) x[,1:500])

res <- microbenchmark( 
  mgcca(table_list.subset.100, method="penalized", lambda=c(0.75,1)),
  mgcca_bd(table_list.subset.100, method="penalized", lambda=c(0.75,1)),
  mgcca(table_list.subset.500, method="penalized", lambda=c(0.75,1)),
  mgcca_bd(table_list.subset.500, method="penalized", lambda=c(0.75,1)),
  times = 10, unit = "s")

resdata <- as.data.frame(summary(res)[, c(1:7)])

# we simplify the names to better understand the microbenchmark results
resdata[,1] <- c("mgcca_100", "mgcca_bd_100","mgcca_500", "mgcca_bd_500" )

print(resdata, digits = 3)

So note that execution time with mgcca_bd is smaller than execution time with mgcca

Session information

sessionInfo()


isglobal-brge/GCCA documentation built on Feb. 19, 2022, 9:21 a.m.