knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
This manual describes how to perform gcca to integrate multiple tables, using the mgcca
function. The mgcca
function is designed for small tables, for omics data to work with omics dataset we have implemented the mgcca_bd
function, which allows to work with large number of variables. We will first show the calculations with mgcca
and later with mgcca_bd.
First, we have to install mgcca
and BigDataStatMeth
packages, for further information about installation, please see vignette for mgcca
package.
# rm(list=ls()) #devtools::install_github("isglobal-brge/mgcca") library(mgcca)
Load other necessary libraries.
# utility library library(R.utils) # libraries needed by TCGA data library(curatedTCGAData) library(TCGAutils) # Libraries needed by mgcca package library(MultiAssayExperiment) library(RSpectra) library(impute) # Benchmark library(microbenchmark)
Data used to perform analysis was data referred to assays RNASeq (Normalized) and Methilation from Adrenocortical carcinoma. We can check the available cancer codes and assays in TCGA repositories. Let's choose ACC cancer, and two assays: RNA-seq and Methylation.
# Available cancer codes and assays curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE) # MultiAssayExperiment created with two assays: RNA-seq and Methylation multiassayexperiment <- curatedTCGAData("ACC",c("RNASeq2GeneNorm","Methylation" ), FALSE)
Now let's check how this MultiAssayExperiment looks like.
class(multiassayexperiment) multiassayexperiment # Methylation data assay(multiassayexperiment[[1]])[1:5,1:5] dim(assay(multiassayexperiment[[1]])) # RNA-seq data assay(multiassayexperiment[[2]])[1:5,1:5] dim(assay(multiassayexperiment[[2]])) # We can see there are some missing values (NA and/or 0) table(is.na(assay(multiassayexperiment[[1]]))) table(is.na(assay(multiassayexperiment[[2]])))
As observed above, we can impute the missing values using K-Nearest Neighbors method (KNN) from impute.knn
package implemented in our package with the function impute
. Only NA's will be imputed by default. However, 0 values can (if desired) be imputed aswell (impute.zero = TRUE
). Take into account that samples must be columns and features must be rows (this is expected by the impute.knn
function by default).
# Arguments: # subset: MultiAssayExperiment # method: knn for K-Nearest Neighbors or # k: Number of neighbors to be used in the imputation # rowmax: maximum percent missing data allowed in any row (default 50%). For any rows with more than rowmax% missing are imputed using the overall mean per sample. # colmax: maximum percent missing data allowed in any column (default 80%). If any column has more than colmax% missing data, the program halts and reports an error. # remove.col: To remove columns with more than colmax% missing data. # impute.zero: To impute 0 values. multiassayexperiment.imputed <- impute(multiassayexperiment, method = "knn", k = 10, rowmax = 0.5, colmax = 0.8, remove.col = TRUE, impute.zero = TRUE)
MultiAssayExperiment
into a list of assaysWe need to obtain a list of tables/matrices, one for each assay, to be able to run mgcca
. The resulting matrices will have the rows and columns transposed, i.e. samples in rows and features in columns (because this is expected by the mgcca
function).
tables_list <- getTables(multiassayexperiment.imputed) # We assess the class, length, data and dimensions of the list of tables. class(tables_list) length(tables_list) tables_list[[1]][1:5,1:5] tables_list[[2]][1:5,1:5] dim(tables_list[[1]]) dim(tables_list[[2]])
First, samples names must at least overlap between tables (there can be missing individuals, but not all of them). In TCGA data, samples names are barcodes that contain information about the type of data, batches, etc. Thus, samples names are different from among different assays. Let's obtain the real sample ID's from the same barcodes.
# The sample ID is the first 12 characters from the sample name for (assay in 1:length(tables_list)) { rownames(tables_list[[assay]]) <- substr(rownames(tables_list[[assay]]),1,12) } #tums <- TCGAsampleSelect(colnames(accmae), "01")
In our case, methylation data seems to be of character, instead of numeric, and although the imputation can be done with that type of data this will give an error in mgcca
, because it expects numeric matrices. Thus, let's change this matrix to numeric values.
# Check if it is character tables_list[[1]][1,1] class(tables_list[[1]][1,1]) # Use our in-house function tables_list <- matrix.chr2num(tables_list, index = 1) # Confirm it's numeric now tables_list[[1]][1,1] class(tables_list[[1]][1,1])
Now we can run mgcca! However, due to speed limitations, let's make a subset of data.
class(tables_list[[1]]) table_list.subset <- lapply(tables_list, function(x) x[,1:100]) lapply(table_list.subset, function(x) x[1:5,1:5])
Run mgcca, select the grouping variable for the plot, and finally plot the result.
# mgcca mgcca.result <- mgcca(table_list.subset, method="penalized", lambda=c(0.75,1)) # grouping variable from the original MultiAssayExperiment # In this case, as example, the vital status of the patients with ACC cancer. group <- factor(multiassayexperiment$vital_status) #plot plotInds(mgcca.result, group=group, pos.leg = "topright")
Run mgcca_bd
, select the grouping variable for the plot, and finally plot the result.
# mgcca mgcca.result.bd <- mgcca(table_list.subset, method="penalized", lambda=c(0.75,1)) # grouping variable from the original MultiAssayExperiment group <- factor(multiassayexperiment$vital_status) #plot plotInds(mgcca.result.bd, group=group, pos.leg = "topright")
We can observe that plot obtained with mgcca
and mgcca_bd
are equal
You can also make a mgcca
for a specific genomic coordinate that you are interested in. To do so, we will first subset the features from each assay (RNA-seq and methylation) that belong to that particular genomic coordinate.
AS an example, we will use the coordinates for two genes:
IGF2 --> a prototypical tumor-progenitor gene, which is the most commonly over-expressed gene in human adrenocortical carcinoma.
MEN1 --> A tumor suppressor gene located at 11q13, also related to this cancer.
TP53 --> The most common tumor supressor gene in all cancers.
We will use both gene coordinates + 2kb up and downstream (in order to obtain CpGs next to it).
NOTE 1: we need at least 2 genes, because (I think) imputation cannot be done if there is only 1 row (gene). NOTE 2: The first thing to do is to find the genes names (and cpgs) belonging to that particular genomic coordinates, but in this case we already know their names. That means that genomic coordinates can be given without knowing the actual genes, because if you know the actual genes you just create a list with its names (see below).
library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) ### 1) Save the genomic coordinates that we are interested in # MEN1 --> chr11: 64570986-64578766 +-2kb --> 64568986-64580766 chr.coords <- GRanges("11:64550986-64578766") ### 2) Annotate each assay independently to obtain the genomic coordinates for all features, in case it is not already done ### 2.1) Methylation annotation # We already have the genomic coordinates for each CpGs in the rowData dataframe rowData(multiassayexperiment[[1]]) ### 2.2) RNA-seq annotation # It will split the assay in two: ranged RangedSummarizedExperiment containing the features with existing genomic coordinates, and unranged SummarizedExperiment, not containing them mae.annotated <- TCGAutils::symbolsToRanges(multiassayexperiment) ### 3) Remove the non-used assays (unranged SummarizedExperiment) mae.annotated[[3]] <- NULL ### 4) Obtain a subset of features (CpGs and genes) that fall within the interested genomic coordinates mae.subset <- genCoords(multiassayexperiment = mae.annotated , gen.coords = chr.coords, meth.index = 1, col.name = "Genomic_Coordinate") # mae.subset <- genCoords(multiassayexperiment = mae.annotated , gen.coords = chr.coords, # type.omics = c("methylation","rnaseq"), col.name = "Genomic_Coordinate")
Now we can make the mgcca
with the MultiAssayExperiment subset by genomic coordinates
### 1) Imputation mae.subset.imputed <- impute(mae.subset, method = "knn", k = 10, rowmax = 0.5, colmax = 0.8, remove.col = TRUE, impute.zero = TRUE) ### 2) List of Matrices tables_list.subset <- getTables(mae.subset.imputed) ### 3) Reassign sample names for (assay in 1:length(tables_list.subset)) { rownames(tables_list.subset[[assay]]) <- substr(rownames(tables_list.subset[[assay]]),1,12) } ### 4) Convert Character matrix (methylation) to Numeric matrix tables_list.subset <- matrix.chr2num(tables_list.subset, index = 1)
### 5) mgcca # mgcca mgcca.result <- mgcca(tables_list.subset, method="penalized", lambda=c(0.75,1)) # grouping variable from the original MultiAssayExperiment # In this case, as example, the vital status of the patients with ACC cancer. group <- factor(multiassayexperiment$vital_status) # miRNA cluster #group <- factor(multiassayexperiment$miRNA.cluster) #plot plotInds(mgcca.result, group=group, pos.leg = "bottomleft")
### 5) mgcca # mgcca mgcca.result_bd <- mgcca_bd(tables_list.subset, method="penalized", lambda=c(0.75,1)) # grouping variable from the original MultiAssayExperiment # In this case, as example, the vital status of the patients with ACC cancer. group <- factor(multiassayexperiment$vital_status) # miRNA cluster #group <- factor(multiassayexperiment$miRNA.cluster) #plot plotInds(mgcca.result_bd, group=group, pos.leg = "bottomleft")
The main difference between mgcca
and mgcca_bd
is that mgcca_bd
allows you to work with more variables and the execution time is lower in mgcca_bd
than mgcca
, here, we show an execution time example
table_list.subset.100 <- lapply(tables_list, function(x) x[,1:100]) table_list.subset.500 <- lapply(tables_list, function(x) x[,1:500]) res <- microbenchmark( mgcca(table_list.subset.100, method="penalized", lambda=c(0.75,1)), mgcca_bd(table_list.subset.100, method="penalized", lambda=c(0.75,1)), mgcca(table_list.subset.500, method="penalized", lambda=c(0.75,1)), mgcca_bd(table_list.subset.500, method="penalized", lambda=c(0.75,1)), times = 10, unit = "s") resdata <- as.data.frame(summary(res)[, c(1:7)]) # we simplify the names to better understand the microbenchmark results resdata[,1] <- c("mgcca_100", "mgcca_bd_100","mgcca_500", "mgcca_bd_500" ) print(resdata, digits = 3)
So note that execution time with mgcca_bd
is smaller than execution time with mgcca
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.