cbaf: an automated, easy-to-use R package for comparing omic data across multiple cancers / a cancer's subgroups

Introduction

cbaf is a Bioconductor package that facilitates working with the high-throughput data stored on http://www.cbioportal.org/. The official CRAN package that is designed for obtaining data from cBioPortal in R, is cgdsr. To obtain data with this package, users have to pass a multistep procedure. Besides, the index of cancers and their subgroups changes frequently, which in turn, requires changing the R code. cbaf makes this procedure automated for RNA-Seq, microRNA-Seq, microarray and methylation data. In addition, comparing the genetic data across multiple cancer studies/subgroups of a cancer study becomes much faster and easier. The results are stored as excel file(s) and multiple heatmaps.

Package Installation

Prerequisites

The package itself doesn't need anything outside of R, but one of the dependant packages rjava needs some prerequisites. Since preparing the prerequisites may be complicated sometimes, they are briefly described in this section.

In a 32 bit windows, 32 bit version of Java Runtime Environment must be installed first. In a 64 bit windows, it is highly recommended that both 32 and 64 bit versions of Java Runtime Environment be installed.

In ubuntu, run the following commands in terminal in the same order as specified:

sudo apt-get install default-jdk

sudo R CMD javareconf

sudo apt-get install r-cran-rjava

sudo apt-get install libgdal1-dev libproj-dev

export LD_LIBRARY_PATH=/usr/lib/jvm/jre/lib/amd64:/usr/lib/jvm/jre/lib/amd64/default

sudo apt-get install libcurl4-openssl-dev libssl-dev

Installation and Loading

The package can be installed via BiocManager::install:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("cbaf", dependencies = TRUE)

After that, the pachage can be loaded into R workspace by

library(cbaf)

How to Use the cbaf

The package contains seven low-level functions: availableData(), obtainOneStudy(), obtainMultipleStudies(), automatedStatistics(), heatmapOutput(), xlsxOutput() and cleanDatabase().

In addition, there are also two high-level functions, processOneStudy() and processMultipleStudies(), that execute some of the mentioned functions in an ordered manner to speed up the overal process.

It is recommended that users only work with two low-level functions - availableData() and cleanDatabase() - directly, since they are independant of other low-level functions. For the rest, please use high-level functions instead. This allows all functions to work with a higher efficiency.

main Functions

availableData()

This function scans all the cancer studies to examine presence of RNA-Seq, microRNA-Seq, microarray and methylation data. It requires a name to label the output excel file. In the following example, the entered name is "list.2020-05-05".

availableData("list.2020-05-05")

Upon finishing, the output excel file is accessible at the present (working) directory. It contains different columns: cancer_study_id, cancer_study_name, RNA.Seq, microRNA.Seq, microarray of mRNA, microarray of miRNA, methylation and description.

if there is already an excel file with the given name in the working directory, the function prints a message, asking the user whether or not it should proceeds. If the answer is no, the function prints a message to inform the user that it has stopped further processing. If the user types yes, availableData() will overwrite the excel file after it has obtained the requested data.

cleanDatabase()

This function removes the created databases in the cbaf package directory. This helps users to obtain the fresh data from cbioportal.org.

It contains one optional argument:

In the following example, databaseNames is Whole2.

cleanDatabase("Whole2")

If the databaseNames left unentered, the function will print the available databases and allow the user to choose the desired ones.

processOneStudy()

This function combines four other functions for the ease of use. It is recommended that users only use this parent function to obtain and process gene data across multiple subsections of a cancer study so that child functions work with maximum efficiency. processOneStudy() uses the following functions:

It requires at least four arguments. All function arguments are the same as low-level functions:

Function also contains nineteen other options:

To get more information about the function options, please refer to the child function to whom they correspond, for example genesList lies within obtainMultipleStudies() function. The following is an example showing how this function can be used:

genes <- list(K.demethylases = c("KDM1A", "KDM1B", "KDM2A", "KDM2B", "KDM3A", "KDM3B", "JMJD1C", "KDM4A"), K.methyltransferases = c("SUV39H1", "SUV39H2", "EHMT1", "EHMT2", "SETDB1", "SETDB2", "KMT2A", "KMT2A"))

processOneStudy(genes, "test", "Breast Invasive Carcinoma (TCGA, Cell 2015)", "RNA-Seq", desiredCaseList = c(2,3,4,5), calculate = c("frequencyPercentage",  "frequencyRatio"), heatmapFileFormat = "TIFF")

The output excel file and heatmaps are stored in separate folders for every gene group. Ultimately, all the folders are located inside another folder, which its name is the combination of submissionName and “output for multiple studies”, for example “test output for multiple studies”.

processMultipleStudies()

This function combines four other functions for the ease of use. It is recommended that users only use this parent function to obtain and process gene data across multiple cancer studies for maximum efficiency. processMultipleStudies() uses the following functions:

It requires at least four arguments. All function arguments are the same as low-level functions:

Function also contains nineteen other options:

To get more information about the function options, please refer to the child function to whom they correspond, for example genesList lies within obtainMultipleStudies() function. The following is an example showing how this function can be used:

genes <- list(K.demethylases = c("KDM1A", "KDM1B", "KDM2A", "KDM2B", "KDM3A", "KDM3B", "JMJD1C", "KDM4A"), K.methyltransferases = c("SUV39H1", "SUV39H2", "EHMT1", "EHMT2", "SETDB1", "SETDB2", "KMT2A", "KMT2A"))

studies <- c("Acute Myeloid Leukemia (TCGA, Provisional)", "Adrenocortical Carcinoma (TCGA, Provisional)", "Bladder Urothelial Carcinoma (TCGA, Provisional)", "Brain Lower Grade Glioma (TCGA, Provisional)", "Breast Invasive Carcinoma (TCGA, Provisional)") 

processMultipleStudies(genes, "test2", studies, "RNA-Seq", calculate = c("frequencyPercentage", "frequencyRatio"), heatmapFileFormat = "TIFF")

The output excel file and heatmaps are stored in separate folders for every gene group. Ultimately, all the folders are located inside another folder, which its name is the combination of submissionName and "output for multiple studies", for example "test output for multiple studies".

Five dependant Functions

The following functions are used by processOneStudy() and processMultipleStudies() functions. It is highly recomended to use thses two functions instead of running the following five functions independantly.

obtainOneStudy()

This function obtains and stores the supported data for at least one group of genes across multiple subgroups of a cancer study. In addion, it can check whether or not all genes are included in different subgroups of a cancer study and, if not, looks for the alternative gene names.

It requires at least four arguments:

Function also contains two other options:

Consider the following example, where genes consists of two gene groups K.demethylases and K.acetyltransferases, submissionName is test, cancername is Breast Invasive Carcinoma (TCGA, Cell 2015) and the desiredTechnique is RNA-Seq. If desired.case.list = "none", all subgroups of the requested cancer study appear on console, function asks the user to choose the index of desired subgroups. Alterntively, user can enter the index of desired cases by changing the argument desired.case.list = "none" to, e.g. desiredCaseList = c(2,3,4,5). After the user has entered the desired subgroups, function continues by getting the data and informs the user with a progress bar.

genes <- list(K.demethylases = c("KDM1A", "KDM1B", "KDM2A"), K.acetyltransferases = c("CLOCK", "CREBBP", "ELP3", "EP300"))

obtainOneStudy(genes, "test", "Breast Invasive Carcinoma (TCGA, Cell 2015)", "RNA-Seq", desiredCaseList = c(2,3,4,5))

obtainMultipleStudies()

This function obtains and stores the supported data for at least one group of genes across multiple cancer studies. It can check whether or not all genes are included in each cancer study and, if not, it looks for the alternative gene names.

It requires at least four arguments:

Function also contains two other options:

In the following example, genes consists of two gene groups K.demethylases and K.acetyltransferases, submissionName is test2, cancername has complete name of five cancer studies and the desired high-throughput study is RNA-Seq.

genes <- list(K.demethylases = c("KDM1A", "KDM1B", "KDM2A"), K.acetyltransferases = c("CLOCK", "CREBBP", "ELP3", "EP300"))

# Specifying names of cancer studies by standard study names
cancernames <- c("Acute Myeloid Leukemia (TCGA, Provisional)", "Adrenocortical Carcinoma (TCGA, Provisional)", "Bladder Urothelial Carcinoma (TCGA, Provisional)", "Brain Lower Grade Glioma (TCGA, Provisional)", "Breast Invasive Carcinoma (TCGA, Provisional)")

# Specifying names of cancer studies by creating a matrix that includes standard and desired study names
cancernames <- matrix(c("Acute Myeloid Leukemia (TCGA, Provisional)", "acute myeloid leukemia", "Adrenocortical Carcinoma (TCGA, Provisional)", "adrenocortical carcinoma", "Bladder Urothelial Carcinoma (TCGA, Provisional)", "bladder urothelial carcinoma", "Brain Lower Grade Glioma (TCGA, Provisional)", "brain lower grade glioma", "Breast Invasive Carcinoma (TCGA, Provisional)",  "breast invasive carcinoma"), nrow = 5, ncol=2 , byrow = TRUE)


obtainMultipleStudies(genes, "test2", cancernames, "RNA-Seq")

automatedStatistics()

The function calculates the statistics of the data obtained by obtainOneStudy() or obtainMultipleStudies() functions. Based on user's preference, these statistics can include frequency percentage, frequency ratio, mean value and median value of samples greater than specific value. Furthermore, it can look for the genes that comprise the highest values in each cancer and list the top 5 genes for frequency percentage, mean value and median value.

It requires at least two arguments:

Function also contains four other options:

In the following example, submissionName is test, and the obtainedDataType is multiple studies. We exclude mean value and median value from calculate. Note that top genes for these two statistics will also be skipped.

automatedStatistics("test", obtainedDataType = "single study", calculate = c("frequencyPercentage", "frequencyRatio"))

heatmapOutput()

This function prepares heatmap for frequency percentage, mean value and median value data provided by automatedStatistics() function. Heatmaps for every gene group are stored in separate folder.

It requires at least one argument:

Function also contains thirteen other options:

In the following example, submissionName is test.

heatmapOutput("test", shortenStudyNames = TRUE, heatmapMargines = c(13,5), heatmapColor = "RdGr", genesToDrop = c("PVT1", "SNHG6"), reverseColor = FALSE, heatmapFileFormat = "JPG")

If the requested heatmaps already exist, it doesn't rewrite the heatmaps. The number of skipped heatmaps is then printed.

xlsxOutput()

This function exports the output of automatedStatistics() and the gene validation result of one of the obtainOneStudy() or obtainMultipleStudies() functions as an excel file. For every gene group, an excel file will be generated and stored in the same folder as heatmaps.

It requires one argument:

There is another optional argument:

In the following example, submissionName is test.

xlsxOutput("test")

If the requested excel files already exist, the function avoids rewriting them. The number of skipped excel files is then printed.



Try the cbaf package in your browser

Any scripts or data that you put into this service are public.

cbaf documentation built on Dec. 9, 2020, 2:02 a.m.