Introduction

Managing data from large scale projects such as The Cancer Genome Atlas (TCGA)[@ref1] for further analysis is an important and time consuming step for research projects. Several efforts, such as Firehose project, make TCGA pre-processed data publicly available via web services and data portals but it requires managing, downloading and preparing the data for following steps. We developed an open source and extensible R based data client for Firehose Level 3 and Level 4 data and demonstrated its use with sample case studies. RTCGAToolbox could improve data management for researchers who are interested with TCGA data. In addition, it can be integrated with other analysis pipelines for further data analysis.

RTCGAToolbox is open-source and licensed under the GNU General Public License Version 2.0. All documentation and source code for RTCGAToolbox is freely available. Please site the paper at [@ref3].

Currently, following functions are provided to access datasets and process datasets.

Installation

To install RTCGAToolbox, you can use Bioconductor. Source code is also available on GitHub. First time users use the following code snippet to install the package

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("RTCGAToolbox")

Data Client

Before getting the data from Firehose pipelines, users have to check valid dataset aliases, stddata run dates and analyze run dates. To provide valid information RTCGAToolbox comes with three control functions. Users can list datasets with "getFirehoseDatasets" function. In addition, users have to provide stddata run date or/and analyze run date for client function. Valid dates are accessible via "getFirehoseRunningDates" and "getFirehoseAnalyzeDates" functions. Below code chunk shows how to list datasets and dates.

library(RTCGAToolbox)
# Valid aliases
getFirehoseDatasets()
# Valid stddata runs
getFirehoseRunningDates(last = 3)
# Valid analysis running dates (will return 3 recent date)
getFirehoseAnalyzeDates(last=3)

When the dates and datasets are determined users can call data client function ("getFirehoseData") to access data. Current version can download multiple data types except ISOFORM and exon level data due to their huge data size. Below code chunk will download READ dataset with clinical and mutation data.

# READ mutation data and clinical data
brcaData <- getFirehoseData(dataset="READ", runDate="20160128",
    forceDownload=TRUE, clinical=TRUE, Mutation=TRUE)

Printing the object will show the user what datasets are in the FirehoseData object:

brcaData

Users have to set several parameters to get data they need. Below "getFirehoseData" options has been explained:

Following logic keys are provided for different data types. By default client only download clinical data.

Users can also set following parameters to set client behavior.

Example Dataset

We've provided an abbreviated dataset from the 'ACC' (Adrenocortical carcinoma) that contains only the top 6 rows for each dataset and a full clinical dataset. This dataset can be invoked by doing:

data(accmini)
accmini

Conversion to Bioconductor classes

The biocExtract function allows the user to take any downloaded dataset and convert it into a standard Bioconductor object. These can either be a SummarizedExperiment, RangedSummarizedExperiment, or RaggedExperiment based on features of the data. The user must provide the desired data type as input to the function along with the actual FirehoseData data object. This allows for easy adaptability to other software in the Bioconductor ecosystem.

biocExtract(accmini, "RNASeq2Gene")

biocExtract(accmini, "CNASNP")

Raw Data

You can obtain the downloaded data in tabular or list format from the FirehoseData object by using 'getData()' function.

head(getData(accmini, "clinical"))

getData(accmini, "RNASeq2GeneNorm")

getData(accmini, "GISTIC", "AllByGene")

Session Info

sessionInfo()

References



mksamur/RTCGAToolbox documentation built on Oct. 29, 2023, 10:06 p.m.