FirebrowseR: An 'API' Client for Broads 'Firehose' Pipeline

author: "Mario Deng" date: "2016-04-14" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{FirebrowseR - A short introduction} %\VignetteEngine{knitr::r markdown} %\usepackage[utf8]{inputenc}

FirebrowseR is an API client for the Firehose Pipeline, provided by Broad Institute, which processes TCGA data sets and makes them available through this API. To access the data provided by the Firehose Pipeline the Broad Institute provides several tools. One of these tools is the Firebrowse, which also serves a Web-API. This API is queried by this R package, FirebrowseR, giving you easy access to genomic data sets.

A short outline about what's in the scope of this package and some brief ideas on how to use it.

The Firebrowse API with all its functions, features and descriptions can be viewed here. The API divides into four categories:

Samples
- Gives access to data types where functional analysis was not performed.
Analyses
- Data sets included here are pre-processed, since the RAW-data would be too big.
Archives
- Allows one to download big compressed archives, including too large data sets, even after pre-processing them.
Metadata
- Here one can access all information needed to design and build cohorts, etc.

This package is designed to provide easy access to Firehose/TCGA data sets for R programmers. Therefore it's implementing all functions provided by the Firebrowse API, allowing you to comfortable query and download data sets. This package does not provide any additional functions, methods or tools to (pre-) process, analyze or evaluate the data sets named above.

FirebrowseR provides all functions displayed in the API-Documentation, having exactly the same names and arguments. Also each function has its own help page, accessible by ?function_name, giving explanations and examples needed for the function.

The FirebrowseR package is installed just like every other R package hosted in GitHub: devtools::install_github("mariodeng/FirebrowseR"). The API just left its beta status, so we are going to submit to CRAN asap.

FirebrowseR is licensed under MIT License. Please see license file or wikipedia.

Here we talk and run through some examples, introducing you to this package and discussing the differences with the API.

In this first example we are going to analyze mRNA expression data of breast cancer. We take a look at some genes which are well known to be differentially expressed within this entity of cancer. At first, we have to design our cohort. The method Metadata.Cohorts returns all cohort identifiers and their corresponding description. Within the description we search for "breast", yielding to the identifier for breast cancer.

require(FirebrowseR)
cohorts = Metadata.Cohorts(format = "csv") # Download all available cohorts
cancer.Type = cohorts[grep("breast", cohorts$description, ignore.case = T), 1]
print(cancer.Type)

## [1] "BRCA"

Now that we know that the breast cancer samples are identified be BRCA, we can retrieve a list of all patients associated with this identifier.

brca.Pats = Samples.Clinical(cohort = cancer.Type, format="tsv")
dim(brca.Pats)

## NULL

The code above, looking at the dimensions of the returned data frame, indicates that there are only 150 patients, which does not correspond to the number given at the Firebrowse website. This is due to the fact, that the Firebrowse API returns the data page wise, with a default page size of 150 or 150 entries (this holds for all functions where the page parameter is provided). The global limit for the page size is 2000. We can resolve this issue by iterating over the pages, until we receive a data frame with less than the page size (150) entries. Also we need to adopt the column names from the first frame, since the API does not return column names for page > 1.

all.Received = F
page.Counter = 1
page.size = 150
brca.Pats = list()
while(all.Received == F){
  brca.Pats[[page.Counter]] = Samples.Clinical(format = "csv",
                                               cohort = cancer.Type,
                                               page_size = page.size,
                                               page = page.Counter)
  if(page.Counter > 1)
    colnames(brca.Pats[[page.Counter]]) = colnames(brca.Pats[[page.Counter-1]])

  if(nrow(brca.Pats[[page.Counter]]) < page.size){
    all.Received = T
  } else{
    page.Counter = page.Counter + 1
  }
}
brca.Pats = do.call(rbind, brca.Pats)
dim(brca.Pats)

## [1] 1097  111

We now have collected all samples. Next we subset this data frame to patients being dead. We only do this to keep the run time short, downloading mRNA expression data for a thousand patients would take a lot of time, later on.

brca.Pats = brca.Pats[ which(brca.Pats$vital_status == "dead"), ]

Here we define a vector containing genes known to be differential expressed in breast cancer and download the mRNA expression data for these genes and our patients. Since there are a lots of BRCA samples available, we chunk this query into a gene-wise subset

diff.Exp.Genes = c("ESR1", "GATA3", "XBP1", "FOXA1", "ERBB2", "GRB7", "EGFR",
                   "FOXC1", "MYC")
all.Found = F
page.Counter = 1
mRNA.Exp = list()
page.Size = 2000 # using a bigger page size is faster
while(all.Found == F){
  mRNA.Exp[[page.Counter]] = Samples.mRNASeq(format = "csv",
                                             gene = diff.Exp.Genes,
                                             cohort = "BRCA",
                                             tcga_participant_barcode =
                                               brca.Pats$tcga_participant_barcode,
                                             page_size = page.Size,
                                             page = page.Counter)
  if(nrow(mRNA.Exp[[page.Counter]]) < page.Size)
    all.Found = T
  else
    page.Counter = page.Counter + 1
}
mRNA.Exp = do.call(rbind, mRNA.Exp)
dim(mRNA.Exp)

## [1] 1791    8

We only keep the samples having a primary tumor and corresponding normal tissue available. Normal tissue is encoded by NT and tumor tissue by TP. These identifiers can be decoded using the {r}Metadata.SampleTypes("csv") function.

# Patients with normal tissue
normal.Tissue.Pats = which(mRNA.Exp$sample_type == "NT")
# get the patients barcodes
patient.Barcodes = mRNA.Exp$tcga_participant_barcode[normal.Tissue.Pats]
# Subset the mRNA.Exp data frame, keeping only the pre-selected barcodes AND
# having a sample type of NT or TP
mRNA.Exp = mRNA.Exp[which(mRNA.Exp$tcga_participant_barcode %in% patient.Barcodes &
                            mRNA.Exp$sample_type %in% c("NT", "TP")), ]

Now we can use the famous ggplot2 package to plot the expression.

library(ggplot2)
p = ggplot(mRNA.Exp, aes(factor(gene), z.score))
p +
  geom_boxplot(aes(fill = factor(sample_type))) +
  # we drop some outlier, so plot looks nicer, this also causes the warning
  scale_y_continuous(limits = c(-1, 5)) +
  scale_fill_discrete(name = "Tissue")

## Warning: Removed 62 rows containing non-finite values (stat_boxplot).

plot of chunk unnamed-chunk-7

Every method in this package returns data, if not please read below. Further, there is no difference when using tsv or csv, they return a matrix/data frame, and they are both implemented to match the API. It is also possible to receive a json object (which requires the jsonlite package). Which one you use depends on what you prefer; working with matrix/data frames or json objects, both have pros and cons when accessing the data.

If your query did not return any data, then there are potentially four reasons for that.

There is no data matching your query
- Some types of analyses might not be available for each cohort. This should be mentioned in the error message.
Your arguments are malformed
- Please visit the API-Doc, to verify and test your arguments for the function you are using.
The API is too busy to answer
- Please try again later.
There is a bug within this function
- Please directly write a mail or consider posting to Biostars or Stackoverflow. We are following every post tagged with FirebrowseR.