FirebrowseR is an API client for the Firehose Pipeline, provided by Broad Institute, which processes TCGA data sets and makes them available through this API. To access the data provided by the Firehose Pipeline the Broad Institute provides several tools. One of these tools is the Firebrowse, which also serves a Web-API. This API is queried by this R package, FirebrowseR, giving you easy access to genomic data sets.
A short outline about what's in the scope of this package and some brief ideas on how to use it.
The Firebrowse API with all its functions, features and descriptions can be viewed here. The API divides into four categories:
This package is designed to provide easy access to Firehose/TCGA data sets for R programmers. Therefore it's implementing all functions provided by the Firebrowse API, allowing you to comfortable query and download data sets. This package does not provide any additional functions, methods or tools to (pre-) process, analyze or evaluate the data sets named above.
FirebrowseR provides all functions displayed in the API-Documentation, having exactly the same names and arguments. Also each function has its own help page, accessible by ?function_name
, giving explanations and examples needed for the function.
The FirebrowseR package is installed just like every other R package hosted in GitHub: devtools::install_github("mariodeng/FirebrowseR")
. The API just left its beta status, so we are going to submit to CRAN asap.
FirebrowseR is licensed under MIT License. Please see license file or wikipedia.
Here we talk and run through some examples, introducing you to this package and discussing the differences with the API.
In this first example we are going to analyze mRNA expression data of breast cancer. We take a look at some genes which are well known to be differentially expressed within this entity of cancer.
At first, we have to design our cohort. The method Metadata.Cohorts
returns all cohort identifiers and their corresponding description. Within the description we search for "breast", yielding to the identifier for breast cancer.
require(FirebrowseR) cohorts = Metadata.Cohorts(format = "csv") # Download all available cohorts cancer.Type = cohorts[grep("breast", cohorts$description, ignore.case = T), 1] print(cancer.Type)
Now that we know that the breast cancer samples are identified be BRCA
, we can retrieve a list of all patients associated with this identifier.
brca.Pats = Samples.Clinical(cohort = cancer.Type, format="tsv") dim(brca.Pats)
The code above, looking at the dimensions of the returned data frame, indicates that there are only 150 patients, which does not correspond to the number given at the Firebrowse website. This is due to the fact, that the Firebrowse API returns the data page wise, with a default page size of 150 or 150 entries (this holds for all functions where the page parameter is provided). The global limit for the page size is 2000. We can resolve this issue by iterating over the pages, until we receive a data frame with less than the page size (150) entries. Also we need to adopt the column names from the first frame, since the API does not return column names for page > 1.
all.Received = F page.Counter = 1 page.size = 150 brca.Pats = list() while(all.Received == F){ brca.Pats[[page.Counter]] = Samples.Clinical(format = "csv", cohort = cancer.Type, page_size = page.size, page = page.Counter) if(page.Counter > 1) colnames(brca.Pats[[page.Counter]]) = colnames(brca.Pats[[page.Counter-1]]) if(nrow(brca.Pats[[page.Counter]]) < page.size){ all.Received = T } else{ page.Counter = page.Counter + 1 } } brca.Pats = do.call(rbind, brca.Pats) dim(brca.Pats)
We now have collected all samples. Next we subset this data frame to patients being dead. We only do this to keep the run time short, downloading mRNA expression data for a thousand patients would take a lot of time, later on.
brca.Pats = brca.Pats[ which(brca.Pats$vital_status == "dead"), ]
Here we define a vector containing genes known to be differential expressed in breast cancer and download the mRNA expression data for these genes and our patients. Since there are a lots of BRCA samples available, we chunk this query into a gene-wise subset
diff.Exp.Genes = c("ESR1", "GATA3", "XBP1", "FOXA1", "ERBB2", "GRB7", "EGFR", "FOXC1", "MYC") all.Found = F page.Counter = 1 mRNA.Exp = list() page.Size = 2000 # using a bigger page size is faster while(all.Found == F){ mRNA.Exp[[page.Counter]] = Samples.mRNASeq(format = "csv", gene = diff.Exp.Genes, cohort = "BRCA", tcga_participant_barcode = brca.Pats$tcga_participant_barcode, page_size = page.Size, page = page.Counter) if(nrow(mRNA.Exp[[page.Counter]]) < page.Size) all.Found = T else page.Counter = page.Counter + 1 } mRNA.Exp = do.call(rbind, mRNA.Exp) dim(mRNA.Exp)
We only keep the samples having a primary tumor and corresponding normal tissue available. Normal tissue is encoded by NT
and tumor tissue by TP
. These identifiers can be decoded using the rMetadata.SampleTypes("csv")
function.
# Patients with normal tissue normal.Tissue.Pats = which(mRNA.Exp$sample_type == "NT") # get the patients barcodes patient.Barcodes = mRNA.Exp$tcga_participant_barcode[normal.Tissue.Pats] # Subset the mRNA.Exp data frame, keeping only the pre-selected barcodes AND # having a sample type of NT or TP mRNA.Exp = mRNA.Exp[which(mRNA.Exp$tcga_participant_barcode %in% patient.Barcodes & mRNA.Exp$sample_type %in% c("NT", "TP")), ]
Now we can use the famous ggplot2 package to plot the expression.
library(ggplot2) p = ggplot(mRNA.Exp, aes(factor(gene), z.score)) p + geom_boxplot(aes(fill = factor(sample_type))) + # we drop some outlier, so plot looks nicer, this also causes the warning scale_y_continuous(limits = c(-1, 5)) + scale_fill_discrete(name = "Tissue")
Every method in this package returns data, if not please read below. Further, there is no difference when using tsv
or csv
, they return a matrix/data frame, and they are both implemented to match the API. It is also possible to receive a json
object (which requires the jsonlite
package). Which one you use depends on what you prefer; working with matrix/data frames or json objects, both have pros and cons when accessing the data.
If your query did not return any data, then there are potentially four reasons for that.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.