BigQuery is Google's managed SQL-like database system that is designed to run at PB scales. It allows storing, querying, and manipulating absolutely huge datasets with little more effort (perhaps even less) than operating on a standard relational database. Cost is a pay-as-you-go model, based on the amount of data processed.
The Institute for Systems Biology (ISB) Cancer Genomics Cloud have loaded all of the TCGA level 3 data into BigQuery, enabling very facile slicing of data to a very fine granularity. Bulk data in file form is still likely to be the most appropriate way to do large-scale data mining and inference, but for gene- or sample-based queries or to join datasets in novel ways, BigQuery offers a possible solution.
library(bigrquery)
bigrquery::list_tables("isb-cgc", "tcga_201510_alpha")
The first time you run this, you will be likely asked to login to a google billing account. This process will involve R kicking you over to a web browser (the first time) where you will need to login to your google cloud project. You will likely see this type of message....
Use a local file to cache OAuth access credentials between R sessions? 1: Yes 2: No Selection: Yes Adding ^\.httr-oauth$ to .Rbuildignore Waiting for authentication in browser... Press Esc/Ctrl + C to abort Authentication complete.
Assuming that you have set up an account with the ISB Cancer Genomics Cloud Pilot, you can authorize R to access the account, and from then on will be able to operate without browser interaction.
bigrquery::list_tables("isb-cgc", "tcga_201510_alpha")
The dplyr database tutorial discusses how to use dplyr
with Google BigQuery. We are going to use that approach here to simplify
the interaction with the database.
library(dplyr)
Next, we set up a connection between dplyr
and bigquery
.
my_billing = "isb-cgc-01-0006" # replace billing info here with your own bq = src_bigquery("isb-cgc", "tcga_201510_alpha", billing = my_billing)
The available tables in the tcga_201510_alpha
database are given by simply showing the bq
object.
bq
To work with a table from BigQuery using dplyr
, we need to set up a dplyr
table connection, one
for each table that we wish to use. Here, I do that with the Annotations
table.
tbl_annot = tbl(bq,'Annotations') # and show the columns colnames(tbl_annot)
Our dplyr
table object behaves essentially like a data.frame when using the dplyr
style of
coding.
head(tbl_annot)
tbl_mut = tbl(bq,'Somatic_Mutation_calls')
mut_counts = tbl_mut %>% select(Study, Variant_Classification) %>% group_by(Study,Variant_Classification) %>% summarize(n = n()) %>% arrange(Study,Variant_Classification)
```r
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.