library(CancerCellLines)
The aim of the CancerCellLines package is to provide standardised code to create and extract data from a SQLite database containing the published genomic data from the Cancer Cell Line Encyclopedia and similar projects. The reason for using a SQLite database is to allow data to be stored on disk, rather than be loaded into memory. This is useful when the user wishes to work with small subsets of the overall dataset, for example just 10-20 genes in 80 lung cell lines. This Vignette will cover the inital set up of the package along with some examples of its use.
Data files referred to here can be download from the CCLE project website.
There are also toy examples included in the package:
list.files(system.file("extdata", package = "CancerCellLines"))
Either make a toy database from scratch using the convenience function makeToyDB
:
test_db <- makeToyDB() test_db test_db@dbname dbListTables(test_db)
Or connect to the one built into the package using the setupSQLite
function:
test_db <- setupSQLite(system.file('extdata/toy.db', package="CancerCellLines")) test_db test_db@dbname dbListTables(test_db)
The functions from RSQLite can be used to query data in the normal way:
dbGetQuery(test_db, "select * from ccle_affy limit 10") dbGetQuery(test_db, "select * from ccle_sampleinfo limit 10")[,1:5] dbGetQuery(test_db, "select Symbol, t1.CCLE_name, Signal, Site_primary, Hist_subtype1 from ccle_affy as t1 inner join ccle_sampleinfo t2 on t1.CCLE_name = t2.CCLE_name where t2.Hist_subtype1 == 'ductal_carcinoma' order by Symbol desc limit 10")
Indexing the database allows fast retrieval even when the dataset gets large - more later.
However, writing the SQL yourself can get inconvenient if you want to retrieve several genes or cell lines:
dbGetQuery(test_db, "select * from ccle_affy where symbol IN ('PTEN', 'TP53', 'BRAF' ) and CCLE_name IN ('BT474_BREAST', 'MDAMB468_BREAST') limit 10") symbols <- c('PTEN', 'TP53', 'BRAF') cell_lines <- c('BT474_BREAST', 'MDAMB468_BREAST') symbols.sql <- paste(symbols, collapse="','") cell_lines.sql <- paste(cell_lines, collapse="','") dbGetQuery(test_db, sprintf("select * from ccle_affy where symbol IN ('%s' ) and CCLE_name IN ('%s') limit 10", symbols.sql, cell_lines.sql))
Things become much nicer if you query with dplyr, since this writes the underlying SQL for you:
con <- src_sqlite(test_db@dbname) ccle_affy <- con %>% tbl('ccle_affy') ccle_affy ccle_sampleinfo <- con %>% tbl('ccle_sampleinfo') ccle_sampleinfo ccle_sampleinfo %>% dplyr::select(CCLE_name, Site_primary, Hist_subtype1) %>% dplyr::filter(Hist_subtype1 == 'ductal_carcinoma') %>% dplyr::inner_join(ccle_affy, by='CCLE_name') %>% dplyr::arrange(desc(Symbol)) ccle_affy %>% filter(symbol %in% symbols & CCLE_name %in% cell_lines)
There are a number of convenience functions that assist in executing typical queries. For example, the getAffyData
and getCopyNumberData
functions can be used to simplify the queries above still further:
getAffyData(test_db, symbols, cell_lines) getCopyNumberData(test_db, symbols, cell_lines)
Whilst the getHybcapData
and getCosmicCLPData
functions retrieve the CCLE hybrid capture and Cosmic Cell Line Project sequencing data respectively:
getHybcapData(test_db, symbols, cell_lines) getCosmicCLPData(test_db, symbols, cell_lines)
Note that the CancerCellLines package includes functionality to convert cell line identifiers between different datasets using the cell_line_ids
table. This happens transparently in the getCosmicCLPData function:
con %>% tbl('cell_line_ids') %>% filter(unified_id %in% cell_lines)
Finally, the getDrugData_CCLE
function retrieves the CCLE drug response data:
drugs <- c('Lapatinib', 'AZD6244', 'Nilotinib' ) getDrugData_CCLE(test_db, drugs, cell_lines)
Whilst the getDrugData_custom
function transforms an arbitrary data frame with the field names below into the standardised data frame:
data(dietlein_data) head(dietlein_data) getDrugData_custom(dietlein_data, drugs = 'KU60648_pGI50', cell_lines = c('DMS114_LUNG', 'A549_LUNG'))
These functions all have a standard output format which means that data from different assay types can be merged and plotted or analysed together.
The makeTallDataFrame
function does this merging in a standard way and returns the the data in a 'tidy' format that is useful for plotting in ggplot2 or further manipulation with tidyr.
makeTallDataFrame(test_db, symbols, cell_lines, drugs)
The makeWideFromTallDataFrame
function can take the output from makeTallDataFrame
and create a wide or matrix-like data frame which is a conveninent input for modelling packages such as caret.
my_df <- makeTallDataFrame(test_db, symbols, cell_lines, drugs) makeWideFromTallDataFrame(my_df)
Finally, there is the makeWideDataFrame
function which generates a wide data frame directly.
makeWideDataFrame(test_db, symbols, cell_lines, drugs)
The data_types
parameter can be used to control which data types are returned, and the drug_df
parameter is used to provide custom drug information as per the getDrugData_custom
function description above
makeWideDataFrame(test_db, symbols, cell_lines, drugs, data_types=c('hybcap', 'affy', 'resp'))
The full CCLE dataset is not included in this package due to reasons of data size and because permission for data re-distribution has not yet been sought. However, the instructions below will demonstrate how this is done:
Define where the data is to be stored/found. Files are downloaded from the CCLE project website and COSMIC Cell Line Project website.
dbpath <- '~/BigData/CellLineData/CancerCellLines.db' infopath <- '~/BigData/CellLineData/RawData/CCLE_sample_info_file_2012-10-18.txt' affypath <- '~/BigData/CellLineData/RawData/CCLE_Expression_Entrez_2012-09-29.gct' cnpath <- '~/BigData/CellLineData/RawData/CCLE_copynumber_byGene_2012-09-29.txt' hybcappath <- '~/BigData/CellLineData/RawData/CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf' cosmicclppath <- '~/BigData/CellLineData/RawData/CosmicCLP_CompleteExport_v74.tsv' drugpath <- '~/BigData/CellLineData/RawData/CCLE_NP24.2009_Drug_data_2012.02.20.csv' idspath <- system.file("extdata", "CellLineIDNormalisationNov15.txt", package = "CancerCellLines")
Set up the SQLite database and run the import functions
full_con <- setupSQLite(dbpath) importCCLE_info(infopath , full_con) importCCLE_hybcap(hybcappath , full_con) importCosmicCLP_exome(cosmicclppath, full_con) importCCLE_drugresponse(drugpath , full_con) importCCLE_affy(affypath , full_con) importCCLE_cn(cnpath, full_con) importCellLineIDs(idspath, full_con)
This process should take 3-4 minutes with most of the time spent importing the affymetrix data.
Now use the database as per the toy example. Thanks to the speed of SQLite and the wonders of indexing, data retrieval should still be just as fast even though the ccle_affy table contains around 20 million data points.
To really put it through its paces try retrieving data from 2000 genes in 200 cell lines as below:
dplyr_con <- src_sqlite(full_con@dbname) #get 2000 random genes random_genes <- dplyr_con %>% tbl('ccle_affy') %>% group_by(Symbol) %>% summarise(N=n()) %>% ungroup() %>% collect %>% dplyr::filter(N < mean(N)) %>% sample_n(2000) %>% as.data.frame random_genes <- random_genes$Symbol #get 200 random cell lines random_cell_lines <- dplyr_con %>% tbl('ccle_sampleinfo') %>% dplyr::select(CCLE_name) %>% distinct %>% collect %>% sample_n(200) %>% as.data.frame random_cell_lines <- random_cell_lines$CCLE_name #get 10 random compounds random_drugs <- dplyr_con %>% tbl('ccle_drug_data') %>% dplyr::select(Compound) %>% distinct %>% collect %>% sample_n(10) %>% as.data.frame random_drugs <- random_drugs$Compound #retrieve the data test_affy <- getAffyData(full_con, random_genes, random_cell_lines) test_cn <- getCopyNumberData(full_con, random_genes, random_cell_lines) test_hybcap <- getHybcapData(full_con, random_genes, random_cell_lines) test_cosmicclp <- getCosmicCLPData(full_con, random_genes, random_cell_lines) #make a big data frame big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, random_drugs) #without resp data big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, drugs=NULL, data_types=c('affy', 'cn', 'hybcap', 'cosmicclp')) #with custom resp data big_df <- makeWideDataFrame(full_con, random_genes, cell_lines = c('DMS114_LUNG', 'A549_LUNG'), drugs = 'KU60648_pGI50', drug_df = dietlein_data)
This should take no more than 4-5 seconds for each constituent retrieval, and ~10 seconds to make the data frame depending on your hardware (SSD's will be quicker than HDD's).
Future plans are to integrate the thinking of using SQLite for fast on disk subsetting and retrieval with the biocMultiAssay package. The will allow generic extension of the concept to other datasets without having to define import and retrieval functions and database schemas one dataset at a time.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.