In chapmandu2/CancerCellLines: Analysis of cancer cell line data

library(CancerCellLines)

Introduction

The aim of the CancerCellLines package is to provide standardised code to create and extract data from a SQLite database containing the published genomic data from the Cancer Cell Line Encyclopedia and similar projects. The reason for using a SQLite database is to allow data to be stored on disk, rather than be loaded into memory. This is useful when the user wishes to work with small subsets of the overall dataset, for example just 10-20 genes in 80 lung cell lines. This Vignette will cover the inital set up of the package along with some examples of its use.

Data Origin

Data files referred to here can be download from the CCLE project website.

There are also toy examples included in the package:

list.files(system.file("extdata", package = "CancerCellLines"))

Creating/Connecting to the toy dataset

Either make a toy database from scratch using the convenience function makeToyDB:

test_db <- makeToyDB()
test_db
test_db@dbname
dbListTables(test_db)

Or connect to the one built into the package using the setupSQLite function:

test_db <- setupSQLite(system.file('extdata/toy.db', package="CancerCellLines"))
test_db
test_db@dbname
dbListTables(test_db)

Querying the toy dataset with RSQLite

The functions from RSQLite can be used to query data in the normal way:

dbGetQuery(test_db, "select * from ccle_affy limit 10")
dbGetQuery(test_db, "select * from ccle_sampleinfo limit 10")[,1:5]
dbGetQuery(test_db, "select Symbol, t1.CCLE_name, Signal, Site_primary, Hist_subtype1 from ccle_affy as t1 
                      inner join ccle_sampleinfo t2 on t1.CCLE_name = t2.CCLE_name
                      where t2.Hist_subtype1 == 'ductal_carcinoma'
                      order by Symbol desc
                      limit 10")

Indexing the database allows fast retrieval even when the dataset gets large - more later.

However, writing the SQL yourself can get inconvenient if you want to retrieve several genes or cell lines:

dbGetQuery(test_db, "select * from ccle_affy 
                      where symbol IN ('PTEN', 'TP53', 'BRAF' ) and 
                            CCLE_name IN ('BT474_BREAST', 'MDAMB468_BREAST') 
                      limit 10")

symbols <- c('PTEN', 'TP53', 'BRAF')
cell_lines <- c('BT474_BREAST', 'MDAMB468_BREAST') 
symbols.sql <- paste(symbols, collapse="','")
cell_lines.sql <- paste(cell_lines, collapse="','")

dbGetQuery(test_db, sprintf("select * from ccle_affy 
                      where symbol IN ('%s' ) and 
                            CCLE_name IN ('%s') 
                      limit 10", symbols.sql, cell_lines.sql))

Querying the toy dataset with dplyr

Things become much nicer if you query with dplyr, since this writes the underlying SQL for you:

con <- src_sqlite(test_db@dbname) 
ccle_affy <- con %>% tbl('ccle_affy')
ccle_affy
ccle_sampleinfo <- con %>% tbl('ccle_sampleinfo')
ccle_sampleinfo

ccle_sampleinfo %>% dplyr::select(CCLE_name, Site_primary, Hist_subtype1) %>% 
  dplyr::filter(Hist_subtype1 == 'ductal_carcinoma') %>%
  dplyr::inner_join(ccle_affy, by='CCLE_name') %>%
  dplyr::arrange(desc(Symbol))

ccle_affy %>% filter(symbol %in% symbols & CCLE_name %in% cell_lines)

Convenience functions to export data

There are a number of convenience functions that assist in executing typical queries. For example, the getAffyData and getCopyNumberData functions can be used to simplify the queries above still further:

getAffyData(test_db, symbols, cell_lines)
getCopyNumberData(test_db, symbols, cell_lines)

Whilst the getHybcapData and getCosmicCLPData functions retrieve the CCLE hybrid capture and Cosmic Cell Line Project sequencing data respectively:

getHybcapData(test_db, symbols, cell_lines)
getCosmicCLPData(test_db, symbols, cell_lines)

Note that the CancerCellLines package includes functionality to convert cell line identifiers between different datasets using the cell_line_ids table. This happens transparently in the getCosmicCLPData function:

con %>% tbl('cell_line_ids') %>% filter(unified_id %in% cell_lines)

Finally, the getDrugData_CCLE function retrieves the CCLE drug response data:

drugs <- c('Lapatinib', 'AZD6244', 'Nilotinib' )
getDrugData_CCLE(test_db, drugs, cell_lines)

Whilst the getDrugData_custom function transforms an arbitrary data frame with the field names below into the standardised data frame:

data(dietlein_data)
head(dietlein_data)
getDrugData_custom(dietlein_data, drugs = 'KU60648_pGI50', cell_lines = c('DMS114_LUNG', 'A549_LUNG'))

Combining different data types

These functions all have a standard output format which means that data from different assay types can be merged and plotted or analysed together.

The makeTallDataFrame function does this merging in a standard way and returns the the data in a 'tidy' format that is useful for plotting in ggplot2 or further manipulation with tidyr.

makeTallDataFrame(test_db, symbols, cell_lines, drugs)

The makeWideFromTallDataFrame function can take the output from makeTallDataFrame and create a wide or matrix-like data frame which is a conveninent input for modelling packages such as caret.

my_df <- makeTallDataFrame(test_db, symbols, cell_lines, drugs)
makeWideFromTallDataFrame(my_df)

Finally, there is the makeWideDataFrame function which generates a wide data frame directly.

makeWideDataFrame(test_db, symbols, cell_lines, drugs)

The data_types parameter can be used to control which data types are returned, and the drug_df parameter is used to provide custom drug information as per the getDrugData_custom function description above

makeWideDataFrame(test_db, symbols, cell_lines, drugs, data_types=c('hybcap', 'affy', 'resp'))

Working with the full CCLE dataset

The full CCLE dataset is not included in this package due to reasons of data size and because permission for data re-distribution has not yet been sought. However, the instructions below will demonstrate how this is done:

Define where the data is to be stored/found. Files are downloaded from the CCLE project website and COSMIC Cell Line Project website.

dbpath <- '~/BigData/CellLineData/CancerCellLines.db'
infopath <- '~/BigData/CellLineData/RawData/CCLE_sample_info_file_2012-10-18.txt'
affypath <- '~/BigData/CellLineData/RawData/CCLE_Expression_Entrez_2012-09-29.gct'
cnpath <- '~/BigData/CellLineData/RawData/CCLE_copynumber_byGene_2012-09-29.txt'
hybcappath <- '~/BigData/CellLineData/RawData/CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf'
cosmicclppath <- '~/BigData/CellLineData/RawData/CosmicCLP_CompleteExport_v74.tsv'
drugpath <- '~/BigData/CellLineData/RawData/CCLE_NP24.2009_Drug_data_2012.02.20.csv'
idspath <- system.file("extdata", "CellLineIDNormalisationNov15.txt", package = "CancerCellLines")

Set up the SQLite database and run the import functions

  full_con <- setupSQLite(dbpath)
  importCCLE_info(infopath , full_con)
  importCCLE_hybcap(hybcappath , full_con)
  importCosmicCLP_exome(cosmicclppath, full_con)
  importCCLE_drugresponse(drugpath , full_con)
  importCCLE_affy(affypath , full_con)
  importCCLE_cn(cnpath, full_con)
  importCellLineIDs(idspath, full_con)

This process should take 3-4 minutes with most of the time spent importing the affymetrix data.

Now use the database as per the toy example. Thanks to the speed of SQLite and the wonders of indexing, data retrieval should still be just as fast even though the ccle_affy table contains around 20 million data points.

To really put it through its paces try retrieving data from 2000 genes in 200 cell lines as below:

    dplyr_con <- src_sqlite(full_con@dbname)

    #get 2000 random genes
    random_genes <- dplyr_con %>% tbl('ccle_affy') %>% group_by(Symbol) %>% summarise(N=n()) %>% 
      ungroup() %>% collect %>% 
      dplyr::filter(N < mean(N)) %>% sample_n(2000) %>% as.data.frame
    random_genes <- random_genes$Symbol

    #get 200 random cell lines
    random_cell_lines <- dplyr_con %>% tbl('ccle_sampleinfo') %>% dplyr::select(CCLE_name) %>%
      distinct %>% collect %>% sample_n(200) %>% as.data.frame
    random_cell_lines <- random_cell_lines$CCLE_name

    #get 10 random compounds
    random_drugs <- dplyr_con %>% tbl('ccle_drug_data') %>% dplyr::select(Compound) %>%
      distinct %>% collect %>% sample_n(10) %>% as.data.frame
    random_drugs <- random_drugs$Compound

    #retrieve the data
    test_affy <- getAffyData(full_con, random_genes, random_cell_lines)
    test_cn <- getCopyNumberData(full_con, random_genes, random_cell_lines)
    test_hybcap <- getHybcapData(full_con, random_genes, random_cell_lines)
    test_cosmicclp <- getCosmicCLPData(full_con, random_genes, random_cell_lines)

    #make a big data frame
    big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, random_drugs)

    #without resp data
    big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, drugs=NULL, data_types=c('affy', 'cn', 'hybcap', 'cosmicclp'))

    #with custom resp data
    big_df <- makeWideDataFrame(full_con, random_genes, cell_lines = c('DMS114_LUNG', 'A549_LUNG'), drugs = 'KU60648_pGI50', drug_df = dietlein_data)

This should take no more than 4-5 seconds for each constituent retrieval, and ~10 seconds to make the data frame depending on your hardware (SSD's will be quicker than HDD's).

Future directions

Future plans are to integrate the thinking of using SQLite for fast on disk subsetting and retrieval with the biocMultiAssay package. The will allow generic extension of the concept to other datasets without having to define import and retrieval functions and database schemas one dataset at a time.