library(CollateralVulnerability2016) library(dplyr) library(illuminaHumanv4.db) library(readr)
work_dir <- '~/BigData/CollateralVulnerability2016/paca-au/' download_dir <- '~/BigData/RTCGA_downloads/' mydb <- 'pacaau_output.db' mydb_path <- paste0(work_dir, mydb) my_con <- setupSQLite ( mydb_path )
In this case we want all human genes using the most recent ENSEMBL annotation:
all_genes <- getAllHumanGenes(my_con, EnsDb = EnsDb.Hsapiens.v79::EnsDb.Hsapiens.v79) head(all_genes)
Gene data also written to SQLite database and can be viewed with dplyr:
src_sqlite(my_con@dbname) %>% tbl('human_genes')
Want to import the ICGC RNAseq (and mutation data) so that it is ready to use for further analyses. Although there isn't a function to do this as long as we can create a table in the SQLite database with the correct format we can run the doBISEPAnalysis
function on that. See ?doBISEPAnalysis
for more details.
First import the Microarray and RNAseq datasets and explore them. Find that whilst there are 269 samples with array data, there are only 91 with RNAseq data. Therefore we use the array data.
paca_au_exp_array <- read_tsv('~/BigData/ICGC_Portal/release_23/exp_array.PACA-AU.tsv') #explore the data - how many genes and how many samples, any duplicates etc? paca_au_exp_array %>% group_by(analysis_id) %>% summarise(N=n()) length(unique(paca_au_exp_array$gene_id)) length(unique(paca_au_exp_array$icgc_specimen_id)) 269 * 47265 == nrow(paca_au_exp_array) # don't use RNAseq as only 92 samples paca_au_exp_seq <- read_tsv('~/BigData/ICGC_Portal/release_23/exp_seq.PACA-AU.tsv') length(unique(paca_au_exp_seq$icgc_donor_id))
Having explored the datasets we can import them in a leaner fashion to make them easier to work with - get rid of unneeded columns.
#import just what we need paca_au_exp_array <- read_tsv('~/BigData/ICGC_Portal/release_23/exp_array.PACA-AU.tsv', skip = 1, col_names = c('patient_id', 'probe_id', 'value'), col_types = 'c______cd________') dim(paca_au_exp_array)
However, we need ENSEMBL gene id's whereas we are provided with Illumina probe id's. We need to convert between the two using the AnnotationDbi package but there are multiple probes per gene. Ideally we would map probes to gene ids then find the median value per gene, but from a practical perspective it's easiest to just take the first probe id per gene and use that.
#now get the probe info - just take the first probe for each probe_info <- AnnotationDbi::select(illuminaHumanv4.db, keys = unique(paca_au_exp_array$probe_id), columns = 'ENSEMBL', keytype = 'PROBEID') %>% dplyr::rename(probe_id=PROBEID, gene_id=ENSEMBL) %>% dplyr::filter(!is.na(gene_id)) %>% dplyr::group_by(gene_id) %>% dplyr::summarise(probe_id=first(probe_id)) %>% dplyr::ungroup() %>% dplyr::tbl_df() #now merge the array and annotation data paca_au_exp_array_bygene <- paca_au_exp_array %>% dplyr::inner_join(probe_info, by='probe_id') #write to db DBI::dbWriteTable(my_con, "paca_au_exp_array_bygene", paca_au_exp_array_bygene, overwrite=TRUE) DBI::dbListTables(my_con)
This is a wrapper around the BISEP function from the BISEP package. For testing can set gene_n to a lower number, setting a high number ensures that all genes are included in the analysis. Some filtering is done to exclude non-variant or non-expressed genes: either by variance or by number of genes with an NA. Can optionally log2 transform the data or even use a non-standard input table. The analysis is parallelised for speed but this has only been tested on Mac OSX and Linux. See ?doBISEPAnalysis
for more details.
bisep_output <- doBISEPAnalysis(my_con, genes_n=100000, table_name = "paca_au_exp_array_bygene", log2_transform=FALSE, gene_var_th=0.2, gene_naprop_th=0.3, seed_val = 80)
At this step we can optionally do an initial filtering of the bisep results use the plotBISEPOutput
function to visualise this. We then use these genes for the next steps. Here we are visualising based on the pi_value (proportion of samples in the low expressors group) and p-value for the bimodal distribution test - see BISEP documentation for more info. However, we take all genes forward for the later steps so that we have data for all genes for all analyses.
plotBISEPOutput(my_con, pi_value_th=0.2, bisep_pval_th = 0.1) bisep_results <- src_sqlite(my_con@dbname) %>% dplyr::tbl('bisep_results') %>% # dplyr::filter(bisep_pval < 0.1 & pi_value <0.2) %>% dplyr::collect() %>% dplyr::arrange(gene_name) bisep_results
In Pancreatic cancer KDM6A (ENS...) is a known tumour suppressor with a low expressing population.
doRNAseqPlot(my_con, 'ENSG00000176024', table_name = "paca_au_exp_array_bygene") doRNAseqPlot(my_con, 'ENSG00000147050', table_name = "paca_au_exp_array_bygene")
human_paralog_res <- countHumanParalogs(my_con, bisep_results$gene_id)
``` {r eval = FALSE} flymine_res <- doFlyMineAnalysis(my_con, bisep_results$gene_id)
## Step 9: Do the WormMine analysis ``` {r eval = FALSE} wormmine_res <- doWormMineAnalysis(my_con, bisep_results$gene_id)
Since we don't have any mutation data we set up a dummy table ``` {r eval = FALSE}
dummy_mut_df <- src_sqlite(my_con@dbname) %>% dplyr::tbl('human_genes') %>% dplyr::transmute(gene_id, N_lof=0, N_protein_coding=0, pct_lof=0.0, pct_protein_coding=0.0, mutations="") %>% dplyr::collect()
DBI::dbWriteTable(my_con, 'mutation_analysis_results', dummy_mut_df, overwrite=TRUE)
## Step 11: Combine the results ``` {r eval=FALSE} combo_res <- combineResults(my_con, bisep_results$gene_id)
``` {r eval=FALSE} filtered_results <- src_sqlite(my_con@dbname) %>% dplyr::tbl('combined_results') %>% dplyr::filter(bisep_pval <= 0.1, pi_value <= 0.2, count_paralogs > 0, count_paralogs <= 5, (lethal_pct_fly >= 20 | lethal_pct_worm >= 20) ) %>% dplyr::collect()
## Step 13: Show results in shiny app ``` {r eval=FALSE} shinyVisApp(my_con, table_name = "paca_au_exp_array_bygene")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.