library(CollateralVulnerability2016)
library(dplyr)
library(illuminaHumanv4.db)
library(readr)

Introduction

Step 1: Set up SQLite database and directories

work_dir <- '~/BigData/CollateralVulnerability2016/paca-au/'
download_dir <- '~/BigData/RTCGA_downloads/'
mydb <- 'pacaau_output.db'
mydb_path <- paste0(work_dir, mydb)
my_con <- setupSQLite ( mydb_path ) 

Step 2: Get human geneset to work with

In this case we want all human genes using the most recent ENSEMBL annotation:

all_genes <- getAllHumanGenes(my_con, EnsDb = EnsDb.Hsapiens.v79::EnsDb.Hsapiens.v79)
head(all_genes)

Gene data also written to SQLite database and can be viewed with dplyr:

src_sqlite(my_con@dbname) %>% tbl('human_genes')

Step 3: Import ICGC RNAseq and mutation data

Want to import the ICGC RNAseq (and mutation data) so that it is ready to use for further analyses. Although there isn't a function to do this as long as we can create a table in the SQLite database with the correct format we can run the doBISEPAnalysis function on that. See ?doBISEPAnalysis for more details.

First import the Microarray and RNAseq datasets and explore them. Find that whilst there are 269 samples with array data, there are only 91 with RNAseq data. Therefore we use the array data.

paca_au_exp_array <- read_tsv('~/BigData/ICGC_Portal/release_23/exp_array.PACA-AU.tsv')

#explore the data - how many genes and how many samples, any duplicates etc?
paca_au_exp_array %>% group_by(analysis_id) %>% summarise(N=n())
length(unique(paca_au_exp_array$gene_id))
length(unique(paca_au_exp_array$icgc_specimen_id))
269 * 47265 == nrow(paca_au_exp_array)  

# don't use RNAseq as only 92 samples
paca_au_exp_seq <- read_tsv('~/BigData/ICGC_Portal/release_23/exp_seq.PACA-AU.tsv')
length(unique(paca_au_exp_seq$icgc_donor_id))

Having explored the datasets we can import them in a leaner fashion to make them easier to work with - get rid of unneeded columns.

#import just what we need
paca_au_exp_array <- read_tsv('~/BigData/ICGC_Portal/release_23/exp_array.PACA-AU.tsv',  skip = 1,
                               col_names = c('patient_id', 'probe_id', 'value'),
                               col_types = 'c______cd________')
dim(paca_au_exp_array)

However, we need ENSEMBL gene id's whereas we are provided with Illumina probe id's. We need to convert between the two using the AnnotationDbi package but there are multiple probes per gene. Ideally we would map probes to gene ids then find the median value per gene, but from a practical perspective it's easiest to just take the first probe id per gene and use that.

#now get the probe info - just take the first probe for each
probe_info <- AnnotationDbi::select(illuminaHumanv4.db, keys = unique(paca_au_exp_array$probe_id), columns = 'ENSEMBL', keytype = 'PROBEID') %>%
    dplyr::rename(probe_id=PROBEID, gene_id=ENSEMBL) %>%
    dplyr::filter(!is.na(gene_id)) %>%
    dplyr::group_by(gene_id) %>%
    dplyr::summarise(probe_id=first(probe_id)) %>%
    dplyr::ungroup() %>%
    dplyr::tbl_df()

#now merge the array and annotation data
paca_au_exp_array_bygene <- paca_au_exp_array %>%
    dplyr::inner_join(probe_info, by='probe_id')

#write to db
DBI::dbWriteTable(my_con,  "paca_au_exp_array_bygene", paca_au_exp_array_bygene, overwrite=TRUE)
DBI::dbListTables(my_con)

Step 4: Run the BISEP low expressors analysis on expression data

This is a wrapper around the BISEP function from the BISEP package. For testing can set gene_n to a lower number, setting a high number ensures that all genes are included in the analysis. Some filtering is done to exclude non-variant or non-expressed genes: either by variance or by number of genes with an NA. Can optionally log2 transform the data or even use a non-standard input table. The analysis is parallelised for speed but this has only been tested on Mac OSX and Linux. See ?doBISEPAnalysis for more details.

bisep_output <- doBISEPAnalysis(my_con, genes_n=100000, table_name = "paca_au_exp_array_bygene", 
                                log2_transform=FALSE, gene_var_th=0.2, gene_naprop_th=0.3, seed_val = 80)

Step 5: View the results of the low expressors analysis

At this step we can optionally do an initial filtering of the bisep results use the plotBISEPOutput function to visualise this. We then use these genes for the next steps. Here we are visualising based on the pi_value (proportion of samples in the low expressors group) and p-value for the bimodal distribution test - see BISEP documentation for more info. However, we take all genes forward for the later steps so that we have data for all genes for all analyses.

plotBISEPOutput(my_con, pi_value_th=0.2, bisep_pval_th = 0.1)

bisep_results <- src_sqlite(my_con@dbname) %>%
    dplyr::tbl('bisep_results') %>%
   # dplyr::filter(bisep_pval < 0.1 & pi_value <0.2) %>% 
    dplyr::collect()  %>%
    dplyr::arrange(gene_name)
bisep_results

Step 6: Generate a plot for a given gene

In Pancreatic cancer KDM6A (ENS...) is a known tumour suppressor with a low expressing population.

doRNAseqPlot(my_con, 'ENSG00000176024', table_name = "paca_au_exp_array_bygene")
doRNAseqPlot(my_con, 'ENSG00000147050', table_name = "paca_au_exp_array_bygene")

Step 7: Do the human paralogue analysis

human_paralog_res <- countHumanParalogs(my_con, bisep_results$gene_id)

Step 8: Do the FlyMine analysis

``` {r eval = FALSE} flymine_res <- doFlyMineAnalysis(my_con, bisep_results$gene_id)

## Step 9: Do the WormMine analysis
``` {r eval = FALSE}
wormmine_res <- doWormMineAnalysis(my_con, bisep_results$gene_id)

Step 10: Do the mutation analysis

Since we don't have any mutation data we set up a dummy table ``` {r eval = FALSE}

mut_res <- doMutationAnalysis(my_con, bisep_results$gene_id)

dummy_mut_df <- src_sqlite(my_con@dbname) %>% dplyr::tbl('human_genes') %>% dplyr::transmute(gene_id, N_lof=0, N_protein_coding=0, pct_lof=0.0, pct_protein_coding=0.0, mutations="") %>% dplyr::collect()

DBI::dbWriteTable(my_con, 'mutation_analysis_results', dummy_mut_df, overwrite=TRUE)

## Step 11: Combine the results
``` {r eval=FALSE}
combo_res <- combineResults(my_con, bisep_results$gene_id)

Step 12: Filter the results

``` {r eval=FALSE} filtered_results <- src_sqlite(my_con@dbname) %>% dplyr::tbl('combined_results') %>% dplyr::filter(bisep_pval <= 0.1, pi_value <= 0.2, count_paralogs > 0, count_paralogs <= 5, (lethal_pct_fly >= 20 | lethal_pct_worm >= 20) ) %>% dplyr::collect()

## Step 13: Show results in shiny app
``` {r eval=FALSE}
    shinyVisApp(my_con, table_name = "paca_au_exp_array_bygene")


chapmandu2/CollateralVulnerability2016 documentation built on May 13, 2019, 3:27 p.m.