knitr::opts_knit$set(global.par = TRUE)
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

GOLiath

GOliath is a web-based tool for performing functional enrichment analysis on lists of genes. It uses existing sets of genes that have been annotated and grouped into functional categories.. what? am I just trying to explain GSEA?

The set of query genes are tested for overrepresentation in each of the functional categories. The statistical test used is a Fisher’s Exact test and this provides a p-value as a measure of significance. A multiple testing correction is carried out using the Benjamini-Hochberg method. An enrichment score is also calculated. This functional enrichment testing is a very useful method to find biological meaning from the data. It comes at the end of a pipeline/after many steps of production and processing of the data. At various points in the preparation of sequencing libraries and throughout the analysis of the data, biases can be introduced, which can ultimately affect which genes are identified as being overrepresented. For example, during library preparation, PCR is usually used to amplify the amount of starting material. During the rounds of PCR, sequences with high proportions of G and C nucleotides can be favoured as the GC bonds break down at a higher temperature than the AT bonds. This can lead to a GC bias in the library. Another eg – long genes/closest genes We wanted to produce lists of genes biased in different ways e.g. by GC content of the genes, length of the gen and run these through functional enrichment analysis to see whether any function categories were identified and may therefore be associated with a bias in the data. The output, a set of functional categories associated with a particular bias, is used in GOliath. Each significant result from the functional enrichment analysis of the query genes is checked against the list of potentially biased categories and flagged up in the results file.

Pre-processing of data on cluster

Generating a gene info file for the mouse genome

The script create_gene_info_file_from_gtf.pl takes a gtf file and parses it to create a gene info file that contains all the genes that are annotated in the gtf file. This should be all the genes in that version of the genome.

Downloaded the raw version of the file from github wget https://raw.githubusercontent.com/s-andrews/GOliath/master/processing/ gene_info_processing/create_gene_info_file_from_gtf.pl

Run this script to create the gene info file perl create_gene_info_file_from_gtf.pl --gtf Mus_musculus.GRCm38.94.gtf.gz --genome GRCm38

We'll import the gene info file so that we can plot the GC distribution for all genes in the Mus_musculus.GRCm38.94.gtf.gz genome.

There are import_GTF and parse_GTF_info functions within the GOcategoryStats package but to get genome information i.e. GC content, the parsing and lookups need to be done with access to genome information, so on the cluster. The import_GTF and parse_GTF_info functions just work with the gtf file itself which does not contain sequence content information.

Import the processed gene info file

genfo <- read.delim("M:/biased_gene_lists/Mus_musculus.GRCm38.94_gene_info.txt")
head(genfo)
colnames(genfo)

Using this gene info file and some extra processing, lists of genes were generated for the following categories:

  1. Length biased gene sets
  2. High transcript biased gene sets
  3. GC biased gene sets
  4. Chromosomal biased gene sets
  5. Closest genes to random positions
  6. Public data gene sets

The processing for categories 1-4 was carried out within an R session and is detailed in the Rmarkdown documents of the same names.

To generate the gene sets for Category 5 - Closest genes to random positions, a python script was written. This generated random locations in the genome and found the closest gene to each position.

Category 6 - the public data required a separate, more extensive workflow.

The processing of each of the 6 categories is detailed in the individual Rmarkdown documents.

See closest_gene.Rmd - the generation of the genfo file was the same. We can use the genfo file created from that processing to generate the biased gene lists.

Some other lists were generated on the cluster as the processing was fairly complicated. In particular, generating the sets of "closest genes" to random positions. I also have a script that quickly filters on various parameters, so have used that for generating some gene lists.

To produce the lists of genes per chromosome, I'll do that within R as it's fairly simple and means not having to use the cluster and move more files around.



laurabiggins/GOcategoryStats documentation built on Oct. 27, 2019, 11:36 a.m.