TODO: see whether any genes in particular come up in the lists. https://www.datacamp.com/community/tutorials/tidyverse-exploratory-analysis

library(knitr)
library(rmdformats)
## Global options
options(max.print="75")
display_code <- FALSE
knitr::opts_knit$set(global.par = TRUE, width=75)
knitr::opts_chunk$set(echo=display_code,
                 cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)

library(rmdformats)
library(devtools)
load_all()
#library("devtools")
#load_all()

library(RColorBrewer)
palette(brewer.pal(6, "Set2"))

library(tidyverse)
library(skimr)
library(DT)

par(mgp = c(2, 0.5, 0))
par(mar = par()$mar * 0.7)

genome <- "Mus_musculus.GRCm38.94"

Closest gene

Some analyses performed on sequencing data use genes as the basis for analysis. In RNA-seq for example, a table of count data is often produced where a value for a gene is calculated by counting reads that fall within or overlap the gene. Some analyses are not directly based around genes. A common method used in ChIP-seq analysis is peak calling, where a test sample is generally compared to a background sample to identify regions that have a high read count in the test sample. These peaks may be exonic or intronic and can simply be annotated with the overlapping gene. The peaks may also be intergenic in which case they are often annotated with the closest gene.

[insert picture/diagram of exonic, intronic and intergenic peaks]

Genes vary greatly in size and the spread of genes is not uniform across chromosomes. It may be expected that longer genes, genes in the middle of gene deserts and those at the ends of clusters of genes are identified more frequently when annotating data with the closest gene to a region of interest.

The aim of this study was to see whether any biases were found when annotating random positions within the genome with the closest gene. This was carried out using the following steps:

Generating genes by random position

The python script generate_genelist_from_random_positions.py produces a list of 200 genes. It generates a number of random positions along each chromosome. The number of positions generated is proportional to the length of the chromosome. The chromosome lengths are taken from a separate file (chr_list.txt) A gtf file containing all the genes in the genome is required. Each line from the gtf file is parsed and all the genes are stored by chromosome. Each random position is looped through to find the closest (or overlapping) gene and the output is written to a new file. This script has been run to create lists of genes for the mouse genome using the gtf file for genome version GRCm38.95

for i in {1..200}; do qsub -cwd -V -l h_vmem=10G python ./generate_genelist_from_random_positions.py --output gene_lists_GRCm38.95/closest_genes_${i}.txt; done

Extract the gene names and remove header for i in closest*; do cut -f2 $i | tail -n 200 > ${i%".txt"}_just_genes.txt; done

# Import and process the sets of genes. This shouldn't need doing again unless the lists are re-generated

#files <- list.files(path = "M:/biased_gene_lists/gene_lists/closest_gene/", pattern = ".txt", full.names = TRUE)
#closest_gene_lists <- lapply(files, function(x) scan(x, what = "character"))
#save(closest_gene_lists, file = "../data/closest_gene_lists.rda")
genfo_file_name <- paste0("../data/", genome, "_gene_info.txt")
genfo <- read.delim(genfo_file_name)

bg_genes <- as.vector(unique(genfo$gene_name))

Checking for overrepresentation at the gene level

There are 200 sets of 200 genes. We can plot out which genes appear the most often in the 200 sets. The top hit, Gm20388, appears 69 times. (Does each gene only appear once per list? I don't think I filter on that.)

load("../data/closest_gene_lists.rda")
all_genes <- unlist(closest_gene_lists)

tabled_genes <- table(all_genes)
tabled_genes <- tabled_genes[order(tabled_genes, decreasing = TRUE)]
gene_counts  <- data.frame(gene_name = names(tabled_genes), n = as.vector(tabled_genes))

gene_counts %>%
  head(n = 30) %>%
  ggplot(aes(x = fct_reorder(gene_name, n), y = n)) + 
  geom_col() + 
  coord_flip() +
  labs(x = "gene")

Evaluating locations in SeqMonk

The output of the generate_genelist_from_random_positions.py script contains the locations of the genes in the genome, along with the distance from the randomly generated position. 200 of these files were generated. The files can be concatenated and loaded into SeqMonk to explore the data. Concatenate the files: info_files=$(ls closest_genes* | ls !(*just_genes*)) head closest_genes_1.txt -n 1 > all_closest_gene_info.txt for i in $info_files; do tail -n 200 $i >> all_closest_gene_info.txt; done

closest genes image

The first gene on the chromosome is highlighted on the SeqMonk plot.

x <- split(genfo, f = genfo$chromosome)
y <- lapply(x, head, n = 1)
first_genes <- sapply(y, "[", "gene_name")
first_genes_vec <- as.vector(unlist(first_genes))

gene_counts$first_genes <- "no"

gene_counts$first_genes[gene_counts$gene_name %in% first_genes_vec] <- "yes"

# add gene lengths or genes longer than 1mb
gene_counts <- genfo %>%
  filter(gene_name %in% gene_counts$gene_name) %>%
  dplyr::select(gene_name, length) %>%
  right_join(gene_counts) %>%
  mutate(longer_than_1mb = length > 1000000)
#, out.width=c('50%', '50%'), fig.show='hold'}
library(plotly)

p1 <- gene_counts %>%
  head(n = 30) %>%
  ggplot(aes(x = fct_reorder(gene_name, n), y = n, fill = first_genes)) + 
  geom_col() + 
  coord_flip() +
  labs(x = "gene")

p2 <- gene_counts %>%
  head(n = 30) %>%
  ggplot(aes(x = fct_reorder(gene_name, n), y = n, fill = longer_than_1mb)) + 
  geom_col() + 
  coord_flip() +
  labs(x = "gene")
#ggplotly(p1)
#ggplotly(p2)
gene_counts <- gene_counts %>%
  mutate(gene_characteristics = ifelse(first_genes == "yes", "first_gene", NA)) 

# cringy code but it wasn't working the other ways I tried
gene_counts$gene_characteristics[gene_counts$longer_than_1mb] <- "long_gene"
gene_counts$gene_characteristics[gene_counts$first_genes == "yes" && gene_counts$longer_than_1mb] <- "first and long"

datatable(gene_counts)

gene_counts %>%
  head(n = 30) %>%
  ggplot(aes(x = fct_reorder(gene_name, n), y = n, fill = gene_characteristics)) + 
  geom_col() + 
  coord_flip() +
  labs(x = "gene")

Plot number of times a gene appeared vs its length

genfo$occurrences <- as.vector(tabled_genes[match(genfo$gene_name, names(tabled_genes))])
genfo$occurrences[is.na(genfo$occurrences)] <- 0

genfo$first_genes <- "no"

genfo$first_genes[genfo$gene_name %in% first_genes_vec] <- "yes"

#genfo %>%
#  select_if(is.numeric) %>%
#  skim()

#plot(genfo$occurrences, log2(genfo$length), cex = 0.7, pch = 16)

ggplot(genfo, aes(x = occurrences, y = log2(length), color = first_genes)) +
  geom_point()#size = 0.9)
library(MASS)
library(ggplot2)
library(viridis)
get_density <- function(x, y, ...) {
  dens <- MASS::kde2d(x, y, ...)
  ix <- findInterval(x, dens$x)
  iy <- findInterval(y, dens$y)
  ii <- cbind(ix, iy)
  return(dens$z[ii])
}

density_values <- get_density(genfo$occurrences, log2(genfo$length), n = 75)
ggplot(genfo) +
  geom_point(aes(x = occurrences, y = log2(length), color = density_values)) +
  scale_colour_viridis()
p <- genfo %>%
  subset(occurrences > 5) %>%
  ggplot() + geom_point(aes(x = occurrences, y = log2(length), color = first_genes))

#ggplotly(p)
p  

Gene ontology analysis

load("../data/closest_gene_lists.rda")
closest <- closest_gene_lists
go_results1 <- overrep_test(all_go_categories, closest[[1]], bg_genes)
go_results1

closest_gene_results <- lapply(closest, function(query){
  overrep_test(all_go_categories, query, bg_genes)#, pval_threshold = 0.2)
})

save(closest_gene_results, file = "../data/closest_gene.rda")
load("../data/closest_gene.rda")

nulls_removed <- closest_gene_results[lapply(closest_gene_results,length) != 0]
number_of_results <- vapply(nulls_removed, nrow, FUN.VALUE = numeric(1))

Each of the 200 sets of genes was run through a gene ontology analysis

The number of gene sets that returned significant results from the GO overrepresentation analysis

all_sig_categories <- unlist(sapply(nulls_removed, rownames))
tabled_categories  <- table(all_sig_categories)
ordered_categories <- tabled_categories[order(tabled_categories, decreasing = TRUE)]

In this set, we had 200 sets of gene lists, while in the others we had 100 per category. We'll therefore double the cut-off value.

min_appearances <- 30
filtered_categories <- ordered_categories[ordered_categories >= min_appearances]
filtered_categories_vec <- as.vector(filtered_categories)
names(filtered_categories_vec) <- names(filtered_categories)
par(mfrow = c(1,2))

beanplot(
  number_of_results, 
  what   = c(0,1,0,1), 
  col    = c("#1B9E77","#06086d"), 
  ll     = 0.03, 
  method = "jitter", 
  border = "#06086d",
  las    = 1,
  main   = "no of results per gene list"
)

boxplot(filtered_categories_vec,
        main = paste0("GO category appearances (>", min_appearances, ")"),
        ylab = "no of times GO category appeared",
        pch  = 16,
        cex  = 0.5,
        col  = "#1B9E77",
        las = 1,
        cex.axis = 0.8
        )

Each line on the beanplot shows the number of significant (corrected p-value < 0.05) functional categories returned from gene ontology analysis.

suspects <- names(ordered_categories[ordered_categories >= min_appearances])
filename <- "../data/closest_genes.txt"

output_file <- file(filename, "wb")

write.table(
  file      = output_file,
  x         = suspects, 
  row.names = FALSE,
  col.names = FALSE, 
  quote     = FALSE
)

close(output_file)


laurabiggins/GOcategoryStats documentation built on Oct. 27, 2019, 11:36 a.m.