TODO: see whether any genes in particular come up in the lists. https://www.datacamp.com/community/tutorials/tidyverse-exploratory-analysis
library(knitr) library(rmdformats) ## Global options options(max.print="75") display_code <- FALSE knitr::opts_knit$set(global.par = TRUE, width=75) knitr::opts_chunk$set(echo=display_code, cache=TRUE, prompt=FALSE, tidy=TRUE, comment=NA, message=FALSE, warning=FALSE) library(rmdformats) library(devtools) load_all()
#library("devtools") #load_all() library(RColorBrewer) palette(brewer.pal(6, "Set2")) library(tidyverse) library(skimr) library(DT) par(mgp = c(2, 0.5, 0)) par(mar = par()$mar * 0.7) genome <- "Mus_musculus.GRCm38.94"
Some analyses performed on sequencing data use genes as the basis for analysis. In RNA-seq for example, a table of count data is often produced where a value for a gene is calculated by counting reads that fall within or overlap the gene. Some analyses are not directly based around genes. A common method used in ChIP-seq analysis is peak calling, where a test sample is generally compared to a background sample to identify regions that have a high read count in the test sample. These peaks may be exonic or intronic and can simply be annotated with the overlapping gene. The peaks may also be intergenic in which case they are often annotated with the closest gene.
[insert picture/diagram of exonic, intronic and intergenic peaks]
Genes vary greatly in size and the spread of genes is not uniform across chromosomes. It may be expected that longer genes, genes in the middle of gene deserts and those at the ends of clusters of genes are identified more frequently when annotating data with the closest gene to a region of interest.
The aim of this study was to see whether any biases were found when annotating random positions within the genome with the closest gene. This was carried out using the following steps:
The python script generate_genelist_from_random_positions.py
produces a list of 200 genes.
It generates a number of random positions along each chromosome.
The number of positions generated is proportional to the length of the chromosome.
The chromosome lengths are taken from a separate file (chr_list.txt
)
A gtf file containing all the genes in the genome is required. Each line from the
gtf file is parsed and all the genes are stored by chromosome.
Each random position is looped through to find the closest (or overlapping) gene
and the output is written to a new file.
This script has been run to create lists of genes for the mouse genome using the
gtf file for genome version GRCm38.95
for i in {1..200}; do qsub -cwd -V -l h_vmem=10G python ./generate_genelist_from_random_positions.py
--output gene_lists_GRCm38.95/closest_genes_${i}.txt; done
Extract the gene names and remove header
for i in closest*; do cut -f2 $i | tail -n 200 > ${i%".txt"}_just_genes.txt; done
# Import and process the sets of genes. This shouldn't need doing again unless the lists are re-generated #files <- list.files(path = "M:/biased_gene_lists/gene_lists/closest_gene/", pattern = ".txt", full.names = TRUE) #closest_gene_lists <- lapply(files, function(x) scan(x, what = "character")) #save(closest_gene_lists, file = "../data/closest_gene_lists.rda")
genfo_file_name <- paste0("../data/", genome, "_gene_info.txt") genfo <- read.delim(genfo_file_name) bg_genes <- as.vector(unique(genfo$gene_name))
There are 200 sets of 200 genes. We can plot out which genes appear the most often in the 200 sets. The top hit, Gm20388, appears 69 times. (Does each gene only appear once per list? I don't think I filter on that.)
load("../data/closest_gene_lists.rda") all_genes <- unlist(closest_gene_lists) tabled_genes <- table(all_genes) tabled_genes <- tabled_genes[order(tabled_genes, decreasing = TRUE)] gene_counts <- data.frame(gene_name = names(tabled_genes), n = as.vector(tabled_genes)) gene_counts %>% head(n = 30) %>% ggplot(aes(x = fct_reorder(gene_name, n), y = n)) + geom_col() + coord_flip() + labs(x = "gene")
The output of the generate_genelist_from_random_positions.py
script contains the
locations of the genes in the genome, along with the distance from the randomly generated
position. 200 of these files were generated. The files can be concatenated and loaded
into SeqMonk to explore the data.
Concatenate the files:
info_files=$(ls closest_genes* | ls !(*just_genes*))
head closest_genes_1.txt -n 1 > all_closest_gene_info.txt
for i in $info_files; do tail -n 200 $i >> all_closest_gene_info.txt; done
The first gene on the chromosome is highlighted on the SeqMonk plot.
x <- split(genfo, f = genfo$chromosome) y <- lapply(x, head, n = 1) first_genes <- sapply(y, "[", "gene_name") first_genes_vec <- as.vector(unlist(first_genes)) gene_counts$first_genes <- "no" gene_counts$first_genes[gene_counts$gene_name %in% first_genes_vec] <- "yes" # add gene lengths or genes longer than 1mb gene_counts <- genfo %>% filter(gene_name %in% gene_counts$gene_name) %>% dplyr::select(gene_name, length) %>% right_join(gene_counts) %>% mutate(longer_than_1mb = length > 1000000)
#, out.width=c('50%', '50%'), fig.show='hold'} library(plotly) p1 <- gene_counts %>% head(n = 30) %>% ggplot(aes(x = fct_reorder(gene_name, n), y = n, fill = first_genes)) + geom_col() + coord_flip() + labs(x = "gene") p2 <- gene_counts %>% head(n = 30) %>% ggplot(aes(x = fct_reorder(gene_name, n), y = n, fill = longer_than_1mb)) + geom_col() + coord_flip() + labs(x = "gene") #ggplotly(p1) #ggplotly(p2)
gene_counts <- gene_counts %>% mutate(gene_characteristics = ifelse(first_genes == "yes", "first_gene", NA)) # cringy code but it wasn't working the other ways I tried gene_counts$gene_characteristics[gene_counts$longer_than_1mb] <- "long_gene" gene_counts$gene_characteristics[gene_counts$first_genes == "yes" && gene_counts$longer_than_1mb] <- "first and long" datatable(gene_counts) gene_counts %>% head(n = 30) %>% ggplot(aes(x = fct_reorder(gene_name, n), y = n, fill = gene_characteristics)) + geom_col() + coord_flip() + labs(x = "gene")
genfo$occurrences <- as.vector(tabled_genes[match(genfo$gene_name, names(tabled_genes))]) genfo$occurrences[is.na(genfo$occurrences)] <- 0 genfo$first_genes <- "no" genfo$first_genes[genfo$gene_name %in% first_genes_vec] <- "yes" #genfo %>% # select_if(is.numeric) %>% # skim() #plot(genfo$occurrences, log2(genfo$length), cex = 0.7, pch = 16) ggplot(genfo, aes(x = occurrences, y = log2(length), color = first_genes)) + geom_point()#size = 0.9)
library(MASS) library(ggplot2) library(viridis) get_density <- function(x, y, ...) { dens <- MASS::kde2d(x, y, ...) ix <- findInterval(x, dens$x) iy <- findInterval(y, dens$y) ii <- cbind(ix, iy) return(dens$z[ii]) } density_values <- get_density(genfo$occurrences, log2(genfo$length), n = 75) ggplot(genfo) + geom_point(aes(x = occurrences, y = log2(length), color = density_values)) + scale_colour_viridis()
p <- genfo %>% subset(occurrences > 5) %>% ggplot() + geom_point(aes(x = occurrences, y = log2(length), color = first_genes)) #ggplotly(p) p
load("../data/closest_gene_lists.rda") closest <- closest_gene_lists go_results1 <- overrep_test(all_go_categories, closest[[1]], bg_genes) go_results1 closest_gene_results <- lapply(closest, function(query){ overrep_test(all_go_categories, query, bg_genes)#, pval_threshold = 0.2) }) save(closest_gene_results, file = "../data/closest_gene.rda")
load("../data/closest_gene.rda") nulls_removed <- closest_gene_results[lapply(closest_gene_results,length) != 0] number_of_results <- vapply(nulls_removed, nrow, FUN.VALUE = numeric(1))
Each of the 200 sets of genes was run through a gene ontology analysis
The number of gene sets that returned significant results from the GO overrepresentation analysis
all_sig_categories <- unlist(sapply(nulls_removed, rownames)) tabled_categories <- table(all_sig_categories) ordered_categories <- tabled_categories[order(tabled_categories, decreasing = TRUE)]
In this set, we had 200 sets of gene lists, while in the others we had 100 per category. We'll therefore double the cut-off value.
min_appearances <- 30
filtered_categories <- ordered_categories[ordered_categories >= min_appearances] filtered_categories_vec <- as.vector(filtered_categories) names(filtered_categories_vec) <- names(filtered_categories)
par(mfrow = c(1,2)) beanplot( number_of_results, what = c(0,1,0,1), col = c("#1B9E77","#06086d"), ll = 0.03, method = "jitter", border = "#06086d", las = 1, main = "no of results per gene list" ) boxplot(filtered_categories_vec, main = paste0("GO category appearances (>", min_appearances, ")"), ylab = "no of times GO category appeared", pch = 16, cex = 0.5, col = "#1B9E77", las = 1, cex.axis = 0.8 )
Each line on the beanplot shows the number of significant (corrected p-value < 0.05) functional categories returned from gene ontology analysis.
suspects <- names(ordered_categories[ordered_categories >= min_appearances])
filename <- "../data/closest_genes.txt" output_file <- file(filename, "wb") write.table( file = output_file, x = suspects, row.names = FALSE, col.names = FALSE, quote = FALSE ) close(output_file)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.