knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this notebook we will look for association between taxonomy and CAI values. The first thing we'll need to do is download the NCBI taxonomy database.

mkdir -p taxondb
cd taxondb
wget -cq ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxf taxdump.tar.gz

ls

OK! Now we can reformat the data into a simpler format that we can parse. We'll be using a great program called taxonkit from Shen Wei https://bioinf.shenwei.me/taxonkit.

export TAXONKIT_DB=$(pwd)/taxondb

echo -e "taxid\tlineage\tkingdom\tphylum\tclass\torder\tfamily\tgenus\tspecies\tsubspecies" > taxondb/lineage.tsv


# Make sure you don't add and whitespace after the backslashes!
# Sorry about the env workaround, knitr doesn't load the usual paths apparently
   taxonkit="${HOME}/.miniconda3/bin/taxonkit" \
&& $taxonkit list --ids 1 \
|  $taxonkit lineage \
|  tail -n+2 \
|  $taxonkit reformat -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{S}" \
>> taxondb/lineage.tsv

head taxondb/lineage.tsv

Cool, so now we can start looking at the lineages and CAI distributions. Let's load the packages we'll use.

library("readr")
library("dplyr")
library("ggplot2")

And we'll setup some paths that we'll reuse.

clustering_path <- "clustering_results"
metadata_path <- "/run/media/rdrive/CCDM_Prog_10_Share-HANE0J-SE00128/ProgrammingClub/process_fungal_genomes/genomes.tsv"
out_dir <- "taxonomy_results"
dir.create(out_dir, showWarnings = FALSE)

Great. Now we can load our data in and start plotting! We'll need to join the CAI data with the taxonomy data. We'll use two joins to do this. One to join our filenames to NCBI taxids, and another to get the full taxonomic lineage for each taxid.

summarised <- readr::read_tsv(file.path(clustering_path, "summarised_cais_classif.tsv")) %>%
  left_join(
    readr::read_tsv(metadata_path) %>%
      select(name, taxid, organism_name),
    by=c("file"="name")
  ) %>%
  left_join(
    readr::read_tsv("taxondb/lineage.tsv"),
    by="taxid"
  ) %>%
  readr::write_tsv(file.path(out_dir, "summarised.tsv"))

head(summarised)

Ok. So you'll see that we now have columns for different taxonomic ranks, alongside our CAI summary statistics.

Lets plot the data.

ggplot(summarised, aes(y=factor(phylum), x=median, color=factor(classif))) +
  geom_jitter(height = 0.4, alpha=0.4) +
  ggsave(file.path(out_dir, "classif_vs_phylum.pdf"), width = 7, height = 6)

So basidiomycota and ascomycota are clearly the abundant phyla in our dataset. Group 2 mostly consists of ascomycetes. There are no other trends visible in the data.

ggplot(summarised, aes(y=factor(class), x=median, color=factor(classif))) +
  geom_jitter(height = 0.3, alpha=0.4) +
  ggsave(file.path(out_dir, "classif_vs_class.pdf"), width = 10, height = 15)
ggplot(summarised, aes(y=factor(order), x=median, color=factor(classif))) +
  geom_jitter(height = 0.3, alpha=0.4) +
  ggsave(file.path(out_dir, "classif_vs_order.pdf"), width = 10, height = 20)
ggplot(summarised, aes(y=factor(family), x=median, color=factor(classif))) +
  geom_jitter(height = 0.3, alpha=0.5) +
  ggsave(file.path(out_dir, "classif_vs_family.pdf"), width = 10, height = 40)
ggplot(summarised, aes(y=factor(genus), x=median, color=factor(classif))) +
  geom_jitter(height = 0.3, alpha=0.5) +
  ggsave(file.path(out_dir, "classif_vs_genus.pdf"), width = 10, height = 70, limitsize = FALSE)



CCDM-Programming-Club/codonfriend documentation built on May 9, 2019, 3:20 a.m.