phylostratr divides a phylostratigraphy analysis into 6 steps:

  1. Acquire a species tree with species identifiers (e.g. NCBI taxon ids or species names) as leafs.

  2. Replace leafs with the data required for homology inference. phylostratr supports retrieval of proteomes from UniProt.

  3. Run the homology inference algorithm (e.g. BLAST), producing a tree with raw results as leaves.

  4. From the raw results, produce p-values for homology of each species against each target.

  5. Infer taxonomic specificity of each gene.

  6. Analyze the results and diagnose problems.

Currently phylostratr supports using either custom trees or the NCBI common tree. For homology inference, a simple protein BLAST is offered. However, phylostratr is designed to be extendable to additional tools.

In this vignette, we will show how to stratify the genes of Arabidopsis thaliana as follows:

  1. Trace the lineage of Arabidopsis thaliana as defined by the NCBI common tree. For each ancestral node, find species with proteomes available in UniProt. For over-represented strata, we select a diverse subset of species (for example, of the thousands of bacterial proteomes in uniprot, we select just one from each class).

  2. Retrieve the proteome for each species in the tree, replacing the leaves with proteome file names. This allows us to access all the protein data needed for a protein BLAST (next step) while also storing the phylogenetic relationship between the species.

  3. BLAST the focal proteome (Arabidopsis thaliana) against each species in the tree, replacing leaves with the filenames of BLAST results. We build a unique BLAST database for each species. Doing this makes the BLAST statistics against each individual species to be independent of the other species in the tree. Later we will calculate a cumulative p-value for existence of a homolog in a particular stratum.

  4. Next, for each BLAST result, find the best hit against each focal gene. The resulting tables (small enough to be easily stored in memory now) are recorded as the new leaves of the tree.

  5. Finally, the best hit tables are merged, and the genes are stratified. In this vignette, we use a simple classifier for homology: a subject species is inferred to have a homolog to a focal gene if BLAST reports an evalue of less than 1e-5. This is a fairly arbitrary criterion and does not attempt to adjust for multiple testing across a stratum.

Select relatives to search

library(phylostratr)
library(reshape2)
library(taxizedb)
library(dplyr)
library(readr)
library(magrittr)
library(onekp)
library(knitr)
weights=uniprot_weight_by_ref()
focal_taxid <- '3702'
strata <-
  # Get stratified relatives represented in UniProt
  uniprot_strata(focal_taxid, from=2) %>%
  # Select a diverse subset of 5 or fewer representatives from each stratum.
  strata_apply(f=diverse_subtree, n=5, weights=weights) %>%
  # Use a prebuilt set of prokaryotic species
  use_recommended_prokaryotes %>%
  # Add yeast and humans
  add_taxa(c('4932', '9606')) %>%
  # Download genomes, storing the filenames
  uniprot_fill_strata

## Replace the UniProt Arabidopsis thaliana proteome with the Araport11 proteome.
## The path below is specific to my pipeline, you can just use the UniProt genes,
## if you want. The problem with them is that they include all previous variants
## of the gene models, raising to 89135 genes. Since `A. thaliana` is the focal
## species, it is important to be very explicit about version of the proteome.
strata@data$faa[['3702']] <- '/db/araport11/Araport11_genes.201606.pep.fasta'

## For the vignette, use only a small subset (1/100) of the sequences:
strata@data$faa[['3702']] <- thin_fasta(strata@data$faa[['3702']], 100)
# Remove Klebsormidium nitens
strata <- prune(strata, '105231', type='name')

Next I add a proteome from a source outside UniProt. Coverage of the Spermatophyta clade is very sparse, so I retrieve a transcriptome from the Ginkgo biloba genome available from the 1KP project.

# Next add a supplementary proteome from 1KP
kp <- onekp::retrieve_onekp()
# get the ginkgo proteome from 1KP
ginkgo_taxid <- taxizedb::name2taxid("Ginkgo biloba")
ginkgo_proteome <- onekp::filter_by_species(kp, ginkgo_taxid) %>% onekp::download_peptides(dir="custom-seqs")
strata <- add_taxa(strata, ginkgo_taxid)
strata@data$faa[[as.character(ginkgo_taxid)]] <- as.vector(ginkgo_proteome)
# We can plot all selected species as follows
strata %>% strata_convert(target='all', to='name') %>% sort_strata %>% plot

Find similiar sequences

# BLAST against each target genome (this will take a few hours)
strata <- strata_blast(strata, blast_args=list(nthreads=8)) %>% strata_besthits
# Merge results into a single hittable
results <- merge_besthits(strata)

Gather additional data

Now we can merge additional data:

# Note, this will take awhile
strata <- strata %>%
    # add map of UniProt IDs to PFAM domains
    strata_uniprot_pfam_map %>%
    # add proteome summary data
    add_proteome_stats %>%
    # add list of proteins that are encoded in the organelle genomes 
    add_organelle_proteins
# prot <- proteome_stats_table(strata)
# strata2 <- strata_convert(strata, target='all', to='name')
# g1 <- plot_proteome_stats(strata2)
# g2 <- plot_proteome_lengths(strata2)
# # make interactive in browser
# plotly::ggplotly(g1)
# plotly::ggplotly(g2)
# We are particularly interested in plant organelles, since they are large and
# often forgotten in proteome datasets.
subtree(strata, name2taxid('Viridiplantae')) %>%
    # and we replace the taxon IDs with scientific names
    strata_convert(target='all', to='name') %>%
    # summarize the numbers of organelle genes present per species
    organelle_table %>%
    # sort by ascending phylostratum age
    dplyr::arrange(-.data$ps)

And get the number of genes in each phylostrata

ph <- stratify(results, classify_by_adjusted_pvalue(0.001))
ph$locus <- sub('\\.[0-9]+', '', ph$qseqid, perl=TRUE)
ph %>%
    dplyr::select(-qseqid) %>%
    dplyr::distinct() %>%
    dplyr::group_by(mrca_name, ps) %>%
    dplyr::summarize(n = length(ps))
table(ph$mrca_name)

Figures

if(file.exists('all.Rda')){
  load('all.Rda')
} else {
  default_classifier <- classify_by_adjusted_pvalue(threshold=0.001)
  results            <- readRDS('results.Rda')
  is_present         <- default_classifier(results)
  strata_table       <- stratify(results, classifier=default_classifier)
  revenants          <- find_revenants(results, classifier=default_classifier)
  strata_levels      <- levels(strata_table$mrca_name)
  strata             <- readRDS('strata.Rda')
  save.image('all.Rda')
}

ll <- list(
  bon2  = stratify(results, classifier=classify_by_adjusted_pvalue(threshold=1e-2)),
  bon5  = stratify(results, classifier=classify_by_adjusted_pvalue(threshold=1e-5))
)
sigresults <- Reduce(init=ll[[1]], x=ll[-1],
  function(a,b){
    merge(a, b, by='qseqid')
  }
)
sigresults <- sigresults[, c(4,7)]
names(sigresults) <- names(ll)
saveRDS(sigresults, 'sigresults.Rda')
require(phylostratr)
cache='figdata.Rda'
d <- if(!file.exists(cache)){
  d <- list()
  d$sigmats <- make_significance_matrices(sigresults)
  d$versus_TIPS <- make_TIPS_comparison_matrix(strata, results, default_classifier)
  d$jmpmat <- make_jump_matrix(results[is_present, c('qseqid', 'ps')], labels=strata_levels)
  saveRDS(d, cache)
  d
} else {
  readRDS(cache)
}

d$sigmats[[1]][[1]]@xlab = "Phylostratum (Adjusted p-value < 1e-5)"
d$sigmats[[1]][[1]]@ylab = "Phylostratum (Adjusted p-value < 0.01)"
fig1 <- plot(d$sigmats[[1]][[1]])
fig2 <- fig1 # TODO: replace with real figure
fig3 <- plot(t(d$versus_TIPS)) +
  ggplot2::ylab("Classification with phylostratr dataset") +
  ggplot2::xlab("Classification with 2014 dataset")
fig4 <- plot(t(d$jmpmat)) +
  ggplot2::ylab("Phylostratum with homolog (1e-5 bonf)") +
  ggplot2::xlab("Next youngest with homolog (1e-5 bonf)")
gridExtra::grid.arrange(fig1, fig2, fig3, fig4, nrow=2)


arendsee/phylostratr documentation built on Dec. 31, 2022, 10:22 a.m.