phylostratr
divides a phylostratigraphy analysis into 6 steps:
Acquire a species tree with species identifiers (e.g. NCBI taxon ids or species names) as leafs.
Replace leafs with the data required for homology inference. phylostratr
supports retrieval of proteomes from UniProt.
Run the homology inference algorithm (e.g. BLAST), producing a tree with raw results as leaves.
From the raw results, produce p-values for homology of each species against each target.
Infer taxonomic specificity of each gene.
Analyze the results and diagnose problems.
Currently phylostratr
supports using either custom trees or the NCBI common
tree. For homology inference, a simple protein BLAST is offered. However,
phylostratr
is designed to be extendable to additional tools.
In this vignette, we will show how to stratify the genes of Arabidopsis thaliana as follows:
Trace the lineage of Arabidopsis thaliana as defined by the NCBI common tree. For each ancestral node, find species with proteomes available in UniProt. For over-represented strata, we select a diverse subset of species (for example, of the thousands of bacterial proteomes in uniprot, we select just one from each class).
Retrieve the proteome for each species in the tree, replacing the leaves with proteome file names. This allows us to access all the protein data needed for a protein BLAST (next step) while also storing the phylogenetic relationship between the species.
BLAST the focal proteome (Arabidopsis thaliana) against each species in the tree, replacing leaves with the filenames of BLAST results. We build a unique BLAST database for each species. Doing this makes the BLAST statistics against each individual species to be independent of the other species in the tree. Later we will calculate a cumulative p-value for existence of a homolog in a particular stratum.
Next, for each BLAST result, find the best hit against each focal gene. The resulting tables (small enough to be easily stored in memory now) are recorded as the new leaves of the tree.
Finally, the best hit tables are merged, and the genes are stratified. In this vignette, we use a simple classifier for homology: a subject species is inferred to have a homolog to a focal gene if BLAST reports an evalue of less than 1e-5. This is a fairly arbitrary criterion and does not attempt to adjust for multiple testing across a stratum.
library(phylostratr) library(reshape2) library(taxizedb) library(dplyr) library(readr) library(magrittr) library(onekp) library(knitr)
weights=uniprot_weight_by_ref() focal_taxid <- '3702' strata <- # Get stratified relatives represented in UniProt uniprot_strata(focal_taxid, from=2) %>% # Select a diverse subset of 5 or fewer representatives from each stratum. strata_apply(f=diverse_subtree, n=5, weights=weights) %>% # Use a prebuilt set of prokaryotic species use_recommended_prokaryotes %>% # Add yeast and humans add_taxa(c('4932', '9606')) %>% # Download genomes, storing the filenames uniprot_fill_strata ## Replace the UniProt Arabidopsis thaliana proteome with the Araport11 proteome. ## The path below is specific to my pipeline, you can just use the UniProt genes, ## if you want. The problem with them is that they include all previous variants ## of the gene models, raising to 89135 genes. Since `A. thaliana` is the focal ## species, it is important to be very explicit about version of the proteome. strata@data$faa[['3702']] <- '/db/araport11/Araport11_genes.201606.pep.fasta' ## For the vignette, use only a small subset (1/100) of the sequences: strata@data$faa[['3702']] <- thin_fasta(strata@data$faa[['3702']], 100)
# Remove Klebsormidium nitens strata <- prune(strata, '105231', type='name')
Next I add a proteome from a source outside UniProt. Coverage of the Spermatophyta clade is very sparse, so I retrieve a transcriptome from the Ginkgo biloba genome available from the 1KP project.
# Next add a supplementary proteome from 1KP kp <- onekp::retrieve_onekp() # get the ginkgo proteome from 1KP ginkgo_taxid <- taxizedb::name2taxid("Ginkgo biloba") ginkgo_proteome <- onekp::filter_by_species(kp, ginkgo_taxid) %>% onekp::download_peptides(dir="custom-seqs") strata <- add_taxa(strata, ginkgo_taxid) strata@data$faa[[as.character(ginkgo_taxid)]] <- as.vector(ginkgo_proteome)
# We can plot all selected species as follows strata %>% strata_convert(target='all', to='name') %>% sort_strata %>% plot
# BLAST against each target genome (this will take a few hours) strata <- strata_blast(strata, blast_args=list(nthreads=8)) %>% strata_besthits # Merge results into a single hittable results <- merge_besthits(strata)
Now we can merge additional data:
# Note, this will take awhile strata <- strata %>% # add map of UniProt IDs to PFAM domains strata_uniprot_pfam_map %>% # add proteome summary data add_proteome_stats %>% # add list of proteins that are encoded in the organelle genomes add_organelle_proteins
# prot <- proteome_stats_table(strata) # strata2 <- strata_convert(strata, target='all', to='name') # g1 <- plot_proteome_stats(strata2) # g2 <- plot_proteome_lengths(strata2) # # make interactive in browser # plotly::ggplotly(g1) # plotly::ggplotly(g2)
# We are particularly interested in plant organelles, since they are large and # often forgotten in proteome datasets. subtree(strata, name2taxid('Viridiplantae')) %>% # and we replace the taxon IDs with scientific names strata_convert(target='all', to='name') %>% # summarize the numbers of organelle genes present per species organelle_table %>% # sort by ascending phylostratum age dplyr::arrange(-.data$ps)
And get the number of genes in each phylostrata
ph <- stratify(results, classify_by_adjusted_pvalue(0.001)) ph$locus <- sub('\\.[0-9]+', '', ph$qseqid, perl=TRUE) ph %>% dplyr::select(-qseqid) %>% dplyr::distinct() %>% dplyr::group_by(mrca_name, ps) %>% dplyr::summarize(n = length(ps))
table(ph$mrca_name)
if(file.exists('all.Rda')){ load('all.Rda') } else { default_classifier <- classify_by_adjusted_pvalue(threshold=0.001) results <- readRDS('results.Rda') is_present <- default_classifier(results) strata_table <- stratify(results, classifier=default_classifier) revenants <- find_revenants(results, classifier=default_classifier) strata_levels <- levels(strata_table$mrca_name) strata <- readRDS('strata.Rda') save.image('all.Rda') } ll <- list( bon2 = stratify(results, classifier=classify_by_adjusted_pvalue(threshold=1e-2)), bon5 = stratify(results, classifier=classify_by_adjusted_pvalue(threshold=1e-5)) ) sigresults <- Reduce(init=ll[[1]], x=ll[-1], function(a,b){ merge(a, b, by='qseqid') } ) sigresults <- sigresults[, c(4,7)] names(sigresults) <- names(ll) saveRDS(sigresults, 'sigresults.Rda')
require(phylostratr) cache='figdata.Rda' d <- if(!file.exists(cache)){ d <- list() d$sigmats <- make_significance_matrices(sigresults) d$versus_TIPS <- make_TIPS_comparison_matrix(strata, results, default_classifier) d$jmpmat <- make_jump_matrix(results[is_present, c('qseqid', 'ps')], labels=strata_levels) saveRDS(d, cache) d } else { readRDS(cache) } d$sigmats[[1]][[1]]@xlab = "Phylostratum (Adjusted p-value < 1e-5)" d$sigmats[[1]][[1]]@ylab = "Phylostratum (Adjusted p-value < 0.01)" fig1 <- plot(d$sigmats[[1]][[1]]) fig2 <- fig1 # TODO: replace with real figure fig3 <- plot(t(d$versus_TIPS)) + ggplot2::ylab("Classification with phylostratr dataset") + ggplot2::xlab("Classification with 2014 dataset") fig4 <- plot(t(d$jmpmat)) + ggplot2::ylab("Phylostratum with homolog (1e-5 bonf)") + ggplot2::xlab("Next youngest with homolog (1e-5 bonf)")
gridExtra::grid.arrange(fig1, fig2, fig3, fig4, nrow=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.