The main practical difference between this vignette and the Arabidopsis vignette is that for S. cerevisiae the NCBI common tree is not high enough resolution. The Saccharomyces genera includes many species disant species, so we cannot just find the NCBI taxonomy backbone nodes and then download the needed sequences from uniprot (indeed, UniProt doesn't even have all the species we want).
The most general way to customize a sequence dataset is to simply write the FASTA filenames into a list and convert it to a tree:
library(phylostratr) library(magrittr) focal_taxid <- 4932 saccharomyces <- Strata( tree = ape::read.tree(system.file('extdata', 'yeast', 'tree', package='phylostratr')), data = list(faa=list( Saccharomyces_cerevisiae = 'yeast/cerevisiae.faa', Saccharomyces_paradoxus = 'yeast/paradoxus.faa', Saccharomyces_mikatae = 'yeast/mikatae.faa', Saccharomyces_kudriavzevii = 'yeast/kudriavzevii.faa', Saccharomyces_arboricola = 'yeast/arboricola.faa', Saccharomyces_eubayanus = 'yeast/eubayanus.faa', Saccharomyces_uvarum = 'yeast/uvarum.faa' )), focal_species = 'Saccharomyces_cerevisiae' ) %>% strata_convert(target='tip', to='id')
Where we assume the folder yeast
holds the respective protein sequences. It
is important to include the focal species, cerevisiae. This tree could also be
loaded from a YAML, JSON, or NEWICK file. For more information on loading
trees, see the documentation for the data.tree
R package.
To get deeper phylostrata, we will use the UniProt genomes and the NCBI taxonomy tree. As in the Arabidopsis vignette, we don't want to use all the UniProt sequences, so we will filter for diverse species.
# set weights on species selection weights <- c( '1355160' = 0, # these three are deprecated '245562' = 0, '1094981' = 0, '284813' = 1.1, # these are the reference replacements '766039' = 1.1, '237561' = 1.1 ) focal_taxid <- '4932' strata <- focal_taxid %>% # Get all UniProt proteomes uniprot_strata %>% # build a tree of all UniProt genomes # Select a diverse subset of 5 or fewer representatives from each stratum. # Only do this above the Saccharomyces genus, since we will later replace # Saccharomyces with out own tree. strata_apply(f=diverse_subtree, n=5, weights=weights) %>% # add prokaryote stratum use_recommended_prokaryotes %>% # download UniProt sequences (this may take 10+ minutes) uniprot_fill_strata
Now we will use the Saccharomyces tree we defined above to resolve phylogenetic relationships among the species in the Saccharomyces genus, this allows for finer phylostratigraphic analysis.
strata <- replace_branch(strata, y=saccharomyces, node='4930')
Next we run BLAST
strata <- strata %>% strata_blast(blast_args = list(nthreads=8)) %>% strata_besthits results <- merge_besthits(strata) plot_heatmaps(results, filename='all-yeast.pdf', tree=strata@tree, focal_id=focal_taxid)
We can perform a domain analysis
# Note, this will take awhile strata <- strata %>% # add map of UniProt IDs to PFAM domains strata_uniprot_pfam_map %>% # add proteome summary data add_proteome_stats %>% # add list of proteins that are encoded in the organelle genomes add_organelle_proteins prot <- proteome_stats_table(strata) strata2 <- strata_convert(strata, target='all', to='name') g1 <- plot_proteome_stats(strata2) g2 <- plot_proteome_lengths(strata2) # make interactive in browser plotly::ggplotly(g1) plotly::ggplotly(g2)
# TODO: fix this, currently this is an awful hack, stratify should take a # character string of levels and return factored mrca and mrca_names strata_names <- levels(get_phylostrata_map(strata)$mrca) strata_names <- factor(strata_names, levels=strata_names) phylostrata <- stratify(results, strata_names = strata_names)
This material will probably not stay in this vignette.
# just the order sc_order <- focal_taxid %>% uniprot_strata %>% subtree(name2taxid('Saccharomycetales')) # tree after filtering sc_filtered <- sc_order %>% strata_apply(f=diverse_subtree, n=5, weights=weights) %>% uniprot_fill_strata # replace Saccharomyces branch with the users tree sc_final <- replace_branch(sc_filtered, saccharomyces, '4930')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.