The goal of utilitybeltrefseq is to make it easy to find and download the best refseq assemblies available for any given species. All you need to know is the species NCBI taxonomy ID
You can install the development version of utilitybeltrefseq from GitHub with:
# install.packages("devtools")
devtools::install_github("selkamand/utilitybeltrefseq")
Say you want to find the best reference genome for Escherichia coli.
First we load the library then download a list of available assemblies from refseq.
library(utilitybeltrefseq)
update_refseq_data_cache()
For best results, run the above command every couple of months to stay up to date with whats currently in refseq.
Next, we need to find the NCBI taxonomy ID of our species of interest. We can find that here (taxid: 562). We could also have used the taxize package if we wanted to stay within R.
Now that we have the species level taxid, we can run choose_best_assembly
choose_best_assembly(taxid_of_interest = 562)
#> Multiple (2) best hits with score = 101400. We will just return the the most recently added assembly with this quality
#> Chosen Assembly:
#> assembly_accession GCF_000008865.2
#> bioproject PRJNA57781
#> biosample SAMN01911278
#> wgs_master
#> refseq_category reference genome
#> taxid 386585
#> species_taxid 562
#> organism_name Escherichia coli O157:H7 str. Sakai
#> infraspecific_name strain=Sakai substr. RIMD 0509952
#> isolate
#> version_status latest
#> assembly_level Complete Genome
#> release_type Major
#> genome_rep Full
#> seq_rel_date 2018/06/08
#> asm_name ASM886v2
#> submitter GIRC
#> gbrs_paired_asm GCA_000008865.2
#> paired_asm_comp identical
#> ftp_path https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/865/GCF_000008865.2_ASM886v2
#> excluded_from_refseq
#> relation_to_type_material
#> asm_not_live_date na
#> ref_score 101400
#> best_ref TRUE
#> [1] "GCF_000008865.2"
Note that we get a warning that there are multiple ‘best’ assemblies. If there are multiple strains in a species with complete assemblies - it his hard to know which to return. We choose the most recently added assembly. To return all of the high quality assemblies you can run:
choose_best_assembly(
taxid_of_interest = 562,
break_ties_based_on_newest_sequence_added = FALSE,
return_accession_only = TRUE # set this to false to get more info about each assembly
)
#> Multiple (2) best hits with score = 101400. Please choose one manually or if you haven't already - add an intraspecific filter and try again
#> [1] "GCF_000005845.2" "GCF_000008865.2"
Keep an eye on the intraspecific name column. Often can use this to choose which strain you’re after. See ?choose_best_assembly
for details.
Once we have the assembly accession we’re interested in, we can download it:
download_assembly(
target_assembly_accession = GCF_000005845.2
)
If we just want to quickly download a ref-genome of our species of interest we can just run
download_best_assembly(taxid_of_interest = 562)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.