README.md

genomes2

genomes2 is a copy of the genomes version 2 package from Bioconductor that read metadata, gff, sequences and other files in the old FTP genomes directories (on Dec 2, 2015, these were moved to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq). The package also collected genome sequencing project data from NCBI using E-utility scripts (also removed from genomes version 3 due to the same functionality in the reutils or rentrez packages).

A short description of the five basic E-utility scripts (einfo, esearch, esummary, efetch and elink) is provided below.

The einfo function lists the 51 Entrez databases (accessed April 12, 2013).

einfo()
            DbName
1         assembly
2       bioproject
3        biosample
4       biosystems
5      blastdbinfo
6            books
7              cdd
...

You can also query a specific database to return the indexing fields available for searching

einfo("taxonomy")[,1:4]
   Name          FullName                                 Description TermCount
1   ALL        All Fields        All terms from all searchable fields   9979456
2  ALLN         All Names                    All aliases for organism   1494291
3  COMN       Common Name                     Common name of organism     37400
4  EDAT       Entrez Date Date record first accessible through Entrez      6088
5  FILT            Filter                          Limits the records       353
6    GC                GC                        Nuclear genetic code        16
7  LNGE           Lineage              Lineage in taxonomic hierarchy   1494291
8  MDAT Modification Date                         Date of last update      4969
9   MGC               MGC                  Mitochondrial genetic code        33
10 NXLV        Next Level     Immediate parent in taxonomic hierarchy    182957
...

or set the links option to TRUE to list the available links (needed for elinks).

einfo("genome", links=TRUE)[, 1:2]
                        Name                      Menu
1          genome_bioproject          BioProject Links
2                genome_gene                Gene Links
3             genome_nuccore                Components
4 genome_nuccore_samespecies Other genomes for species
5             genome_protein             Protein Links
6     genome_proteinclusters     Protein Cluster Links
7              genome_pubmed              PubMed Links
8            genome_taxonomy            Taxonomy Links

The esearch command runs the Entrez database searches and returns the History Server details (query_key and WebEnv for subsequent calls). For example, this command will find pubmed articles with bioconductor listed in the title (pubmed is the default database).

esearch("bioconductor[TITLE]")
[1] "92 results found"
      db results query_key                                                  WebEnv
1 pubmed      92         1 NCID_1_224456233_130.14.18.53_9001_1330229913_639962231

Set the usehistory option to "n" to return a vector of ids.

esearch("mouse", db="taxonomy", usehistory="n")
[1] "10090"

Also, any additional key-value pairs may be included in the options, for example, this query finds pubmed articles published in the last year (see the NCBI help pages for full details on the optional E-utility parameters).

x <- esearch("bioconductor[TITLE]", reldate=360)

The objects returned by esearch may be passed directly to the remaining three function (esummary, efetch or elink). I recommned using the history object since this also reads the Entrez database name and you are not limited to 200 IDs (which is the limit requested by NCBI in the URL strings). The esummary functions returns Entrez database summaries and includes a simple XML parser to return a data.frame (or set parse=FALSE to return the raw XML or optionally use the ncbiPubmed function to parse the raw XML and return a short citation with author, year, title, journal and published date). You may also select the old XML format (default) or the newer Entrez version 2.0 (for display here I have selected only a few of the 42 columns available).

esummary(x, version="2.0")[, c(1, 42, 6)]
      PubDate SortFirstAuthor                                                                                                                       Title
1  2013 Mar 2        Heider A                               virtualArray: a R/bioconductor package to merge raw data from different microarray platforms.
2  2013 Mar 1     Schröder MS                                                   RamiGO: an R/Bioconductor package providing an AmiGO visualize interface.
3 2012 Dec 24       Taminau J Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages.
...
13 2012 Apr 24      Castro MA   RedeR: R/Bioconductor package for representing modular structures, nested networks and multiple levels of hierarchical...

The efetch command may be used to return records in a variety of formats and you should check the online table for a long list of valid retrieval types and modes for each database. In addition, the genomes package includes a number of specific parsers to format efecth output, for example, to parse submission dates from GenBank files using ncbiSubmit or taxonomy lineage using ncbiTaxonomy. Many other parsers could be added such as reading feature tables into GRanges or FASTA files into Biostrings (some are included in the genomes2 package on Github). In this example, efetch is used to return abstracts from the previous esearch query.

efetch(x, rettype="abstract")
  [1] ""                                                                                    
  [2] "1. BMC Bioinformatics. 2013 Mar 2;14:75. doi: 10.1186/1471-2105-14-75."              
  [3] ""                                                                                    
  [4] "virtualArray: a R/bioconductor package to merge raw data from different"             
  [5] "microarray platforms."                                                               
  [6] ""                                                                                    
  [7] "Heider A, Alt R."                                                                    
  [8] ""                                                                                    
  [9] "Translational Centre for Regenerative Medicine Leipzig, University of Leipzig,"      
 [10] "Semmelweisstr, 14, Leipzig 04103, Germany. aheider@trm.uni-leipzig.de."              
 [11] ""                                                                                    
 [12] "BACKGROUND: Microarrays have become a routine tool to address diverse biological "   
 [13] "questions. Therefore, different types and generations of microarrays have been"      
 [14] "produced by several manufacturers over time. Likewise, the diversity of raw data "   
 [15] "deposited in public databases such as NCBI GEO or EBI ArrayExpress has grown"

The functions may be nested (esearch -> esummary) to summarize searches in a single step.

esummary(esearch( "Yersinia pestis CO92[ORGN] AND refseq[FILTER] AND nuccore genome[Filter]", db="nuccore"))[, c(2,3,5,10)]
[1] "4 results found"
    Caption                                                 Title       Gi  Length
1 NC_003143      Yersinia pestis CO92 chromosome, complete genome 16120353 4653728
2 NC_003131  Yersinia pestis CO92 plasmid pCD1, complete sequence 16082691   70305
3 NC_003134  Yersinia pestis CO92 plasmid pMT1, complete sequence 16082781   96210
4 NC_003132 Yersinia pestis CO92 plasmid pPCP1, complete sequence 16082679    9612

If you are fetching sequences, use caution when downloading large sequences. One option is to add a low seq_stop if you are not sure what will be returned in the search results.

efetch(esearch( "Yersinia pestis CO92[ORGN] AND refseq[FILTER] AND nuccore genome[Filter]", db="nuccore"), seq_stop=700, rettype="fasta")
[1] "4 results found"
 [1] ">gi|16120353:1-700 Yersinia pestis CO92 chromosome, complete genome"      "GATCTTTTTATTTAAACGATCTCTTTATTAGATCTCTTATTAGGATCATGATCCTCTGTGGATAAGTGAT"  
 [3] "TATTCACATGGCAGATCATATAATTAAGGAGGATCGTTTGTTGTGAGTGACCGGTGATCGTATTGCGTAT"   "AAGCTGGGATCTAAATGGCATGTTATGCACAGTCACTCGGCAGAATCAAGGTTGTTATGTGGATATCTAC"  
 [5] "TGGTTTTACCCTGCTTTTAAGCATAGTTATACACATTCGTTCGCGCGATCTTTGAGCTAATTAGAGTAAA"   "TTAATCCAATCTTTGACCCAAATCTCTGCTGGATCCTCTGGTATTTCATGTTGGATGACGTCAATTTCTA"  
 [7] "ATATTTCACCCAACCGTTGAGCACCTTGTGCGATCAATTGTTGATCCAGTTTTATGATTGCACCGCAGAA"   "AGTGTCATATTCTGAGCTGCCTAAACCAACCGCCCCAAAGCGTACTTGGGATAAATCAGGCTTTTGTTGT"  
 [9] "TCGATCTGTTCTAATAATGGCTGCAAGTTATCAGGTAGATCCCCGGCACCATGAGTGGATGTCACGATTA"   "ACCACAGGCCATTCAGCGTAAGTTCGTCCAACTCTGGGCCATGAAGTATTTCTGTAGAAAACCCAGCTTC"  
[11] "TTCTAATTTATCCGCTAAATGTTCAGCAACATATTCAGCACTACCAAGCGTACTGCCACTTATCAACGTT"   ""                                                                        
[13] ">gi|16082691:1-700 Yersinia pestis CO92 plasmid pCD1, complete sequence"  "GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTC"  
[15] "CTGATTCAGGAGAGTTTATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAAT"   "GAGTAGCCGGGCGATTGCCAGAGAACTGGGGATCTCCCGCAATACCGTTAAACGTTATTTGCAGGCAAAA"

Similar to esearch, the elink command also returns either the History Server or a vector of ids if the cmd option is switched to neighbor (default is neighbor_history).

elink("15718680,157427902", dbfrom="protein", db="gene")
    db         link query_key                                                  WebEnv
1 gene protein_gene         1 NCID_1_15399116_130.14.18.52_9001_1330231183_1240047351

elink("15718680,157427902", dbfrom="protein", db="gene", cmd="neighbor")
[1] 522311   3702

These commands can also be nested to return complicated queries in a single line (esearch-> elink -> esummary). In this case, viral genomes linked to a Reference sequence in Entrez genome are displayed. Many links are also available as filters, so this search in nuccore would also work Nipah virus[ORGN] AND nuccore genome samespecies[Filter]

esummary( elink( esearch("Nipah virus", "genome"), db="nuccore", linkname="genome_nuccore_samespecies"))[, c(2,3,6,10)]
[1] "1 result found"
    Caption                                                           Title CreateDate Length
1  JN808863          Nipah virus isolate NIVBGD2008RAJBARI, complete genome 2012/02/04  18252
2  JN808857        Nipah virus isolate NIVBGD2008MANIKGONJ, complete genome 2012/02/04  18252
3  FJ513078 Nipah virus isolate Ind-Nipah-07-FG from India, complete genome 2009/12/31  18252
4  AY988601                    Nipah virus from Bangladesh, complete genome 2005/06/01  18252
5  AJ627196          Nipah virus complete genome, isolate NV/MY/99/VRI-0626 2005/01/07  18246
6  AJ564623           Nipah virus complete genome, isolate NV/MY/99/UM-0128 2004/01/05  18246
7  AJ564622          Nipah virus complete genome, isolate NV/MY/99/VRI-1413 2004/01/05  18246
8  AJ564621          Nipah virus complete genome, isolate NV/MY/99/VRI-2794 2004/01/05  18246
9  AY029768                      Nipah virus isolate UMMC2, complete genome 2001/09/07  18246
10 AY029767                      Nipah virus isolate UMMC1, complete genome 2001/09/07  18246

Check the help pages for additional details and please post any comments or issues.



cstubben/genomes2 documentation built on May 14, 2019, 12:25 p.m.