EUtils: EUtils NCBI

Description Usage Arguments Details Examples

Description

Utils to process data downloaded from NCBI.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
parseTaxaSet(xml)

parseTSeqSet(xml, save.seq = T)

batchDownload(uid.vec, db = NULL, rettype = NULL, retmode = NULL,
  out.file = "res.txt", sleep = 10, ...)

seqSet2Fasta(seqset.uid, by = 5, folder = ".", file.prefix = "Refseq",
  file.extension = "fasta", seq.label = c("accver", ", ", "orgname",
  ", chloroplast"), sleep = 30)

Arguments

xml

The XML result of efetch.

save.seq

If TRUE as default, save sequences into data.frame returned by parseTSeqSet.

uid.vec

The vector of uid for efetch to NCBI data. The uid will either look like NC_016668.1 ("accver") or KM462867 ("INSDC").

db, rettype, retmode, ...

The arguments of efetch. See https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly for all supported databases and their available values.

out.file

The file to write all results directly without parsing. Note: if length(uid) > 500, outfile of efetch is required.

sleep

Please be nice to give enough time to break. Default 10 seconds.

seqset.uid

The vector of uid for efetch to download sequence, where db = "nuccore", rettype = "fasta". The uid will either look like NC_016668.1 ("accver") or KM462867 ("INSDC").

by

Split seqset.uid into subgroups by the given number, defaul to 20.

folder, file.prefix, file.extension

Determine the file name.

seq.label

The vector of string to determine how to label the sequence. If the element is one of the column name data.frame from parseTSeqSet, then the label of that position will be the value of that column. The available columns are "seqtype", "gi", "accver", "taxid", "orgname", "defline", "length".

Details

parseTaxaSet parses the taxonomy XML (DOCTYPE is TaxaSet) as the result of efetch from taxonomy database into a data.frame, which inlcudes "TaxId", "ScientificName", "Rank", "Lineage", "Division", and the format of taxa.table from "kingdom" to "genus".

<TaxaSet> <Taxon> <TaxId>123685</TaxId> <ScientificName>Oryzias minutillus</ScientificName> <ParentTaxId>8089</ParentTaxId> <Rank>species</Rank> <Division>Vertebrates</Division> <Lineage>cellular organisms; Eukaryota; Opisthokonta; Metazoa; ...</Lineage> <LineageEx> <Taxon> <TaxId>131567</TaxId> <ScientificName>cellular organisms</ScientificName> <Rank>no rank</Rank> </Taxon> ... </LineageEx> </Taxon> ... </TaxaSet>

parseTSeqSet parses the TinySeq XML (DOCTYPE is TSeqSet) as the result of efetch from nuccore database into a data.frame, which inlcudes "TaxId", "ScientificName", "ACCESSION", "Lineage", "sequence".

<TSeqSet> <TSeq> <TSeq_seqtype value="nucleotide"/> <TSeq_gi>1079489517</TSeq_gi> <TSeq_accver>NC_031445.1</TSeq_accver> <TSeq_sid>gnl|NCBI_GENOMES|60824</TSeq_sid> <TSeq_taxid>126358</TSeq_taxid> <TSeq_orgname>Abeliophyllum distichum</TSeq_orgname> <TSeq_defline>Abeliophyllum distichum chloroplast, complete genome</TSeq_defline> <TSeq_length>155982</TSeq_length> <TSeq_sequence>CATTTTAGTTATGGGC...GCTGT</TSeq_sequence> </TSeq> </TSeqSet>

batchDownload downloads NCBI data given a vector of uid using efetch one at a time for each uid, and writes all results directly to a file without parsing.

seqSet2Fasta downloads reference sequences given their uid using efetch. parseTSeqSet is used to parse the result of efetch.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
library("reutils")
taxa <- efetch(c("123685", "8089", "8088"), "taxonomy")
taxa.df <- parseTaxaSet(taxa$content)

seqset <- efetch("NC_031445.1", "nuccore", "fasta")
seqset.df <- parseTSeqSet(seqset$content)

batchDownload(c("NC_031445.1", "NC_026892.1"), "nuccore", "gb", out.file="res.gbff") 
 
seqSet2Fasta(c("NC_031445.1", "NC_026892.1"), seq.label=c("accver", ", ", "orgname", ", chloroplast")) 
 

walterxie/ComMA documentation built on May 3, 2019, 11:51 p.m.