README.md

Travis-CI Build Status AppVeyor Build Status CRAN RStudio mirror downloads CRAN_Status_Badge

biofiles - an interface to GenBank/GenPept files in R

biofiles provides interfacing to GenBank/GenPept or Embl flat file records. It includes utilities for reading and writing GenBank files, and methods for interacting with annotation and sequence data.

Installation

Install the latest stable release of the biofiles package from CRAN:

install.packages("biofiles")

Install the development version from github using the devtools package.

# install.packages("devtools")
devtools::install_github("gschofl/biofiles")

Basic functionality

Let's download a small bacterial genome, Chlamydophila psittaci DC15 (GenBank acc. CP002806; GI 334693792) from NCBI.

# install.packages("reutils")
gb_file <- reutils::efetch("CP002806", "nuccore", rettype = "gbwithparts", retmode = "text")
gb_file
#> Object of class 'efetch' 
#> LOCUS       CP002806             1172182 bp    DNA     circular BCT 31-AUG-2017
#> DEFINITION  Chlamydia psittaci 02DC15, complete genome.
#> ACCESSION   CP002806
#> VERSION     CP002806.1
#> DBLINK      BioProject: PRJNA66481
#>             BioSample: SAMN02603518
#> KEYWORDS    .
#> SOURCE      Chlamydia psittaci 02DC15
#>   ORGANISM  Chlamydia psittaci 02DC15
#>             Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae;
#>             Chlamydia/Chlamydophila group; Chlamydia.
#> REFERENCE   1  (bases 1 to 1172182)
#> ...
#> EFetch query using the 'nuccore' database.
#> Query url: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi...'
#> Retrieval type: 'gbwithparts', retrieval mode: 'text'

Next, we parse the efetch object into a gbRecord instance.

rec <- biofiles::gbRecord(gb_file)
rec
#> An object of class 'gbRecord', with 2040 features
#> LOCUS       CP002806             1172182 bp    DNA     circular BCT 31 ...
#> DEFINITION  Chlamydia psittaci 02DC15, complete genome.
#> ACCESSION   CP002806
#> VERSION     CP002806.1  GI:NA
#> DBLINK      Project: PRJNA66481
#> KEYWORDS    .
#> SOURCE      Chlamydia psittaci 02DC15
#>   ORGANISM  Chlamydia psittaci 02DC15
#>             Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae;
#>             Chlamydia/Chlamydophila group; Chlamydia.
#> REFERENCE   1  (bases 1 to 1172182)
#>   AUTHORS   Schofl,G., Voigt,A., Litsche,K., Sachse,K. and Saluz,H.P.
#>   TITLE     Complete Genome Sequences of Four Mammalian Isolates of
#>             Chlamydophila psittaci
#>   JOURNAL   J. Bacteriol. 193 (16), 4258 (2011)
#>   PUBMED    21705611
#> REFERENCE   2  (bases 1 to 1172182)
#>   AUTHORS   Schofl,G.
#>   TITLE     Direct Submission
#>   JOURNAL   Submitted (20-MAY-2011) Cell and Molecular Biology,
#>             Leibniz Institute for Natural Product Research and
#>             Infection Biology, Beutenbergstrasse 11a, Jena D-07745,
#>             Germany
#> COMMENT     Source DNA available from Dr. Konrad Sachse
#>             (Konrad.Sachse@fli.bund.de) at the Institute for Molecular
#>             Pathogenesis, Jena, Germany.
#> ORIGIN      CAAAGTTTTAAACATGTTAACCACGTTGTTTTCCCTCTATAAACCGAGTTTTAAACAGT
#>             ...
#>             ATCTTGTGTTCACAAACATGCAAAATCTCGATTTATTTACAAACTGAACAATGTTTTAA
#> CONTIG

The summary function provides an overview over the object:

biofiles::summary(rec)
#> [[CP002806]]
#>   1172182 bp: Chlamydia psittaci 02DC15, complete genome.
#>    Id Feature Location                     GeneId Product           ...
#>     1 source  1..1172182                   NA     NA                ...
#>     2 gene    149..1144                    hemB   NA                ...
#>     3 CDS     149..1144                    hemB   delta-aminolevuli ...
#>     4 gene    complement(1160..2578)       NA     NA                ...
#>     5 CDS     complement(1160..2578)       NA     Na(+)-translocati ...
#>     6 gene    complement(2601..3035)       NA     NA                ...
#>     7 CDS     complement(2601..3035)       NA     conserved hypothe ...
#>   ... ...     ...                          ...    ...               ...
#>  2034 CDS     1169125..1170006             NA     geranylgeranyl py ...
#>  2035 gene    complement(1170021..1170094) NA     NA                ...
#>  2036 tRNA    complement(1170021..1170094) NA     tRNA-Pro          ...
#>  2037 gene    complement(1170106..1170360) NA     NA                ...
#>  2038 CDS     complement(1170106..1170360) NA     conserved hypothe ...
#>  2039 gene    complement(1170443..1172035) NA     NA                ...
#>  2040 CDS     complement(1170443..1172035) NA     conserved hypothe ...

Various getter methods provide access to the data contained in a GenBank record; for instance:

biofiles::getAccession(rec)
#> [1] "CP002806"
biofiles::getGeneID(rec)
#> [1] "NA"
biofiles::getDefinition(rec)
#> [1] "Chlamydia psittaci 02DC15, complete genome."
biofiles::getOrganism(rec)
#> [1] "Chlamydia psittaci 02DC15"
biofiles::getSequence(rec)
#>   A DNAStringSet instance of length 1
#>       width seq                                        names               
#> [1] 1172182 CAAAGTTTTAAACATGTTAA...AAACTGAACAATGTTTTAA CP002806
biofiles::getReference(rec)
#> A 'gbReferenceList' instance:
#> REFERENCE   1  (bases 1 to 1172182)
#>   AUTHORS   Schofl,G., Voigt,A., Litsche,K., Sachse,K. and Saluz,H.P.
#>   TITLE     Complete Genome Sequences of Four Mammalian Isolates of
#>             Chlamydophila psittaci
#>   JOURNAL   J. Bacteriol. 193 (16), 4258 (2011)
#>   PUBMED    21705611
#> REFERENCE   2  (bases 1 to 1172182)
#>   AUTHORS   Schofl,G.
#>   TITLE     Direct Submission
#>   JOURNAL   Submitted (20-MAY-2011) Cell and Molecular Biology,
#>             Leibniz Institute for Natural Product Research and
#>             Infection Biology, Beutenbergstrasse 11a, Jena D-07745,
#>             Germany

The function uniqueQualifs() provides an overview over the feature qualifiers used in a record:

biofiles::uniqueQualifs(rec)
#>  [1] "organism"            "mol_type"            "strain"             
#>  [4] "db_xref"             "gene"                "locus_tag"          
#>  [7] "note"                "codon_start"         "transl_table"       
#> [10] "product"             "protein_id"          "translation"        
#> [13] "EC_number"           "artificial_location"

Genbank Features

The important part of a GenBank record will generally be the list of annotions or features.

We can access the gbFeatureList of a gbRecord using getFeatures() or ft():

f <- biofiles::ft(rec)
f
#> 'gbFeatureTable' with 2040 features:
#> Feature:        Location/Qualifiers:
#> source          1..1172182
#>                 /organism = "Chlamydia psittaci 02DC15"
#>                 /mol_type = "genomic DNA"
#>                 /strain = "02DC15"
#>                 /db_xref = "taxon:1112254"
#> ...
#> Feature:        Location/Qualifiers:
#> CDS             complement(1170443..1172035)
#>                 /locus_tag = "CPS0B_1080"
#>                 /note = "[F] COG1351 Predicted alternative thymidylate
#>                 synthase"
#>                 /codon_start = "1"
#>                 /transl_table = "11"
#>                 /product = "conserved hypothetical protein"
#>                 /protein_id = "AEG87987.1"
#>                 /translation = "MLSRDDEFSSEQRKSLSHFVTNLETNIFALKNLPEVVKGA
#>                 LFSKYSRSTLGLRSLLLKEFLEGEGGDFLDSSGVDFEVGIHKAADFYRRVLDGFG
#>                 DDSIGELGGAHLAIESVSMLAAKILEDARIGGSPLEKSSRYVYFDQKVKGEYLYY
#>                 RDPILMTSAFKDVFLGTCDFLFDTYADLIPKVRTYFEKIYPKESEVSQSAYTISL
#>                 RAKVLDCLRGLLPAATLTNLGFFGNGRFWQTLLHKIQGHNLTEIRQIGESSLTEL
#>                 MKIIPSFVSRAESHHHHHQAMLSYRQTLREQLTSLAEKFSGGSHPSKQTGVRLVY
#>                 GDPEGIYKVAAGFLFPYSEHTYEELINICKSMPREDLIRVLEAGSSSRENRRHKS
#>                 PRGLECLEFGFDITADFGAYRDLQRHRILTQERQLLTTNLGYHIPEQLLDTPMEK
#>                 DFREAMEKAEEAYNQISLEFPEEAQYVVPLAYNIRWFFHINGRALQWLCELRSQV
#>                 QGHENYRKIAIDMVKEVIRFDPVYESFFKFVDYSECDLGRIKQESRKRSF"
#> Seqinfo:
#> CP002806  1172182 DNA  Chlamydia psittaci 02DC15, complete genome.

We can extract features either by numeric subsetting:

f[[1]]
#> Feature:        Location/Qualifiers:
#> source          1..1172182
#>                 /organism = "Chlamydia psittaci 02DC15"
#>                 /mol_type = "genomic DNA"
#>                 /strain = "02DC15"
#>                 /db_xref = "taxon:1112254"
#> Seqinfo:
#> CP002806  1172182 DNA  Chlamydia psittaci 02DC15, complete genome.

or we can subset by feature key:

f["CDS"]
#> 'gbFeatureTable' with 978 features:
#> Feature:        Location/Qualifiers:
#> CDS             149..1144
#>                 /gene = "hemB"
#>                 /locus_tag = "CPS0B_0001"
#>                 /note = "PRIAM ID: PRI000415; [H] COG0113
#>                 Delta-aminolevulinic acid dehydratase"
#>                 /codon_start = "1"
#>                 /transl_table = "11"
#>                 /product = "delta-aminolevulinic acid dehydratase"
#>                 /protein_id = "AEG87010.1"
#>                 /translation = "MSSLALSRRPRRNRRTEAIRDLVSETSLLPQDFICPFFVK
#>                 EGKNIREEIESLTGVYRWSIDLLLKEIERLCSLGLRAVILFPVIPSHLKDAYGSY
#>                 SSNPKNILCKSIYEVKKAFPNLCVISDIALDPYTTHGHDGIIDRGEVLNDESVRI
#>                 FGNIATLHAEMGADVVAPSDMMDGRVAHIRSKLDQAGWTQTLILSYSVKYASALY
#>                 NPFRDALGSHLQSGDKRNYQMNPKNVLEALLECSLDEQEGADMLMIKPAGLYLDV
#>                 LHRVKNSTTLPLAAYQVSGEYAMIAAASTMGWLDREKIVYESLIAIKRAGADMII
#>                 SYATPLILEMIASSRV"
#> ...
#> Feature:        Location/Qualifiers:
#> CDS             complement(1170443..1172035)
#>                 /locus_tag = "CPS0B_1080"
#>                 /note = "[F] COG1351 Predicted alternative thymidylate
#>                 synthase"
#>                 /codon_start = "1"
#>                 /transl_table = "11"
#>                 /product = "conserved hypothetical protein"
#>                 /protein_id = "AEG87987.1"
#>                 /translation = "MLSRDDEFSSEQRKSLSHFVTNLETNIFALKNLPEVVKGA
#>                 LFSKYSRSTLGLRSLLLKEFLEGEGGDFLDSSGVDFEVGIHKAADFYRRVLDGFG
#>                 DDSIGELGGAHLAIESVSMLAAKILEDARIGGSPLEKSSRYVYFDQKVKGEYLYY
#>                 RDPILMTSAFKDVFLGTCDFLFDTYADLIPKVRTYFEKIYPKESEVSQSAYTISL
#>                 RAKVLDCLRGLLPAATLTNLGFFGNGRFWQTLLHKIQGHNLTEIRQIGESSLTEL
#>                 MKIIPSFVSRAESHHHHHQAMLSYRQTLREQLTSLAEKFSGGSHPSKQTGVRLVY
#>                 GDPEGIYKVAAGFLFPYSEHTYEELINICKSMPREDLIRVLEAGSSSRENRRHKS
#>                 PRGLECLEFGFDITADFGAYRDLQRHRILTQERQLLTTNLGYHIPEQLLDTPMEK
#>                 DFREAMEKAEEAYNQISLEFPEEAQYVVPLAYNIRWFFHINGRALQWLCELRSQV
#>                 QGHENYRKIAIDMVKEVIRFDPVYESFFKFVDYSECDLGRIKQESRKRSF"
#> Seqinfo:
#> CP002806  1172182 DNA  Chlamydia psittaci 02DC15, complete genome.

A more versatile method to narrow down the list of features of interest is the function filter(). For instance, we can filter for all coding sequences (CDS) with the annotation "hypothetical" in the product qualifiers:

hypo <- biofiles::filter(rec, key = "CDS", product = "hypothetical")
biofiles::summary(hypo)
#> [[CP002806]]
#>   1172182 bp: Chlamydia psittaci 02DC15, complete genome.
#>    Id Feature Location                     GeneId Product           ...
#>     7 CDS     complement(2601..3035)       NA     conserved hypothe ...
#>    31 CDS     20427..21713                 NA     conserved hypothe ...
#>    33 CDS     21822..23741                 NA     conserved hypothe ...
#>    35 CDS     complement(23931..26531)     NA     conserved hypothe ...
#>    37 CDS     26689..29340                 NA     conserved hypothe ...
#>    39 CDS     29390..29620                 NA     conserved hypothe ...
#>    43 CDS     complement(30347..30832)     NA     conserved hypothe ...
#>   ... ...     ...                          ...    ...               ...
#>  2002 CDS     1152648..1153925             NA     conserved hypothe ...
#>  2008 CDS     complement(1155986..1156711) NA     conserved hypothe ...
#>  2010 CDS     complement(1156746..1157474) NA     conserved hypothe ...
#>  2024 CDS     1163308..1164144             NA     conserved hypothe ...
#>  2028 CDS     complement(1165683..1167527) NA     conserved hypothe ...
#>  2038 CDS     complement(1170106..1170360) NA     conserved hypothe ...
#>  2040 CDS     complement(1170443..1172035) NA     conserved hypothe ...

or we can filter for all elongation factors,

elong <- biofiles::filter(rec, key = "CDS", product = "elongation factor")
biofiles::summary(elong)
#> [[CP002806]]
#>   1172182 bp: Chlamydia psittaci 02DC15, complete genome.
#>    Id Feature Location                   GeneId Product             ...
#>     9 CDS     3148..5301                 greA   transcription elong ...
#>   103 CDS     complement(56119..56967)   tsf    translation elongat ...
#>   408 CDS     206213..208297             fusA   translation elongat ...
#>   950 CDS     complement(526454..527758) nusA   transcription elong ...
#>  1126 CDS     623470..624027             NA     elongation factor P ...
#>  1406 CDS     complement(805145..806329) tuf    translation elongat ...
#>  1750 CDS     complement(997270..997842) efp    translation elongat ...

now let's extract the sequence for all elongation factors, and using the tools from the Biostrings packages, translate them into protein sequences. Note, that in order to do so, we first get the gbFeatureTable from the gbRecord, as otherwise we'd just extract the complete sequence associated with the GenBank record.

dna <- biofiles::getSequence(biofiles::ft(elong))
dna
#>   A DNAStringSet instance of length 7
#>     width seq                                          names               
#> [1]  2154 GTGGACTATCTAGAAAAGTTG...AATCTATTTGGGATGCATAA lcl|CDS.9|gb|CP00...
#> [2]   849 TTAAGCTCCTATTTTCCATAA...TCCATAGAAAAGTTGCTCAT lcl|CDS.103|gb|CP...
#> [3]  2085 ATGAGTGATCAAGAATTCGAT...AAGAGATTGTTAAGAAGTAA lcl|CDS.408|gb|CP...
#> [4]  1305 TTAATCTTCAATTTTAGGTTT...GCTACAAGATCTTTATTCAT lcl|CDS.950|gb|CP...
#> [5]   558 ATGGTATTAAGTAGCCAGCTC...AATATATTCAACGCGTCTAA lcl|CDS.1126|gb|C...
#> [6]  1185 TTAGGCAATAATTTTTGAAAT...TGAAAAGTTTCTTTTGACAT lcl|CDS.1406|gb|C...
#> [7]   573 TTACTTGGAAACACGCGATTC...CTAGTGCTTACACGAACCAT lcl|CDS.1750|gb|C...
str <- biofiles::strand(elong)
dna_revcomp <- c(Biostrings::reverseComplement(dna[str == -1]), dna[str == 1])
aa <- Biostrings::translate(dna_revcomp)
names(aa) <- names(dna_revcomp)
aa
#>   A AAStringSet instance of length 7
#>     width seq                                          names               
#> [1]   283 MSNFSMETLKLLRQQTGVGLT...KTSGNSVEVKEFILWKIGA* lcl|CDS.103|gb|CP...
#> [2]   435 MNKDLVAIFDYMEKEKGIQRP...EQVSKYGEGKVDEKPKIED* lcl|CDS.950|gb|CP...
#> [3]   395 MSKETFQRTKPHINIGTIGHV...AIREGGRTIGAGTISKIIA* lcl|CDS.1406|gb|C...
#> [4]   191 MVRVSTSEFRVGLRIEIDGQP...GEVVKVDTRTGSYESRVSK* lcl|CDS.1750|gb|C...
#> [5]   718 VDYLEKLQVLIDEEQPSSFFN...LQFQGKKYKISRIQSIWDA* lcl|CDS.9|gb|CP00...
#> [6]   695 MSDQEFDLSKIRNIGIMAHID...EPAFFAKVPQKIQEEIVKK* lcl|CDS.408|gb|CP...
#> [7]   186 MVLSSQLSVGMFISTKDGLYK...EIGDIIKIDTRTCEYIQRV* lcl|CDS.1126|gb|C...

We can use the ranges() method to extract GRanges objects defined in the Bioconductor package GenomicRanges:

elong_ranges <- biofiles::ranges(elong, include = c("locus_tag", "protein_id", "product"))
elong_ranges
#> GRanges object with 7 ranges and 4 metadata columns:
#>              seqnames           ranges strand |         key   locus_tag
#>                 <Rle>        <IRanges>  <Rle> | <character> <character>
#>   CPS0B_0004 CP002806 [  3148,   5301]      + |         CDS  CPS0B_0004
#>   CPS0B_0056 CP002806 [ 56119,  56967]      - |         CDS  CPS0B_0056
#>   CPS0B_0219 CP002806 [206213, 208297]      + |         CDS  CPS0B_0219
#>   CPS0B_0509 CP002806 [526454, 527758]      - |         CDS  CPS0B_0509
#>   CPS0B_0600 CP002806 [623470, 624027]      + |         CDS  CPS0B_0600
#>   CPS0B_0751 CP002806 [805145, 806329]      - |         CDS  CPS0B_0751
#>   CPS0B_0928 CP002806 [997270, 997842]      - |         CDS  CPS0B_0928
#>               protein_id
#>              <character>
#>   CPS0B_0004  AEG87013.1
#>   CPS0B_0056  AEG87055.1
#>   CPS0B_0219  AEG87200.1
#>   CPS0B_0509  AEG87466.1
#>   CPS0B_0600  AEG87550.1
#>   CPS0B_0751  AEG87684.1
#>   CPS0B_0928  AEG87849.1
#>                                                          product
#>                                                      <character>
#>   CPS0B_0004 transcription elongation factor GreA domain protein
#>   CPS0B_0056                    translation elongation factor Ts
#>   CPS0B_0219                     translation elongation factor G
#>   CPS0B_0509                     transcription elongation factor
#>   CPS0B_0600                                 elongation factor P
#>   CPS0B_0751                    translation elongation factor Tu
#>   CPS0B_0928                     translation elongation factor P
#>   -------
#>   seqinfo: 1 sequence (1 circular) from Chlamydia psittaci 02DC15, complete genome. genome


gschofl/biofiles documentation built on Sept. 27, 2020, 12:08 a.m.