knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
biofiles provides interfacing to GenBank/GenPept or Embl flat file records. It includes utilities for reading and writing GenBank files, and methods for interacting with annotation and sequence data.
Install the latest stable release of the biofiles
package from CRAN:
install.packages("biofiles")
Install the development version from github
using the devtools
package.
# install.packages("devtools") devtools::install_github("gschofl/biofiles")
Let's download a small bacterial genome, Chlamydophila psittaci DC15 (GenBank acc. CP002806; GI 334693792) from NCBI.
# install.packages("reutils") gb_file <- reutils::efetch("CP002806", "nuccore", rettype = "gbwithparts", retmode = "text") gb_file
Next, we parse the efetch
object into a gbRecord
instance.
rec <- biofiles::gbRecord(gb_file) rec
The summary
function provides an overview over the object:
biofiles::summary(rec)
Various getter methods provide access to the data contained in a GenBank record; for instance:
biofiles::getAccession(rec)
biofiles::getGeneID(rec)
biofiles::getDefinition(rec)
biofiles::getOrganism(rec)
biofiles::getSequence(rec)
biofiles::getReference(rec)
The function uniqueQualifs()
provides an overview over the feature qualifiers used
in a record:
biofiles::uniqueQualifs(rec)
The important part of a GenBank record will generally be the list of annotions or features.
We can access the gbFeatureList
of a gbRecord
using getFeatures()
or ft()
:
f <- biofiles::ft(rec) f
We can extract features either by numeric subsetting:
f[[1]]
or we can subset by feature key:
f["CDS"]
A more versatile method to narrow down the list of features of interest is the
function filter()
.
For instance, we can filter for all coding sequences (CDS) with the
annotation "hypothetical" in the product qualifiers:
hypo <- biofiles::filter(rec, key = "CDS", product = "hypothetical") biofiles::summary(hypo)
or we can filter for all elongation factors,
elong <- biofiles::filter(rec, key = "CDS", product = "elongation factor") biofiles::summary(elong)
now let's extract the sequence for all elongation factors, and using the tools
from the Biostrings
packages, translate them into protein sequences. Note, that
in order to do so, we first get the gbFeatureTable
from the gbRecord
, as otherwise
we'd just extract the complete sequence associated with the GenBank record.
dna <- biofiles::getSequence(biofiles::ft(elong)) dna
str <- biofiles::strand(elong) dna_revcomp <- c(Biostrings::reverseComplement(dna[str == -1]), dna[str == 1]) aa <- Biostrings::translate(dna_revcomp) names(aa) <- names(dna_revcomp) aa
We can use the ranges()
method to extract GRanges
objects defined in the
Bioconductor package GenomicRanges
:
elong_ranges <- biofiles::ranges(elong, include = c("locus_tag", "protein_id", "product")) elong_ranges
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.