knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

Travis-CI Build Status AppVeyor Build Status CRAN RStudio mirror downloads CRAN_Status_Badge

biofiles - an interface to GenBank/GenPept files in R

biofiles provides interfacing to GenBank/GenPept or Embl flat file records. It includes utilities for reading and writing GenBank files, and methods for interacting with annotation and sequence data.

Installation

Install the latest stable release of the biofiles package from CRAN:

install.packages("biofiles")

Install the development version from github using the devtools package.

# install.packages("devtools")
devtools::install_github("gschofl/biofiles")

Basic functionality

Let's download a small bacterial genome, Chlamydophila psittaci DC15 (GenBank acc. CP002806; GI 334693792) from NCBI.

# install.packages("reutils")
gb_file <- reutils::efetch("CP002806", "nuccore", rettype = "gbwithparts", retmode = "text")
gb_file

Next, we parse the efetch object into a gbRecord instance.

rec <- biofiles::gbRecord(gb_file)
rec

The summary function provides an overview over the object:

biofiles::summary(rec)

Various getter methods provide access to the data contained in a GenBank record; for instance:

biofiles::getAccession(rec)
biofiles::getGeneID(rec)
biofiles::getDefinition(rec)
biofiles::getOrganism(rec)
biofiles::getSequence(rec)
biofiles::getReference(rec)

The function uniqueQualifs() provides an overview over the feature qualifiers used in a record:

biofiles::uniqueQualifs(rec)

Genbank Features

The important part of a GenBank record will generally be the list of annotions or features.

We can access the gbFeatureList of a gbRecord using getFeatures() or ft():

f <- biofiles::ft(rec)
f

We can extract features either by numeric subsetting:

f[[1]]

or we can subset by feature key:

f["CDS"]

A more versatile method to narrow down the list of features of interest is the function filter(). For instance, we can filter for all coding sequences (CDS) with the annotation "hypothetical" in the product qualifiers:

hypo <- biofiles::filter(rec, key = "CDS", product = "hypothetical")
biofiles::summary(hypo)

or we can filter for all elongation factors,

elong <- biofiles::filter(rec, key = "CDS", product = "elongation factor")
biofiles::summary(elong)

now let's extract the sequence for all elongation factors, and using the tools from the Biostrings packages, translate them into protein sequences. Note, that in order to do so, we first get the gbFeatureTable from the gbRecord, as otherwise we'd just extract the complete sequence associated with the GenBank record.

dna <- biofiles::getSequence(biofiles::ft(elong))
dna
str <- biofiles::strand(elong)
dna_revcomp <- c(Biostrings::reverseComplement(dna[str == -1]), dna[str == 1])
aa <- Biostrings::translate(dna_revcomp)
names(aa) <- names(dna_revcomp)
aa

We can use the ranges() method to extract GRanges objects defined in the Bioconductor package GenomicRanges:

elong_ranges <- biofiles::ranges(elong, include = c("locus_tag", "protein_id", "product"))
elong_ranges


gschofl/biofiles documentation built on Sept. 27, 2020, 12:08 a.m.