knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The fasta file format is one of the simplest file formats for sequence data
and the files can contain either protein or nucleotide sequences. The fasta
format contains a minimal amount of additional information or metadata.
The fasta file contains the following elements
> character is used as a field delimiterfirst_word after the > delimiter is used to identify the sequence name
or accessionsequence is contained following the delimiter line (and until the EOF or
subsequent delimiters) and may be either in the form of constant line width
records or single monolithic sequence records.The information within the fasta record is thus limited to facets of the
sequence itself - e.g. length.
There is not really any mystery or complication with fasta format sequence
files - the functionality is provided within floundeR to support other
applications and workflows. The fasta parsing functionality is provided
by the Rsamtools
package but could equally have been provided by numberous other packages such as
ShortRead.
The aim of this vignette is to introduce the Fasta R6 object and to show how
this can be used within the broader floundeR environment for producing tabular
data and graphical visualisations.
library(Rsamtools) library(ggplot2)
The first step in a floundeR based fasta analysis is to load the floundeR
package. We will also load a collection of other packages - please check the
vignette code to see what has been loaded silently.
library(floundeR)
A fasta format sequence file is provided within the accompanying floundeR
packaged data. Let's have a quick look at a fasta format sequence file.
canonical_fasta <- flnDr("cluster_cons.fasta.bgz") print(canonical_fasta)
The extension of the file above shows that this is a bgzip compressed fasta
file. This can be read directly using the R readLines command - let's have
a look at the first 10 lines contained within the file.
readLines(canonical_fasta, n=10)
The output above should reflect the description in the preamble. Since this is
an R package let's also have a quick look at the file contents using the
Rsamtools package instead.
fasta <- open(FaFile(canonical_fasta)) index <- scanFaIndex(fasta) # how many fasta entries in file? countFa(fasta) # let's pull out the first two entries scanFa(fasta, index[1:2]) # fasta is a connection and should thus be closed when done close(fasta)
This really covers the basics of fasta sequence handling using R. The
objectives of floundeR are not to reproduce the capabilities of other packages
but to simplify analyses.
floundeR Fasta R6 object.The floundeR package contained R6 objects to describe many bioinformatics
data types. There is a simple constructor for loading a fasta format sequence
file.
fasta <- Fasta$new(canonical_fasta) print(fasta) fasta$as_tibble()
That's pretty lean data - not much to show or present.
Fasta object?As described in the previous section, there is not really very much information
in the fasta sequence format other than the sequence itself. The Fasta R6
object can be exported as a SequencingSet object.
fasta %>% to_sequencing_set()
The SequencingSet object can also be used to access simple but primitive
summary statistics such as mean sequence length, N50 length etc
fasta$sequencingset$N50 fasta$sequencingset$mean
The SequencingSet in turn has a collection of methods that can be used to
structure and visualise the data. The first that we'll have a look at is the
$enumerate method that returns an Angenieux object for data visualisation.
knitr::include_graphics( fasta$sequencingset$enumerate$to_file("figure_8.png")$plot() )
The format for the plotting command is a little gnarly - please check the
vignettes on the Angenieux R6 object for further details and information on
the logic and control of the presentation.
The final plot that makes sense with just sequence data is a length distribution
plot; this can be prepared with the command below. In this command we transform
the Fasta object into a SequencingSet and we request that a distribution
of binned sequence lengths be prepared.
knitr::include_graphics( fasta$sequencingset$read_length_bins(bins=35, outliers=0.001)$to_file("figure_9.png")$plot(style="stacked") )
Fasta object ...floundeRPlease consider having a quick read on the following subjects
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.