knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The fasta
file format is one of the simplest file formats for sequence data
and the files can contain either protein or nucleotide sequences. The fasta
format contains a minimal amount of additional information or metadata.
The fasta
file contains the following elements
>
character is used as a field delimiterfirst_word
after the >
delimiter is used to identify the sequence name
or accession
sequence
is contained following the delimiter line (and until the EOF or
subsequent delimiters) and may be either in the form of constant line width
records or single monolithic sequence records.The information within the fasta
record is thus limited to facets of the
sequence itself - e.g. length.
There is not really any mystery or complication with fasta
format sequence
files - the functionality is provided within floundeR
to support other
applications and workflows. The fasta
parsing functionality is provided
by the Rsamtools
package but could equally have been provided by numberous other packages such as
ShortRead
.
The aim of this vignette is to introduce the Fasta
R6 object and to show how
this can be used within the broader floundeR
environment for producing tabular
data and graphical visualisations.
library(Rsamtools) library(ggplot2)
The first step in a floundeR
based fasta
analysis is to load the floundeR
package. We will also load a collection of other packages - please check the
vignette code to see what has been loaded silently.
library(floundeR)
A fasta
format sequence file is provided within the accompanying floundeR
packaged data. Let's have a quick look at a fasta
format sequence file.
canonical_fasta <- flnDr("cluster_cons.fasta.bgz") print(canonical_fasta)
The extension of the file above shows that this is a bgzip
compressed fasta
file. This can be read directly using the R
readLines
command - let's have
a look at the first 10 lines contained within the file.
readLines(canonical_fasta, n=10)
The output above should reflect the description in the preamble. Since this is
an R
package let's also have a quick look at the file contents using the
Rsamtools
package instead.
fasta <- open(FaFile(canonical_fasta)) index <- scanFaIndex(fasta) # how many fasta entries in file? countFa(fasta) # let's pull out the first two entries scanFa(fasta, index[1:2]) # fasta is a connection and should thus be closed when done close(fasta)
This really covers the basics of fasta
sequence handling using R
. The
objectives of floundeR
are not to reproduce the capabilities of other packages
but to simplify analyses.
floundeR
Fasta
R6 object.The floundeR
package contained R6 objects to describe many bioinformatics
data types. There is a simple constructor for loading a fasta
format sequence
file.
fasta <- Fasta$new(canonical_fasta) print(fasta) fasta$as_tibble()
That's pretty lean data - not much to show or present.
Fasta
object?As described in the previous section, there is not really very much information
in the fasta
sequence format other than the sequence itself. The Fasta
R6
object can be exported as a SequencingSet
object.
fasta %>% to_sequencing_set()
The SequencingSet
object can also be used to access simple but primitive
summary statistics such as mean
sequence length, N50
length etc
fasta$sequencingset$N50 fasta$sequencingset$mean
The SequencingSet
in turn has a collection of methods that can be used to
structure and visualise the data. The first that we'll have a look at is the
$enumerate
method that returns an Angenieux
object for data visualisation.
knitr::include_graphics( fasta$sequencingset$enumerate$to_file("figure_8.png")$plot() )
The format for the plotting command is a little gnarly - please check the
vignettes on the Angenieux
R6 object for further details and information on
the logic and control of the presentation.
The final plot that makes sense with just sequence data is a length distribution
plot; this can be prepared with the command below. In this command we transform
the Fasta
object into a SequencingSet
and we request that a distribution
of binned sequence lengths be prepared.
knitr::include_graphics( fasta$sequencingset$read_length_bins(bins=35, outliers=0.001)$to_file("figure_9.png")$plot(style="stacked") )
Fasta
object ...floundeR
Please consider having a quick read on the following subjects
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.