Preparing FASTA files for pan-genomics
Preparing a FASTA file before starting comparisons of sequences in a pan-genome study.
The name of a FASTA formatted file with protein or nucleotide sequences for coding genes in a genome.
The Genome IDentifier tag, see below.
Name of file where the prepared sequences will be written.
Logical, indicating if the in.file contains protein (
A text, a regular expression, and sequences having a match against this in their header text will be discarded.
This function will read a FASTA file and produce another, slightly modified, FASTA file which is prepared for genome-wise comparisons using
hmmerScan or any other method.
The main purpose of
panPrep is to make certain every sequence is labeled with a tag called a GID.tag (Genome IDentifier tag) identifying the genome. This text contains the text “GID” followed by an integer. This integer can be any integer as long as it is unique to every genome in the study. It can typically be the BioProject number or any other integer that is uniquely related to a specific genome. If a genome has the text “GID12345” as identifier, then the sequences in the file produced by
panPrep will have headerlines starting with “GID12345_seq1”, “GID12345_seq2”, “GID12345_seq3”...etc. This makes it possible to quickly identify which genome every sequence belongs to.
The GID.tag is also added to the file name specified in out.file. For this reason the out.file must have a file extension containing letters only. By convention, we expect FASTA files to have one of the extensions .fsa, .faa, .fa or .fasta.
panPrep will also remove very short sequences (< 10 amino acids), removing stop codon symbols (*), replacing alien characters with X and converting all sequences to upper-case. If the input discard contains a regular expression, any sequences having a match to this in their headerline are also removed. Example: If we use
prodigalPredict to find proteins in a genome, partially predicted genes will have the text partial=10 or partial=01 in their headerline. Using discard="partial=01|partial=10" will remove these from the data set.
This function produces a FASTA formatted sequence file.
Lars Snipen and Kristian Liland.
1 2 3 4 5 6 7 8 9 10 11 12
# Using a FASTA file in the micropan package # We need to uncompress it first... extdata.path <- file.path(path.package("micropan"),"extdata") filenames <- "Mpneumoniae_309_protein.fsa" pth <- lapply( file.path( extdata.path, paste( filenames, ".xz", sep="" ) ), xzuncompress ) # ...then we prep it, using the GID.tag "GID123" panPrep(file.path(extdata.path,filenames),GID.tag="GID123","Mpneumoniae_309.fsa") # ...should produce a FASTA file named Mpneumoniae_309_GID123.fsa # ...and compress the input file again... pth <- lapply( file.path( extdata.path, filenames ), xzcompress )