View source: R/read.alignment.R
read.alignment | R Documentation |
Read a file in mase
, clustal
, phylip
, fasta
or msf
format.
These formats are used to store nucleotide or protein multiple alignments.
read.alignment(file, format, forceToLower = TRUE, oldclustal = FALSE, ...)
file |
the name of the file which the aligned sequences are to be read from.
If it does not contain an absolute or relative path, the file name is relative
to the current working directory, |
format |
a character string specifying the format of the file : |
forceToLower |
a logical defaulting to TRUE stating whether the returned characters in the sequence should be in lower case (introduced in seqinR release 1.1-3). |
oldclustal |
a logical defaulting to FALSE wether to use the old C function to read a clustal file (which is faster but stricter concerning sequence line length.) |
... |
For the |
The mase format is used to store nucleotide or protein
multiple alignments. The beginning of the file must contain a header
containing at least one line (but the content of this header may be
empty). The header lines must begin by ;;
. The body of the
file has the following structure: First, each entry must begin by
one (or more) commentary line. Commentary lines begin by the character
;
. Again, this commentary line may be empty. After the
commentaries, the name of the sequence is written on a separate
line. At last, the sequence itself is written on the following lines.
The CLUSTAL format (*.aln) is the format of the ClustalW multialignment tool output. It can be described as follows. The word CLUSTAL is on the first line of the file. The alignment is displayed in blocks of a fixed length, each line in the block corresponding to one sequence. Each line of each block starts with the sequence name (maximum of 10 characters), followed by at least one space character. The sequence is then displayed in upper or lower cases, '-' denotes gaps. The residue number may be displayed at the end of the first line of each block.
MSF is the multiple sequence alignment format of the GCG sequence analysis package. It begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional). A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.(optional) A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required) msf files contain some other information: the Name/Weight, a Separating Line which must include two slashes (//) to divide the name/weight information from the sequence alignment.(required) and the multiple sequence alignment.
PHYLIP is a tree construction program. The format is as follows: the number of sequences and their length (in characters) is on the first line of the file. The alignment is displayed in an interleaved or sequential format. The sequence names are limited to 10 characters and may contain blanks.
Sequence in fasta format begins with a single-line description (distinguished by a greater-than (>) symbol), followed by sequence data on the next line.
An object of class alignment
which is a list with the following components:
nb |
the number of aligned sequences |
nam |
a vector of strings containing the names of the aligned sequences |
seq |
a vector of strings containing the aligned sequences |
com |
a vector of strings containing the commentaries for each sequence or |
D. Charif, J.R. Lobry
citation("seqinr")
To read aligned sequences in NEXUS format, see the function
read.nexus
that was available in the CompPairWise
package
(not sure it is still maintained as of 09/09/09).
The NEXUS format was mainly used by the non-GPL commercial PAUP
software.
Related functions: as.matrix.alignment
, read.fasta
,
write.fasta
, reverse.align
, dist.alignment
.
mase.res <- read.alignment(file = system.file("sequences/test.mase", package = "seqinr"),
format = "mase")
clustal.res <- read.alignment(file = system.file("sequences/test.aln", package = "seqinr"),
format="clustal")
phylip.res <- read.alignment(file = system.file("sequences/test.phylip", package = "seqinr"),
format = "phylip")
msf.res <- read.alignment(file = system.file("sequences/test.msf", package = "seqinr"),
format = "msf")
fasta.res <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"),
format = "fasta")
#
# Quality control routine sanity checks:
#
data(mase); stopifnot(identical(mase, mase.res))
data(clustal); stopifnot(identical(clustal, clustal.res))
data(phylip); stopifnot(identical(phylip, phylip.res))
data(msf); stopifnot(identical(msf, msf.res))
data(fasta); stopifnot(identical(fasta, fasta.res))
#
# Example of using extra arguments from the read.fasta function, here to keep
# whole headers for sequences names.
#
whole.header.test <-
read.alignment(file = system.file("sequences/LTPs128_SSU_aligned_First_Two.fasta",
package = "seqinr"), format = "fasta", whole.header = TRUE)
whole.header.test$nam
# Sould be:
#
# [1] "D50541\t1\t1411\t1411bp\trna\tAbiotrophia defectiva\tAerococcaceae"
# [2] "KP233895\t1\t1520\t1520bp\trna\tAbyssivirga alkaniphila\tLachnospiraceae"
#
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.