read.dna: Read DNA Sequences in a File

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

This function reads DNA sequences in a file, and returns a matrix or a list of DNA sequences with the names of the taxa read in the file as rownames or names, respectively. By default, the sequences are stored in binary format, otherwise (if as.character = "TRUE") in lower case.

Usage

1
2
3
read.dna(file, format = "interleaved", skip = 0,
         nlines = 0, comment.char = "#", seq.names = NULL,
         as.character = FALSE, as.matrix = NULL)

Arguments

file

a file name specified by either a variable of mode character, or a double-quoted string.

format

a character string specifying the format of the DNA sequences. Four choices are possible: "interleaved", "sequential", "clustal", or "fasta", or any unambiguous abbreviation of these.

skip

the number of lines of the input file to skip before beginning to read data.

nlines

the number of lines to be read (by default the file is read untill its end).

comment.char

a single character, the remaining of the line after this character is ignored.

seq.names

the names to give to each sequence; by default the names read in the file are used.

as.character

a logical controlling whether to return the sequences as an object of class "DNAbin" (the default).

as.matrix

(used if format = "fasta") one of the three followings: (i) NULL: returns the sequences in a matrix if they are of the same length, otherwise in a list; (ii) TRUE: returns the sequences in a matrix, or stops with an error if they are of different lengths; (iii) FALSE: always returns the sequences in a list.

Details

This function follows the interleaved and sequential formats defined in PHYLIP (Felsenstein, 1993) but with the original feature than there is no restriction on the lengths of the taxa names. For these two formats, the first line of the file must contain the dimensions of the data (the numbers of taxa and the numbers of nucleotides); the sequences are considered as aligned and thus must be of the same lengths for all taxa. For the FASTA format, the conventions defined in the URL below (see References) are followed; the sequences are taken as non-aligned. For all formats, the nucleotides can be arranged in any way with blanks and line-breaks inside (with the restriction that the first ten nucleotides must be contiguous for the interleaved and sequential formats, see below). The names of the sequences are read in the file unless the ‘seq.names’ option is used. Particularities for each format are detailed below.

Value

a matrix or a list (if format = "fasta") of DNA sequences stored in binary format, or of mode character (if as.character = "TRUE").

Author(s)

Emmanuel Paradis

References

Anonymous. FASTA format description. http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

Anonymous. IUPAC ambiguity codes. http://www.ncbi.nlm.nih.gov/SNP/iupac.html

Felsenstein, J. (1993) Phylip (Phylogeny Inference Package) version 3.5c. Department of Genetics, University of Washington. http://evolution.genetics.washington.edu/phylip/phylip.html

See Also

read.GenBank, write.dna, DNAbin, dist.dna, woodmouse

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
### a small extract from `data(woddmouse)'
cat("3 40",
"No305     NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
"No304     ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
"No306     ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
file = "exdna.txt", sep = "\n")
ex.dna <- read.dna("exdna.txt", format = "sequential")
str(ex.dna)
ex.dna
### the same data in interleaved format...
cat("3 40",
"No305     NTTCGAAAAA CACACCCACT",
"No304     ATTCGAAAAA CACACCCACT",
"No306     ATTCGAAAAA CACACCCACT",
"          ACTAAAANTT ATCAGTCACT",
"          ACTAAAAATT ATCAACCACT",
"          ACTAAAAATT ATCAATCACT",
file = "exdna.txt", sep = "\n")
ex.dna2 <- read.dna("exdna.txt")
### ... in clustal format...
cat("CLUSTAL (ape) multiple sequence alignment", "",
"No305     NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
"No304     ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
"No306     ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
"           ************************** ******  ****",
file = "exdna.txt", sep = "\n")
ex.dna3 <- read.dna("exdna.txt", format = "clustal")
### ... and in FASTA format
cat("> No305",
"NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
"> No304",
"ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
"> No306",
"ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
file = "exdna.txt", sep = "\n")
ex.dna4 <- read.dna("exdna.txt", format = "fasta")
### The first three are the same!
identical(ex.dna, ex.dna2)
identical(ex.dna, ex.dna3)
identical(ex.dna, ex.dna4)
unlink("exdna.txt") # clean-up

gjuggler/ape documentation built on May 17, 2019, 6:03 a.m.