read.msa: Reading an MSA Object
In rphast: Interface to 'PHAST' Software for Comparative Genomics

Description Usage Arguments Value Note Author(s) See Also Examples

Reads an MSA from a file.

read.msa(filename, format = c(guess.format.msa(filename), "FASTA")[1],
  alphabet = NULL, features = NULL, do.4d = FALSE, ordered = (do.4d ==
  FALSE && is.null(features)), tuple.size = (if (do.4d) 3 else NULL),
  do.cats = NULL, refseq = NULL, offset = 0, seqnames = NULL,
  discard.seqnames = NULL, pointer.only = FALSE)

`filename`	The name of the input file containing an alignment.
`format`	input file format: one of "FASTA", "MAF", "SS", "PHYLIP", "MPM", must be correctly specified.
`alphabet`	the alphabet of non-missing-data chraracters in the alignment. Determined automatically from the alignment if not given.
`features`	An object of type `feat`. If provided, the return value will only contain portions of the alignment which fall within a feature. The alignment will not be ordered. The loaded regions can be further constrained with the do.4d or do.cats options. Note that if this object is passed as a pointer to a structure stored in C, the values will be altered by this function!
`do.4d`	Logical. If `TRUE`, the return value will contain only the columns corresponding to four-fold degenerate sties. Requires features to be specified.
`ordered`	Logical. If `FALSE`, the MSA object may not retain the original column order.
`tuple.size`	Integer. If given, and if pointer.only is `TRUE`, MSA will be stored in sufficient statistics format, where each tuple contains tuple.size consecutive columns of the alignment.
`do.cats`	Character vector if features is provided; integer vector if cats.cylce is provided. If given, only the types of features named here will be represented in the (unordered) return alignment.
`refseq`	Character string specifying a FASTA format file with a reference sequence. If given, the reference sequence will be "filled in" whereever missing from the alignment.
`offset`	An integer giving offset of reference sequence from beginning of chromosome. Not used for MAF or SS format.
`seqnames`	A character vector. If provided, discard any sequence in the msa that is not named here. This is only implemented efficiently for MAF input files, but in this case, the reference sequence must be named.
`discard.seqnames`	A character vector. If provided, discard sequenced named here. This is only implemented efficiently for MAF input files, but in this case, the reference sequenced must NOT be discarded.
`pointer.only`	If `TRUE`, MSA will be stored by reference as an external pointer to an object created by C code, rather than directly in R memory. This improves performance and may be necessary for large alignments, but reduces functionality. See `msa` for more details on MSA object storage options.

an MSA object.

If the input is in "MAF" format and features is specified, the resulting alignment will be stripped of gaps in the reference (1st) sequence.

Melissa J. Hubisz and Adam Siepel

msa, read.feat

exampleArchive <- system.file("extdata", "examples.zip", package="rphast")
files <- c("ENr334-100k.maf", "ENr334-100k.fa", "gencode.ENr334-100k.gff")
unzip(exampleArchive, files)

# Read a fasta file, ENr334-100k.fa
# this file represents a 4-way alignment of the encode region
# ENr334 starting from hg18 chr6 position 41405894
idx.offset <- 41405894
m1 <- read.msa("ENr334-100k.fa", offset=idx.offset)
m1

# Now read in only a subset represented in a feature file
f <- read.feat("gencode.ENr334-100k.gff")
f$seqname <- "hg18"  # need to tweak source name to match name in alignment
m1 <- read.msa("ENr334-100k.fa", features=f, offset=idx.offset)

# Can also subset on certain features
do.cats <- c("CDS", "5'flank", "3'flank")
m1 <- read.msa("ENr334-100k.fa", features=f, offset=idx.offset,
               do.cats=do.cats)

# Can read MAFs similarly, but don't need offset because
# MAF file is annotated with coordinates
m2 <- read.msa("ENr334-100k.maf", features=f, do.cats=do.cats)
# Also, note that when features is given and the file is
# in MAF format, the first sequence is automatically
# stripped of gaps
ncol.msa(m1)
ncol.msa(m2)
ncol.msa(m1, "hg18")

unlink(files) # clean up

msa object with 4 sequences and 105513 columns, stored in R
$names
[1] "hg18"    "canFam2" "mm9"     "rn4"    

$alphabet
[1] "ACGT"

$is.ordered
[1] TRUE

$offset
[1] 41405894

(alignment output suppressed)

[1] 1419
[1] 1375
[1] 1375