Reading an MSA Object

Share:

Description

Reads an MSA from a file.

Usage

1
2
3
4
5
read.msa(filename, format = c(guess.format.msa(filename), "FASTA")[1],
  alphabet = NULL, features = NULL, do.4d = FALSE, ordered = (do.4d ==
  FALSE && is.null(features)), tuple.size = (if (do.4d) 3 else NULL),
  do.cats = NULL, refseq = NULL, offset = 0, seqnames = NULL,
  discard.seqnames = NULL, pointer.only = FALSE)

Arguments

filename

The name of the input file containing an alignment.

format

input file format: one of "FASTA", "MAF", "SS", "PHYLIP", "MPM", must be correctly specified.

alphabet

the alphabet of non-missing-data chraracters in the alignment. Determined automatically from the alignment if not given.

features

An object of type feat. If provided, the return value will only contain portions of the alignment which fall within a feature. The alignment will not be ordered. The loaded regions can be further constrained with the do.4d or do.cats options. Note that if this object is passed as a pointer to a structure stored in C, the values will be altered by this function!

do.4d

Logical. If TRUE, the return value will contain only the columns corresponding to four-fold degenerate sties. Requires features to be specified.

ordered

Logical. If FALSE, the MSA object may not retain the original column order.

tuple.size

Integer. If given, and if pointer.only is TRUE, MSA will be stored in sufficient statistics format, where each tuple contains tuple.size consecutive columns of the alignment.

do.cats

Character vector if features is provided; integer vector if cats.cylce is provided. If given, only the types of features named here will be represented in the (unordered) return alignment.

refseq

Character string specifying a FASTA format file with a reference sequence. If given, the reference sequence will be "filled in" whereever missing from the alignment.

offset

An integer giving offset of reference sequence from beginning of chromosome. Not used for MAF or SS format.

seqnames

A character vector. If provided, discard any sequence in the msa that is not named here. This is only implemented efficiently for MAF input files, but in this case, the reference sequence must be named.

discard.seqnames

A character vector. If provided, discard sequenced named here. This is only implemented efficiently for MAF input files, but in this case, the reference sequenced must NOT be discarded.

pointer.only

If TRUE, MSA will be stored by reference as an external pointer to an object created by C code, rather than directly in R memory. This improves performance and may be necessary for large alignments, but reduces functionality. See msa for more details on MSA object storage options.

Value

an MSA object.

Note

If the input is in "MAF" format and features is specified, the resulting alignment will be stripped of gaps in the reference (1st) sequence.

Author(s)

Melissa J. Hubisz and Adam Siepel

See Also

msa, read.feat

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
exampleArchive <- system.file("extdata", "examples.zip", package="rphast")
files <- c("ENr334-100k.maf", "ENr334-100k.fa", "gencode.ENr334-100k.gff")
unzip(exampleArchive, files)

# Read a fasta file, ENr334-100k.fa
# this file represents a 4-way alignment of the encode region
# ENr334 starting from hg18 chr6 position 41405894
idx.offset <- 41405894
m1 <- read.msa("ENr334-100k.fa", offset=idx.offset)
m1

# Now read in only a subset represented in a feature file
f <- read.feat("gencode.ENr334-100k.gff")
f$seqname <- "hg18"  # need to tweak source name to match name in alignment
m1 <- read.msa("ENr334-100k.fa", features=f, offset=idx.offset)

# Can also subset on certain features
do.cats <- c("CDS", "5'flank", "3'flank")
m1 <- read.msa("ENr334-100k.fa", features=f, offset=idx.offset,
               do.cats=do.cats)

# Can read MAFs similarly, but don't need offset because
# MAF file is annotated with coordinates
m2 <- read.msa("ENr334-100k.maf", features=f, do.cats=do.cats)
# Also, note that when features is given and the file is
# in MAF format, the first sequence is automatically
# stripped of gaps
ncol.msa(m1)
ncol.msa(m2)
ncol.msa(m1, "hg18")

unlink(files) # clean up

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.