MultipleAlignment-class: MultipleAlignment objects
In Bioconductor/Biostrings: Efficient manipulation of biological strings

MultipleAlignment-class

R Documentation

MultipleAlignment objects

Description

The MultipleAlignment class is a container for storing multiple sequence alignments.

Usage

## Constructors:
DNAMultipleAlignment(x=character(), start=NA, end=NA, width=NA,
    use.names=TRUE, rowmask=NULL, colmask=NULL)
RNAMultipleAlignment(x=character(), start=NA, end=NA, width=NA,
    use.names=TRUE, rowmask=NULL, colmask=NULL)
AAMultipleAlignment(x=character(), start=NA, end=NA, width=NA,
    use.names=TRUE, rowmask=NULL, colmask=NULL)

## Read functions:
readDNAMultipleAlignment(filepath, format)
readRNAMultipleAlignment(filepath, format)
readAAMultipleAlignment(filepath, format)

## Write funtions:
write.phylip(x, filepath)

## ... and more (see below)

Arguments

`x`	Either a character vector (with no NAs), or an XString, XStringSet or XStringViews object containing strings with the same number of characters. If writing out a Phylip file, then x would be a MultipleAlignment object
`start`, `end`, `width`	Either `NA`, a single integer, or an integer vector of the same length as `x` specifying how `x` should be "narrowed" (see `?narrow` in the IRanges package for the details).
`use.names`	`TRUE` or `FALSE`. Should names be preserved?
`filepath`	A character vector (of arbitrary length when reading, of length 1 when writing) containing the paths to the files to read or write. Note that special values like `""` or `"\|cmd"` (typically supported by other I/O functions in R) are not supported here. Also `filepath` cannot be a connection.
`format`	Either `"fasta"` (the default), `"stockholm"`, `"phylip"`, or `"clustal"`.
`rowmask`	a NormalIRanges object that will set masking for rows
`colmask`	a NormalIRanges object that will set masking for columns

Details

The MultipleAlignment class is designed to hold and represent multiple sequence alignments. The rows and columns within an alignment can be masked for ad hoc analyses.

Accessor methods

In the code snippets below, x is a MultipleAlignment object.

unmasked(x):

The underlying XStringSet object containing the multiple sequence alignment.

rownames(x):

NULL or a character vector of the same length as x containing a short user-provided description or comment for each sequence in x.

rowmask(x), rowmask(x, append, invert) <- value:

Gets and sets the NormalIRanges object representing the masked rows in x. The append argument takes union, replace or intersect to indicate how to combine the new value with rowmask(x). The invert argument takes a logical argument to indicate whether or not to invert the new mask. The value argument can be of any class that is coercible to a NormalIRanges via the as function.

colmask(x), colmask(x, append, invert) <- value:

Gets and sets the NormalIRanges object representing the masked columns in x. The append argument takes union, replace or intersect to indicate how to combine the new value with colmask(x). The invert argument takes a logical argument to indicate whether or not to invert the new mask. The value argument can be of any class that is coercible to a NormalIRanges via the as function.

maskMotif(x, motif, min.block.width=1, ...):

Returns a MultipleAlignment object with a modified column mask based upon motifs found in the consensus string where the consensus string keeps all the columns but drops the masked rows.

motif: The motif to mask.
min.block.width: The minimum width of the blocks to mask.
...: Additional arguments for matchPattern.

maskGaps(x, min.fraction, min.block.width):

Returns a MultipleAlignment object with a modified column mask based upon gaps in the columns. In particular, this mask is defined by min.block.width or more consecutive columns that have min.fraction or more of their non-masked rows containing gap codes.

min.fraction: A value in [0, 1] that indicates the minimum fraction needed to call a gap in the consensus string (default is 0.5).
min.block.width: A positive integer that indicates the minimum number of consecutive gaps to mask, as defined by min.fraction (default is 4).

nrow(x):

Returns the number of sequences aligned in x.

ncol(x):

Returns the number of characters for each alignment in x.

dim(x):

Equivalent to c(nrow(x), ncol(x)).

maskednrow(x):

Returns the number of masked aligned sequences in x.

maskedncol(x):

Returns the number of masked aligned characters in x.

maskeddim(x):

Equivalent to c(maskednrow(x), maskedncol(x)).

maskedratio(x):

Equivalent to maskeddim(x) / dim(x).

nchar(x):

Returns the number of unmasked aligned characters in x, i.e. ncol(x) - maskedncol(x).

alphabet(x):

Equivalent to alphabet(unmasked(x)).

Coercion

In the code snippets below, x is a MultipleAlignment object.

as(from, "DNAStringSet"), as(from, "RNAStringSet"), as(from, "AAStringSet"), as(from, "BStringSet"):: Creates an instance of the specified XStringSet object subtype that contains the unmasked regions of the multiple sequence alignment in x.
as.character(x, use.names):: Convert x to a character vector containing the unmasked regions of the multiple sequence alignment. use.names controls whether or not rownames(x) should be used to set the names of the returned vector (default is TRUE).
as.matrix(x, use.names):: Returns a character matrix containing the "exploded" representation of the unmasked regions of the multiple sequence alignment. use.names controls whether or not rownames(x) should be used to set the row names of the returned matrix (default is TRUE).

Utilities

In the code snippets below, x is a MultipleAlignment object.

consensusMatrix(x, as.prob, baseOnly):: Creates an integer matrix containing the column frequencies of the underlying alphabet with masked columns being represented with NA values. If as.prob is TRUE, then probabilities are reported, otherwise counts are reported (the default). If baseOnly is TRUE, then the non-base letters are collapsed into an "other" category.
consensusString(x, ...):: Creates a consensus string for x with the symbol "#" representing a masked column. See consensusString for details on the arguments.
consensusViews(x, ...):: Similar to the consensusString method. It returns a XStringViews on the consensus string containing subsequence contigs of non-masked columns. Unlike the consensusString method, the masked columns in the underlying string contain a consensus value rather than the "#" symbol.
alphabetFrequency(x, as.prob, collapse):: Creates an integer matrix containing the row frequencies of the underlying alphabet. If as.prob is TRUE, then probabilities are reported, otherwise counts are reported (the default). If collapse is TRUE, then returns the overall frequency instead of the frequency by row.
detail(x, invertColMask, hideMaskedCols):: Allows for a full pager driven display of the object so that masked cols and rows can be removed and the entire sequence can be visually inspected. If hideMaskedCols is set to it's default value of TRUE then the output will hide all the the masked columns in the output. Otherwise, all columns will be displayed along with a row to indicate the masking status. If invertColMask is TRUE then any displayed mask will be flipped so as to represent things in a way consistent with Phylip style files instead of the mask that is actually stored in the MultipleAlignment object. Please notice that invertColMask will be ignored if hideMaskedCols is set to its default value of TRUE since in that case it will not make sense to show any masking information in the output. Masked rows are always hidden in the output.

Display

The letters in a DNAMultipleAlignment or RNAMultipleAlignment object are colored when displayed by the show() method. Set global option Biostrings.coloring to FALSE to turn off this coloring.

Author(s)

P. Aboyoun and M. Carlson

Examples

## create an object from file
origMAlign <-
  readDNAMultipleAlignment(filepath =
                           system.file("extdata",
                                       "msx2_mRNA.aln",
                                       package="Biostrings"),
                           format="clustal")

## list the names of the sequences in the alignment
rownames(origMAlign)

## rename the sequences to be the underlying species for MSX2
rownames(origMAlign) <- c("Human","Chimp","Cow","Mouse","Rat",
                          "Dog","Chicken","Salmon")
origMAlign

## See a detailed pager view
if (interactive()) {
detail(origMAlign)
}

## operations to mask rows
## For columns, just use colmask() and do the same kinds of operations
rowMasked <- origMAlign
rowmask(rowMasked) <- IRanges(start=1,end=3)
rowMasked

## remove rowumn masks
rowmask(rowMasked) <- NULL
rowMasked

## "select" rows of interest
rowmask(rowMasked, invert=TRUE) <- IRanges(start=4,end=7)
rowMasked

## or mask the rows that intersect with masked rows
rowmask(rowMasked, append="intersect") <- IRanges(start=1,end=5)
rowMasked

## TATA-masked
tataMasked <- maskMotif(origMAlign, "TATA")
colmask(tataMasked)

## automatically mask rows based on consecutive gaps
autoMasked <- maskGaps(origMAlign, min.fraction=0.5, min.block.width=4)
colmask(autoMasked)
autoMasked

## calculate frequencies
alphabetFrequency(autoMasked)
consensusMatrix(autoMasked, baseOnly=TRUE)[, 84:90]

## get consensus values
consensusString(autoMasked)
consensusViews(autoMasked)

## cluster the masked alignments
library(pwalign)
sdist <- pwalign::stringDist(as(autoMasked,"DNAStringSet"), method="hamming")
clust <- hclust(sdist, method = "single")
plot(clust)
fourgroups <- cutree(clust, 4)
fourgroups

## write out the alignement object (with current masks) to Phylip format
write.phylip(x = autoMasked, filepath = tempfile("foo.txt",tempdir()))

Bioconductor/Biostrings documentation built on June 10, 2025, 1:14 p.m.