msaCheckNames: Check and fix sequence names

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/msaCheckNames.R

Description

This function checks and fixed sequence names of multiple alignment objects if they contain characters that might lead to LaTeX problems when using msaPrettyPrint.

Usage

1
    msaCheckNames(x, replacement=" ", verbose=TRUE)

Arguments

x

an object of class MultipleAlignment (which includes objects of classes MsaAAMultipleAlignment, MsaDNAMultipleAlignment, and MsaRNAMultipleAlignment)

replacement

a character string specifying with which character(s) potentially problematic characters should be replaced.

verbose

if TRUE (default), a warning message is shown if potentially problematic characters are found. Otherwise, the function silently replaces these characters (see details below).

Details

The Biostrings package does not impose any restrictions on the names of sequences. Consequently, msa also allows all possible ASCII strings as sequence (row) names in multiple alignments. As soon as msaPrettyPrint is used for pretty-printing multiple sequence alignments, however, the sequence names are interpreted as plain LaTeX source code. Consequently, LaTeX errors may arise because of characters or words in the sequence names that LaTeX does not or cannot interpret as plain text correctly. This particularly includes appearances of special characters and backslash characters in the sequence names.

The msaCheckNames function takes a multiple alignment object and checks sequence names for possibly problematic characters, which are all characters but letters (upper and lower case), digits, spaces, commas, colons, semicolons, periods, question and exclamation marks, dashes, braces, single quotes, and double quotes. All other characters are considered problematic. The function allows for both checking and fixing the sequence names. If called with verbose=TRUE (default), the function prints a warning if a problematic character is found. At the same time, regardless of the verbose argument, the function invisibly returns a copy of x in whose sequence names all problematic characters have been replaced by the string that is supplied via the replacement argument (the default is a single space).

In any case, the best solution is to check sequence names carefully and to avoid problematic sequence names from the beginning.

Value

The function invisibly returns a copy of the argument x (therefore, an object of the same class as x), but with modified sequence/row names (see details above).

Author(s)

Ulrich Bodenhofer <msa@bioinf.jku.at>

References

http://www.bioinf.jku.at/software/msa

U. Bodenhofer, E. Bonatesta, C. Horejs-Kainrath, and S. Hochreiter (2015). msa: an R package for multiple sequence alignment. Bioinformatics 31(24):3997-3999. DOI: 10.1093/bioinformatics/btv494.

See Also

msaPrettyPrint, MsaAAMultipleAlignment, MsaDNAMultipleAlignment, MsaRNAMultipleAlignment

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## create toy example
mySeqs <- DNAStringSet(c("ACGATCGATC", "ACGACGATC", "ACGATCCCCC"))
names(mySeqs) <- c("Seq. #1", "Seq. \2", "Seq. ~3")

## perform multiple alignment
myAlignment <- msa(mySeqs)
myAlignment

## check names
msaCheckNames(myAlignment)

## fix names
myAlignment <- msaCheckNames(myAlignment, replacement="", verbose=FALSE)
myAlignment

Example output

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
    rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

use default substitution matrix
CLUSTAL 2.1  

Call:
   msa(mySeqs)

MsaDNAMultipleAlignment with 3 rows and 10 columns
    aln        names
[1] ACGACGATC- Seq. 
[2] ACGATCCCCC Seq. ~3
[3] ACGATCGATC Seq. #1
Con ACGATC??CC Consensus 
sequence names contain invalid characters
CLUSTAL 2.1  

Call:
   msa(mySeqs)

MsaDNAMultipleAlignment with 3 rows and 10 columns
    aln        names
[1] ACGACGATC- Seq. 
[2] ACGATCCCCC Seq. 3
[3] ACGATCGATC Seq. 1
Con ACGATC??CC Consensus 

msa documentation built on Nov. 8, 2020, 5:41 p.m.