MaskAlignment: Mask Highly Variable Regions of An Alignment
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

Automatically masks poorly aligned regions of an alignment based on sequence conservation and gap frequency.

MaskAlignment(myXStringSet,
              type = "sequences",
              windowSize = 5,
              threshold = 1,
              maxFractionGaps = 0.2,
              includeTerminalGaps = FALSE,
              correction = FALSE,
              showPlot = FALSE)

`myXStringSet`	An `AAStringSet`, `DNAStringSet`, or `RNAStringSet` object of aligned sequences.
`type`	Character string indicating the type of result desired. This should be (an abbreviation of) one of `"sequences"`, `"ranges"`, or `"values"`. (See value section below.)
`windowSize`	Integer value specifying the size of the region to the left and right of the center-point to use in calculating the moving average.
`threshold`	Numeric giving the average entropy in bits below which a region is masked.
`maxFractionGaps`	Numeric specifying the maximum faction of gaps in an alignment column to be masked.
`includeTerminalGaps`	Logical specifying whether or not to include terminal gaps ("." or "-" characters on each end of the sequences) into the calculation of gap fraction.
`correction`	Logical indicating whether to apply a small-sample size correction to columns with few letters (Yu et al., 2015).
`showPlot`	Logical specifying whether or not to show a plot of the positions that were kept or masked.

Poorly aligned regions of a multiple sequence alignment may lead to incorrect results in downstream analyses. One method to mitigate their effects is to mask columns of the alignment that may be poorly aligned, such as highly-variable regions or regions with many insertions and deletions (gaps).

Highly variable regions are detected by their signature of having a low information content. Here, information content is defined by the relative entropy of a column in the alignment (Yu et al., 2015), which is higher for conserved columns. The relative entropy is based on the background distribution of letter-frequencies in the alignment.

A moving average of windowSize nucleotides to the left and right of the center-point is applied to smooth noise in the information content signal along the sequence. Regions dropping below threshold bits or more than maxFractionGaps are masked.

If type is "sequences" then a MultipleAlignment object of the input type with masked columns where the input criteria are met. Otherwise, if type is "ranges" then an IRanges object giving the start and end positions of the masked columns. Else (type is "values") a data.frame containing one row per site in the alignment and three columns of information:

`"entropy"`	The entropy score of each column, in units of bits.
`"gaps"`	For each column, the fraction of gap characters ("-" or ".").
`"mask"`	A logical vector indicating whether or not the column met the criteria for masking.

Erik Wright eswright@pitt.edu

Yu, Y.-K., et al. (2015). Log-odds sequence logos. Bioinformatics, 31(3), 324-331. http://doi.org/10.1093/bioinformatics/btu634

AlignSeqs, IdClusters

fas <- system.file("extdata", "Streptomyces_ITS_aligned.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
masked_dna <- MaskAlignment(dna, showPlot=TRUE)

# display only unmasked nucleotides for use in downstream analyses
not_masked <- as(masked_dna, "DNAStringSet")
BrowseSeqs(not_masked)

# display only masked nucleotides that are covered by the mask
masked <- masked_dna
colmask(masked, append="replace", invert=TRUE) <- colmask(masked)
masked <- as(masked, "DNAStringSet")
BrowseSeqs(masked)

# display the complete DNA sequence set including the mask
masks <- lapply(width(colmask(masked_dna)), rep, x="+")
masks <- unlist(lapply(masks, paste, collapse=""))
masked_dna <- replaceAt(dna, at=IRanges(colmask(masked_dna)), value=masks)
BrowseSeqs(masked_dna)

# get the start and end ranges of masked columns
ranges <- MaskAlignment(dna, type="ranges")
ranges
replaceAt(dna, ranges) # remove the masked columns

# obtain the entropy scores of each column
values <- MaskAlignment(dna, type="values")
head(values)

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Loading required package: RSQLite
IRanges object with 13 ranges and 0 metadata columns:
           start       end     width
       <integer> <integer> <integer>
   [1]        19        19         1
   [2]       170       213        44
   [3]       230       230         1
   [4]       235       235         1
   [5]       267       286        20
   ...       ...       ...       ...
   [9]       378       394        17
  [10]       399       406         8
  [11]       418       418         1
  [12]       420       442        23
  [13]       495       496         2
DNAStringSet object of length 88:
     width seq                                              names               
 [1]   473 TGTACACACCGCCCGTCACGTCA...GGGGTTTCCGAATGGGGAAACC supercont3.1 of S...
 [2]   473 NNNNCACACCGCCCGTCACGTCA...GGGGTTTCCGAATGGGGAAACC supercont3.1 of S...
 [3]   473 TGTACACACCGCCCGTCACGTCA...GGGGTTTCCGAATGGGGAAACC supercont1.1 of S...
 [4]   473 CGTACACACCGCCCGTCACGTCA...GGGGTTTCCGAATGGGGAAACC supercont1.1 of S...
 [5]   473 TGTACACACCGCCCGTCACGTCA...GGGGTTTCCGAATGGGGAAACC supercont1.1 of S...
 ...   ... ...
[84]   473 TGTACACACCGCCCGTCACGTCA...GGGGTTTCCGAATGGGGAAACC gi|297189896|ref|...
[85]   473 TGTACACACCGCCCGTCACGTCA...GGGGTGTCCGAATGGGGAAACC gi|224581106|ref|...
[86]   473 TGTACACACCGCCCGTCACGTCA...GGGGTGTCCGAATGGGGAAACC gi|224581106|ref|...
[87]   473 TGTACACACCGCCCGTCACGTCA...GGGGTGTCCGAATGGGGAAACC gi|224581106|ref|...
[88]   473 TGTACACACCGCCCGTCACGTCA...GGGGTTTCCGAATGGGGAAACC gi|224581108|ref|...
   entropy gaps  mask
1 1.895882    0 FALSE
2 1.260362    0 FALSE
3 1.964568    0 FALSE
4 2.087429    0 FALSE
5 1.889936    0 FALSE
6 2.211969    0 FALSE

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.

DECIPHER index

Package overview Classify Sequences Design Group-Specific FISH Probes Design Group-Specific Primers Design Microarray Probes Design Primers That Yield Group-Specific Signatures Finding Chimeric Sequences Getting Started DECIPHERing The Art of Multiple Sequence Alignment in R The Magic of Gene Finding

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

MaskAlignment: Mask Highly Variable Regions of An Alignment
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Example output

Related to MaskAlignment in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER Tools for curating, analyzing, and manipulating biological sequences

MaskAlignment: Mask Highly Variable Regions of An Alignment In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Example output

Related to MaskAlignment in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

MaskAlignment: Mask Highly Variable Regions of An Alignment
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences