injectHardMask: Injecting a hard mask in a sequence

Description Usage Arguments Details Value Author(s) See Also Examples

Description

injectHardMask allows the user to "fill" the masked regions of a sequence with an arbitrary letter (typically the "+" letter).

Usage

1

Arguments

x

A MaskedXString or XStringViews object.

letter

A single letter.

Details

The name of the injectHardMask function was chosen because of the primary use that it is intended for: converting a pile of active "soft masks" into a "hard mask". Here the pile of active "soft masks" refers to the active masks that have been put on top of a sequence. In Biostrings, the original sequence and the masks defined on top of it are bundled together in one of the dedicated containers for this: the MaskedBString, MaskedDNAString, MaskedRNAString and MaskedAAString containers (this is the MaskedXString family of containers). The original sequence is always stored unmodified in a MaskedXString object so no information is lost. This allows the user to activate/deactivate masks without having to worry about losing the letters that are in the regions that are masked/unmasked. Also this allows better memory management since the original sequence never needs to be copied, even when the set of active/inactive masks changes.

However, there are situations where the user might want to really get rid of the letters that are in some particular regions by replacing them with a junk letter (e.g. "+") that is guaranteed to not interfer with the analysis that s/he is currently doing. For example, it's very likely that a set of motifs or short reads will not contain the "+" letter (this could easily be checked) so they will never hit the regions filled with "+". In a way, it's like the regions filled with "+" were masked but we call this kind of masking "hard masking".

Some important differences between "soft" and "hard" masking:

injectHardMask creates a (modified) copy of the original sequence. Using "soft masking" does not.

A function that is "mask aware" like alphabetFrequency or matchPattern will really skip the masked regions when "soft masking" is used i.e. they will not walk thru the regions that are under active masks. This might lead to some speed improvements when a high percentage of the original sequence is masked. With "hard masking", the entire sequence is walked thru.

Matches cannot span over masked regions with "soft masking". With "hard masking" they can.

Value

An XString object of the same length as the orignal object x if x is a MaskedXString object, or of the same length as subject(x) if it's an XStringViews object.

Author(s)

H. Pag<c3><a8>s

See Also

maskMotif, MaskedXString-class, replaceLetterAt, chartr, XString, XStringViews-class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
## ---------------------------------------------------------------------
## A. WITH AN XStringViews OBJECT
## ---------------------------------------------------------------------
v2 <- Views("abCDefgHIJK", start=c(8, 3), end=c(14, 4))
injectHardMask(v2)
injectHardMask(v2, letter="=")

## ---------------------------------------------------------------------
## B. WITH A MaskedXString OBJECT
## ---------------------------------------------------------------------
mask0 <- Mask(mask.width=29, start=c(3, 10, 25), width=c(6, 8, 5))
x <- DNAString("ACACAACTAGATAGNACTNNGAGAGACGC")
masks(x) <- mask0
x
subject <- injectHardMask(x)

## Matches can span over masked regions with "hard masking":
matchPattern("ACggggggA", subject, max.mismatch=6)
## but not with "soft masking":
matchPattern("ACggggggA", x, max.mismatch=6)

Example output

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package:BiocGenericsThe following objects are masked frompackage:parallel:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked frompackage:stats:

    IQR, mad, sd, var, xtabs

The following objects are masked frompackage:base:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package:S4VectorsThe following object is masked frompackage:base:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package:BiostringsThe following object is masked frompackage:base:

    strsplit

11-letter BString object
seq: ++CD+++HIJK
11-letter BString object
seq: ==CD===HIJK
29-letter MaskedDNAString object (# for masking)
seq: AC######A########TNNGAGA#####
masks:
  maskedwidth maskedratio active
1          19   0.6551724   TRUE
Views on a 29-letter DNAString subject
subject: AC++++++A++++++++TNNGAGA+++++
views:
      start end width
  [1]     1   9     9 [AC++++++A]
  [2]    16  24     9 [++TNNGAGA]
Views on a 29-letter DNAString subject
subject: ACACAACTAGATAGNACTNNGAGAGACGC
views:
      start end width
  [1]    16  24     9 [ACTNNGAGA]

Biostrings documentation built on Nov. 8, 2020, 11:12 p.m.