Description Usage Arguments Details Accessor-like methods Subsequence extraction and related transformations Subsetting and appending Set operations Other methods Display Author(s) See Also Examples
The BStringSet class is a container for storing a set of
BString
objects and for making its manipulation
easy and efficient.
Similarly, the DNAStringSet (or RNAStringSet, or AAStringSet) class is
a container for storing a set of DNAString
(or RNAString
, or AAString
) objects.
All those containers derive directly (and with no additional slots) from the XStringSet virtual class.
1 2 3 4 5 6 7 8 9 10 11 12 13 | ## Constructors:
BStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
DNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
RNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
AAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
## Accessor-like methods:
## S4 method for signature 'character'
width(x)
## S4 method for signature 'XStringSet'
nchar(x, type="chars", allowNA=FALSE)
## ... and more (see below)
|
x |
Either a character vector (with no NAs), or an XString, XStringSet or XStringViews object. |
start,end,width |
Either |
use.names |
|
type,allowNA |
Ignored. |
The BStringSet
, DNAStringSet
, RNAStringSet
and
AAStringSet
functions are constructors that can be used to
turn input x
into an XStringSet object of the desired base type.
They also allow the user to "narrow" the sequences contained in x
via proper use of the start
, end
and/or width
arguments. In this context, "narrowing" means dropping a prefix or/and
a suffix of each sequence in x
.
The "narrowing" capabilities of these constructors can be illustrated
by the following property: if x
is a character vector
(with no NAs), or an XStringSet (or XStringViews) object,
then the 3 following transformations are equivalent:
BStringSet(x, start=mystart, end=myend, width=mywidth)
subseq(BStringSet(x), start=mystart, end=myend, width=mywidth)
BStringSet(subseq(x, start=mystart, end=myend, width=mywidth))
Note that, besides being more convenient, the first form is also more efficient on character vectors.
In the code snippets below,
x
is an XStringSet object.
length(x)
:
The number of sequences in x
.
width(x)
:
A vector of non-negative integers containing the number
of letters for each element in x
.
Note that width(x)
is also defined for a character vector
with no NAs and is equivalent to nchar(x, type="bytes")
.
names(x)
:
NULL
or a character vector of the same length as x
containing a short user-provided description or comment for each
element in x
.
These are the only data in an XStringSet object that can safely
be changed by the user. All the other data are immutable!
As a general recommendation, the user should never try to modify
an object by accessing its slots directly.
alphabet(x)
:
Return NULL
, DNA_ALPHABET
,
RNA_ALPHABET
or AA_ALPHABET
depending on
whether x
is a BStringSet, DNAStringSet, RNAStringSet or
AAStringSet object.
nchar(x)
:
The same as width(x)
.
In the code snippets below,
x
is a character vector (with no NAs),
or an XStringSet (or XStringViews) object.
subseq(x, start=NA, end=NA, width=NA)
:
Applies subseq
on each element in x
.
See ?subseq
for the details.
Note that this is similar to what substr
does on a
character vector. However there are some noticeable differences:
(1) the arguments are start
and stop
for
substr
;
(2) the SEW interface (start/end/width) interface of subseq
is richer (e.g. support for negative start or end values);
and (3) subseq
checks that the specified start/end/width values
are valid i.e., unlike substr
, it throws an error if
they define "out of limits" subsequences or subsequences with a
negative width.
narrow(x, start=NA, end=NA, width=NA, use.names=TRUE)
:
Same as subseq
. The only differences are: (1) narrow
has a use.names
argument; and (2) all the things narrow
and subseq
work on
(IRanges, XStringSet or
XStringViews objects for narrow
,
XVector or XStringSet objects for
subseq
). But they both work and do the same thing on an
XStringSet object.
threebands(x, start=NA, end=NA, width=NA)
:
Like the method for IRanges
objects, the
threebands
methods for character vectors and XStringSet
objects extend the capability of narrow
by returning the 3
set of subsequences (the left, middle and right subsequences)
associated to the narrowing operation.
See ?threebands
in the
IRanges package for the details.
subseq(x, start=NA, end=NA, width=NA) <- value
:
A vectorized version of the subseq<-
method for XVector objects.
See ?`subseq<-`
for the details.
In the code snippets below,
x
and values
are XStringSet objects,
and i
should be an index specifying the elements to extract.
x[i]
:
Return a new XStringSet object made of the selected elements.
x[[i]]
:
Extract the i-th XString
object from x
.
append(x, values, after=length(x))
:
Add sequences in values
to x
.
In the code snippets below,
x
and y
are XStringSet objects.
union(x, y)
:
Union of x
and y
.
intersect(x, y)
:
Intersection of x
and y
.
setdiff(x, y)
:
Asymmetric set difference of x
and y
.
setequal(x, y)
:
Set equality of x
to y
.
In the code snippets below,
x
is an XStringSet object.
unlist(x)
:
Turns x
into an XString object by combining the
sequences in x
together.
Fast equivalent to do.call(c, as.list(x))
.
as.character(x, use.names=TRUE)
:
Converts x
to a character vector of the same length as x
.
The use.names
argument controls whether or not names(x)
should be propagated to the names of the returned vector.
as.factor(x)
:
Converts x
to a factor, via as.character(x)
.
as.matrix(x, use.names=TRUE)
:
Returns a character matrix containing the "exploded" representation of
the strings. Can only be used on an XStringSet object with
equal-width strings.
The use.names
argument controls whether or not names(x)
should be propagated to the row names of the returned matrix.
toString(x)
:
Equivalent to toString(as.character(x))
.
show(x)
:
By default the show
method displays 5 head and 5 tail
lines. The number of lines can be altered by setting the global
options showHeadLines
and showTailLines
. If the
object length is less than the sum of the options, the full object
is displayed. These options affect GRanges, GAlignments, IRanges,
and XStringSet objects.
The letters in a DNAStringSet or RNAStringSet object are colored
when displayed by the show()
method. Set global option
Biostrings.coloring
to FALSE to turn off this coloring.
H. Pag<c3><a8>s
readDNAStringSet
and writeXStringSet
for reading/writing a DNAStringSet object (or other
XStringSet derivative) from/to a FASTA or FASTQ file.
XStringSet-comparison
XString objects.
XStringViews objects.
XStringSetList objects.
subseq
, narrow
,
and substr
.
compact
XVectorList objects.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | ## ---------------------------------------------------------------------
## A. USING THE XStringSet CONSTRUCTORS ON A CHARACTER VECTOR OR FACTOR
## ---------------------------------------------------------------------
## Note that there is no XStringSet() constructor, but an XStringSet
## family of constructors: BStringSet(), DNAStringSet(), RNAStringSet(),
## etc...
x0 <- c("#CTC-NACCAGTAT", "#TTGA", "TACCTAGAG")
width(x0)
x1 <- BStringSet(x0)
x1
## 3 equivalent ways to obtain the same BStringSet object:
BStringSet(x0, start=4, end=-3)
subseq(x1, start=4, end=-3)
BStringSet(subseq(x0, start=4, end=-3))
dna0 <- DNAStringSet(x0, start=4, end=-3)
dna0 # 'options(Biostrings.coloring=FALSE)' to turn off coloring
names(dna0)
names(dna0)[2] <- "seqB"
dna0
## When the input vector contains a lot of duplicates, turning it into
## a factor first before passing it to the constructor will produce an
## XStringSet object that is more compact in memory:
library(hgu95av2probe)
x2 <- sample(hgu95av2probe$sequence, 999000, replace=TRUE)
dna2a <- DNAStringSet(x2)
dna2b <- DNAStringSet(factor(x2)) # slower but result is more compact
object.size(dna2a)
object.size(dna2b)
## ---------------------------------------------------------------------
## B. USING THE XStringSet CONSTRUCTORS ON A SINGLE SEQUENCE (XString
## OBJECT OR CHARACTER STRING)
## ---------------------------------------------------------------------
x3 <- "abcdefghij"
BStringSet(x3, start=2, end=6:2) # behaves like 'substring(x3, 2, 6:2)'
BStringSet(x3, start=-(1:6))
x4 <- BString(x3)
BStringSet(x4, end=-(1:6), width=3)
## Randomly extract 1 million 40-mers from C. elegans chrI:
extractRandomReads <- function(subject, nread, readlength)
{
if (!is.integer(readlength))
readlength <- as.integer(readlength)
start <- sample(length(subject) - readlength + 1L, nread,
replace=TRUE)
DNAStringSet(subject, start=start, width=readlength)
}
library(BSgenome.Celegans.UCSC.ce2)
rndreads <- extractRandomReads(Celegans$chrI, 1000000, 40)
## Notes:
## - This takes only 2 or 3 seconds versus several hours for a solution
## using substring() on a standard character string.
## - The short sequences in 'rndreads' can be seen as the result of a
## simulated high-throughput sequencing experiment. A non-realistic
## one though because:
## (a) It assumes that the underlying technology is perfect (the
## generated reads have no technology induced errors).
## (b) It assumes that the sequenced genome is exactly the same as the
## reference genome.
## (c) The simulated reads can contain IUPAC ambiguity letters only
## because the reference genome contains them. In a real
## high-throughput sequencing experiment, the sequenced genome
## of course doesn't contain those letters, but the sequencer
## can introduce them in the generated reads to indicate ambiguous
## base-calling.
## (d) The simulated reads come from the plus strand only of a single
## chromosome.
## - See the getSeq() function in the BSgenome package for how to
## circumvent (d) i.e. how to generate reads that come from the whole
## genome (plus and minus strands of all chromosomes).
## ---------------------------------------------------------------------
## C. USING THE XStringSet CONSTRUCTORS ON AN XStringSet OBJECT
## ---------------------------------------------------------------------
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
probes
RNAStringSet(probes, start=2, end=-5) # does NOT copy the sequence data!
## ---------------------------------------------------------------------
## D. USING THE XStringSet CONSTRUCTORS ON AN ORDINARY list OF XString
## OBJECTS
## ---------------------------------------------------------------------
probes10 <- head(probes, n=10)
set.seed(33)
shuffled_nucleotides <- lapply(probes10, sample)
shuffled_nucleotides
DNAStringSet(shuffled_nucleotides) # does NOT copy the sequence data!
## Note that the same result can be obtained in a more compact way with
## just:
set.seed(33)
endoapply(probes10, sample)
## ---------------------------------------------------------------------
## E. USING subseq() ON AN XStringSet OBJECT
## ---------------------------------------------------------------------
subseq(probes, start=2, end=-5)
subseq(probes, start=13, end=13) <- "N"
probes
## Add/remove a prefix:
subseq(probes, start=1, end=0) <- "--"
probes
subseq(probes, end=2) <- ""
probes
## Do more complicated things:
subseq(probes, start=4:7, end=7) <- c("YYYY", "YYY", "YY", "Y")
subseq(probes, start=4, end=6) <- subseq(probes, start=-2:-5)
probes
## ---------------------------------------------------------------------
## F. UNLISTING AN XStringSet OBJECT
## ---------------------------------------------------------------------
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
unlist(probes)
## ---------------------------------------------------------------------
## G. COMPACTING AN XStringSet OBJECT
## ---------------------------------------------------------------------
## As a particular type of XVectorList objects, XStringSet objects can
## optionally be compacted. Compacting is done typically before
## serialization. See ?compact for more information.
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
y <- subseq(probes[1:12], start=5)
probes@pool
y@pool
object.size(probes)
object.size(y)
y0 <- compact(y)
y0@pool
object.size(y0)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.