Count Cylinders (FixedOffset Patterns) in Character Vectors
Description
Count fixed tuples of not necessarily adjacent symbols/elements in a character vector.
Usage
1  cylinder.counts(x, cylinder, case=c("lower", "upper", "as is"), circular=TRUE)

Arguments
x 
a character vector or an object that can be coersed to a character vector. 
cylinder 
A vector of indices specifying the form of cylinders to count. See ‘Details’. 
case 
determines how labels for the array should be generated: in lowercase, in uppercase or left as is, in which case labels such as “b” and “B” will be seen as distinct symbols and counted separately. 
circular 
Determines if the vector should be treated as circular or not. The default is

Details
cylinder
represents a set of symbol patterns that one wishes to count in
the sequence x
. For example, if cylinder
is c(1,3,5)
, then
this function will count occurrences of all patterns of the form u.v.w,
where u, v and w can be any symbol present in x
and
.
stands for a symbol whose value is not relevant to the pattern.
Suppose that x
is a sequence of the nucleotides a
, c
,
g
and t
. Then, cylinder=1:2
will count the occurrences of
all 16 dinucleotides: aa
, ac
, ag
, at
, ca
,
cc
, .... In contrast, cylinder=c(1,3)
will counts 16 sets of
trinucleotides: a.a
, a.c
, a.g
, a.t
, c.a
,
c.c
, c.g
, .... the dot “.
” stands for any
nucleotide, so that a.c
represents the set aac
, acc
,
agc
, atg
. In both of these examples, a 4 X 4
array of counts will be produced, but in the first case the array will
represent counts of dinucleotides, while in the second case it will represent
counts of groups of trinucleotides.
If circular
is TRUE
, the vector x
is treated as circular so that the
some of all the counts in the resulting array is equal to the length of the
vector and the sums across all dimentions of the array are equivalent, that is:
writing
counts < cylinder.counts(x, cylinder=c(1,3,5))
for some character sequence x, then
apply(counts,1,sum)
, apply(counts,2,sum)
and apply(counts,3,sum)
will all be identical.
On the other hand, if circular
is FALSE
, the sum of all the
entries in the counts array will be less than the length of the vector and
there will be a discrepancy between the sums over the various dimensions.
Value
An ndimensional array of counts, where n is the length of
cylinder
.
Note
table
is more efficient (by almost a factor of 2) at computing the
counts of cylinders of length 1, whereas cylinder.counts
is faster and
uses less memory than for cylinders of length greater than 1.
Author(s)
Andrew Hart and Servet Mart<ed>nez
See Also
pair.counts
, triple.counts
,
quadruple.counts
,
array2vector
, table2vector
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  #Generate an IID uniform DNA sequence
seq < simulateMarkovChain(5000, matrix(0.25, 4, 4), states=c("a","c","g","t"))
cylinder.counts(seq, 1) #essentially the same as unclass(table(seq))
cylinder.counts(seq, 1:5) #counts of all 5mers in the sequence
#counts of all patterns of the form a.b where a and b represent
#specific symbols and . denotes an arbitrary symbol.
pat < cylinder.counts(seq, c(1, 3))
#For example, pat["a","c"] gives the number of times that any of
#the following 4 words appears in the sequence: aac, acc, agc, atc.
identical(cylinder.counts(seq, c(1,3)), apply(cylinder.counts(seq, 1:3), c(1, 3), sum))
##some relationships between cylinder.counts and other functionns
identical(cylinder.counts(seq, 1:2), pair.counts(seq))
identical(cylinder.counts(seq, 1:3), triple.counts(seq))
identical(cylinder.counts(seq, 1:4), quadruple.counts(seq))
#The following relationship means that counts on circular sequences are
#invariant under translationn
identical(cylinder.counts(seq, 1:6), cylinder.counts(seq, 10:15))
#Treating seq as non circular, most of the preceding relationships continue to hold
identical(cylinder.counts(seq, 1:2, circular=FALSE),
pair.counts(seq, circular=FALSE))
identical(cylinder.counts(seq, 1:3, circular=FALSE),
triple.counts(seq, circular=FALSE))
identical(cylinder.counts(seq, 1:4, circular=FALSE),
quadruple.counts(seq, circular=FALSE))
#The following relationship no longer holds; that is, noncircular counts
#are not invariant under translation.
identical(cylinder.counts(seq, 1:6, circular=FALSE),
cylinder.counts(seq, 10:15, circular=FALSE))
