Count Cylinders (Fixed-Offset Patterns) in Character Vectors
Count fixed tuples of not necessarily adjacent symbols/elements in a character vector.
a character vector or an object that can be coersed to a character vector.
A vector of indices specifying the form of cylinders to count. See ‘Details’.
determines how labels for the array should be generated: in lowercase, in uppercase or left as is, in which case labels such as “b” and “B” will be seen as distinct symbols and counted separately.
Determines if the vector should be treated as circular or not. The default is
cylinder represents a set of symbol patterns that one wishes to count in
x. For example, if
this function will count occurrences of all patterns of the form u.v.w,
where u, v and w can be any symbol present in
. stands for a symbol whose value is not relevant to the pattern.
x is a sequence of the nucleotides
cylinder=1:2 will count the occurrences of
all 16 dinucleotides:
cc, .... In contrast,
cylinder=c(1,3) will counts 16 sets of
c.g, .... the dot “
.” stands for any
nucleotide, so that
a.c represents the set
atg. In both of these examples, a 4 X 4
array of counts will be produced, but in the first case the array will
represent counts of dinucleotides, while in the second case it will represent
counts of groups of trinucleotides.
TRUE, the vector
x is treated as circular so that the
some of all the counts in the resulting array is equal to the length of the
vector and the sums across all dimentions of the array are equivalent, that is:
counts <- cylinder.counts(x, cylinder=c(1,3,5))
for some character sequence x, then
will all be identical.
On the other hand, if
FALSE, the sum of all the
entries in the counts array will be less than the length of the vector and
there will be a discrepancy between the sums over the various dimensions.
An n-dimensional array of counts, where n is the length of
tableis more efficient (by almost a factor of 2) at computing the
counts of cylinders of length 1, whereas
cylinder.counts is faster and
uses less memory than for cylinders of length greater than 1.
Andrew Hart and Servet Mart<ed>nez
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#Generate an IID uniform DNA sequence seq <- simulateMarkovChain(5000, matrix(0.25, 4, 4), states=c("a","c","g","t")) cylinder.counts(seq, 1) #essentially the same as unclass(table(seq)) cylinder.counts(seq, 1:5) #counts of all 5-mers in the sequence #counts of all patterns of the form a.b where a and b represent #specific symbols and . denotes an arbitrary symbol. pat <- cylinder.counts(seq, c(1, 3)) #For example, pat["a","c"] gives the number of times that any of #the following 4 words appears in the sequence: aac, acc, agc, atc. identical(cylinder.counts(seq, c(1,3)), apply(cylinder.counts(seq, 1:3), c(1, 3), sum)) ##some relationships between cylinder.counts and other functionns identical(cylinder.counts(seq, 1:2), pair.counts(seq)) identical(cylinder.counts(seq, 1:3), triple.counts(seq)) identical(cylinder.counts(seq, 1:4), quadruple.counts(seq)) #The following relationship means that counts on circular sequences are #invariant under translationn identical(cylinder.counts(seq, 1:6), cylinder.counts(seq, 10:15)) #Treating seq as non circular, most of the preceding relationships continue to hold identical(cylinder.counts(seq, 1:2, circular=FALSE), pair.counts(seq, circular=FALSE)) identical(cylinder.counts(seq, 1:3, circular=FALSE), triple.counts(seq, circular=FALSE)) identical(cylinder.counts(seq, 1:4, circular=FALSE), quadruple.counts(seq, circular=FALSE)) #The following relationship no longer holds; that is, non-circular counts #are not invariant under translation. identical(cylinder.counts(seq, 1:6, circular=FALSE), cylinder.counts(seq, 10:15, circular=FALSE))