cylinder.counts: Count Cylinders (Fixed-Offset Patterns) in Character Vectors
In spgs: Statistical Patterns in Genomic Sequences

cylinder.counts

R Documentation

Count Cylinders (Fixed-Offset Patterns) in Character Vectors

Description

Count fixed tuples of not necessarily adjacent symbols/elements in a character vector.

Usage

cylinder.counts(x, cylinder, case=c("lower", "upper", "as is"), circular=TRUE)

Arguments

`x`	a character vector or an object that can be coersed to a character vector.
`cylinder`	A vector of indices specifying the form of cylinders to count. See ‘Details’.
`case`	determines how labels for the array should be generated: in lowercase, in uppercase or left as is, in which case labels such as “b” and “B” will be seen as distinct symbols and counted separately.
`circular`	Determines if the vector should be treated as circular or not. The default is `TRUE`, meaning that the start and end of the sequence will be joined together for the purpose of counting.

Details

cylinder represents a set of symbol patterns that one wishes to count in the sequence x. For example, if cylinder is c(1,3,5), then this function will count occurrences of all patterns of the form ‘⁠u.v.w⁠’, where ‘⁠u⁠’, ‘⁠v⁠’ and ‘⁠w⁠’ can be any symbol present in x and . stands for a symbol whose value is not relevant to the pattern.

Suppose that x is a sequence of the nucleotides a, c, g and t. Then, cylinder=1:2 will count the occurrences of all 16 dinucleotides: aa, ac, ag, at, ca, cc, .... In contrast, cylinder=c(1,3) will counts 16 sets of trinucleotides: a.a, a.c, a.g, a.t, c.a, c.c, c.g, .... the dot “.” stands for any nucleotide, so that a.c represents the set aac, acc, agc, atg. In both of these examples, a 4\times 4 array of counts will be produced, but in the first case the array will represent counts of dinucleotides, while in the second case it will represent counts of groups of trinucleotides.

If circular is TRUE, the vector x is treated as circular so that the some of all the counts in the resulting array is equal to the length of the vector and the sums across all dimentions of the array are equivalent, that is: writing
counts <- cylinder.counts(x, cylinder=c(1,3,5))
for some character sequence x, then
apply(counts,1,sum), apply(counts,2,sum) and apply(counts,3,sum)
will all be identical.

On the other hand, if circular is FALSE, the sum of all the entries in the counts array will be less than the length of the vector and there will be a discrepancy between the sums over the various dimensions.

Value

An n-dimensional array of counts, where n is the length of cylinder.

Note

tableis more efficient (by almost a factor of 2) at computing the counts of cylinders of length 1, whereas cylinder.counts is faster and uses less memory than for cylinders of length greater than 1.

Author(s)

Andrew Hart and Servet Martínez

Examples

#Generate an IID uniform DNA sequence
seq <- simulateMarkovChain(5000, matrix(0.25, 4, 4), states=c("a","c","g","t"))
cylinder.counts(seq, 1) #essentially the same as unclass(table(seq))
cylinder.counts(seq, 1:5) #counts of all 5-mers in the sequence

 #counts of all patterns of the form a.b where a and b represent
 #specific symbols and . denotes an arbitrary symbol.
 pat <- cylinder.counts(seq, c(1, 3))
#For example, pat["a","c"] gives the number of times that any of 
#the following 4 words appears in the sequence:  aac, acc, agc, atc.
identical(cylinder.counts(seq, c(1,3)), apply(cylinder.counts(seq, 1:3), c(1, 3), sum))

##some relationships between cylinder.counts and other functionns
identical(cylinder.counts(seq, 1:2), pair.counts(seq))
identical(cylinder.counts(seq, 1:3), triple.counts(seq))
identical(cylinder.counts(seq, 1:4), quadruple.counts(seq))

#The following relationship means that counts on circular sequences are 
#invariant under translationn
identical(cylinder.counts(seq, 1:6), cylinder.counts(seq, 10:15))

#Treating seq as non circular, most of the preceding relationships continue to hold
identical(cylinder.counts(seq, 1:2, circular=FALSE), 
  pair.counts(seq, circular=FALSE))
identical(cylinder.counts(seq, 1:3, circular=FALSE), 
triple.counts(seq, circular=FALSE))
identical(cylinder.counts(seq, 1:4, circular=FALSE), 
  quadruple.counts(seq, circular=FALSE))
#The following relationship no longer holds; that is, non-circular counts
#are not invariant under translation.
identical(cylinder.counts(seq, 1:6, circular=FALSE), 
  cylinder.counts(seq, 10:15, circular=FALSE))

spgs documentation built on Oct. 3, 2023, 5:07 p.m.