Description n-grams n-gram data dimensionality Author(s) Examples
biogram
package is a toolbox for the analysis of
nucleic acid and protein sequences using n-grams. Possible applications include
motif discovery, feature selection, clustering, and classification.
n-grams (k-tuples) are sets of n
characters derived from the input sequence(s).
They may form continuous sub-sequences or be discontinuous. For example, from the
sequence of nucleotides AATA
one can extract the following continuous
2-grams (bigrams): AA
, AT
and TA
. Moreover, there are two
possible bigrams separated by a single space: A_T
and A_A
, and one
bigram separated by two spaces: A__A
.
Another important n-gram parameter is its position. Instead of just counting n-grams,
one may want to count how many n-grams occur at a given position in multiple (e.g. related)
sequences. For example, in the sequences AATA
and AACA
there is only one
bigram at position 1: AA
, but there are two bigrams at position two: AT
and
AC
. The following notation is used for position-specific n-grams: 1_AA
,
2_AT
, 2_AC
.
In the biogram
package, the count_ngrams
function is used for
counting and extracting n-grams. Using the d
argument the user can specify the
distance between elements of the n-grams. The pos
argument can be used to enable
position specificity.
We note that n-grams suffer from the curse of dimensionality. For example, for a peptide of length 6 20^{n} n-grams and 6 \times 20^{n} positioned n-grams are possible. Data sets of such an enormous size are hard to manage and analyze in R.
The biogram
package deals with both of the abovementioned problems. It uses
innate properties of the n-gram data which usually can be represented by sparse
matrices. Data storage is done using functionalities from the slam
package. To ease
the selection of significant features, biogram
provides the user with QuiPT,
a very fast permutation test for binary data (see test_features
).
Another way of reducing dimensionality is the aggregation of sequence residues into more
general groups. For example, all positively-charged amino acids may be aggregated into
one group. This action can be performed using the degenerate
function.
Encoding of amino acids can easu sequence analysis, but multidimensional
objects as the aggregations of amino acids are not easily comparable. We introduced the
encoding distance, a measure defining the distance between encodings. It can be computed
using the calc_ed
function.
Michal Burdukiewicz, Piotr Sobczyk, Chris Lauber
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # use data set from package
data(human_cleave)
# first nine columns represent subsequent nine amino acids from cleavage sites
# degenerate the sequence to reduce the dimensionality of the problem
# (use five groups instead of 20 amino acids)
deg_seqs <- degenerate(human_cleave[, 1L:9],
list(`a` = c(1, 6, 8, 10, 11, 18),
`b` = c(2, 13, 14, 16, 17),
`c` = c(5, 19, 20),
`d` = c(7, 9, 12, 15),
'e' = c(3, 4)))
# EXAMPLE 1 - extract significant trigrams
# extract trigrams
trigrams <- count_ngrams(deg_seqs, 3, letters[1L:5], pos = TRUE)
# select features that differ between the two target groups using QuiPT
test1 <- test_features(human_cleave[, "tar"], trigrams)
# see a summary of the results
summary(test1)
# aggregate features in groups based on their p-value
gr <- cut(test1)
# get position map of the most significant n-grams
position_ngrams(gr[[1]])
# transform the most significant n-grams to more readable form
decode_ngrams(gr[[1]])
# EXAMPLE 2 - search for specific n-grams
# the n-grams of the interest are a_a (a-gap-a) and e_e (e-gap-e) on the
# 3rd and 4th position
# firstly code n-grams in biogram notation and add position information
coded <- code_ngrams(c("a_a", "c_c"))
# add position information
coded <- c(paste0("3_", coded), paste0("4_", coded))
# count only the features of the interest
bigrams <- count_specified(deg_seqs, coded)
# test which of the features of the interest is significant
test2 <- test_features(human_cleave[, "tar"], bigrams)
cut(test2)
|
Loading required package: slam
Total number of features: 690
Number of significant features: 70
Criterion used: Information Gain
Feature test: QuiPT
p-values adjustment method: BH
$`1`
[1] a_0 a_0 b_0
Levels: a_0 b_0 c_0 d_0 e_0
$`2`
[1] a_0 a_0 a_0 b_0 b_0 b_0
Levels: a_0 b_0 c_0 d_0 e_0
$`3`
[1] a_0 a_0 a_0 a_0 a_0 a_0 a_0 a_0 a_0 a_0 b_0 d_0
Levels: a_0 b_0 c_0 d_0 e_0
$`4`
[1] a_0 a_0 a_0 a_0 a_0 b_0 b_0 c_0 d_0 e_0
Levels: a_0 b_0 c_0 d_0 e_0
$`5`
[1] a_0 a_0 a_0 a_0 a_0 a_0 a_0 a_0
Levels: a_0 b_0 c_0 d_0 e_0
$`6`
[1] a_0 a_0
Levels: a_0 b_0 c_0 d_0 e_0
$`7`
[1] b_0
Levels: a_0 b_0 c_0 d_0 e_0
1_a.a.a_0.0 2_a.a.a_0.0 3_a.a.a_0.0 4_a.a.a_0.0 1_b.a.a_0.0 2_b.a.a_0.0
"aaa" "aaa" "aaa" "aaa" "baa" "baa"
3_b.a.a_0.0 1_a.b.a_0.0 3_a.b.a_0.0 3_a.c.a_0.0 3_a.d.a_0.0 3_a.e.a_0.0
"baa" "aba" "aba" "aca" "ada" "aea"
5_a.a.b_0.0 2_b.d.b_0.0
"aab" "bdb"
$`[0,0.0001]`
[1] "3_a.a_1"
$`(0.0001,0.01]`
[1] "3_c.c_1" "4_a.a_1"
$`(0.01,0.05]`
character(0)
$`(0.05,1]`
[1] "4_c.c_1"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.