count_multigrams: Detect and count multiple n-grams in sequences

Description Usage Arguments Details Value Examples

View source: R/count_multigrams.R

Description

A convinient wrapper around count_ngrams for counting multiple values of n and d.

Usage

1
2
3
4
5
6
7
8
9
count_multigrams(
  ns,
  ds = rep(0, length(ns)),
  seq,
  u,
  pos = FALSE,
  scale = FALSE,
  threshold = 0
)

Arguments

ns

numeric vector of n-grams' sizes. See Details.

ds

list of distances between elements of n-grams. Each element of the list is a vector used as distance for the respective n-gram size given by the ns parameter.

seq

a vector or matrix describing sequence(s).

u

integer, numeric or character vector of all possible unigrams.

pos

logical, if TRUE position-specific n_grams are counted.

scale

logical, if TRUE output data is normalized. May be applied only to the counts of n-grams without position information. See Details.

threshold

integer, if not equal to 0, data is binarized into two groups (larger or equal to threshold vs. smaller than threshold).

Details

ns vector and ds vector must have equal length. Elements of ds vector are used as equivalents of d parameter for respective values of ns. For example, if ns is c(4, 4, 4), the ds must be a list of length 3. Each element of the ds list must have length 3 or 1, as appropriate for a d parameter in count_ngrams function.

Value

An integer matrix with named columns. The naming conventions are the same as in count_ngrams.

Examples

1
2
3
4
5
6
7
8
9
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
count_multigrams(c(3, 1), list(c(1, 0), 0), seqs, 1L:4, pos = TRUE)
# if ds parameter is not present, n-grams are calculated for distance 0
count_multigrams(c(3, 1), seq = seqs, u = 1L:4)

# calculate three times n-gram with the same length, but different distances between
# elements
count_multigrams(c(4, 4, 4), list(c(2, 0, 1), c(2, 1, 0), c(0, 1, 2)), 
                 seqs, 1L:4, pos = TRUE)

biogram documentation built on March 31, 2020, 5:14 p.m.