pattern2id: Convert regex and glob patterns to type IDs or fixed patterns

Description Usage Arguments Value Examples

View source: R/pattern2fixed.R

Description

pattern2id converts regex or glob to type IDs to allow C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs an index of glob patterns for faster matching.

pattern2fixed converts regex and glob patterns to fixed patterns.

index_types is an auxiliary function for pattern2id that constructs an index of "glob" or "fixed" patterns to avoid expensive sequential search. For example, a type "cars" is index by keys "cars", "car?", "c*", "ca*", "car*" and "cars*" when valuetype="glob".

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
pattern2id(
  pattern,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  keep_nomatch = FALSE
)

pattern2fixed(
  pattern,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  keep_nomatch = FALSE
)

index_types(types, valuetype, case_insensitive, max_len = NULL)

Arguments

pattern

a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.

types

unique types of tokens obtained by types()

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

keep_nomatch

keep patterns not found

max_len

maximum length of types to be indexed

Value

pattern2id returns a list of integer vectors containing type IDs

pattern2fixed returns a list of character vectors containing types

index_types returns a list of integer vectors containing type IDs with index keys as an attribute

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")

pats_regex <- list(c("^a$", "^b"), c("c"), c("d"))
pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)

pats_glob <- list(c("a*", "b*"), c("c"), c("d"))
pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)

pattern <- list(c("^a$", "^b"), c("c"), c("d"))
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)
index <- index_types(c("xxx", "yyyy", "ZZZ"), "glob", FALSE, 3)
quanteda.core:::search_glob("yy*", attr(index, "type_search"), index)

koheiw/quanteda.core documentation built on Sept. 21, 2020, 3:44 p.m.