re2_regexp: Compile regular expression pattern
In re2: R Interface to Google RE2 (C++) Regular Expression Library

re2_regexp

R Documentation

Compile regular expression pattern

Description

re2_regexp compiles a character string containing a regular expression and returns a pointer to the object.

Usage

re2_regexp(pattern, ...)

Arguments

pattern

Character string containing a regular expression.

...

Options, which are (defaults in parentheses):

`⁠encoding⁠`	(`⁠"UTF8"⁠`) String and pattern are UTF-8; Otherwise `⁠"Latin1"⁠`.
`⁠posix_syntax⁠`	(`⁠FALSE⁠`) Restrict regexps to POSIX egrep syntax.
`⁠longest_match⁠`	(`⁠FALSE⁠`) Search for longest match, not first match.
`⁠max_mem⁠`	(see below) Approx. max memory footprint of RE2 C++ object.
`⁠literal⁠`	(`⁠FALSE⁠`) Interpret pattern as literal, not regexp.
`⁠never_nl⁠`	(`⁠FALSE⁠`) Never match \n, even if it is in regexp.
`⁠dot_nl⁠`	(`⁠FALSE⁠`) Dot matches everything including new line.
`⁠never_capture⁠`	(`⁠FALSE⁠`) Parse all parens as non-capturing.
`⁠case_sensitive⁠`	(`⁠TRUE⁠`) Match is case-sensitive (regexp can override with (?i) unless in posix_syntax mode).

The following options are only consulted when ⁠posix_syntax=TRUE⁠. When ⁠posix_syntax=FALSE⁠, these features are always enabled and cannot be turned off; to perform multi-line matching in that case, begin the regexp with (?m).

`⁠perl_classes⁠`	(`⁠FALSE⁠`) Allow Perl's `⁠\d \s \w \D \S \W⁠`.
`⁠word_boundary⁠`	(`⁠FALSE⁠`) Allow Perl's `⁠\b \B⁠` (word boundary and not).
`⁠one_line⁠`	(`⁠FALSE⁠`) `⁠^⁠` and `⁠$⁠` only match beginning and end of text.

The ⁠max_mem⁠ option controls how much memory can be used to hold the compiled form of the regexp and its cached DFA graphs (DFA: The execution engine that implements Deterministic Finite Automaton search). Default is 8MB.

Value

Compiled regular expression.

Regexp Syntax

RE2 regular expression syntax is similar to Perl's with some of the more complicated things thrown away. In particular, backreferences and generalized assertions are not available, nor is ⁠\Z⁠.

See re2_syntax for the syntax supported by RE2, and a comparison with PCRE and PERL regexps.

For those not familiar with Perl's regular expressions, here are some examples of the most commonly used extensions:

`⁠"hello (\w+) world"⁠`	--	\w matches a "word" character.
`⁠"version (\d+)"⁠`	--	\d matches a digit.
`⁠"hello\s+world"⁠`	--	\s matches any whitespace character.
`⁠"\b(\w+)\b"⁠`	--	\b matches non-empty string at word boundary.
`⁠"(?i)hello"⁠`	--	(?i) turns on case-insensitive matching.
`⁠"/\(.?)\*/"⁠`	--	`⁠.*?⁠` matches . minimum no. of times possible.

The double backslashes are needed when writing R string literals. However, they should NOT be used when writing raw string literals:

`⁠r"(hello (\w+) world)"⁠`	--	\w matches a "word" character.
`⁠r"(version (\d+))"⁠`	--	\d matches a digit.
`⁠r"(hello\s+world)"⁠`	--	\s matches any whitespace character.
`⁠r"(\b(\w+)\b)"⁠`	--	\b matches non-empty string at word boundary.
`⁠r"((?i)hello)"⁠`	--	(?i) turns on case-insensitive matching.
`⁠r"(/\(.?)\*/)"⁠`	--	`⁠.*?⁠` matches . minimum no. of times possible.

When using UTF-8 encoding, case-insensitive matching will perform simple case folding, not full case folding.

Examples

re2p <- re2_regexp("hello world")
stopifnot(mode(re2p) == "externalptr")

## UTF-8 and matching interface
# By default, pattern and input text are interpreted as UTF-8.
# The Latin1 option causes them to be interpreted as Latin-1.
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
re2_detect(x, re2_regexp("fa\xE7", encoding = "Latin1"))

## Case insensitive
re2_detect("fOobar ", re2_regexp("Foo", case_sensitive = FALSE))

## Literal string (as opposed to regular expression)
## Matches only when 'literal' option is TRUE
re2_detect("foo\\$bar", re2_regexp("foo\\$b", literal = TRUE))
re2_detect("foo\\$bar", re2_regexp("foo\\$b", literal = FALSE))

## Use of never_nl
re <- re2_regexp("(abc(.|\n)*def)", never_nl = FALSE)
re2_match("abc\ndef\n", re)
re <- re2_regexp("(abc(.|\n)*def)", never_nl = TRUE)
re2_match("abc\ndef\n", re)

re2 documentation built on April 4, 2025, 1:42 a.m.