regex | R Documentation |
This help page documents the regular expression patterns supported by
grep
and related functions grepl
, regexpr
,
gregexpr
, sub
and gsub
, as well as by
strsplit
and optionally by agrep
and
agrepl
.
A ‘regular expression’ is a pattern that describes a set of
strings. Two types of regular expressions are used in R,
extended regular expressions (the default) and
Perl-like regular expressions used by perl = TRUE
.
There is also fixed = TRUE
which can be considered to use a
literal regular expression.
Other functions which use regular expressions (often via the use of
grep
) include apropos
, browseEnv
,
help.search
, list.files
and ls
.
These will all use extended regular expressions.
Patterns are described here as they would be printed by cat
:
(do remember that backslashes need to be doubled when entering R
character strings, e.g. from the keyboard).
Long regular expression patterns may or may not be accepted: the POSIX standard only requires up to 256 bytes.
This section covers the regular expressions allowed in the default
mode of grep
, grepl
, regexpr
, gregexpr
,
sub
, gsub
, regexec
and strsplit
. They use
an implementation of the POSIX 1003.2 standard: that allows some scope
for interpretation and the interpretations here are those currently
used by R. The implementation supports some extensions to the
standard.
Regular expressions are constructed analogously to arithmetic
expressions, by using various operators to combine smaller
expressions. The whole expression matches zero or more characters
(read ‘character’ as ‘byte’ if useBytes = TRUE
).
The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
Escaping non-metacharacters with a backslash is implementation-dependent. The current implementation interprets \a as BEL, \e as ESC, \f as FF, \n as LF, \r as CR and \t as TAB. (Note that these will be interpreted by R's parser in literal character strings.)
A character class is a list of characters enclosed between
[ and ] which matches any single character in that list;
unless the first character of the list is the caret ^, when it
matches any character not in the list. For example, the
regular expression [0123456789] matches any single digit, and
[^abc] matches anything except the characters a,
b or c. A range of characters may be specified by
giving the first and last characters, separated by a hyphen. (Because
their interpretation is locale- and implementation-dependent,
character ranges are best avoided. Some but not all implementations
include both cases in ranges when doing caseless matching.) The only
portable way to specify all ASCII letters is to list them all as the
character class
[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz].
(The
current implementation uses numerical order of the encoding, normally a
single-byte encoding or Unicode points.)
Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.
Alphanumeric characters: [:alpha:] and [:digit:].
Alphabetic characters: [:lower:] and [:upper:].
Blank characters: space and tab, and possibly other locale-dependent characters such as non-breaking space.
Control characters. In ASCII, these characters have octal codes
000 through 037, and 177 (DEL
). In another character set,
these are the equivalent characters, if any.
Digits: 0 1 2 3 4 5 6 7 8 9.
Graphical characters: [:alnum:] and [:punct:].
Lower-case letters in the current locale.
Printable characters: [:alnum:], [:punct:] and space.
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters.
Upper-case letters in the current locale.
Hexadecimal digits:
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.
For example, [[:alnum:]] means [0-9A-Za-z], except the
latter depends upon the locale and the character encoding, whereas the
former is independent of locale and character set. (Note that the
brackets in these class names are part of the symbolic names, and must
be included in addition to the brackets delimiting the bracket list.)
Most metacharacters lose their special meaning inside a character
class. To include a literal ], place it first in the list.
Similarly, to include a literal ^, place it anywhere but first.
Finally, to include a literal -, place it first or last (or,
for perl = TRUE
only, precede it by a backslash). (Only
^ - \ ] are special inside character classes.)
The period . matches any single character. The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]). Symbols \d, \s, \D and \S denote the digit and space classes and their negations (these are all extensions).
The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. The symbols \< and \> match the empty string at the beginning and end of a word. The symbol \b matches the empty string at either edge of a word, and \B matches the empty string provided it is not at an edge of a word. (The interpretation of ‘word’ depends on the locale and implementation: these are all extensions.)
A regular expression may be followed by one of several repetition quantifiers:
The preceding item is optional and will be matched at most once.
The preceding item will be matched zero or more times.
The preceding item will be matched one or more times.
The preceding item is matched exactly n
times.
The preceding item is matched n
or more
times.
The preceding item is matched at least n
times, but not more than m
times.
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to ‘minimal’ by appending
?
to the quantifier. (There are further quantifiers that allow
approximate matching: see the TRE documentation.)
Regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating the substrings that match the concatenated subexpressions.
Two regular expressions may be joined by the infix operator |;
the resulting regular expression matches any string matching either
subexpression. For example, abba|cde matches either the
string abba
or the string cde
. Note that alternation
does not work inside character classes, where | has its literal
meaning.
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.
The backreference \N, where N = 1 ... 9, matches the substring previously matched by the Nth parenthesized subexpression of the regular expression. (This is an extension for extended regular expressions: POSIX defines them only for basic ones.)
The perl = TRUE
argument to grep
, regexpr
,
gregexpr
, sub
, gsub
and strsplit
switches
to the PCRE library that implements regular expression pattern
matching using the same syntax and semantics as Perl 5.x,
with just a few differences.
For complete details please consult the man pages for PCRE, especially
man pcrepattern
and man pcreapi
, on your system or
from the sources at https://www.pcre.org. (The version in use can be
found by calling extSoftVersion
. It need not be the version
described in the system's man page. PCRE1 (reported as version < 10.00 by
extSoftVersion
) has been feature-frozen for some time
(essentially 2012), the man pages at
https://www.pcre.org/original/doc/html/ should be a good match.
PCRE2 (PCRE version >= 10.00) has man pages at
https://www.pcre.org/current/doc/html/).
Perl regular expressions can be computed byte-by-byte or
(UTF-8) character-by-character: the latter is used in all multibyte
locales and if any of the inputs are marked as UTF-8 (see
Encoding
, or as Latin-1 except in a Latin-1 locale.
All the regular expressions described for extended regular expressions
are accepted except \< and \>: in Perl all backslashed
metacharacters are alphanumeric and backslashed symbols always are
interpreted as a literal character. { is not special if it
would be the start of an invalid interval specification. There can be
more than 9 backreferences (but the replacement in sub
can only refer to the first 9).
Character ranges are interpreted in the numerical order of the characters, either as bytes in a single-byte locale or as Unicode code points in UTF-8 mode. So in either case [A-Za-z] specifies the set of ASCII letters.
In UTF-8 mode the named character classes only match ASCII characters: see \p below for an alternative.
The construct (?...) is used for Perl extensions in a variety of ways depending on what immediately follows the ?.
Perl-like matching can work in several modes, set by the options (?i) (caseless, equivalent to Perl's /i), (?m) (multiline, equivalent to Perl's /m), (?s) (single line, so a dot matches all characters, even new lines: equivalent to Perl's /s) and (?x) (extended, whitespace data characters are ignored unless escaped and comments are allowed: equivalent to Perl's /x). These can be concatenated, so for example, (?im) sets caseless multiline matching. It is also possible to unset these options by preceding the letter with a hyphen, and to combine setting and unsetting such as (?im-sx). These settings can be applied within patterns, and then apply to the remainder of the pattern. Additional options not in Perl include (?U) to set ‘ungreedy’ mode (so matching is minimal unless ? is used as part of the repetition quantifier, when it is greedy). Initially none of these options are set.
If you want to remove the special meaning from a sequence of characters, you can do so by putting them between \Q and \E. This is different from Perl in that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in Perl, $ and @ cause variable interpolation.
The escape sequences \d, \s and \w represent
any decimal digit, space character and ‘word’ character
(letter, digit or underscore in the current locale: in UTF-8 mode only
ASCII letters and digits are considered) respectively, and their
upper-case versions represent their negation. Vertical tab was not
regarded as a space character in a C
locale before PCRE 8.34.
Sequences \h, \v, \H and \V match
horizontal and vertical space or the negation. (In UTF-8 mode, these
do match non-ASCII Unicode code points.)
There are additional escape sequences: \cx is cntrl-x for any x, \ddd is the octal character (for up to three digits unless interpretable as a backreference, as \1 to \7 always are), and \xhh specifies a character by two hex digits. In a UTF-8 locale, \x{h...} specifies a Unicode code point by one or more hex digits. (Note that some of these will be interpreted by R's parser in literal character strings.)
Outside a character class, \A matches at the start of a subject (even in multiline mode, unlike ^), \Z matches at the end of a subject or before a newline at the end, \z matches only at end of a subject. and \G matches at first matching position in a subject (which is subtly different from Perl's end of the previous match). \C matches a single byte, including a newline, but its use is warned against. In UTF-8 mode, \R matches any Unicode newline character (not just CR), and \X matches any number of Unicode characters that form an extended Unicode sequence. \X, \R and \B cannot be used inside a character class (with PCRE1, they are treated as characters X, R and B; with PCRE2 they cause an error).
A hyphen (minus) inside a character class is treated as a range, unless it is first or last character in the class definition. It can be quoted to represent the hyphen literal (\-). PCRE1 allows an unquoted hyphen at some other locations inside a character class where it cannot represent a valid range, but PCRE2 reports an error in such cases.
In UTF-8 mode, some Unicode properties may be supported via
\p{xx} and \P{xx} which match characters with and
without property xx respectively. For a list of supported
properties see the PCRE documentation, but for example Lu is
‘upper case letter’ and Sc is ‘currency symbol’.
(This support depends on the PCRE library being compiled with
‘Unicode property support’ which can be checked via
pcre_config
. PCRE2 when compiled with Unicode support always
supports also Unicode properties.)
The sequence (?# marks the start of a comment which continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters that make up a comment play no part at all in the pattern matching.
If the extended option is set, an unescaped # character outside a character class introduces a comment that continues up to the next newline character in the pattern.
The pattern (?:...) groups characters just as parentheses do but does not make a backreference.
Patterns (?=...) and (?!...) are zero-width positive and
negative lookahead assertions: they match if an attempt to
match the ...
forward from the current position would succeed
(or not), but use up no characters in the string being processed.
Patterns (?<=...) and (?<!...) are the lookbehind
equivalents: they do not allow repetition quantifiers nor \C
in ...
.
regexpr
and gregexpr
support ‘named capture’. If
groups are named, e.g., "(?<first>[A-Z][a-z]+)"
then the
positions of the matches are also returned by name. (Named
backreferences are not supported by sub
.)
Atomic grouping, possessive qualifiers and conditional and recursive patterns are not covered here.
This help page is based on the TRE documentation and the POSIX
standard, and the pcre2pattern
man page from PCRE2 10.35.
grep
, apropos
, browseEnv
,
glob2rx
, help.search
, list.files
,
ls
, strsplit
and agrep
.
The TRE regexp syntax.
The POSIX 1003.2 standard at https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html.
The pcre2pattern
or pcrepattern
man
page
(found as part of https://www.pcre.org/original/pcre.txt), and
details of Perl's own implementation at
https://perldoc.perl.org/perlre.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.