regexprCapture: Extract text with regexp capture groups

View source: R/stringUtils.R

regexprCaptureR Documentation

Extract text with regexp capture groups

Description

Applies a (perl) regular expression with capture groups to text strings and returns a matrix. Each matrix column is the text that one capture group matched (in order), each matrix row is the outcome of applying that regexp to one element of the text data. If a capture group does not match, the empty string is returned unless use.na = TRUE is set, it which case NA is returned. In either case, if a capture group matches nothing (i.e. when * is used to match 0 or more, and 0 match), an empty string is returned.

Usage

regexprCapture(re, data, use.na = FALSE)

Arguments

re

The (perl) regular expression as a string, with capture groups. May use named capture groups ((?<name>...)). Must double any \ used, e.g. zero or more whitespace characters would be (\s*)

data

A vector of strings to search in. The rows in the returned matrix will be the captured text from successive elements of this vector.

use.na

Set TRUE to return NA as the matched text for capture groups that fail to match

Details

This is implemented using regexprMatches

Value

A matrix with one column per regular expression capture group and one row per data element. Columns will be named if named capture groups are used.

Examples

# Capture group: (...)
# Named capture group: (?<name>...)
# Lazy quantifier: *?
regExp <- "\\s*(?<name>.*?)\\s*<\\s*(?<email>.+)\\s*>\\s*"
data <- c('Stuart R. Jefferys <srj@unc.edu>',
          'nonya business <nobody@nowhere.com>',
          'no email', '<just@an.email>' )

regexprCapture(regExp, data)
#=> name                  email
#=> [1,] "Stuart R. Jefferys" "srj@unc.edu"
#=> [2,] "nonya business"     "nobody@nowhere.com"
#=> [3,] ""                    ""
#=> [4,] ""                    "just@an.email"

regexprCapture(regExp, data, use.na=TRUE)
#=> name                  email
#=> [1,] "Stuart R. Jefferys" "srj@unc.edu"
#=> [2,] "nonya business"     "nobody@nowhere.com"
#=> [3,] NA                    NA
#=> [4,] ""                    "just@an.email"


jefferys/JefferysRUtils documentation built on Jan. 12, 2024, 9:18 p.m.