knitr::opts_chunk$set(
  comment = "#>",
  tidy = FALSE,
  error = FALSE,
  fig.width = 8,
  fig.height = 8)

rematch2

Match Regular Expressions with a Nicer 'API'

Linux Build Status Windows Build status CRAN RStudio mirror downloads Coverage Status

A small wrapper on regular expression matching functions regexpr and gregexpr to return the results in tidy data frames.


Installation

source("https://install-github.me/MangoTheCat/rematch2")

Rematch vs rematch2

Note that rematch2 is not compatible with the original rematch package. There are at least three major changes: The order of the arguments for the functions is different. In rematch2 the text vector is first, and pattern is second. In the result, .match is the last column instead of the first. * rematch2 returns tibble data frames. See https://github.com/hadley/tibble.

Usage

First match

library(rematch2)

With capture groups:

dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
  "76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)

Named capture groups:

isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)

A slightly more complex example:

github_repos <- c(
    "metacran/crandb",
    "jeroenooms/curl@v0.9.3",
    "jimhester/covr#47",
    "hadley/dplyr@*release",
    "mangothecat/remotes@550a3c7d3f9e1493a2ba",
    "/$&@R64&3"
)
owner_rx   <- "(?:(?<owner>[^/]+)/)?"
repo_rx    <- "(?<repo>[^/@#]+)"
subdir_rx  <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx     <- "(?:@(?<ref>[^*].*))"
pull_rx    <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"

subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx  <- sprintf(
    "^(?:%s%s%s%s|(?<catchall>.*))$",
    owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)

All matches

Extract all names, and also first names and last names:

name_rex <- paste0(
  "(?<first>[[:upper:]][[:lower:]]+) ",
  "(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
  "  Ben Franklin and Jefferson Davis",
  "\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
not$first
not$last
not$.match

Match positions

re_exec and re_exec_all are similar to re_match and re_match_all, but they also return match positions. These functions return match records. A match record has three components: match, start, end, and each component can be a vector. It is similar to a data frame in this respect.

pos <- re_exec(notables, name_rex)
pos

Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but rematch2 defines some special classes and an $ operator, to make it easier to extract parts of re_exec and re_exec_all matches. You simply query the match, start or end part of a column:

pos$first$match
pos$first$start
pos$first$end

re_exec_all is very similar, but these queries return lists, with arbitrary number of matches:

allpos <- re_exec_all(notables, name_rex)
allpos
allpos$first$match
allpos$first$start
allpos$first$end

License

MIT © Mango Solutions



MangoTheCat/rematch2 documentation built on May 7, 2019, 2:11 p.m.