README.md

fugly

This package provides a single function (str_capture) for using named capture groups to extract values from strings. A key requirement for readability is that the names of the capture groups are specified inline as part of the regex, and not in an external vector or as separate names.

fugly::str_capture() is implemented as a wrapper around stringr. This is because stringr itself does not yet do named capture groups (See issues for stringr and stringi).

fugly::str_capture() is very similar to a number of existing packages. See table below for a comparison.

| Method | Speed | Inline capture group naming | robust | |-----------------------------|----------|-----------------------------|--------| | fugly::str_capture | Fast | Yes | No | | rr4r::rr4r_extract_groups | Fast | Yes | Yes | | nc::capture_first_vec | Fast | No | Yes | | tidy::extract | Fast | No | Yes | | utils::strcapture | Middling | No | Yes | | unglue::unglue | Slow | Yes | Yes | | ore::ore_search | Slow | Yes | Yes |

What do I mean when I say fugly::str_capture() is unsafe/dodgy/non-robust?

What’s in the box?

Installation

You can install from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/fugly')

Example 1

In the following example:

library(fugly)

string <- c(
  "information: Name:greg Age:27 ",
  "information: Name:mary Age:34 "
)

str_capture(string, pattern = "Name:{name} Age:{age=\\d+}")
#>   name age
#> 1 greg  27
#> 2 mary  34

Example 2

A more complicated example:

string <- c(
'{"type":"Feature","properties":{"hash":"1348778913c0224a","number":"27","street":"BANAMBILA STREET","unit":"","city":"ARANDA","district":"","region":"ACT","postcode":"2614","id":"GAACT714851647"},"geometry":{"type":"Point","coordinates":[149.0826143,-35.2545558]}}',
'{"type":"Feature","properties":{"hash":"dc776871c868bc7e","number":"139","street":"BOUVERIE STREET","unit":"UNIT 711","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC423944917"},"geometry":{"type":"Point","coordinates":[144.9617149,-37.8032551]}}',
'{"type":"Feature","properties":{"hash":"8197f34a40ccad47","number":"6","street":"MOGRIDGE STREET","unit":"","city":"WARWICK","district":"","region":"QLD","postcode":"4370","id":"GAQLD155949502"},"geometry":{"type":"Point","coordinates":[152.0230999,-28.2230133]}}',
'{"type":"Feature","properties":{"hash":"18edc96308fc1a8e","number":"22","street":"ORR STREET","unit":"UNIT 507","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC424282716"},"geometry":{"type":"Point","coordinates":[144.9653484,-37.8063371]}}'
)


str_capture(string, pattern = '"number":"{number}","street":"{street}".*?"coordinates":\\[{coords}\\]')
#>   number           street                  coords
#> 1     27 BANAMBILA STREET 149.0826143,-35.2545558
#> 2    139  BOUVERIE STREET 144.9617149,-37.8032551
#> 3      6  MOGRIDGE STREET 152.0230999,-28.2230133
#> 4     22       ORR STREET 144.9653484,-37.8063371

Simple Benchmark

I acknowledge that this isn’t the greatest benchmark, but it is relevant to my current use-case.

# remotes::install_github("jonclayden/ore")
# remotes::install_github("yutannihilation/rr4r")
# remotes::install_github('qinwf/re2r') 
library(ore)
library(rr4r)
library(unglue)
library(ggplot2)
library(tidyr)

# meaningless strings for benchmarking
N <- 1000
string <- paste0("Information name:greg age:", seq(N))


res <- bench::mark(
  `fugly::str_capture()` = fugly::str_capture(string, "name:{name} age:{age=\\d+}"),
  `unglue::unglue()` = unglue::unglue_data(string, "Information name:{name} age:{age=\\d+}"),
  `utils::strcapture()` = utils::strcapture("Information name:(.*?) age:(\\d+)", string, 
                    proto = data.frame(name=character(), age=character())),
  `ore::ore_search()` = do.call(rbind.data.frame, lapply(ore_search(ore('name:(?<name>.*?) age:(?<age>\\d+)', encoding='utf8'), string, all=TRUE), function(x) {x$groups$matches})),
   `rr4r::rr4r_extract_groups()` = rr4r::rr4r_extract_groups(string, "name:(?P<name>.*?) age:(?P<age>\\d+)"),
  `nc::capture_first_vec() PCRE` = nc::capture_first_vec(string, "Information name:", name=".*?", " age:", age="\\d+", engine = 'PCRE'),
  `tidyr::extract()` = tidyr::extract(data.frame(x = string), x, into = c('name', 'age'), regex = 'name:(.*?) age:(\\d+)'),
  check = FALSE
)

Related Software

Acknowledgements



coolbutuseless/fugly documentation built on Dec. 19, 2021, 6:03 p.m.