knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
There are several "helper" functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns:
one.pattern <- function(pat){ if(is.character(pat)){ pat }else{ nc::var_args_list(pat)[["pattern"]] } } show.patterns <- function(...){ L <- list(...) str(lapply(L, one.pattern)) }
The nc::field
function can be used to avoid repetition when defining
patterns of the form variable: value
. The example below shows three
(mostly) equivalent ways to write a regex that captures the text after
the colon and space; the captured text is stored in the variable
group or output column:
show.patterns( "variable: (?<variable>.*)", #repetitive regex string list("variable: ", variable=".*"),#repetitive nc R code nc::field("variable", ": ", ".*"))#helper function avoids repetition
Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern).
Another example:
show.patterns( "Alignment (?<Alignment>[0-9]+)", list("Alignment ", Alignment="[0-9]+"), nc::field("Alignment", " ", "[0-9]+"))
Another example:
show.patterns( "Chromosome:\t+(?<Chromosome>.*)", list("Chromosome:\t+", Chromosome=".*"), nc::field("Chromosome", ":\t+", ".*"))
Another helper function is =nc::quantifier= which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group:
show.patterns( "(?:-(?<chromEnd>[0-9]+))?", #regex string list(list("-", chromEnd="[0-9]+"), "?"), #nc pattern using lists nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
Another example with a named capture group inside an optional non-capturing group:
show.patterns( "(?: (?<name>[^,}]+))?", list(list(" ", name="[^,}]+"), "?"), nc::quantifier(" ", name="[^,}]+", "?"))
We also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.
show.patterns( "(?:(?<first>bar+)|(?<second>fo+))", list(first="bar+", "|", second="fo+"), nc::alternatives(first="bar+", second="fo+"))
nc::alternatives_with_shared_groups
for alternatives with identical named sub-pattern groupsSometimes each alternative is just a re-arrangement of the same sub-patterns. For example consider the following subjects, each of which are dates, in one of two formats.
subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")
In each of the two formats, the month consists of three lower-case letters, the day consists of two digits, and the year consists of four digits. Is there a single pattern that can match each of these subjects? Yes, such a pattern can be defined using the code below,
pattern <- nc::alternatives_with_shared_groups( month="[a-z]{3}", day=list("[0-9]{2}", as.integer), year=list("[0-9]{4}", as.integer), list(month, " ", day, ", ", year), list(day, " ", month, " ", year))
In the code above, we used nc::alternatives_with_shared_groups
,
which requires two kinds of arguments:
The pattern can be used for matching, and the result is a data table with one column for each unique name,
(match.dt <- nc::capture_first_vec(subject.vec, pattern))
After having parsed the dates into these three columns, we can add a date column:
Sys.setlocale(locale="C")#to recognize months in English. match.dt[, date := data.table::as.IDate( paste(month, day, year), format="%b %d %Y")] print(match.dt, class=TRUE)
nc::altlist
for named alternativesFor most use cases, nc::alternatives_with_shared_groups
is
sufficient, but one case where it does not work is when you want to
name each alternative (for example, to easily count how many matches
to each alternative there were). In that case you can instead use
nc::altlist
as in the code below,
shared.groups <- nc::altlist( month="[a-z]{3}", day=list("[0-9]{2}", as.integer), year=list("[0-9]{4}", as.integer)) alt.args <- with(shared.groups, list( american=list(month, " ", day, ", ", year), european=list(day, " ", month, " ", year))) pattern <- do.call(nc::alternatives, alt.args) (match.dt <- nc::capture_first_vec(subject.vec, pattern)) match.dt[, lapply(.SD, function(x)sum(x!="")), .SDcols=names(alt.args)]
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.