capture_first_vec: Capture first match in each character vector element
In nc: Named Capture to Data Tables

capture_first_vec

R Documentation

Capture first match in each character vector element

Description

Use a regular expression (regex) with capture groups to extract the first matching text from each of several subject strings. For all matches in one multi-line text file or string use capture_all_str. For the first match in every row of a data.frame, using a different regex for each column, use capture_first_df. For reading regularly named files, use capture_first_glob. For matching column names in a wide data frame and then melting/reshaping those columns to a taller/longer data frame, see capture_melt_single and capture_melt_multiple. To simplify the definition of the regex you can use field, quantifier, and alternatives.

Usage

capture_first_vec(..., 
    nomatch.error = getOption("nc.nomatch.error", 
        TRUE), engine = getOption("nc.engine", 
        "PCRE"), type.convert = getOption("nc.type.convert", 
        FALSE))

Arguments

`...`	subject, name1=pattern1, fun1, etc. The first argument must be a character vector of length>0 (subject strings to parse with a regex). Arguments after the first specify the regex/conversion and must be string/list/function. All character strings are pasted together to obtain the final regex used for matching. Each string/list with a named argument in R becomes a capture `group` in the regex, and the name is used for the corresponding column of the output data table. Each function must be un-named, and is used to convert the previous capture `group`. Each un-named list becomes a non-capturing `group`. Elements in each list are parsed recursively using these rules.
`nomatch.error`	if TRUE (default), stop with an error if any subject does not match; otherwise subjects that do not match are reported as missing/NA rows of the result.
`engine`	character string, one of PCRE, ICU, RE2
`type.convert`	Default conversion function, which will be used on each capture `group`, unless a specific conversion is specified for that `group`. If TRUE, use `type.convert`; if FALSE, use `identity`; otherwise must be a function of at least one argument (character), returning an atomic vector of the same length.

Value

data.table with one row for each subject, and one column for each capture group.

Author(s)

Toby Hocking <toby.hocking@r-project.org> [aut, cre]

Examples


chr.pos.vec <- c(
  "chr10:213,054,000-213,055,000",
  "chrM:111,000",
  "chr1:110-111 chr2:220-222") # two possible matches.
## Find the first match in each element of the subject character
## vector. Named argument values are used to create capture groups
## in the generated regex, and argument names become column names in
## the result.
(dt.chr.cols <- nc::capture_first_vec(
  chr.pos.vec,
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+"))

## Even when no type conversion functions are specified, the result
## is always a data.table:
str(dt.chr.cols)

## Conversion functions are used to convert the previously named
## group, and patterns may be saved in lists for re-use.
keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
int.pattern <- list("[0-9,]+", keep.digits)
range.pattern <- list(
  chrom="chr.*?",
  ":",
  chromStart=int.pattern,
  list( # un-named list becomes non-capturing group.
    "-",
    chromEnd=int.pattern
  ), "?") # chromEnd is optional.
(dt.int.cols <- nc::capture_first_vec(
  chr.pos.vec, range.pattern))

## Conversion functions used to create non-char columns.
str(dt.int.cols)

## NA used to indicate no match or missing subject.
na.vec <- c(
  "this will not match",
  NA, # neither will this.
  chr.pos.vec)
nc::capture_first_vec(na.vec, range.pattern, nomatch.error=FALSE)

## another subject from https://adventofcode.com/2024/day/14
## type.convert=TRUE means to use utils::type.convert as default
## conversion function
pvxy.subject <- c("p=0,4 v=3,-3","p=6,3 v=-1,-3")
nc::capture_first_vec(
  pvxy.subject,
  "p=",
  px="[0-9]",
  ",",
  py="[0-9]",
  " v=",
  vx="[-0-9]+",
  ",",
  vy="[-0-9]+",
  type.convert=TRUE)

## to do the same as above but with less repetition:
g <- function(prefix,suffix)nc::group(
  name=paste0(prefix,suffix),
  "[-0-9]+")
xy <- function(prefix)list(
  prefix,
  "=",
  g(prefix,"x"),
  ",",
  g(prefix,"y"))
nc::capture_first_vec(
  pvxy.subject,
  xy("p"),
  " ",
  xy("v"),
  type.convert=TRUE)

## or use a sub-pattern list without type.convert arg:
ipat <- list("[-0-9]+", as.integer)
nc::capture_first_vec(
  pvxy.subject,
  "p=",
  px=ipat,
  ",",
  py=ipat,
  " v=",
  vx=ipat,
  ",",
  vy=ipat)

nc documentation built on April 21, 2026, 9:06 a.m.