Capture all matches in a single subject string"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Capture all matches in a single subject string

The nc::capture_all_str function is for the common case of extracting each match from a multi-line text file (a single large subject string). In this section we demonstrate how to extract data tables from such loosely structured text data. For example we consider the following track hub meta-data file:

trackDb.txt.gz <- system.file(
  "extdata", "trackDb.txt.gz", package="nc")
trackDb.vec <- readLines(trackDb.txt.gz)

Some representative lines from that file are shown below.

cat(trackDb.vec[78:107], sep="\n")

Match all tracks in the text file

Each block of text begins with "track" and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:

tracks.dt <- nc::capture_all_str(
  trackDb.vec, 
  "track ",
  track="\\S+",
  fields="(?:\n[^\n]+)*",
  "\n")
str(tracks.dt)

The result is a data.table with one row for each track block that matches the regex. There are two character columns: track is a unique name, and fields is a string with the rest of the data in that block:

tracks.dt[, .(track, fields.start=substr(fields, 1, 30))]

Match all fields in each track

Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:

(fields.dt <- tracks.dt[, nc::capture_all_str(
  fields,
  "\\s+",
  variable=".*?",
  " ",
  value="[^\n]+"),  
  by=track])
str(fields.dt)

Note that because by=track was specified, nc::capture_all_str is called for each unique value of track (i.e. each row). The results are combined into a single data.table with one row for each field. This data.table can be easily queried, e.g.

fields.dt[
  J("tcell_McGill0107Coverage", "bigDataUrl"),
  value,
  on=.(track, variable)]
fields.dt[, .(count=.N), by=variable][order(count)]

For more information about data.table syntax, read vignette("datatable-intro", package="data.table").

Match all tracks and some fields with one regex

In the examples above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the track name, split into separate columns (using a single regex for the track).

cell.sample.type <- list(
  cellType="[^ ]*?",
  "_",
  sampleName=list(
    "McGill",
    sampleID="[0-9]+", as.integer),
  dataType="Coverage|Peaks")
nc::capture_all_str(trackDb.vec, cell.sample.type)

Note that the pattern above defines nested capture groups via named lists (e.g. sampleID is a subset of sampleName). The pattern below matches either the previously specified track pattern, or any other type of track name:

sample.or.anything <- list(
  cell.sample.type,
  "|",
  "[^\n]+")
track.pattern.old <- list(
  "track ",
  track=sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern.old)

Notice the repetition of track in the pattern above. This can be avoided by using the nc::field helper function, which takes three arguments, that are pasted together to form a pattern:

The example above can thus be re-written as below, avoiding the repetition of track which was present above:

track.pattern <- nc::field("track", " ", sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern)

Finally we use field again to match the type column:

any.lines.pattern <- "(?:\n[^\n]+)*"
nc::capture_all_str(
  trackDb.vec,
  track.pattern,
  any.lines.pattern,
  "\\s+",
  nc::field("type", " ", "[^\n]+"))

Exercise for the reader (easy): modify the above regex in order to capture the bigDataUrl field, and three additional columns (red, green, blue) from the color field. Assume that bigDataUrl occurs before color in each track. Note that this is a limitation of the single regex approach --- using two regex, as described in previous sections, could extract any/all fields, even if they appear in different orders in different tracks.

Exercise for the reader (hard): note that the last code block only matches tracks which define the type field. How would you optionally match the type field? Hint: the current any.lines.pattern can match the type field.

Parsing SweeD output files

Thanks to Marc Tollis for providing the example data used in this section (from the SweeD bioinformatics program). Some representative lines from one output file are shown below.

info.txt.gz <- system.file(
  "extdata", "SweeD_Info.txt.gz", package="nc")
info.vec <- readLines(info.txt.gz)
info.vec[20:50]

The Alignment numbers must be matched with the numbers before slashes in the other file,

report.txt.gz <- system.file(
  "extdata", "SweeD_Report.txt.gz", package="nc")
report.vec <- readLines(report.txt.gz)
cat(report.vec[1:10], sep="\n")
cat(report.vec[1000:1010], sep="\n")

The goal is to produce a bed file, which has tab-separated values with four columns: chrom, chromStart, chromEnd, Likelihood. The chrom values appear in the info file (Chromosome) so we will need to join the two files based on alignment ID. First we capture all alignments in the info file:

(info.dt <- nc::capture_all_str(
  info.vec,
  "Alignment ",
  alignment="[0-9]+",
  "\n\n\t\tChromosome:\t\t",
  chrom=".*",
  "\n"))

Then we capture all alignment/csv blocks in the report file:

(report.dt <- nc::capture_all_str(
  report.vec,
  "//",
  alignment="[0-9]+",
  "\n",
  csv="[^/]+"
)[, {
  data.table::fread(text=csv)
}, by=alignment])

Note that because by=alignment was specified, fread is called for each unique value of alignment (i.e. each row). The results are combined into a single data.table with all of the csv data from the original file, plus the additional alignment column. Next, we join this table to the previous table in order to get the chrom column:

(join.dt <- report.dt[info.dt, on=.(alignment)])

Finally the desired bed table can be created via

join.dt[, .(
  chrom,
  chromStart=as.integer(Position-1),
  chromEnd=as.integer(Position),
  Likelihood)]

Exercise for the reader (easy): notice that the code above for creating info.dt involves repetition in the pattern and group names (alignment, Alignment, chrom, Chromosome). Re-write the pattern using nc::field in order to eliminate that repetition.

Exercise for the reader (hard): notice that Chromosome is only the first field -- how could you extract the other fields as well? Hint: use nc::field in a helper function in order to avoid repetition.



Try the nc package in your browser

Any scripts or data that you put into this service are public.

nc documentation built on Sept. 1, 2023, 1:07 a.m.