Making parsers with higher order functions
In parcr: Construct Parsers for Structured Text Files

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Goal of the package

The goal of this package is to simplify the creation of transparent parsers for structured text files generated by machines like laboratory instruments. For example, we use the package to construct parsers for files generated by plate readers. The data generated by these instruments can usually be exported to text or spreadsheet files. Such files consist of lines of text organized in higher-order structures like headers with metadata and blocks of measured values, etc.. It's often convenient to analyze the data in a program like R. To be able to do that you have to have a parser that processes these files and creates R-objects as output. The parcrpackage simplifies the task of creating such parsers.

Higher order functions in R

The parsers that are created with this package make extensive use of functional programming. If this topic is new to you then please read about Functional Programming, in particular the chapters Function Factories and Function Operators, in Hadley Wickham's book Advanced R.

Creating parser combinators in R to parse text files

The parcr package contains a set of functions that allow you to create simple parsers with higher order functions, functions that can take functions as input and have functions as output. These are sometimes called combinators. The ideas behind the package are described in a paper by @Hutton1992. A number of the functions described in this paper are implemented with modifications in the current package. The package was also heavily inspired by the Ramble package which is written in the same vein, but without the explicit parsing of structured text files in mind.

The output of the parsers: a `list`

The parsers constructed with the functions from this package generate a list as an output. The parsers read the input vector from left to right. When the parser fails the output will be the empty list list(). However, if the parser is successful then it produces a list with two elements. An element called L (the left part) contains the output generated by that part of the input vector which was successfully parsed, and the element called R (the right part) which contains the remainder that was not parsed. When an entire character vector is parsed the content of the R element equals character(0). The content of the L part can be shaped to your desire. This is demonstrated in the example of the fasta file parser later in this document.

A simple example of using parser combinators

Please realize that every function described below is a higher order function: their output is a function. In its turn, this function can take a character vector as its input. For example, literal("a") yields a function. To use that function as a parser you have to provide it with a character vector, the object that needs to be parsed, as input:

literal("a")(c("a","att"))

This parser tests whether the next element in its input is literally the string "a". It will succeed in the example above, but will only consume the first element of its input and then stop. However, you can also use a higher order function like literal("a") as input to other higher order functions to create more complex parsers. For example, the function then takes two parsers p1 and p2 as arguments, then(p1, p2) and applies them in sequence to the input. In the parcr package the function is implemented in the infix form %then% which makes parser constructs better readable. The composite parser:

literal("a") %then% literal("att")

looks for an element with string "a" followed by an element with string "att". Its application to the same vector:

(literal("a") %then% literal("att"))(c("a","att"))

will completely consume the input. In this way, using a number of standard parsers defined in the package, you can quickly construct flexible parsers taking complex input. Furthermore, the functions also allow you to construct a desired R object as output while parsing.

The functions in the `parcr` package

We will now discuss all of the parser combinator functions present in the package. You should also study their help pages. In particular the Pseudocode listed for each of them should help you to understand their properties.

The fundamental parsers

The six fundamental parsers allow you to construct a parser that will completely consume input, or to fail when the input does not satisfy the specifications of the parser.

Succeeding and failing

library(parcr)

succeed(o): where o is any kind of R-object
fail()

The succeed and fail parsers are the nuts and bolts of a parser construction. The succeed parser always succeeds, without consuming any input, whereas the fail parser always fails.

The succeed parser constructs a list object with a 'left' or L-element that contains the parsed result of the consumed part of the input vector and the 'right' or R-element that contains the unconsumed part of the vector. The L-element can contain any R-object that is constructed during parsing.

While succeed never fails, fail always does, regardless of the input vector. To signal failure it returns a special form of the empty list list(), namely a marker object printed as the icon [].

Important: It is unlikely that you will ever use these two functions to construct parsers.

Examples:

succeed("A")("abc")
succeed(data.frame(title="Keisri hull", author="Jaan Kross"))(c("Unconsumed","text"))

fail()("abc")

Parsers for the current element

The basic functions for recognizing the content of the current element, the left-most element of the input vector.

literal(c): tests whether the current element equals the string c.
satisfy(b): tests whether the current element satisfies function b() where b() is a logical function: it takes a string as input and returns TRUE or FALSE.
eof() : tests whether the input is at its end.

eof() is a special function that detects the end of a character vector, or if that character vector represents the lines of text file, the end of the file (EOF). In fact, it detects character(0) in the input vector and when successful it turns the R-side of the output into an empty list (list()) to signal that the end of the vector was detected.

Examples:

literal('abc')(c('abc','def'))

starts_with_a <- function(x) {grepl("^a", x)}
satisfy(starts_with_a)(c('abc','def'))

And here is an example of an unsuccessful parser:

literal('a')(c('ab','a'))

An application of eof() to detect that we parsed the input completely.

(literal("a") %then% literal("att") %then% eof())(c("a","att"))

Notice how the R-element differs from just

(literal("a") %then% literal("att"))(c("a","att"))

The fundamental combinators

p1 %or% p2: applies alternative parsers p1 and p2 on the current element and returns the result of the first successful parser, or failure when both fail.
p1 %then% p2: applies parser p1 on the current element and p2 on the next element.

The %or% combinator enables us to try alternative parsers on the current element, whereas the %then% combinator enables us to test sequences of elements in a character vector.

Note that %or% uses lazy evaluation which means that the output of %or% depends on the order of p1 and p2: if both would in principle succeed then only the result of p1 is returned.

We also have two variations of the %then% combinator, %xthen% and %thenx% which do test but then discard the result from the first or second argument:

p1 %xthen% p2: where p1 and p2 are parsers discards the result from p2
p1 %thenx% p2: where p1 and p2 are parsers discards the result from p1

Examples:

(literal('A') %or% satisfy(starts_with_a))(c('abc','def'))

(literal('A') %then% satisfy(starts_with_a))(c('A', 'abc'))

(literal('>') %thenx% satisfy(starts_with_a))(c('>', 'abc'))

Modifying the output of a parser

As said, the six fundamental parsers allow you to construct a parser that will completely consume input. However, when this parser succeeds its output will, apart from the fact that every element is put in a list, be equal to the input. In general, this is not very useful if you want to use the output in other code. Therefore, we have two functions that allow you to modify the output of successful parser. The basic functions for modifying output of a parser are:

p %ret% c : when parser p is successful it returns the object c (a string or NULL).
p %using% f : when parser p is successful function f() is applied to the input and its output is stored as the result.

Examples

(literal('a') %ret% "We have an 'a'")(c('a','b'))

(satisfy(starts_with_a) %using% toupper)(c('abc','d'))

Derived parsers

Derived parsers are constructed from the six fundamental parsers.

Iterators

zero_or_one(p): where p is a parser.
zero_or_more(p): where p is a parser.
one_or_more(p): where p is a parser.
exactly(n,p): where n is an integer and p is a parser.
match_n(n,p): where n is an integer and p is a parser.

zero_or_one, zero_or_more and one_or_more do exactly what their names suggest. You should realize that these are greedy parsers: they consume as many as possible strings that can be successfully parsed by p. Similarly, exactly is a greedy parser, and it fails when there are less or more than n consecutive strings that can be successfully parsed by p. On the other hand match_n is not greedy. It consumes n but no more strings that can be successfully parsed with p.

Examples:

This parser will fail on its input, too many strings starting with "a":

zero_or_one(satisfy(starts_with_a))(c('acc','aat','cgg'))

The following is a successful parse. Note that its result is not merely [] which would have indicated failure, but an L,R-list with an empty list in the L-element.

zero_or_more(satisfy(starts_with_a))(c('cat','gac','cct'))

one_or_more(satisfy(starts_with_a))(c('att','aac','cct'))

exactly(2, satisfy(starts_with_a))(c('att','aac','cct'))

match_n(1, satisfy(starts_with_a))(c('att','aac','cct'))

Recognizing and processing strings with `match_s`

When constructing a parser you will often need to recognize as well as process strings. For example, you want to recognize multiple integers in a line, extract these and then return them as a numeric vector. You're not interested in other elements like comments in these strings. This could be achieved by combining satisfy() and subsequently %using%, like:

satisfy(has_integers) %using% process_integers

where has_integers is a boolean function and process_integers is a function that both recognizes , extracts and rearranges numbers to a numeric vector. You will often find that has_integers and process_integers use the same regular expressions. Then it may be more efficient to combine these, which is what the match_s() function does.

The match_s() parser takes a simple (not higher-order) function s to process the string from the current element and returns the result from that function. The function s has to be constructed in such a way that it returns the empty list() when the string does not satisfy the criteria that the user sets.

Example:

Here numbers is a function hat recognizes and returns numbers (in fact, positive integers) in a string:

numbers <- function(x) {
  m <- gregexpr("[[:digit:]]+", x)
  matches <- as.numeric(regmatches(x,m)[[1]])
  if (length(matches)==0) {
    return(list()) # we signal parser failure when no numbers were found
  } else {
    return(matches)
  }
}

match_s(numbers)(" 101 12 187 # a comment on these numbers")

Functions that split a string and parse the substrings

by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE): where p is a parser
by_symbol(p, finish = TRUE): where p is a parser

Although you can use the string processing functions from the base or stringr packages to parse and process individual elements of a character vector it is also possible to parse substrings by first splitting a string. by_split uses a split pattern to first split the incoming string and then applies the parser p to it. by_symbol splits the incoming string to individual symbols and then applies the parser p. The finish boolean indicates whether the parser should completely consume the split string.

Under the hood these functions use the function strsplit() and its split, fixed and perl arguments are passed on.

Examples:

starts_with_a <- function(x) grepl("^a",x[1])
# don's forget to use satisfy(), it turns starts_with into a parser
by_split(one_or_more(satisfy(starts_with_a)), ",", fixed = TRUE)("atggc,acggg,acttg")

by_symbol(literal(">") %thenx% one_or_more(literal("b")), finish = FALSE)(">bb")

Note: Parsers become slow when using these two functions extensively. If that bothers you then you should use the match_s or satisfy()and %using% parsers together with string processing functions like grepl and grep or the ones from stringr to process strings. Those parsers will be much faster.

Derived functions to recognize and modify empty lines

EmptyLine()
Spacer()
MaybeEmpty()

The function EmptyLine() detects and returns empty line. Empty lines are either the string "" or strings consisting entirely of space-like characters as identified by the regular expression \\s. Spacer() detects one or more consecutive empty lines and discards these whereas MaybeEmpty() detects zero or more empty lines and discards these.

An additional function Ignore() ignores all lines, whether empty or not, until the end of the file. This is sometimes useful when the interesting part of a file has been parsed and all else can be ignored until the end of the file.

Note that I write these functions with capital letters. I use this convention here and in the example below to indicate that these functions parse higher-order structures (higher than the individual strings) in the input.

Examples:

EmptyLine()("")

Spacer()(c(" ","\t\t\t", "atgcc"))

MaybeEmpty()(c("ggacc","gatccg", "atgcc"))

(literal("Interesting") %then% 
  Ignore() %then% 
  eof())(c("Interesting", LETTERS))

Example application: a parser for fasta sequence files

As an example of a somewhat realistic application let's try to write a parser for fasta-formatted files for mixed nucleotide and protein sequences.

Such a fasta file could look like the example below

data("fastafile")
cat(paste0(fastafile, collapse="\n"))

where the first two are nucleotide sequences and the last is a protein sequence[^1].

[^1]: It is not clear to me whether mixing of sequence types is allowed in the fasta format. I guess not, because a protein sequence consisting entirely of glutamate (G), alanine (A), threonine (T) and cysteine (C) would not be distinguishable from a nucleotide sequence. Such protein sequences would be extremely rare. Anyway I demonstrate here that apart from this ambiguous case it is easy to parse them from a single file.

Since fasta files are text files we could read such a file using readLines(). Below we simulate the result of reading the file above by loading the nuclfasta and protfasta data sets present in the package. The consist of character vectors.

data("fastafile")

We can distinguish the following higher order components in a fasta file:

A fasta file: consists of one or more sequence blocks until the end of the file.
A sequence block: consist of a header[^2] and a nucleotide sequence or a protein sequence. A sequence block could be preceded by zero or more empty lines.
A nucleotide sequence: consists of one or more nucleotide sequence strings.
A protein sequence: consists of one or more protein sequence strings.
A header is a string that starts with a ">" immediately followed by a title without spaces.
A nucleotide sequence string is a string without spaces that consists entirely of symbols from the set {G,A,T,C}.
A protein sequence string is a string without spaces that consists entirely of symbols from the set {A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}.

[^2]: Note that real fasta headers and sequences can have more complicated formats than I pretend here.

It now becomes clear what I mean when I say that the package allows us to write transparent parsers: the description above of the structure of fasta files can be translated straight into code for a Fasta() parser:

Fasta <- function() {
  one_or_more(SequenceBlock()) %then%
    eof()
}

SequenceBlock <- function() {
  MaybeEmpty() %then% 
    Header() %then% 
    (NuclSequence() %or% ProtSequence())
}

NuclSequence <- function() {
  one_or_more(NuclSequenceString())
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString())
}

Notice that these elements are functions taking no input, hence the empty argument brackets () behind their names. They can take input when needed, for example to change their behavior (like match_n(), or see the other example below).

Now we need to define the string-parsers Header(), NuclSequenceString() and ProtSequenceString() that recognize and process these elements in the character vector fastafile. We use functions from stringr to do this in three helper functions, and we use match_s() to create parsers from these.

# returns the title after the ">" in the sequence header
parse_header <- function(line) {
  # Study stringr::str_match() to understand what we do here
  m <- stringr::str_match(line, "^>(\\w+)")
  if (is.na(m[1])) {
    return(list()) # signal failure: no title found
  } else {
    return(m[2])
  }
}

# returns a nucleotide sequence string
parse_nucl_sequence_line <- function(line) {
  # The line must consist of GATC from the start (^) until the end ($)
  m <- stringr::str_match(line, "^([GATC]+)$")
  if (is.na(m[1])) {
    return(list()) # signal failure: not a valid nucleotide sequence string
  } else {
    return(m[2])
  }
}

# returns a protein sequence string
parse_prot_sequence_line <- function(line) {
  # The line must consist of ARNDBCEQZGHILKMFPSTWYV from the start (^) until the
  # end ($)
  m <- stringr::str_match(line, "^([ARNDBCEQZGHILKMFPSTWYV]+)$")
  if (is.na(m[1])) {
    return(list()) # signal failure: not a valid protein sequence string
  } else {
    return(m[2])
  }
}

Then we define the parsers.

Header <- function() {
  match_s(parse_header)
}

NuclSequenceString <- function() {
  match_s(parse_nucl_sequence_line)
}

ProtSequenceString <- function() {
  match_s(parse_prot_sequence_line)
}

Now we have all the elements that we need to apply the Fasta() parser.

Fasta()(fastafile)

Apart from match_s(), we have used only the six fundamental parsers. Therefore, the output is almost the same as the parsed input. This is not very useful because it is difficult to extract the individual sequences and titles from it; we would have to write sort of parser again to process this output. To mend this, we have to modify the output of the parsers. The first thing that we will do is to let every sequence block be returned as an element of a list. To achieve this we extend the SequenceBlock parser by changing its output with the %using% operator:

SequenceBlock <- function() {
  MaybeEmpty() %then% 
    Header() %then% 
    (NuclSequence() %or% ProtSequence()) %using%
    function(x) list(x)
}

Now the result is a list of three lists, one for each sequence block.

Fasta()(fastafile)[["L"]]

In principle, this output is easier to extract information from, but we can improve on it. First, we want the sequences to appear as one long string, not as separate character vectors corresponding to the lines in the sequence block. Therefore, we extend the NuclSequence and ProtSequence parsers collapsing their output:

NuclSequence <- function() {
  one_or_more(NuclSequenceString()) %using% 
    function(x) paste0(x, collapse = "")
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString()) %using% 
  function(x) paste0(x, collapse="")
}

Then we get

Fasta()(fastafile)[["L"]]

This looks much better: we know that the first element in each of these lists is the title and the second element is the complete sequence. Then why not just attach a name to these elements? This would make extracting the information even easier. Furthermore, we also report whether the sequence is a nucleotide or a protein sequence by adding a type tag.

Header <- function() {
  match_s(parse_header) %using% 
    function(x) list(title = unlist(x))
}

NuclSequence <- function() {
  one_or_more(NuclSequenceString()) %using% 
    function(x) list(type = "Nucl", sequence = paste0(x, collapse=""))
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString()) %using% 
    function(x) list(type = "Prot", sequence = paste0(x, collapse=""))
}

Finally, we have our desired output.

d <- Fasta()(fastafile)[["L"]]
d

Let's present the result more concisely using the names of these elements:

invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))

Example application: parsers with parameters

In the examples above we showed how to create parsers without parameters. It is easy and useful to sometimes create parsers with parameters. The parameters are used to change the behavior of the parsers. For example, when writing online course material I use a simple structured question template that is converted to html when the syllabus is generated. It consists mostly of markdown content. Its parser makes use of parametrized parsers. The structure of such a question template document is as follows[^3]:

[^3]: I simplified the template and code for this example. In fact the content is processed differently depending on the type of element, meaning that Content() is a function of type. Furthermore, questions are automatically numbered.

qtemp <- c(
  "#### INTRO",
  "## Title about a set of questions",
  "",
  "This is optional introductory text to a set of questions.",
  "Titles preceded by four hashes are not allowed in a question template.",
  "",
  "#### QUESTION",
  "This is the first question",
  "",
  "#### TIP",
  "This would be a tip. tips are optional, and multiple tips can be given. Tips are",
  "wrapped in hide-reveal style html elements.",
  "",
  "#### TIP",
  "This would be a second tip.",
  "",
  "#### ANSWER",
  "The answer to the question is optional and is wrapped in a hide-reveal html element.",
  "",
  "#### QUESTION",
  "This is the second question. No tips for this one",
  "",
  "#### ANSWER",
  "Answer to the second question"
)

cat(paste0(c(qtemp,"","<optionally more questions>"), collapse="\n"))

I stored this example content in a vector qtemp to parse it later.

You notice the recurring structure of a header with four hashes #### and some text following it. These headers represent four types of elements: intro, question, tip and answer. Instead of writing separate parsers we could create a generic parser for such elements as:

HeaderAndContent <- function(type) {
    (Header(type) %then% Content()) %using% 
    function(x) list(list(type=type, content=unlist(x)))
}

Then we define each of the four parsers as:

Intro <- function() HeaderAndContent("intro")
Question <- function() HeaderAndContent("question")
Tip <- function() HeaderAndContent("tip")
Answer <- function() HeaderAndContent("answer")

The function Header(type) is defined as

Header <- function(type) satisfy(header(type)) %ret% NULL

# This must also be a generic function: a function that generates a function to 
# recognize a header of type 'type'
header <- function(type) {
  function(x) grepl(paste0("^####\\s+", toupper(type), "\\s*"), x)
}

The content consists of one or more lines not starting with ####, which includes empty lines. We discard trailing empty lines.

Content <- function() {
  (one_or_more(match_s(content))) %using%
    function(x) stringr::str_trim(paste0(x,collapse="\n"), "right")
}

content <- function(x) {
  if (grepl("^####", x)) list()
  else x
}

The complete template is defined as follows

Template <- function() {
  zero_or_more(Intro()) %then%
    one_or_more(QuestionBlock()) %then%
    eof()
}

where QuestionBlock() is defined using the previously defined elements as

QuestionBlock <- function() {
    Question() %then%
    zero_or_more(Tip()) %then%
    zero_or_one(Answer()) %using%
    function(x) list(x)
}

We can now parse the input. We wrap the Template() parser in the reporter() function to have proper error messaging and warnings, if applicable. Furthermore only the L-element, the parsed input, is returned.

reporter(Template())(qtemp)

Literature

Any scripts or data that you put into this service are public.

parcr documentation built on June 22, 2024, 10:31 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

parcr
Construct Parsers for Structured Text Files

Making parsers with higher order functions
In parcr: Construct Parsers for Structured Text Files

Goal of the package

Higher order functions in R

Creating parser combinators in R to parse text files

The output of the parsers: a `list`

A simple example of using parser combinators

The functions in the `parcr` package

The fundamental parsers

Succeeding and failing

Parsers for the current element

The fundamental combinators

Modifying the output of a parser

Derived parsers

Iterators

Recognizing and processing strings with `match_s`

Functions that split a string and parse the substrings

Derived functions to recognize and modify empty lines

Example application: a parser for fasta sequence files

Example application: parsers with parameters

Literature

Try the parcr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

parcr Construct Parsers for Structured Text Files

Making parsers with higher order functions In parcr: Construct Parsers for Structured Text Files

Goal of the package

Higher order functions in R

Creating parser combinators in R to parse text files

The output of the parsers: a list

A simple example of using parser combinators

The functions in the parcr package

The fundamental parsers

Succeeding and failing

Parsers for the current element

The fundamental combinators

Modifying the output of a parser

Derived parsers

Iterators

Recognizing and processing strings with match_s

Functions that split a string and parse the substrings

Derived functions to recognize and modify empty lines

Example application: a parser for fasta sequence files

Example application: parsers with parameters

Literature

Try the parcr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

parcr
Construct Parsers for Structured Text Files

Making parsers with higher order functions
In parcr: Construct Parsers for Structured Text Files

The output of the parsers: a `list`

The functions in the `parcr` package

Recognizing and processing strings with `match_s`