In coolbutuseless/minilexer: A Simple Tool for Lexing Text Data

suppressPackageStartupMessages({
  library(minilexer)
})
knitr::opts_chunk$set(echo = TRUE)

minilexer: A Simple Lexer in R

minilexer provides a tool for simple tokenising/lexing of text files.

minilexer aims to be great at helping to get unsupported text data formats into R fast.

For complicated parsing (especially of computer programs) you'll want to use the more formally correct lexing/parsing provided by the rly package or the dparser package.

Note: As of version 0.1.6, the TokenStream handler has been removed.

Installation

remotes::install_github('coolbutuseless/minilexer')

Package Overview

Current the package provides one function:

minilexer::lex(text, patterns) for splitting the text into tokens.
- This function uses the user-defined regular expressions (patterns) to split text into a character vector of tokens.
- The patterns argument is a named vector of character strings representing regular expressions for elements to match within the text.

Introducing the `minilexer` package

minilexer provides a tool for simple tokenising/lexing text files.

I will emphasise the mini in minilexer as this is not a rigorous or formally complete lexer, but it suits 90% of my needs for turning data text formats into tokens.

For complicated parsing (especially of computer programs) you'll probably want to use the more formally correct lexing/parsing provided by the rly package or the dparser package.

Example: Use `lex()` to split sentence into tokens

sentence_patterns <- c(
  word        = "\\w+", 
  whitespace  = "\\s+",
  fullstop    = "\\.",
  comma       = ","
)

sentence = "Hello there, Rstats."

lex(sentence, sentence_patterns)

Example: Use `lex()` to split some simplified R code into tokens

R_patterns <- c(
  number      = "-?\\d*\\.?\\d+",
  name        = "\\w+",
  equals      = "==",
  assign      = "<-|=",
  plus        = "\\+",
  lbracket    = "\\(",
  rbracket    = "\\)",
  newline     = "\n",
  whitespace  = "\\s+"
)

R_code <- "x <- 3 + 4.2 + rnorm(1)"

R_tokens <- lex(R_code, R_patterns)
R_tokens