minilexer: A Simple Lexer in R

minilexer provides a tool for simple tokenising/lexing of text files.

minilexer aims to be great at helping to get unsupported text data formats into R fast.

For complicated parsing (especially of computer programs) you’ll want to use the more formally correct lexing/parsing provided by the rly package or the dparser package.

Note: As of version 0.1.6, the TokenStream handler has been removed.



Package Overview

Current the package provides one function:

Example: Use lex() to split sentence into tokens

sentence_patterns <- c(
  word        = "\\w+", 
  whitespace  = "\\s+",
  fullstop    = "\\.",
  comma       = ","

sentence = "Hello there, Rstats."

lex(sentence, sentence_patterns)
##       word whitespace       word      comma whitespace       word 
##    "Hello"        " "    "there"        ","        " "   "Rstats" 
##   fullstop 
##        "."

Example: Use lex() to split some simplified R code into tokens

R_patterns <- c(
  number      = "-?\\d*\\.?\\d+",
  name        = "\\w+",
  equals      = "==",
  assign      = "<-|=",
  plus        = "\\+",
  lbracket    = "\\(",
  rbracket    = "\\)",
  newline     = "\n",
  whitespace  = "\\s+"

R_code <- "x <- 3 + 4.2 + rnorm(1)"

R_tokens <- lex(R_code, R_patterns)
##       name whitespace     assign whitespace     number whitespace 
##        "x"        " "       "<-"        " "        "3"        " " 
##       plus whitespace     number whitespace       plus whitespace 
##        "+"        " "      "4.2"        " "        "+"        " " 
##       name   lbracket     number   rbracket 
##    "rnorm"        "("        "1"        ")"

