README.md
In coolbutuseless/minilexer: A Simple Tool for Lexing Text Data

minilexer provides a tool for simple tokenising/lexing of text files.

minilexer aims to be great at helping to get unsupported text data formats into R fast.

For complicated parsing (especially of computer programs) you’ll want to use the more formally correct lexing/parsing provided by the rly package or the dparser package.

Note: As of version 0.1.6, the TokenStream handler has been removed.

remotes::install_github('coolbutuseless/minilexer')

Current the package provides one function:

minilexer::lex(text, patterns) for splitting the text into tokens.
- This function uses the user-defined regular expressions (patterns) to split text into a character vector of tokens.
- The patterns argument is a named vector of character strings representing regular expressions for elements to match within the text.

Introducing the `minilexer` package

minilexer provides a tool for simple tokenising/lexing text files.

I will emphasise the mini in minilexer as this is not a rigorous or formally complete lexer, but it suits 90% of my needs for turning data text formats into tokens.

For complicated parsing (especially of computer programs) you’ll probably want to use the more formally correct lexing/parsing provided by the rly package or the dparser package.

Example: Use `lex()` to split sentence into tokens

sentence_patterns <- c(
  word        = "\\w+", 
  whitespace  = "\\s+",
  fullstop    = "\\.",
  comma       = ","
)

sentence = "Hello there, Rstats."

lex(sentence, sentence_patterns)

##       word whitespace       word      comma whitespace       word 
##    "Hello"        " "    "there"        ","        " "   "Rstats" 
##   fullstop 
##        "."

Example: Use `lex()` to split some simplified R code into tokens

R_patterns <- c(
  number      = "-?\\d*\\.?\\d+",
  name        = "\\w+",
  equals      = "==",
  assign      = "<-|=",
  plus        = "\\+",
  lbracket    = "\\(",
  rbracket    = "\\)",
  newline     = "\n",
  whitespace  = "\\s+"
)

R_code <- "x <- 3 + 4.2 + rnorm(1)"

R_tokens <- lex(R_code, R_patterns)
R_tokens

##       name whitespace     assign whitespace     number whitespace 
##        "x"        " "       "<-"        " "        "3"        " " 
##       plus whitespace     number whitespace       plus whitespace 
##        "+"        " "      "4.2"        " "        "+"        " " 
##       name   lbracket     number   rbracket 
##    "rnorm"        "("        "1"        ")"

coolbutuseless/minilexer documentation built on May 14, 2019, 6:09 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com