xlex: Parse xlsx (Excel) formulas into tokens

Description Usage Arguments Details Value Examples

View source: R/xlex.R

Description

xlex takes an Excel formula and separates it into tokens. The name is a bad pun on 'Excel' and 'lexer'. It returns a dataframe, one row per token, giving the token itself, its type (e.g. number, or error), and its level.

The level is a number to show the depth of a token within nested function calls. The token A2 in the formula IF(A1=1,A2,MAX(A3,A4)) is at level 1. Tokens A3 and A4 are at level 2. The token IF is at level 0, which is the outermost level.

The output isn't enough to enable computation or validation of formulas, but it is enough to investigate the structure of formulas and spreadsheets. It has been tested on millions of formulas in the Enron corpus.

Usage

1
xlex(x)

Arguments

x

Character vector of length 1, giving the formula.

Details

The different types of tokens are:

Every part of the original formula is returned as a token, so the original formula can be reconstructed by concatenating the tokens. If that doesn't work, please report it at https://github.com/nacnudus/tidyxl/issues with a reproducible example (e.g. using the reprex package).

The XLParser project was a great help in creating the grammar. https://github.com/spreadsheetlab/XLParser.

Value

A data frame (a tibble, if you use the tidyverse) one row per token, giving the token itself, its type (e.g. number, or error), and its level.

A class attribute xlex is added, so that the base::print() generic can be specialised to print the tree prettily.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# All explicit cell references/addresses are returned as a single 'ref' token.
xlex("A1")
xlex("A$1")
xlex("$A1")
xlex("$A$1")
xlex("A1:B2")
xlex("1:1") # Whole row
xlex("A:B") # Whole column

# If one part of an address is a name or a function, then the colon ':' is
# regarded as a 'range operator', so is tagged 'operator'.
xlex("A1:SOME.NAME")
xlex("SOME_FUNCTION():B2")
xlex("SOME_FUNCTION():SOME.NAME")

# Sheet names are recognised by the terminal exclamation mark '!'.
xlex("Sheet1!A1")
xlex("'Sheet 1'!A1")       # Quoted names may contain some punctuation
xlex("'It''s a sheet'!A1") # Quotes are escaped by doubling

# Sheets can be ranged together in so-called 'three-dimensional formulas'.
# Both sheets are returned in a single 'sheet' token.
xlex("Sheet1:Sheet2!A1")
xlex("'Sheet 1:Sheet 2'!A1") # Quotes surround both (rather than each) sheet

# Sheets from other files are prefixed by the filename, which Excel
# normalizes the filenames into indexes.  Either way, xlex() includes the
# file/index in the 'sheet' token.
xlex("[1]Sheet1!A1")
xlex("'[1]Sheet 1'!A1") # Quotes surround both the file index and the sheet
xlex("'C:\\My Documents\\[file.xlsx]Sheet1'!A1")

# Function names are recognised by the terminal open-parenthesis '('.  There
# is no distinction between custom functions and built-in Excel functions.
# The open-parenthesis is tagged 'fun_open', and the corresponding
# close-parenthesis at the end of the arguments is tagged 'fun_close'.
xlex("MAX(1,2)")
xlex("_xll.MY_CUSTOM_FUNCTION()")

# Named ranges (properly called 'named formulas') are a last resort after
# attempting to match a function (ending in an open parenthesis '(') or a
# sheet (ending in an exclamation mark '!')
xlex("MY_NAMED_RANGE")

# Some cell addresses/references, functions and names can look alike, but
# xlex() should always make the right choice.
xlex("XFD1")     # A cell in the maximum column in Excel
xlex("XFE1")     # Beyond the maximum column, must be a named range/formula
xlex("A1048576") # A cell in the maximum row in Excel
xlex("A1048577") # Beyond the maximum row, must be a named range/formula
xlex("LOG10")    # A cell address
xlex("LOG10()")  # A log function
xlex("LOG:LOG")  # The whole column 'LOG'
xlex("LOG")      # Not a cell address, must be a named range/formula
xlex("LOG()")    # Another log function
xlex("A1.2!A1")  # A sheet called 'A1.2'

# Text is surrounded by double-quotes.
xlex("\"Some text\"")
xlex("\"Some \"\"text\"\"\"") # Double-quotes within text are escaped by

# Numbers are signed where it makes sense, and can be scientific
xlex("1")
xlex("1.2")
xlex("-1")
xlex("-1-1")
xlex("-1+-1")
xlex("MAX(-1-1)")
xlex("-1.2E-3")

# Booleans can be constants or functions, and names can look like booleans,
# but xlex() should always make the right choice.
xlex("TRUE")
xlex("TRUEISH")
xlex("TRUE!A1")
xlex("TRUE()")

# Errors are tagged 'error'
xlex("#DIV/0!")
xlex("#N/A")
xlex("#NAME?")
xlex("#NULL!")
xlex("#NUM!")
xlex("#REF!")
xlex("#VALUE!")

# Operators with more than one character are treated as single tokens
xlex("1<>2")
xlex("1<=2")
xlex("1<2")
xlex("1=2")
xlex("1&2")
xlex("1 2")
xlex("(1,2)")
xlex("1%")   # postfix operator

# The union operator is a comma ',', which is the same symbol that is used
# to separate function arguments or array columns.  It is tagged 'operator'
# only when it is inside parentheses that are not function parentheses or
# array curly braces.  The curly braces are tagged 'array_open' and
# 'array_close'.
tidyxl::xlex("A1,B2") # invalid formula, defaults to 'union' to avoid a crash
tidyxl::xlex("(A1,B2)")
tidyxl::xlex("MAX(A1,B2)")
tidyxl::xlex("SMALL((A1,B2),1)")

# Function arguments are separated by commas ',', which are tagged
# 'separator'.
xlex("MAX(1,2)")

# Nested functions are marked by an increase in the 'level'.  The level
# increases inside parentheses, rather than at the parentheses.  Curly
# braces, for arrays, have the same behaviour, as do subexpressions inside
# ordinary parenthesis, tagged 'paren_open' and 'paren_close'.  To see the
# levels explicitly (rather than by the pretty printing), print as a normal
# data frame or tibble by specifying `pretty = FALSE`.
# class with as.data.frame.
xlex("MAX(MIN(1,2),3)")
xlex("{1,2;3,4}")
xlex("1*(2+3)")
print(xlex("1*(2+3)"), pretty = FALSE)

# Arrays are marked by opening and closing curly braces, with comma ','
# between columns, and semicolons ';' between rows  Commas and semicolons are
# both tagged 'separator'.  Arrays contain only constants, which are
# booleans, numbers, text, and errors.
xlex("MAX({1,2;3,4})")
xlex("=MAX({-1E-2,TRUE;#N/A,\"Hello, World!\"})")

# Structured references are surrounded by square brackets.  Subexpressions
# may also be surrounded by square brackets, but xlex() returns the whole
# expression in a single 'structured_ref' token.
xlex("[@col2]")
xlex("SUM([col22])")
xlex("Table1[col1]")
xlex("Table1[[col1]:[col2]]")
xlex("Table1[#Headers]")
xlex("Table1[[#Headers],[col1]]")
xlex("Table1[[#Headers],[col1]:[col2]]")

# DDE calls (Dynamic Data Exchange) are normalized by Excel into indexes.
# Either way, xlex() includes everything in one token.
xlex("[1]!'DDE_parameter=1'")
xlex("'Quote'|'NYSE'!ZAXX")
# Meaningless spaces that appear in some old files are returned as 'space'
# tokens, so that the original formula can still be recovered by
# concatenating all the tokens.  Spaces between function names and their open
# parenthesis have not been observed, so are not permitted.
xlex(" MAX( A1 ) ")

# print.xlex() invisibly returns the original argument, so that it can be
# used in magrittr pipelines.
str(print(xlex("ROUND(A1*2")))

Example output

root    
<U+00B0>-- A1  ref 
root     
<U+00B0>-- A$1  ref 
root     
<U+00B0>-- $A1  ref 
root      
<U+00B0>-- $A$1  ref 
root       
<U+00B0>-- A1:B2  ref 
root     
<U+00B0>-- 1:1  ref 
root     
<U+00B0>-- A:B  ref 
root           
<U+00A6>-- A1         ref
<U+00A6>-- :          operator
<U+00B0>-- SOME.NAME  name 
root               
<U+00A6>-- SOME_FUNCTION  function
<U+00A6>-- (              fun_open
<U+00A6>-- )              fun_close
<U+00A6>-- :              operator
<U+00B0>-- B2             ref 
root               
<U+00A6>-- SOME_FUNCTION  function
<U+00A6>-- (              fun_open
<U+00A6>-- )              fun_close
<U+00A6>-- :              operator
<U+00B0>-- SOME.NAME      name 
root         
<U+00A6>-- Sheet1!  sheet
<U+00B0>-- A1       ref 
root            
<U+00A6>-- 'Sheet 1'!  sheet
<U+00B0>-- A1          ref 
root                  
<U+00A6>-- 'It''s a sheet'!  sheet
<U+00B0>-- A1                ref 
root                
<U+00A6>-- Sheet1:Sheet2!  sheet
<U+00B0>-- A1              ref 
root                    
<U+00A6>-- 'Sheet 1:Sheet 2'!  sheet
<U+00B0>-- A1                  ref 
root            
<U+00A6>-- [1]Sheet1!  sheet
<U+00B0>-- A1          ref 
root               
<U+00A6>-- '[1]Sheet 1'!  sheet
<U+00B0>-- A1             ref 
root                                      
<U+00A6>-- 'C:\My Documents\[file.xlsx]Sheet1'!  sheet
<U+00B0>-- A1                                    ref 
root       
<U+00A6>-- MAX    function
<U+00B0>-- (      fun_open
    <U+00A6>-- 1  number
    <U+00A6>-- ,  separator
    <U+00B0>-- 2  number
<U+00B0>-- )      fun_close 
root                         
<U+00A6>-- _xll.MY_CUSTOM_FUNCTION  function
<U+00A6>-- (                        fun_open
<U+00B0>-- )                        fun_close 
root                
<U+00B0>-- MY_NAMED_RANGE  name 
root      
<U+00B0>-- XFD1  ref 
root      
<U+00B0>-- XFE1  name 
root          
<U+00B0>-- A1048576  ref 
root          
<U+00B0>-- A1048577  name 
root       
<U+00B0>-- LOG10  ref 
root       
<U+00A6>-- LOG10  function
<U+00A6>-- (      fun_open
<U+00B0>-- )      fun_close 
root         
<U+00B0>-- LOG:LOG  ref 
root     
<U+00B0>-- LOG  name 
root     
<U+00A6>-- LOG  function
<U+00A6>-- (    fun_open
<U+00B0>-- )    fun_close 
root       
<U+00A6>-- A1.2!  sheet
<U+00B0>-- A1     ref 
root             
<U+00B0>-- "Some text"  text 
root                 
<U+00B0>-- "Some ""text"""  text 
root   
<U+00B0>-- 1  number 
root     
<U+00B0>-- 1.2  number 
root    
<U+00B0>-- -1  number 
root    
<U+00A6>-- -1  number
<U+00A6>-- -   operator
<U+00B0>-- 1   number 
root    
<U+00A6>-- -1  number
<U+00A6>-- +   operator
<U+00B0>-- -1  number 
root       
<U+00A6>-- MAX    function
<U+00B0>-- (      fun_open
    <U+00A6>-- -  operator
    <U+00A6>-- 1  number
    <U+00A6>-- -  operator
    <U+00B0>-- 1  number
<U+00B0>-- )      fun_close 
root         
<U+00B0>-- -1.2E-3  number 
root      
<U+00B0>-- TRUE  bool 
root         
<U+00B0>-- TRUEISH  name 
root       
<U+00A6>-- TRUE!  sheet
<U+00B0>-- A1     ref 
root      
<U+00A6>-- TRUE  function
<U+00A6>-- (     fun_open
<U+00B0>-- )     fun_close 
root         
<U+00B0>-- #DIV/0!  error 
root      
<U+00B0>-- #N/A  error 
root        
<U+00B0>-- #NAME?  error 
root        
<U+00B0>-- #NULL!  error 
root       
<U+00B0>-- #NUM!  error 
root       
<U+00B0>-- #REF!  error 
root         
<U+00B0>-- #VALUE!  error 
root    
<U+00A6>-- 1   number
<U+00A6>-- <>  operator
<U+00B0>-- 2   number 
root    
<U+00A6>-- 1   number
<U+00A6>-- <=  operator
<U+00B0>-- 2   number 
root   
<U+00A6>-- 1  number
<U+00A6>-- <  operator
<U+00B0>-- 2  number 
root   
<U+00A6>-- 1  number
<U+00A6>-- =  operator
<U+00B0>-- 2  number 
root   
<U+00A6>-- 1  number
<U+00A6>-- &  operator
<U+00B0>-- 2  number 
root   
<U+00A6>-- 1  number
<U+00A6>--    operator
<U+00B0>-- 2  number 
root       
<U+00B0>-- (      paren_open
    <U+00A6>-- 1  number
    <U+00A6>-- ,  operator
    <U+00B0>-- 2  number
<U+00B0>-- )      paren_close 
root   
<U+00A6>-- 1  number
<U+00B0>-- %  operator 
root    
<U+00A6>-- A1  ref
<U+00A6>-- ,   operator
<U+00B0>-- B2  ref 
root        
<U+00B0>-- (       paren_open
    <U+00A6>-- A1  ref
    <U+00A6>-- ,   operator
    <U+00B0>-- B2  ref
<U+00B0>-- )       paren_close 
root        
<U+00A6>-- MAX     function
<U+00B0>-- (       fun_open
    <U+00A6>-- A1  ref
    <U+00A6>-- ,   separator
    <U+00B0>-- B2  ref
<U+00B0>-- )       fun_close 
root            
<U+00A6>-- SMALL       function
<U+00B0>-- (           fun_open
    <U+00B0>-- (       paren_open
        <U+00A6>-- A1  ref
        <U+00A6>-- ,   operator
        <U+00B0>-- B2  ref
    <U+00A6>-- )       paren_close
    <U+00A6>-- ,       separator
    <U+00B0>-- 1       number
<U+00B0>-- )           fun_close 
root       
<U+00A6>-- MAX    function
<U+00B0>-- (      fun_open
    <U+00A6>-- 1  number
    <U+00A6>-- ,  separator
    <U+00B0>-- 2  number
<U+00B0>-- )      fun_close 
root           
<U+00A6>-- MAX        function
<U+00B0>-- (          fun_open
    <U+00A6>-- MIN    function
    <U+00B0>-- (      fun_open
        <U+00A6>-- 1  number
        <U+00A6>-- ,  separator
        <U+00B0>-- 2  number
    <U+00A6>-- )      fun_close
    <U+00A6>-- ,      separator
    <U+00B0>-- 3      number
<U+00B0>-- )          fun_close 
root       
<U+00B0>-- {      open_array
    <U+00A6>-- 1  number
    <U+00A6>-- ,  separator
    <U+00A6>-- 2  number
    <U+00A6>-- ;  separator
    <U+00A6>-- 3  number
    <U+00A6>-- ,  separator
    <U+00B0>-- 4  number
<U+00B0>-- }      close_array 
root       
<U+00A6>-- 1      number
<U+00A6>-- *      operator
<U+00B0>-- (      paren_open
    <U+00A6>-- 2  number
    <U+00A6>-- +  operator
    <U+00B0>-- 3  number
<U+00B0>-- )      paren_close 
  level        type token
1     0      number     1
2     0    operator     *
3     0  paren_open     (
4     1      number     2
5     1    operator     +
6     1      number     3
7     0 paren_close     )
root           
<U+00A6>-- MAX        function
<U+00B0>-- (          fun_open
    <U+00B0>-- {      open_array
        <U+00A6>-- 1  number
        <U+00A6>-- ,  separator
        <U+00A6>-- 2  number
        <U+00A6>-- ;  separator
        <U+00A6>-- 3  number
        <U+00A6>-- ,  separator
        <U+00B0>-- 4  number
    <U+00B0>-- }      close_array
<U+00B0>-- )          fun_close 
root                         
<U+00A6>-- =                        operator
<U+00A6>-- MAX                      function
<U+00B0>-- (                        fun_open
    <U+00B0>-- {                    open_array
        <U+00A6>-- -1E-2            number
        <U+00A6>-- ,                separator
        <U+00A6>-- TRUE             bool
        <U+00A6>-- ;                separator
        <U+00A6>-- #N/A             error
        <U+00A6>-- ,                separator
        <U+00B0>-- "Hello, World!"  text
    <U+00B0>-- }                    close_array
<U+00B0>-- )                        fun_close 
root         
<U+00B0>-- [@col2]  structured_ref 
root             
<U+00A6>-- SUM          function
<U+00B0>-- (            fun_open
    <U+00B0>-- [col22]  structured_ref
<U+00B0>-- )            fun_close 
root              
<U+00B0>-- Table1[col1]  structured_ref 
root                       
<U+00B0>-- Table1[[col1]:[col2]]  structured_ref 
root                  
<U+00B0>-- Table1[#Headers]  structured_ref 
root                           
<U+00B0>-- Table1[[#Headers],[col1]]  structured_ref 
root                                  
<U+00B0>-- Table1[[#Headers],[col1]:[col2]]  structured_ref 
root                       
<U+00B0>-- [1]!'DDE_parameter=1'  DDE 
root                     
<U+00B0>-- 'Quote'|'NYSE'!ZAXX  DDE 
root        
<U+00A6>--         operator
<U+00A6>-- MAX     function
<U+00B0>-- (       fun_open
    <U+00A6>--     operator
    <U+00A6>-- A1  ref
    <U+00B0>--     operator
<U+00A6>-- )       fun_close
<U+00B0>--         operator 
root        
<U+00A6>-- ROUND   function
<U+00B0>-- (       fun_open
    <U+00A6>-- A1  ref
    <U+00A6>-- *   operator
    <U+00B0>-- 2   number 
Classes 'xlex', 'tbl_df', 'tbl' and 'data.frame':	5 obs. of  3 variables:
 $ level: int  0 0 1 1 1
 $ type : chr  "function" "fun_open" "ref" "operator" ...
 $ token: chr  "ROUND" "(" "A1" "*" ...

tidyxl documentation built on Nov. 16, 2020, 5:09 p.m.