Description Usage Arguments Format Details Value Methods Final Remarks Examples
Reading and processing tokens from a text file usually is done in three steps: Load the file, cut into tokens, act upon the resulting vector of strings.
The Tokenizer
aims to simplify and streamline the process, when tokens must be processed in a sequential manner.
1 2 3 4 5 6 7 | # tok <- Tokenizer$new(filename=NA, skipEmptyTokens=TRUE)
# Tokenizer$getDelimiters()
# Tokenizer$setDelimiters(delims)
# Tokenizer$nextToken()
# Tokenizer$close()
# getOffset()
# setOffset(offset)
|
filename |
The file to open. |
skipEmptyTokens |
set whether empty tokens ("") shall be skipped or returned |
delims |
An integer vector holding the ASCII-codes of characters that serve as delimiters. If not set, it defaults to blank, tab, carriage return and linefeed (the last two together resemble a Windows newline). |
offset |
an integer vector of length >= 2, where the first component holds the upper 32 bits of the offset and the second component holds the lower 32 bit of the offset |
An R6Class
generator object.
While the life-cycle of the Tokenizer
still requires the user to act in three phases, it abstracts away the nasties of the file access, leverages the powers of the underlying operaring system for prefetching. Most of all, bookkeeping is much simpler: The user simply has to keep track of the object returned by the constructor and is free to pass it around between functions without caring for the current state. The Tokenizer
will also try to close open files by itself, before it ist Garbage Collected.
The Tokenizer
is object-oriented, so functions on any instance can be called in a OO style or in more imperative style:
OO style: tok$nextToken()
imperative style: nextToken(tok)
Both calls will give the same result.
A new Tokenizer object, backed by a memory mapped file and the delimiters set to the default values.
new()
Create a new instance of a Tokenizer
nextToken()
Obtain the next token, that is the character vector from the character after the last delimiter up to the next delimiter from the current list of delimiters. It will return NA
on all invocations once the EOF is reached.
setDelimiters()
Set the list of delimiters. It is given as an integer vector of (extended) ASCII-character values, i.e. in the range [0..255].
getDelimiters()
Get the current list of delimiters.
close()
Close the file behind the tokenizer. Future calls to nextToken()
will return NA
. It is considered good style to close the file manually to avoid to many open handles. The file will be closed automatically when there are no more references to the Tokenizer
and it is garbage collected or upon exiting the R session.
print()
Prints the name of the currently opened file.
getOffset()
Get the offset relative to the beginning of the file of the next token.
setOffset()
Set the offset where reading should continue.
While it may be tempting to clone a tokenizer object to split a file into different tokens from a given start position, this is not supported, as file state cannot be synchronized between the clones, leading to unpredictable results, when one of the clones closes the underlying shared file.
For efficiency reasons, Tokenizer
will not re-stat the file once it is successfully opened. This means that especially a change of the file size can lead to unpredictable behaviour.
The sequence \cr\lf
will be interpreted as two distinct tokens, if skipEmptyTokens=FALSE
. The default setting is TRUE
1 2 3 4 5 6 7 8 9 10 11 12 13 | ## Not run:
tok<-Tokenizer$new("tokenfile.txt")
tok$nextToken() #
tok$print() # or just 'tok'
tok$getDelimiters()
tok$setDelimiters(c(59L,0xaL)) # new Delimiters: ';', newline
tok$setDelimiters(as.integer(charToRaw(";\n"))) # the same
tok$nextToken()
tok$setDelimiters(Tokenizer$new()$getDelimiters()) # reset to default
while(!is.na(s<-tok$nextToken())) print(s) # print the remaing tokens of file
tok$close() # good style, but not required
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.