knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

library(simpletrie)

simpletrie

simpletrie is a simple trie implementation in plain R code.

A trie is a structure for representing certain data as a tree of nodes. This data representation allows for fast queries of certain types.

simpletrie can build a trie from 175k words in about 3s.

What's in the box

Installation

You can install from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/simpletrie')

Example: Finding sub-words from a given word

My main use-case for simpletrie is for looking up scrabble words i.e. to find all possible sub-words of a given word with:

library(simpletrie)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read the ENABLE word list
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
words <- tolower(readLines("working/enable1.txt"))
length(words)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create trie strcture
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
system.time({
  trie <- trie_create(words)
})

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Find all words which can be made with 'hi.rstats'
# '.' is a wildcard, and can match any letter. (Like a blank tile in scrabble)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
system.time({
  subwords <- trie_find_subwords(trie, 'hirstats.')
})

length(subwords)
head(subwords, 100)

What's a trie?

A trie is an ordered data structure represented as a tree of nodes.

The image below shows a very small trie which represents the words: 'cat', 'cog', 'dog' and 'dote.'

This representation allows for very fast queries of certain types.

```{dot echo=FALSE, out.width = "50%"} digraph D { root [label="root node"] c a t o1 [label="o"] g1 [label="g"] d1 [label="d"] o2 [label="o"] g2 [label="g"] t3 [label="t"] e4 [label="e"]

root -> c; root -> d1;

c -> a -> t; c -> o1 -> g1; d1 -> o2 -> g2; o2 -> t3 -> e4; } ```

Technical bits

The trie is implemented here as nested environments. Environments are created with new.env(parent = emptyenv(), hash = FALSE).

Using the emptyenv() as parent saves some memory, as the parent environment is never needed in this process.

hash is set to FALSE as using a hash table expands the memory usage by quite a bit - even if size is set to limit the initial hash table size.

Using a hash table on the environment doesn't appreciably change the speed of trie creation or preforming look-ups.

Environments are used instead of lists as I make use of the reference semantics of environment objects when building the trie.

Related Software + Resources

Acknowledgements



coolbutuseless/simpletrie documentation built on July 28, 2024, 3:49 p.m.