In vgherard/sbo: Text Prediction via Stupid Back-Off N-Gram Models

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

sbo

sbo provides utilities for building and evaluating text predictors based on Stupid Back-off N-gram models in R. It includes functions such as:

kgram_freqs(): Extract $k$-gram frequency tables from a text corpus
sbo_predictor(): Train a next-word predictor via Stupid Back-off.
eval_sbo_predictor(): Test text predictions against an independent corpus.

Installation

Released version

You can install the latest release of sbo from CRAN:

install.packages("sbo")

Development version:

You can install the development version of sbo from GitHub:

# install.packages("devtools")
devtools::install_github("vgherard/sbo")

Example

This example shows how to build a text predictor with sbo:

library(sbo)
p <- sbo_predictor(sbo::twitter_train, # 50k tweets, example dataset
                   N = 3, # Train a 3-gram model
                   dict = sbo::twitter_dict, # Top 1k words appearing in corpus
                   .preprocess = sbo::preprocess, # Preprocessing transformation
                   EOS = ".?!:;" # End-Of-Sentence characters
                   )

The object p can now be used to generate predictive text as follows:

predict(p, "i love") # a character vector
predict(p, "you love") # another character vector
predict(p, 
        c("i love", "you love", "she loves", "we love", "you love", "they love")
        ) # a character matrix