knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

kgrams

Project Status: Active – The project has reached a stable, usable state and is being actively developed. R-CMD-check Codecov test coverage CRAN status R-universe status Website Tweet

kgrams provides tools for training and evaluating $k$-gram language models, including several probability smoothing methods, perplexity computations, random text generation and more. It is based on an C++ back-end which makes kgrams fast, coupled with an accessible R API which aims at streamlining the process of model building, and can be suitable for small- and medium-sized NLP experiments, baseline model building, and for pedagogical purposes.

For beginners

If you have no idea about what $k$-gram models are and didn't get here by accident, you can check out my hands-on tutorial post on $k$-gram language models using R at DataScience+.

Installation

Released version

You can install the latest release of kgrams from CRAN with:

install.packages("kgrams")

Development version

You can install the development version from my R-universe with:

install.packages("kgrams", repos = "https://vgherard.r-universe.dev/")

Example

This example shows how to train a modified Kneser-Ney 4-gram model on Shakespeare's play "Much Ado About Nothing" using kgrams.

library(kgrams)
# Get k-gram frequency counts from text, for k = 1:4
freqs <- kgram_freqs(kgrams::much_ado, N = 4)
# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)

We can now use this language_model to compute sentence and word continuation probabilities:

# Compute sentence probabilities
probability(c("did he break out into tears ?",
              "we are predicting sentence probabilities ."
              ), 
            model = mkn
            )
# Compute word continuation probabilities
probability(c("tears", "pieces") %|% "did he break out into", model = mkn)

Here are some sentences sampled from the language model's distribution at temperatures t = c(1, 0.1, 10):

# Sample sentences from the language model at different temperatures
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)

Getting Help

For further help, you can consult the reference page of the kgrams website or open an issue on the GitHub repository of kgrams. A vignette is available on the website, illustrating the process of building language models in-depth.



vgherard/kgrams documentation built on Nov. 17, 2024, 8:56 p.m.