knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
kgrams
provides tools for training and evaluating $k$-gram language models, including several probability smoothing methods, perplexity computations, random text generation and more. It is based on an C++ back-end which makes kgrams
fast, coupled with an accessible R API which aims at streamlining the process of model building, and can be suitable for small- and medium-sized NLP experiments, baseline model building, and for pedagogical purposes.
If you have no idea about what $k$-gram models are and didn't get here by accident, you can check out my hands-on tutorial post on $k$-gram language models using R at DataScience+.
You can install the latest release of kgrams
from CRAN with:
install.packages("kgrams")
You can install the development version from my R-universe with:
install.packages("kgrams", repos = "https://vgherard.r-universe.dev/")
This example shows how to train a modified Kneser-Ney 4-gram model on Shakespeare's play "Much Ado About Nothing" using kgrams
.
library(kgrams) # Get k-gram frequency counts from text, for k = 1:4 freqs <- kgram_freqs(kgrams::much_ado, N = 4) # Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3. mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)
We can now use this language_model
to compute sentence and word continuation probabilities:
# Compute sentence probabilities probability(c("did he break out into tears ?", "we are predicting sentence probabilities ." ), model = mkn ) # Compute word continuation probabilities probability(c("tears", "pieces") %|% "did he break out into", model = mkn)
Here are some sentences sampled from the language model's distribution at temperatures t = c(1, 0.1, 10)
:
# Sample sentences from the language model at different temperatures set.seed(840) sample_sentences(model = mkn, n = 3, max_length = 10, t = 1) sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1) sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
For further help, you can consult the reference page of the kgrams
website or open an issue on the GitHub repository of kgrams
. A vignette is available on the website, illustrating the process of building language models in-depth.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.