knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(kgrams)
kgrams
provides R users with a set of tools for training, tuning and exploring $k$-gram language models[^word_level]. It gives support for a number of common Natural Language Processing (NLP) tasks: from the basic ones, such as extracting (tokenizing) $k$-grams from a text and predicting sentence or continuation probabilities, to more advanced ones such as computing language model perplexities [^perplexity] and sampling sentences according the language model's probability distribution.
Furthermore, it supports many classical $k$-gram smoothing methods, including the well-known modified Kneser-Ney algorithm, first described in [@chen1999empirical], and widely considered the best performing smoothing technique for $k$-gram models.
$k$-gram language models are notoriously demanding from the space point of view, and many of the toolkits available for $k$-gram based NLP employ various techniques and data structures to achieve the data compression required by the large scales of industry (and, sometimes, academic) applications (see [@Pibiri_2019] for a recent review). On the other hand, at such large scales, neural language models are often the most economic and best performing choice, and this is likely to become more and more so in the future. In developing kgrams
, I made no special attempt at data compression,
and $k$-grams and count estimates are stored in plain C++ STL hash-tables, which can grow rapidly large as the size of corpora and dictionaries increases.
On the other hand, most focus is put on providing a fast, time efficient implementation, with intuitive interfaces for text processing and for model evaluation, and a reasonably large choice of pre-implemented smoothing algorithms, making kgrams
suitable for small- and medium-scale language model experiments, for rapidly building baseline models, and for pedagogical purposes.
In my point of view, the interest in $k$-gram language models is mainly
pedagogical, as they provide very simple algorithms (together with all their
limitations) for learning a natural language's sentence probability
distribution. Nonetheless, and for the same reasons, $k$-gram models can also
provide a useful, quick baseline for model building with more complex
algorithms. An R implementation of classical $k$-gram smoothing techniques is
lacking at the time of writing, and the goal of kgrams
is to fill this gap.
In the following Sections, I illustrate the prototypical workflow for building a $k$-gram language model with kgrams
, show how to compute probabilities and perplexities, and (for the sake of fun!) generate random text at different temperatures.
[^word_level]: Here and below, when we talk about "language models", we always refer to word-level language models. In particular, a $k$-gram is a $k$-tuple of words.
[^perplexity]: Perplexity is a standard evaluation metric for language models, based on the model's sentence probability distribution cross-entropy with the empirical distribution of a test corpus. It is described in some more detail in this Subsection.
This section illustrates the typical workflow for building a $k$-gram language
model with kgrams
. In summary, this involves the following main steps:
We illustrate all these steps in the following.
kgrams
offers two options for reading the text corpora used in its computations, which are basically in-memory and out-of-memory solutions:
character
vector.connection
. This solution includes, for instance, reading text from a file,
from an URL, or from the standard input.The out-of-memory solution can be useful for training over large corpora without the need of storing the entire text into the RAM.
In this vignette, for illustration, we will use the example dataset kgrams::much_ado
(William Shakespeare's "Much Ado About Nothing").
One can (and usually should) apply some transformations to the raw training corpus before feeding it as input to the $k$-gram counting algorithm. In particular, the algorithm considers as a sentence each entry of its pre-processed input, and pads each sentence with Begin-Of-Sentence (BOS) and End-Of-Sentence (EOS) tokens. It considers as a word any substring of a sentence delimited by (one or more) space characters.
For the moment, we only need to define the functions used for preprocessing and
sentence tokenization. We will use the following functions, which leverage on the basic utilities kgrams::preprocess()
and kgrams::tknz_sent()
, and perform some additional steps, since we will be reading raw HTML lines from the URL connection created above.
.preprocess <- function(x) { # Remove speaker name and locations (boldfaced in original html) x <- gsub("<b>[A-z]+</b>", "", x) # Remove other html tags x <- gsub("<[^>]+>||<[^>]+$||^[^>]+>$", "", x) # Apply standard preprocessing including lower-case x <- kgrams::preprocess(x) # Collapse to a single string to avoid splitting into more sentences at the end of lines x <- paste(x, collapse = " ") return(x) } .tknz_sent <- function(x) { # Tokenize sentences x <- kgrams::tknz_sent(x, keep_first = TRUE) # Remove empty sentences x <- x[x != ""] return(x) }
We can now obtain $k$-gram frequency counts from Shakespeare with a single command, using the function kgram_freqs()
. The following stores $k$-gram counts for $k$-grams of order less than or equal to $N = 5$:
freqs <- kgram_freqs(much_ado, # Read Shakespeare's text from connection N = 5, # Store k-gram counts for k <= 5 .preprocess = .preprocess, # preprocess text .tknz_sent = .tknz_sent, # tokenize sentences verbose = FALSE ) freqs
The object freqs
is an object of class kgram_freqs
, i.e. a $k$-gram frequency table. We can obtain a first informative summary of what this object contains by calling summary()
:
summary(freqs)
The parameter V
is the size of the dictionary, which was created behind the scenes by kgram_freqs()
, using all words encountered in the text. In alternative, we could have used a pre-specified dictionary through the argument dict
, and specify whether new words (not present in the original dictionary) should be added to it, or be replaced by an Unknown-Word (UNK) token, by the argument open_dict
; see ?kgram_freqs
for further details. The number of distinct unigrams is greater than the size of the dictionary, because the former also includes the special BOS and EOS tokens.
Notice that the functions .preprocess()
and .tknz_sent()
we defined above
are passed as arguments of kgram_freqs()
[^preprocessing]. These are also saved in the final
kgram_freqs
object, and are by default applied also to inputs at prediction time.
The following shows how to query $k$-gram counts from the frequency table created above [^string_concat]:
# Query some simple unigrams and bigrams query(freqs, c("leonato", "enter leonato", "thy", "smartphones")) # Query k-grams at the beginning or end of a sentence query(freqs, c(BOS() %+% BOS() %+% "i", "love" %+% EOS())) # Total number of words processed query(freqs, "") # Total number of sentences processed query(freqs, EOS())
The most important use of kgram_freqs
objects is to create language models, as we illustrate in the next step.
[^preprocessing]: Strictly speaking, a single argument .preprocess
would suffice, as the processed input is (symbolically) .tknz_sent(.preprocess(input))
. Having two separate arguments for preprocessing and sentence tokenization has a couple of advantages, as explained in ?kgram_freqs
.
[^string_concat]: The string concatenation operator %+%
is equivalent to paste(lhs, rhs)
. Also, the helpers BOS()
, EOS()
and UNK()
return the BOS, EOS and UNK tokens, respectively.
kgrams
provides support for creating language models using several classical
smoothing techniques. The list of smoothers currently supported by kgrams
can be retrieved through:
smoothers()
The documentation page ?smoothers
provides a list of original references for the various smoothers. We will use Interpolated Kneser-Ney smoothing [@Kneser1995ImprovedBF; see also @chen1999empirical], which goes under the code "kn"
. We can get some usage help for this method through the command:
info("kn")
As shown above, Kneser-Ney has one parameter $D$, which is the discount applied to bare $k$-gram frequency counts or continuation counts. We will initialize the model with $D = 0.75$, and later tune this parameter through a test corpus.
To train a language model with the $k$-gram counts stored in freqs
, use:
kn <- language_model(freqs, "kn", D = 0.75) kn
This will create a language_model
object, which can be used to obtain word continuation and sentence probabilities. Let us first get a summary of our final model:
summary(kn)
The parameter D
can be accessed and modified through the functions parameters()
and param()
, which have a similar interface to the base R function attributes()
and attr()
:
parameters(kn) param(kn, "D") param(kn, "D") <- 0.6 param(kn, "D") param(kn, "D") <- 0.75
We can also modify the order of the $k$-gram model, by choosing any number less than or equal to $N = 5$ (since we stored up to $5$-gram counts):
param(kn, "N") <- 4 # 'kn' uses only 1:4-grams param(kn, "N") param(kn, "N") <- 5 # 'kn' uses also 5-grams
In the next section we show how to use this language model for basic tasks such as predicting word and sentence probabilities, and for more complex tasks such as computing perplexities and generating random text.
language_model
objectsSo far we have created a language_model
object using Interpolated Kneser-Ney as smoothing method. In this section we show how to:
We can obtain both sentence probabilities and word continuation probabilities through the function probability()
. This is generic on the first argument, which can be a character
for sentence probabilities, or a word_context
expression for continuation probabilities.
Sentence probabilities can be obtained as follows (the first two are sentences from the training corpus):
probability(c("Did he break out into tears?", "I see, lady, the gentleman is not in your books.", "We are predicting sentence probabilities." ), model = kn )
Behind the scenes, the same .preprocess()
and .tknz_sent()
functions used during training are being applied to the input. We can turn off this behaviour by explicitly specifying the .preprocess
and .tknz_sent
arguments of probability()
.
Word continuation probabilities are the conditional probabilities of words following some given context. For instance, the probability that the words "tears"
or "pieces"
will follow the context "Did he break out into"
are computed as follows:
probability("tears" %|% "Did he break out into", model = kn) probability("pieces" %|% "Did he break out into", model = kn)
The operator %|%
takes as input a character vector on its left-hand side, i.e. the list of candidate words, and a length one character vector on its right-hand side, i.e. the given context. If the context has more than $N - 1$ words (where $N$ is the order of the language model, five in our case), only the last $N - 1$ words are used for prediction.
We can sample sentences from the probability distribution defined by our language model using sample_sentences()
. For instance:
set.seed(840) sample_sentences(model = kn, n = 10, max_length = 10 )
The sampling is performed word by word, and the output is truncated if no EOS
token is found after sampling max_length
words.
We can also sample with a temperature different from one. The temperature transformation of a probability distribution $p(i)$ is defined by:
$$p_t(i) = \dfrac{\exp(\log{p(i)} / t)} {Z(t)},$$ where $Z(t)$ is the partition function, defined in such a way that $\sum _i p_t(i) \equiv 1$. Intuitively, higher and lower temperatures make the original probability distribution smoother and rougher, respectively. By making a physical analogy, we can think of less probable words as states with higher energies, and the effect of higher (lower) temperatures is to make more (less) likely to excite these high energy states.
We can test the effects of temperature on our Shakespeare-inspired language model, by changing the parameter t
of sample_sentences()
(notice that the default t = 1
corresponds to the original distribution):
sample_sentences(model = kn, n = 10, max_length = 10, t = 0.1 # low temperature ) sample_sentences(model = kn, n = 10, max_length = 10, t = 10 # high temperature )
As explained above, sampling with low temperature gives much more weight to probable sentences, and indeed the output is very repetitive. On the contrary, high temperature makes sentence probabilities more uniform, and in fact our output above looks quite random.
Perplexity is a standard evaluation metric for the overall performance of a language model. It is given by $P = e^H$, where $H$ is the cross-entropy of the language model sentence probability distribution against a test corpus empirical distribution:
$$ H = - \dfrac{1}{W}\sum _s\ \ln (\text {Prob}(s)) $$ Here $W$ is total number of words in the test corpus (following Ref. [@chen1999empirical], we include counts of the EOS token, but not the BOS token, in $W$), and the sum extends over all sentences in the test corpus. Perplexity does not give direct information on the performance of a language model in a specific end-to-end task, but is often found to correlate with the latter, which provides a practical justification for the use of this metric. Notice that better models are associated with lower perplexities, and that $H$ is proportional to the negative log-likelihood of the corpus under the language model assumption.
Perplexities can be computed in kgrams
using the function perplexity()
, which can read text both from a character
vector and from a connection
. We will take our test corpus again from Shakespeare's opus, specifically the play "A Midsummer Night's Dream", which is example data from kgrams
namespace:
midsummer[840]
We can compute the perplexity of our Kneser-Ney $5$-gram model kn
against this corpus as follows:
perplexity(midsummer, model = kn)
We can use perplexity to tune our model parameter $D$. We compute perplexity over a grid of values for D
and plot the results. We do this for the $k$-gram models with $k \in {2, 3, 4, 5}$:
D_grid <- seq(from = 0.5, to = 0.99, by = 0.01) FUN <- function(D, N) { param(kn, "N") <- N param(kn, "D") <- D perplexity(midsummer, model = kn) } P_grid <- lapply(2:5, function(N) sapply(D_grid, FUN, N = N)) oldpar <- par(mar = c(2, 2, 1, 1)) plot(D_grid, P_grid[[1]], type = "n", xlab = "D", ylab = "Perplexity", ylim = c(300, 500)) lines(D_grid, P_grid[[1]], col = "red") lines(D_grid, P_grid[[2]], col = "chartreuse") lines(D_grid, P_grid[[3]], col = "blue") lines(D_grid, P_grid[[4]], col = "black") par(oldpar)
We see that the optimal choices for D
are close to its maximum allowed value
D = 1
, for which the performance of the 2-gram model is slightly worse than the higher order models, and that the 5-gram model performs generally worse than the 3-gram and 4-gram models. Indeed, the optimized perplexities for the various $k$-gram orders are given by:
sapply(c("2-gram" = 1, "3-gram" = 2, "4-gram" = 3, "5-gram" = 4), function(N) min(P_grid[[N]]) )
which shows that the best performing model is the 4-gram one, while it seems that the 5-gram model is starting to overfit (which is not very surprising, given the ridiculously small size of our training corpus!).
In this vignette I have shown how to implement and explore $k$-gram language models in R using kgrams
. For further help, you can consult the reference page of the kgrams
website. Development of kgrams
takes place on its GitHub repository. If you find a bug,
please let me know by opening an issue on GitHub, and if you have any ideas or proposals for improvement, please feel welcome to send a pull request, or simply an e-mail at vgherard840@gmail.com.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.