options(knitr.table.format = "html") options(max.print=100, scipen=999, width = 800) knitr::opts_chunk$set(echo=FALSE, cache=FALSE, prompt=FALSE, eval = TRUE, tidy=TRUE, root.dir = "..", fig.height = 8, fig.width = 20, comment=NA, message=FALSE, warning=FALSE) knitr::opts_knit$set(width=100, figr.prefix = T, figr.link = T) knitr::knit_hooks$set(inline = function(x) { prettyNum(x, big.mark=",") })
library(data.table) library(MASS) library(extrafont) library(ggplot2) library(gridExtra) library(kfigr) library(kableExtra) library(NLPStudio)
"Prediction is very difficult, especially if it's about the future."
- Nils Bohr, Nobel laureate in physics
In this article, we set ourselves the natural language processing (NLP) task of word prediction using a generative n-gram language model, and in doing so, we will show that it's actually not that hard.
Generative language models are used in range of NLP applications such as chatbots for messaging apps, digital assistants, statistical machine translation, speech recognition, handwriting recognition, spelling correction, Augmented and Alternative Communication (AAC), and word prediction. In fact, the demand for NLP skills is hot!
The focus of this article is word prediction using an n-gram language model with a smoothing algorithm developed by Slava Katz in his 1987 paper "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer" [@Katz1987]. By the end of this piece, you will:
Ultimately you will be able to design and implement your own generative language model in R, python or any other language of your choosing
First, we'll build the foundation
1. Generative Language Models
2.
Let's get started!
Wikipedia offers the following definition:
A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability ${\displaystyle P(w_{1},\ldots ,w_{m})}$ to the whole sequence.
Such a model, for example, would predict that the following sentence has a much higher probability of appearing in a text:
“I’m just one stomach flu away from my goal weight.”
-- Devil Wears Prada (2006)
than does:
"Stomach, goal flu I'm from my just weight one away"
Language models that assign probabilities to entire sentences may also be used to predict the next word in a sentence by assigning probabilities to candidate words, given the preceding sequence of words, or 'history'. Probabilities of words and sequences of words are based upon their observed frequencies in a collection of written texts, known as a corpus of text.
Let's begin by estimating the probability of the word "weight" given its history “I’m just one stomach flu away from my goal”, which we notate as the conditional probability, $P(w|h)$ or $P(w_n|w_1^{n-1})$. Concretely we want to know:
$P(weight | I'm just one stomach flue away from my goal)
Theoretically, we might estimate the probability of the sequence ${w_1}$,${w_2}$, ..., $w_{n}$ by counting the number of times the sequence occurs and normalizing that by the number of times the history (${w_1}$,${w_2}$, ..., $w_{n-1}$) occurs as follows:
$$P(w_n | {w_1}, ..., w_{n-1}) = \frac{C({w_1},{w_2}, ..., w_{n})}{C({w_1},{w_2}, ..., w_{n-1})}$$ This is known as maximum likelihood estimation or MLE because it is the probability that maximizes the likelihood of the data. Yet, even with a corpus as large as the internet, new language and sequences are created all the time and it may not be possible to estimate such long sequences. We need a smarter way to estimate the probability of the next word, given its history.
Russian mathematician Andrey Markov observed that the language production is a "memoryless" stochastical process where the conditional probability distribution of a future word depends only upon the present word [@Markov1906]. Thus we can estimate the probability of word given its entire history by calculating the probability of the word given the prior word. Thus: $$P({w_n}|{w_1^{n-1}}) \approx P({w_n}|{w_{n-1}})$$ The Markov assumption is the basis for the simplest language model that assigns probabilities to sentences and sequences of words, the N-gram model.
The N-Gram is simply a sequence of $N$ words. A unigram is a sequence of one word (if there is such a thing), a bigram is a sequence of two words, such as "my goal"; analogously, a trigram is a sequence of three words as in "from my goal". The trigram model is fairly standard; although, models up to 5-grams are not uncommon.
The language model that estimates the probability using only the previous word is the bigram model. This can be generalized to the trigram model that takes into account, the previous two words and to the N-gram model that considers the previous N-1 words. Thus the general equation for the N-gram approximation of the conditional probability of the next word in a sentence is [@Jurafsky2016]:
$$P({w_n}|{w_1^{n-1}}) \approx P({w_n}|{w_{n-N+1}^{n-1}})$$
We can now compute the maximum likelood estimation of the N-gram probabilities as follows:
$$P({w_n}|{w_{n-N+1}^{n-1}}) = \frac{C({w_{n-N+1}^{n-1}},{w_n})}{C({w_{n-N+1}^{n-1}})}$$
Equation r figr("mleEstimation-I", F, type="Equation")
uses the relative frequency of a particular sequence and its prefix to estimate N-gram probabilities.
Intrinsic evaluations require language models to deal with unseen words in the perplexity calculation. There are essentially two techniques for training probabilities of unseen words. The first is to choose a fixed vocabulary in advance, convert in the training set, any word that is not in the vocabulary set, to an unknown word token
Smoothing refers to a class of techniques to deal with unseen events by adjusting the maximum likelihood estimates of low probabilities upward and high probabilities downward. This not only prevents zero probabilities, but improves the predictive accuracy of the language model overall. One of the simplest smoothing techniques is Add-k smoothing [@Jeffreys61], in which the count of each n-gram is increased by $\delta$, typically $0 < \delta \leq 1$. Good-Turing smoothing estimates the probability of unseen events by reallocating the probability mass of n-grams that occur $r + 1$ times in the training data to the n-grams that occur $r$ times [@Good1953]. Since the algorithm doesn't include the combination of higher-order models with lower-order models, it is typically used with other smoothing techniques [@Chen1998]. Jelinek-Mercer smoothing interpolates higher-order maximum likely estimate probabilities with lower-order probabilities, where the interpolation factor $\lambda_{w_{i-n+1}^{i-1}}$ is estimated for each history using held-out data [@JelMer80]. Katz smoothing extends the Good-Turing estimate by adding the combination of higher-order models with lower-order models [@Katz1987]. Witten-Bell smoothing [@Bell:1990:TC:77753] extends Jelinek-Mercer smoothing in that the nth-order smoothed model is defined recursively as a linear interpolation between the nth-order maximum likelihood model and the ($n-1$)th-order smoothed model. The interpolation parameter $\lambda_{w_{i-n+1}^{i-1}}$, is computed by counting the number of unique words that follow the history $w_{i-n+1}^{i-1}$. Absolute Discounting [@Ney1994a], ), like Jelinek-Mercer smoothing, combines higher and lower-order models; however, instead of multiplying the higher-order maximum-likelihood distribution by a factor $\lambda_{w_{i-n+1}^{i-1}}$, a discount factor $D\leq1$ is subtracted from each non-zero count. Kneser-Ney smoothing [@Kneser1995] is an extension of absolute discounting where the lower-order distribution is proportional not with the count of the n-gram, but the number of unique words that follow the n-gram in the training text. Modified Kneser-Ney enhances the discounting regime by using multiple discounts $d_1$, $d-2$, $d_{3+}$ for N-grams with counts of 1, 2 and 3 or more, respectively.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.