options(knitr.table.format = "html")
options(max.print=100, scipen=999, width = 800)
knitr::opts_chunk$set(echo=FALSE,
                 cache=FALSE,
               prompt=FALSE,
                 eval = TRUE,
               tidy=TRUE,
               root.dir = "..",
               fig.height = 8,
               fig.width = 20,
               comment=NA,
               message=FALSE,
               warning=FALSE)
knitr::opts_knit$set(width=100, figr.prefix = T, figr.link = T)
knitr::knit_hooks$set(inline = function(x) {
  prettyNum(x, big.mark=",")
})
library(data.table)
library(MASS)
library(extrafont)
library(ggplot2)
library(gridExtra)
library(kfigr)
library(kableExtra)
library(NLPStudio)

"Prediction is very difficult, especially if it's about the future."
- Nils Bohr, Nobel laureate in physics

In this article, we set ourselves the natural language processing (NLP) task of word prediction using a generative n-gram language model, and in doing so, we will show that it's actually not that hard.

Why should I read this?

Generative language models are used in range of NLP applications such as chatbots for messaging apps, digital assistants, statistical machine translation, speech recognition, handwriting recognition, spelling correction, Augmented and Alternative Communication (AAC), and word prediction. In fact, the demand for NLP skills is hot!

  1. According to Upwork,NLP ranks first among the fastest growing skills in the global job market. The demand for workers with experience in (NLP) grew by more than 200% in Q4 2016, faster than any other skillset [@Upwork2017].
  2. Forbes ranked natural language generation, speech recognition, text analytics, and natural language processing among the Top 10 Hot Artificial Intelligence (AI) Technologies [@Press2018].
  3. By 2021, more than 50% of enterprises will be spending more per annum on language model based bots and chatbot creation than traditional mobile app development [@Gartner2018].

What will I learn?

The focus of this article is word prediction using an n-gram language model with a smoothing algorithm developed by Slava Katz in his 1987 paper "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer" [@Katz1987]. By the end of this piece, you will:

Ultimately you will be able to design and implement your own generative language model in R, python or any other language of your choosing

Ok, What's the Plan?

First, we'll build the foundation 1. Generative Language Models
2.

Let's get started!

What is a language model?

Wikipedia offers the following definition:

A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability ${\displaystyle P(w_{1},\ldots ,w_{m})}$ to the whole sequence.

Such a model, for example, would predict that the following sentence has a much higher probability of appearing in a text:

“I’m just one stomach flu away from my goal weight.”

-- Devil Wears Prada (2006)

than does:

"Stomach, goal flu I'm from my just weight one away"

Language models that assign probabilities to entire sentences may also be used to predict the next word in a sentence by assigning probabilities to candidate words, given the preceding sequence of words, or 'history'. Probabilities of words and sequences of words are based upon their observed frequencies in a collection of written texts, known as a corpus of text.

Maximum Likelihood Estimation

Let's begin by estimating the probability of the word "weight" given its history “I’m just one stomach flu away from my goal”, which we notate as the conditional probability, $P(w|h)$ or $P(w_n|w_1^{n-1})$. Concretely we want to know:

$P(weight | I'm just one stomach flue away from my goal)

Theoretically, we might estimate the probability of the sequence ${w_1}$,${w_2}$, ..., $w_{n}$ by counting the number of times the sequence occurs and normalizing that by the number of times the history (${w_1}$,${w_2}$, ..., $w_{n-1}$) occurs as follows:

$$P(w_n | {w_1}, ..., w_{n-1}) = \frac{C({w_1},{w_2}, ..., w_{n})}{C({w_1},{w_2}, ..., w_{n-1})}$$ This is known as maximum likelihood estimation or MLE because it is the probability that maximizes the likelihood of the data. Yet, even with a corpus as large as the internet, new language and sequences are created all the time and it may not be possible to estimate such long sequences. We need a smarter way to estimate the probability of the next word, given its history.

Markov Assumption

Russian mathematician Andrey Markov observed that the language production is a "memoryless" stochastical process where the conditional probability distribution of a future word depends only upon the present word [@Markov1906]. Thus we can estimate the probability of word given its entire history by calculating the probability of the word given the prior word. Thus: $$P({w_n}|{w_1^{n-1}}) \approx P({w_n}|{w_{n-1}})$$ The Markov assumption is the basis for the simplest language model that assigns probabilities to sentences and sequences of words, the N-gram model.

N-Grams

The N-Gram is simply a sequence of $N$ words. A unigram is a sequence of one word (if there is such a thing), a bigram is a sequence of two words, such as "my goal"; analogously, a trigram is a sequence of three words as in "from my goal". The trigram model is fairly standard; although, models up to 5-grams are not uncommon.

The language model that estimates the probability using only the previous word is the bigram model. This can be generalized to the trigram model that takes into account, the previous two words and to the N-gram model that considers the previous N-1 words. Thus the general equation for the N-gram approximation of the conditional probability of the next word in a sentence is [@Jurafsky2016]: $$P({w_n}|{w_1^{n-1}}) \approx P({w_n}|{w_{n-N+1}^{n-1}})$$ We can now compute the maximum likelood estimation of the N-gram probabilities as follows: $$P({w_n}|{w_{n-N+1}^{n-1}}) = \frac{C({w_{n-N+1}^{n-1}},{w_n})}{C({w_{n-N+1}^{n-1}})}$$ Equation r figr("mleEstimation-I", F, type="Equation") uses the relative frequency of a particular sequence and its prefix to estimate N-gram probabilities.

Out-of-Vocabulary (OOV) Words

Intrinsic evaluations require language models to deal with unseen words in the perplexity calculation. There are essentially two techniques for training probabilities of unseen words. The first is to choose a fixed vocabulary in advance, convert in the training set, any word that is not in the vocabulary set, to an unknown word token , then estimate the probabilities for from counts in the training set. The other method, should a fixed vocabulary not be available, is to convert the first occurrence of each word type in the training set to , then train the model as normal [@Jurafsky2016].

Smoothing

Smoothing refers to a class of techniques to deal with unseen events by adjusting the maximum likelihood estimates of low probabilities upward and high probabilities downward. This not only prevents zero probabilities, but improves the predictive accuracy of the language model overall. One of the simplest smoothing techniques is Add-k smoothing [@Jeffreys61], in which the count of each n-gram is increased by $\delta$, typically $0 < \delta \leq 1$. Good-Turing smoothing estimates the probability of unseen events by reallocating the probability mass of n-grams that occur $r + 1$ times in the training data to the n-grams that occur $r$ times [@Good1953]. Since the algorithm doesn't include the combination of higher-order models with lower-order models, it is typically used with other smoothing techniques [@Chen1998]. Jelinek-Mercer smoothing interpolates higher-order maximum likely estimate probabilities with lower-order probabilities, where the interpolation factor $\lambda_{w_{i-n+1}^{i-1}}$ is estimated for each history using held-out data [@JelMer80]. Katz smoothing extends the Good-Turing estimate by adding the combination of higher-order models with lower-order models [@Katz1987]. Witten-Bell smoothing [@Bell:1990:TC:77753] extends Jelinek-Mercer smoothing in that the nth-order smoothed model is defined recursively as a linear interpolation between the nth-order maximum likelihood model and the ($n-1$)th-order smoothed model. The interpolation parameter $\lambda_{w_{i-n+1}^{i-1}}$, is computed by counting the number of unique words that follow the history $w_{i-n+1}^{i-1}$. Absolute Discounting [@Ney1994a], ), like Jelinek-Mercer smoothing, combines higher and lower-order models; however, instead of multiplying the higher-order maximum-likelihood distribution by a factor $\lambda_{w_{i-n+1}^{i-1}}$, a discount factor $D\leq1$ is subtracted from each non-zero count. Kneser-Ney smoothing [@Kneser1995] is an extension of absolute discounting where the lower-order distribution is proportional not with the count of the n-gram, but the number of unique words that follow the n-gram in the training text. Modified Kneser-Ney enhances the discounting regime by using multiple discounts $d_1$, $d-2$, $d_{3+}$ for N-grams with counts of 1, 2 and 3 or more, respectively.



DataScienceSalon/NLPStudio documentation built on May 23, 2019, 10:36 p.m.