language_model: k-gram Language Models

View source: R/language_model.R

language_modelR Documentation

k-gram Language Models

Description

Build a k-gram language model.

Principal methods supported by objects of class language_model

  • probability(): compute word continuation and sentence probabilities. See probability.

  • sample_sentences(): generate random text by sampling from the language model probability distribution at arbitary temperature. See sample_sentences.

  • perplexity(): Compute the language model perplexity on a test corpus. See perplexity.

Usage

language_model(object, ...)

## S3 method for class 'language_model'
language_model(object, ...)

## S3 method for class 'kgram_freqs'
language_model(object, smoother = "ml", N = param(object, "N"), ...)

Arguments

object

an object which stores the information required to build the k-gram model. At present, necessarily a kgram_freqs object, or a language_model object of which a copy is desired (see Details).

...

possible additional parameters required by the smoother.

smoother

a length one character vector. Indicates the smoothing technique to be applied to compute k-gram continuation probabilities. A list of available smoothers can be obtained with smoothers(), and further information on a particular smoother through info().

N

a length one integer. Maximum order of k-grams to use in the language model. This muss be less than or equal to the order of the underlying kgram_freqs object.

Details

These generics are used to construct objects of class language_model. The language_model method is only needed to create copies of language_model objects (that is to say, new copies which are not altered by methods which modify the original object in place, see e.g. parameters). The discussion below focuses on language models and the kgram_freqs method.

kgrams supports several k-gram language models, including Interpolated Kneser-Ney, Stupid Backoff and others (see smoothers). The objects created by language_models() have methods for computing word continuation and sentence probabilities (see probability), random text generation (see sample_sentences) and other type of language modeling tasks such as computing perplexities and word prediction accuracies.

Smoothers have often tuning parameters, which need to be specified by (exact) name through the ... arguments; otherwise, language_model() will use default values and, once per session, throw a warning. info(smoother) lists all parameters needed by a specific smoother, together with their allowed parameter space.

The run-time of language_model() may vary substantially for different smoothing methods, depending on whether or not a method requires the computation of additional quantities (that is to say, beyond k-gram counts) for its operativity (this is, for instance, the case for the Kneser-Ney smoother).

Value

A language_model object.

Author(s)

Valerio Gherardi

Examples

# Create an interpolated Kneser-Ney 2-gram language model

freqs <- kgram_freqs("a a b a a b a b a b a b", 2)
model <- language_model(freqs, "kn", D = 0.5)
model
summary(model)
probability("a" %|% "b", model)

# For more examples, see ?probability, ?sample_sentences and ?perplexity.


kgrams documentation built on Oct. 6, 2023, 5:06 p.m.