sample_sentences: Random Text Generation

View source: R/sample_sentences.R

sample_sentencesR Documentation

Random Text Generation

Description

Sample sentences from a language model's probability distribution.

Usage

sample_sentences(model, n, max_length, t = 1)

Arguments

model

an object of class language_model.

n

an integer. Number of sentences to sample.

max_length

an integer. Maximum length of sampled sentences.

t

a positive number. Sampling temperature (optional); see Details.

Details

This function samples sentences according the prescribed language model's probability distribution, with an optional temperature parameter. The temperature transform of a probability distribution is defined by p(t) = exp(log(p) / t) / Z(t) where Z(t) is the partition function, fixed by the normalization condition sum(p(t)) = 1.

Sampling is performed word by word, using the already sampled string as context, starting from the Begin-Of-Sentence context (i.e. N - 1 BOS tokens). Sampling stops either when an End-Of-Sentence token is encountered, or when the string exceeds max_length, in which case a truncated output is returned.

Some language models may give a non-zero probability to the the Unknown word token, but this is never produced in text generated by sample_sentences(): when randomly sampled, it is simply ignored.

Finally, a word of caution on some special smoothers: "sbo" smoother (Stupid Backoff), does not produce normalized continuation probabilities, but rather continuation scores. Sampling is here performed by assuming that Stupid Backoff scores are proportional to actual probabilities. 'ml' smoother (Maximum Likelihood) does not assign probabilities when the k-gram count of the context is zero. When this happens, the next word is chosen uniformly at random from the model's dictionary.

Value

a character vector of length n. Random sentences generated from the language model's distribution.

Author(s)

Valerio Gherardi

Examples

# Sample sentences from 8-gram Kneser-Ney model trained on Shakespeare's
# "Much Ado About Nothing"



### Prepare the model and set seed
freqs <- kgram_freqs(much_ado, 8, .tknz_sent = tknz_sent)
model <- language_model(freqs, "kn", D = 0.75)
set.seed(840)

sample_sentences(model, n = 3, max_length = 10)

### Sampling at high temperature
sample_sentences(model, n = 3, max_length = 10, t = 100)

### Sampling at low temperature
sample_sentences(model, n = 3, max_length = 10, t = 0.01)



vgherard/kgrams documentation built on Nov. 17, 2024, 8:56 p.m.