# probability: Language Model Probabilities In kgrams: Classical k-gram Language Models

## Description

Compute sentence probabilities and word continuation conditional probabilities from a language model

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ```probability(object, model, .preprocess = attr(model, ".preprocess"), ...) ## S3 method for class 'kgrams_word_context' probability(object, model, .preprocess = attr(model, ".preprocess"), ...) ## S3 method for class 'character' probability( object, model, .preprocess = attr(model, ".preprocess"), .tknz_sent = attr(model, ".tknz_sent"), ... ) ```

## Arguments

 `object` a character vector for sentence probabilities, a word-context conditional expression created with the conditional operator `%|%` (see word_context). for word continuation probabilities. `model` an object of class `language_model`. `.preprocess` a function taking a character vector as input and returning a character vector as output. Preprocessing transformation applied to input before computing probabilities `...` further arguments passed to or from other methods. `.tknz_sent` a function taking a character vector as input and returning a character vector as output. Optional sentence tokenization step applied before computing sentence probabilities.

## Details

The generic function `probability()` is used to obtain both sentence unconditional probabilities (such as Prob("I was starting to feel drunk")) and word continuation conditional probabilities (such as Prob("you" | "i love")). In plain words, these probabilities answer the following related but conceptually different questions:

• Sentence probability Prob(s): what is the probability that extracting a single sentence (from a corpus of text, say) we will obtain exactly 's'?

• Continuation probability Prob(w|c): what is the probability that a given context 'c' will be followed exactly by the word 'w'?

In order to compute continuation probabilities (i.e. Prob(w|c)), one must create conditional expressions with the infix operator `%|%`, as shown in the examples below. Both `probability` and `%|%` are vectorized with respect to words (left hand side of `%|%`), but the context must be a length one character (right hand side of `%|%`).

The input is treated as in query for what concerns word tokenization: anything delimited by (one or more) white space(s) is tokenized as a word. For sentence probabilities, Begin-Of-Sentence and End-Of-Sentence paddings are implicitly added to the input, but specifying them explicitly does not produce wrong results as BOS and EOS tokens are ignored by `probability()` (see the examples below). For continuation probabilities, any context of more than `N - 1` words (where `N` is the k-gram order the language model) is truncated to the last `N - 1` words.

By default, the same `.preprocess()` and `.tknz_sent()` functions used during model building are applied to the input, but this can be overriden with arbitrary functions. Notice that the `.tknz_sent` can be useful (for sentence probabilities) if e.g. the input is a length one unprocessed character vector.

## Value

a numeric vector. Probabilities of the sentences or word continuations.

Valerio Gherardi

## Examples

 ```1 2 3 4 5 6 7``` ```# Usage of probability() f <- kgram_freqs("a b b a b a b", 2) m <- language_model(f, "add_k", k = 1) probability(c("a", "b", EOS(), UNK()) %|% BOS(), m) # c(0.4, 0.2, 0.2, 0.2) probability("a" %|% UNK(), m) # not NA ```

kgrams documentation built on Nov. 16, 2021, 9:22 a.m.