sbo_predictions: Stupid Back-off text predictions

Description Usage Arguments Details Value Author(s) See Also Examples

Description

Train a text predictor via Stupid Back-off

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
sbo_predictor(object, ...)

predictor(object, ...)

## S3 method for class 'character'
sbo_predictor(
  object,
  N,
  dict,
  .preprocess = identity,
  EOS = "",
  lambda = 0.4,
  L = 3L,
  filtered = "<UNK>",
  ...
)

## S3 method for class 'sbo_kgram_freqs'
sbo_predictor(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)

## S3 method for class 'sbo_predtable'
sbo_predictor(object, ...)

sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)

predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)

## S3 method for class 'character'
sbo_predtable(
  object,
  lambda = 0.4,
  L = 3L,
  filtered = "<UNK>",
  N,
  dict,
  .preprocess = identity,
  EOS = "",
  ...
)

## S3 method for class 'sbo_kgram_freqs'
sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)

Arguments

object

either a character vector or an object inheriting from classes sbo_kgram_freqs or sbo_predtable. Defines the method to use for training.

...

further arguments passed to or from other methods.

N

a length one integer. Order 'N' of the N-gram model.

dict

a sbo_dictionary, a character vector or a formula. For more details see kgram_freqs.

.preprocess

a function for corpus preprocessing. For more details see kgram_freqs.

EOS

a length one character vector. String listing End-Of-Sentence characters. For more details see kgram_freqs.

lambda

a length one numeric. Penalization in the Stupid Back-off algorithm.

L

a length one integer. Maximum number of next-word predictions for a given input (top scoring predictions are retained).

filtered

a character vector. Words to exclude from next-word predictions. The strings '<UNK>' and '<EOS>' are reserved keywords referring to the Unknown-Word and End-Of-Sentence tokens, respectively.

Details

These functions are generics used to train a text predictor with Stupid Back-Off. The functions predictor() and predtable() are aliases for sbo_predictor() and sbo_predtable(), respectively.

The sbo_predictor data structure carries all information required for prediction in a compact and efficient (upon retrieval) way, by directly storing the top L next-word predictions for each k-gram prefix observed in the training corpus.

The sbo_predictor objects are for interactive use. If the training process is computationally heavy, one can store a "raw" version of the text predictor in a sbo_predtable class object, which can be safely saved out of memory (with e.g. save()). The resulting object can be restored in another R session, and the corresponding sbo_predictor object can be loaded rapidly using again the generic constructor sbo_predictor() (see example below).

The returned objects are a sbo_predictor and a sbo_predtable objects. The latter contains Stupid Back-Off prediction tables, storing next-word prediction for each k-gram prefix observed in the text, whereas the former is an external pointer to an equivalent (but processed) C++ structure.

Both objects have the following attributes:

Value

A sbo_predictor object for sbo_predictor(), a sbo_predtable object for sbo_predtable().

Author(s)

Valerio Gherardi

See Also

predict.sbo_predictor

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Train a text predictor directly from corpus
p <- sbo_predictor(twitter_train, N = 3, dict = max_size ~ 1000,
                   .preprocess = preprocess, EOS = ".?!:;")


# Train a text predictor from previously computed 'kgram_freqs' object
p <- sbo_predictor(twitter_freqs)


# Load a text predictor from a Stupid Back-Off prediction table
p <- sbo_predictor(twitter_predtable)


# Predict from Stupid Back-Off text predictor
p <- sbo_predictor(twitter_predtable)
predict(p, "i love")


# Build Stupid Back-Off prediction tables directly from corpus
t <- sbo_predtable(twitter_train, N = 3, dict = max_size ~ 1000, 
                   .preprocess = preprocess, EOS = ".?!:;")


# Build Stupid Back-Off prediction tables from kgram_freqs object
t <- sbo_predtable(twitter_freqs)

## Not run: 
# Save and reload a 'sbo_predtable' object with base::save()
save(t)
load("t.rda")

## End(Not run)

sbo documentation built on Dec. 6, 2020, 1:06 a.m.