knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
Prediction by Partial Matching (PPM) is a flexible and robust algorithm
for generating incremental probabilistic predictions for symbolic sequences
[@Cleary1984].
Many variants of PPM exist in the literature.
This package, ppm
, implements the PPM variant developed by @Bunton1996
and subsequently used by @Pearce2005 in the Information Dynamics Of Music
(IDyOM) model.
It also implements the PPM-Decay algorithm of @Harrison2020,
which applies a flexible memory-decay kernel to PPM such that
historic events are downweighted compared to recent events.
Please cite the ppm
package by referring to this latter paper:
Harrison, Peter M. C., Roberta Bianco, Maria Chait, and Marcus T. Pearce. 2020. “PPM-Decay: A Computational Model of Auditory Prediction with Memory Decay.” bioRxiv. https://doi.org/10.1101/2020.01.09.900266.
You might also want to mention the particular version of the package you used. To check the installed version, you can run the following code in your R console:
asNamespace("ppm")$`.__NAMESPACE__.`$spec[["version"]]
In addition to the GitHub repository,
the source code for the ppm
package is permanently archived on Zenodo,
and is available at the following DOI: https://doi.org/10.5281/zenodo.2620414.
This DOI will always point to the latest version of the ppm
package,
but you can also find version-specific DOIs on the Zenodo page.
To install from GitHub:
if (!require("devtools")) install.packages("devtools") devtools::install_github("pmcharrison/ppm")
The standard PPM model is a variable-order Markov order model that blends together the predictions of multiple n-gram models. An n-gram model generates predictions by tabulating the occurrences of subsequences of n tokens -- n-grams -- in the training dataset. These n-gram models are combined using a technique called smoothing.
Many versions of PPM exist corresponding to different smoothing and order selection techniques. The implementation in this package uses interpolated smoothing as introduced by Bunton (1996), along with a selection of different escape methods which determine the weighting parameters for the n-gram models. Update exclusion is also supported. The implementation also supports the order selection technique from PPM, which was designed to help PPM extend to arbitrarily long n*-gram orders. Note however that the present implementation is not optimized for large order bounds; with an infinite order bound, model training has computational complexity quadratic in the length of the input, whereas alternative implementations can achieve linear complexity in the length of the input.
You can create a PPM model object using the new_ppm_simple
function:
library(ppm) mod <- new_ppm_simple(alphabet_size = 5)
See ?new_ppm_simple
for information on configuring the model.
Note how we provided an alphabet_size
parameter to the model.
This compulsory parameter refers to the size of the alphabet over which
the model will be trained and will generate predictions.
For example, if we were modelling the letters A-Z, we might set
this parameter to 26, corresponding to the 26 letters in the English alphabet.
Input sequences are coded as integers within this alphabet. For example, we might define a sequence as follows:
seq_1 <- c(1, 3, 3, 1, 2, 1, 3)
The factor
function from base R can be useful for making this encoding
from a series of textual tokens.
Factor objects are represented as integer vectors under the hood,
but they come with an associated mapping between the integers and
their labels, which makes them easier to manipulate within R.
For example:
print(letters) seq_2 <- factor(c("a", "b", "r", "a", "c", "a", "d", "a", "b", "r", "a"), levels = c("a", "b", "c", "d", "r")) print(seq_2) print(as.integer(seq_2))
The ppm package treats factor objects like their underlying integer representations.
When modelling factor objects, it's recommended to pass the underlying factor representation
to the PPM model upon initialisation. In this case, the alphabet_size
parameter need not
be provided.
mod <- new_ppm_simple(alphabet_levels = c("a", "b", "c", "d", "r"))
You feed sequences to the PPM model using the model_seq
function.
By default, the model processes these sequences incrementally,
one symbol at a time.
It simultaneously learns from these incoming symbols,
and generates predictions for future symbols based
on what came before.
For example:
res <- model_seq(mod, seq_2) print(res)
The output of model_seq
is a tibble with several columns.
Each row corresponds to a different element of the input sequence,
organised in order of presentation.
The row describes what happened when the model tried to predict
this element, conditioned on the preceding elements in the sequence.
Each row has the following fields:
symbol
- The integer encoding of the corresponding symbol in the sequence.model_order
- The highest-order n-gram model used for generating predictions.information_content
- The negative log probability, or information content,
of the observed symbol according to the model.entropy
- The entropy of the model's predictive distribution when predicting
that element of the sequence.distribution
- The predictive probability distribution when predicting
that element of the sequence.Often we are particularly interested in the information_content
field,
which gives us an index of the model's surprisal at different
parts of the input sequence.
plot(res$information_content, xlab = "Position", ylab = "Information content (bits)", type = "l", ylim = c(0, 5))
The model_seq
function changes the input PPM model object,
even the absence of the assignment operator <-
.
Typically, the PPM model will have been updated with the contents of
the training sequence.
If we present the same sequence again, it should prove to be much more predictable (red line):
res_2 <- model_seq(mod, seq_2) plot(res$information_content, xlab = "Position", ylab = "Information content (bits)", type = "l", ylim = c(0, 5)) points(res_2$information_content, type = "l", col = "red")
By setting generate = TRUE
, we can also instruct the model to generate new samples
based on the statistics that it has learned so far. In this case the first argument to
model_seq
should be an integer corresponding to the desired length of the generated sequence.
res_3 <- model_seq(mod, seq = 20, generate = TRUE) res_3$symbol
The original PPM algorithm has a perfect memory, and weights all historic observations equally when generating predictions. The PPM-Decay modifies this behaviour, introducing a customisable memory decay kernel that determines the weight of historic observations as a function of the time and number of events observed since the original event. For example, a decay kernel for modelling auditory prediction might resemble the following:
knitr::include_graphics("man/figures/example-decay-kernel.png")
In its most general form (illustrated above), the decay kernel comprises three phases:
The parameters for these different phases, in particular durations and relative weights, are customisable. Each phase can be disabled separately to produce simpler families of decay kernels. For example, the default parameters define a one-stage exponential decay kernel; adding a buffer phase and retrieval noise produces the two-stage decay kernel in @Harrison2020.
The new_ppm_decay
function is used to create a new PPM-Decay model.
It works in a similar way to new_ppm_simple
, described above for PPM models.
The function has many customisable parameters (see ?new_ppm_decay
)
that support the specification of various kinds of decay kernels,
including single-stage exponential decays,
two-stage exponential decays,
exponential decays with non-zero asymptotes,
and kernels combining a flat buffer period with subsequent exponential decays.
mod_decay <- new_ppm_decay(alphabet_size = 5, ltm_half_life = 2)
When modelling a sequence with a PPM-Decay model, you need to specify both the sequence itself and a numeric vector corresponding to the timepoints of the symbol observations.
seq_2_time <- seq_along(seq_2) print(seq_2_time) res_4 <- model_seq(mod_decay, seq_2, time = seq_2_time) plot(res_4$information_content, xlab = "Position", ylab = "Information content (bits)", type = "l", ylim = c(0, 5)) points(res_4$information_content, type = "l", col = "blue")
Here the original PPM output is plotted in black, the PPM-Decay model in blue.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.