predict: Compute the probability of categorical sequences using a... In PST: Probabilistic Suffix Trees and Variable Length Markov Chains

Description

Compute the probability (likelihood) of categorical sequences using a Probabilistic Suffix Tree

Usage

 1 2 ## S4 method for signature 'PSTf' predict(object, data, cdata, group, L=NULL, p1=NULL, output="prob", decomp=FALSE, base=2) 

Arguments

 object a probabilistic suffix tree, i.e., an object of class "PSTf" as returned by the pstree, prune or tune function. data a sequence object, i.e., an object of class 'stslist' as created by TraMineR seqdef function, containing the sequences to predict. cdata not implemented yet. group if object is a segmented PST, providing a vector of group membership so that each sequence probability will be predicted with the conditional probability distributions for the group it belongs to. If object is a segmented PST and group is not provided, each sequence will be predicted by each of the submodel, and the output will be a matrix with nbgroup columns, where nbgroup is the number of segments in the PST. L integer. Maximal context length for sequence prediction. This is the same as pruning the PST by removing all nodes of depth

Details

A probabilistic suffix tree (PST) allows to compute the likelihood of any sequence built on the alphabet of the learning sample. This feature is called sequence prediction. The likelihood of the sequence a-b-a-a-b given a PST S1 fitted to the example sequence s1 (see example) is

P^{S1}(abaab)= P^{S1}(a) \times P^{S1}(b|a) \times P^{S1}(a|ab) \times P^{S1}(a|aba) \times P^{S1}(b|abaa)

The probability of each of the state is retrieved from the PST. To get for example P(a|a-b-a), the tree is scanned for the node labelled with the string a-b-a, and if this node does not exist, it is scanned for the node labelled with the longest suffix of this string, that is b-a, and so on. The node a-b-a is not found in the tree (it has been removed during the pruning stage), and the longest suffix of a-b-a found is b-a. The probability P(a|b-a) is then used instead of P(a|a-b-a).

The sequence likelihood is returned by the predict function. By setting decomp=TRUE the output is a matrix containing the probability of each of the symbol composing the sequence. The score P^S(x) of a sequence x represents the probability that the VLMC model stored by the PST S generates x. It can be turned into a more readable prediction quality measure such as the average log-loss

logloss(S,x)=-\frac{1}{\ell} ∑_{i=1}^{\ell} \log_{2} P^{S}(x_{i}| x_{1}, …, x_{i-1})=-\frac{1}{\ell} \log_{2} P^{S}(x)

by using 'output=logloss'. The returned value is the average log-loss of each state in the sequence, which allows to compare the prediction for sequences of unequal lengths. The average log-loss can be interpreted as a residual, that is the distance between the prediction of a sequence by a PST S and the perfect prediction P(x)=1 yielding logloss(P^{S},x)=0. The lower the value of logloss(P^{S},s) the better the sequence is predicted. For more details, see Gabadinho 2016.

Value

Either a vector of sequence probabilities (decomp=FALSE) or a matrix (if decomp=FALSE) containing for each sequence (row) the probability of each state in columns.

References

Gabadinho, A. & Ritschard, G. (2016) Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), 1-39.

Examples

 1 2 3 4 5 6 7 8 data(s1) s1 <- seqdef(s1) S1 <- pstree(s1, L=3, nmin=2, ymin=0.001) S1 <- prune(S1, gain="G1", C=1.20, delete=FALSE) predict(S1, s1, decomp=TRUE) predict(S1, s1) 

Example output

Loading required package: TraMineR

TraMineR stable version 2.0-7 (Built: "Sat,)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.

PST version 0.94 (Built: 2017-09-22)
Website: http://r-forge.r-project.org/projects/pst
[>] 2 distinct states appear in the data:
1 = a
2 = b
[>] state coding:
[alphabet]  [label]  [long label]
1  a           a        a
2  b           b        b
[>] 1 sequences in the data set
[>] min/max sequence length: 27/27
[>] 1 sequence(s) - min/max length: 27/27
[>] max. depth L=3, nmin=2, ymin=0.001
[L]  [nodes]
0        1
1        2
2        4
3        7
[>] computing sequence(s) likelihood ... (0.024 secs)
[>] total time: 0.312 secs
[>] pruning results:
[L]  [nodes]  [pruned]
3        7         2
2        4         0
1        2         0
[>] computing sequence(s) likelihood ... (0.159 secs)
[>] 1 sequence(s) - min/max length: 27/27
[>] max. context length: L=3
[>] 2pruned nodes removed
[>] found 10 distinct context(s)
[>] total time: 0.007 secs
[1]       [2]   [3]       [4]  [5]  [6]       [7]  [8]  [9]      [10]
[1] 0.4814815 0.6153846 0.625 0.5714286 0.75 0.75 0.5714286 0.75 0.75 0.5714286
[11] [12]      [13] [14] [15]      [16]      [17]      [18]      [19]
[1] 0.75 0.25 0.6666667  0.5  0.5 0.4285714 0.3333333 0.4285714 0.6666667
[20]      [21] [22] [23] [24]      [25]      [26]      [27]
[1] 0.3333333 0.5714286 0.25  0.8 0.75 0.4285714 0.6666667 0.6666667
[>] 1 sequence(s) - min/max length: 27/27
[>] max. context length: L=3
[>] 2pruned nodes removed
[>] found 10 distinct context(s)
[>] total time: 0.007 secs
prob
[1] 7.58916e-08


PST documentation built on May 29, 2017, 5:16 p.m.