Compute probabilistic divergence between two PST

1 2 |

`x` |
a probabilistic suffix tree, i.e., an object of class |

`y` |
a probabilistic suffix tree, i.e., an object of class |

`method` |
character. Method for computing distances. So far only one method is available. |

`l` |
integer. Length of the sequence(s) to generate. |

`ns` |
integer. Number sequences to generate. |

`symetric` |
logical. If |

`output` |
character. See |

The function computes a probabilistic divergence measure between PST *S_{A}* and *S_{B}* based on the measure originally proposed in Juang-1985 and Rabiner-1989 for the comparison of two (hidden) Markov models *S_{A}* and *S_{B}*

*
d(S_{A}, S_{B})=\frac{1}{\ell} [\log P^{S_{A}}(x)-\log P^{S_{B}}(x)]=\frac{1}{\ell}\log \frac{P^{S_{A}}(x)}{P^{S_{B}}(x)}
*

where *x=x_{1}, …, x_{\ell}* is a sequence generated by model *S_{A}*, *P^{S_{A}}(x)* is the probability of *x* given model *S_{A}* and *P^{S_{B}}(x)* is the probability of *x* given model *S_{B}*. The ratio between the two sequence likelihoods measures how many times the sequence *x* is more likely to have been generated by *S_{A}* than by *S_{2}*.

As the number *n* of generated sequences on which the measure is computed (or the length of a single sequence) approaches infinity, the expected value of *d(S_{A}, S_{B})* converges to *d_{KL}(S_{A}, S_{B})* Falkhausen-1995, He-2000, the Kullback-Leibler (KL) divergence (also called information gain) used in information theory to measure the difference between two probability distributions.

The `pdist`

function uses the following procedure to compute the divergence between two PST:

generate a ransom sample of

*n*sequences (of length*\ell*) with model*S_{A}*using the`generate`

methodpredict the sequences with

*S_{A}*and with*S_{B}*compute

*d_{i}(S_{A}, S_{B})=\frac{1}{\ell} [\log P^{S_{A}}(x_{i})-\log P^{S_{B}}(x_{i}))], \; i=1, …, n*the expected value

*E(d(S_{A}, S_{B}))*is the divergence between models

*S_{A}*and*S_{B}*and is estimated as*\hat{E}(d(S_{A}, S_{B}))=\frac{1}{n} ∑_{i=1}^{n} d_{i}(S_{A}, S_{B})*

For more details, see Gabadinho 2016.

If `ouput="all"`

, a vector containing the divergence value for each generated sequence, if `output="mean"`

, the mean, i.e. expected value which is the divergence between models.

Alexis gabadinho

Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. *Journal of Statistical Software*, **72**(3), pp. 1-39.

Juang, B. H. and Rabiner, L. R. (1985). A probabilistic distance measure for hidden Markov models. *ATT Technical Journal*, **64**(2), pp. 391-408.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. *Proceedings of the IEEE*, **77**(2), pp. 257-286.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ```
## activity calendar for year 2000
## from the Swiss Household Panel
## see ?actcal
data(actcal)
## selecting individuals aged 20 to 59
actcal <- actcal[actcal$age00>=20 & actcal$age00 <60,]
## defining a sequence object
actcal.lab <- c("> 37 hours", "19-36 hours", "1-18 hours", "no work")
actcal.seq <- seqdef(actcal,13:24,labels=actcal.lab)
## building a PST segmented by age group
gage10 <- cut(actcal$age00, c(20,30,40,50,60), right=FALSE,
labels=c("20-29","30-39", "40-49", "50-59"))
actcal.pstg <- pstree(actcal.seq, nmin=2, ymin=0.001, group=gage10)
## pruning
C99 <- qchisq(0.99,4-1)/2
actcal.pstg.opt <- prune(actcal.pstg, gain="G2", C=C99)
## extracting PST for age group 20-39 and 30-39
g1.pst <- subtree(actcal.pstg.opt, group=1)
g2.pst <- subtree(actcal.pstg.opt, group=2)
## generating 5000 sequences with g1.pst
## and computing 5000 distances
dist.g1_g2 <- pdist(g1.pst, g2.pst, l=11)
hist(dist.g1_g2)
## the probabilistic distance is the mean
## of the 5000 distances
mean(dist.g1_g2)
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

All documentation is copyright its authors; we didn't write any of that.