Compute probabilistic divergence between two PST

Share:

Description

Compute probabilistic divergence between two PST

Usage

1
2
## S4 method for signature 'PSTf,PSTf'
pdist(x,y, method="cp", l, ns=5000, symetric=FALSE, output="all")

Arguments

x

a probabilistic suffix tree, i.e., an object of class "PSTf" as returned by the pstree, prune or tune function.

y

a probabilistic suffix tree, i.e., an object of class "PSTf" as returned by the pstree, prune or tune function.

method

character. Method for computing distances. So far only one method is available.

l

integer. Length of the sequence(s) to generate.

ns

integer. Number sequences to generate.

symetric

logical. If TRUE, the symetric version of the measure is returned, see details.

output

character. See value.

Details

The function computes a probabilistic divergence measure between PST S_{A} and S_{B} based on the measure originally proposed in Juang-1985 and Rabiner-1989 for the comparison of two (hidden) Markov models S_{A} and S_{B}

d(S_{A}, S_{B})=\frac{1}{\ell} [\log P^{S_{A}}(x)-\log P^{S_{B}}(x)]=\frac{1}{\ell}\log \frac{P^{S_{A}}(x)}{P^{S_{B}}(x)}

where x=x_{1}, …, x_{\ell} is a sequence generated by model S_{A}, P^{S_{A}}(x) is the probability of x given model S_{A} and P^{S_{B}}(x) is the probability of x given model S_{B}. The ratio between the two sequence likelihoods measures how many times the sequence x is more likely to have been generated by S_{A} than by S_{2}.

As the number n of generated sequences on which the measure is computed (or the length of a single sequence) approaches infinity, the expected value of d(S_{A}, S_{B}) converges to d_{KL}(S_{A}, S_{B}) Falkhausen-1995, He-2000, the Kullback-Leibler (KL) divergence (also called information gain) used in information theory to measure the difference between two probability distributions.

The pdist function uses the following procedure to compute the divergence between two PST:

  • generate a ransom sample of n sequences (of length \ell) with model S_{A} using the generate method

  • predict the sequences with S_{A} and with S_{B}

  • compute

    d_{i}(S_{A}, S_{B})=\frac{1}{\ell} [\log P^{S_{A}}(x_{i})-\log P^{S_{B}}(x_{i}))], \; i=1, …, n

  • the expected value

    E(d(S_{A}, S_{B}))

    is the divergence between models S_{A} and S_{B} and is estimated as

    \hat{E}(d(S_{A}, S_{B}))=\frac{1}{n} ∑_{i=1}^{n} d_{i}(S_{A}, S_{B})

For more details, see Gabadinho 2016.

Value

If ouput="all", a vector containing the divergence value for each generated sequence, if output="mean", the mean, i.e. expected value which is the divergence between models.

Author(s)

Alexis gabadinho

References

Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), pp. 1-39.

Juang, B. H. and Rabiner, L. R. (1985). A probabilistic distance measure for hidden Markov models. ATT Technical Journal, 64(2), pp. 391-408.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), pp. 257-286.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## activity calendar for year 2000
## from the Swiss Household Panel
## see ?actcal
data(actcal)

## selecting individuals aged 20 to 59
actcal <- actcal[actcal$age00>=20 & actcal$age00 <60,]

## defining a sequence object
actcal.lab <- c("> 37 hours", "19-36 hours", "1-18 hours", "no work")
actcal.seq <- seqdef(actcal,13:24,labels=actcal.lab)

## building a PST segmented by age group
gage10 <- cut(actcal$age00, c(20,30,40,50,60), right=FALSE,
	labels=c("20-29","30-39", "40-49", "50-59"))

actcal.pstg <- pstree(actcal.seq, nmin=2, ymin=0.001, group=gage10)

## pruning
C99 <- qchisq(0.99,4-1)/2
actcal.pstg.opt <- prune(actcal.pstg, gain="G2", C=C99)

## extracting PST for age group 20-39 and 30-39
g1.pst <- subtree(actcal.pstg.opt, group=1)
g2.pst <- subtree(actcal.pstg.opt, group=2)

## generating 5000 sequences with g1.pst 
## and computing 5000 distances
dist.g1_g2 <- pdist(g1.pst, g2.pst, l=11)
hist(dist.g1_g2)

## the probabilistic distance is the mean
## of the 5000 distances
mean(dist.g1_g2)