cprob: Empirical conditional probability distributions of order 'L'
In PST: Probabilistic Suffix Trees and Variable Length Markov Chains

Description Usage Arguments Details Value Author(s) References Examples

Compute the empirical conditional probability distributions of order L from a set of sequences

1
2
3

## S4 method for signature 'stslist'
cprob(object, L, cdata=NULL, context, stationary=TRUE, nmin=1, prob=TRUE, 
weighted=TRUE, with.missing=FALSE, to.list=FALSE)

`object`	a sequence object, that is an object of class stslist as created by TraMineR `seqdef` function.
`L`	integer. Context length.
`cdata`	under development
`context`	character. An optional subsequence (a character string where symbols are separated by '-') for which the conditional probability distribution is to be computed.
`stationary`	logical. If `FALSE` probability distributions are computed for each sequence position L+1 ... l where l is the maximum sequence length. If `TRUE` the probability distributions are stationary that is time homogenous.
`nmin`	integer. Minimal frequency of a context. See details.
`prob`	logical. If `TRUE` the probability distributions are returned. If `FALSE` the function returns the empirical counts on which the probability distributions are computed.
`weighted`	logical. If `TRUE` case weights attached to the sequence object are used in the computation of the probabilities.
`with.missing`	logical. If `FALSE` only contexts contining no missing status are considered.
`to.list`	logical. If `TRUE` and `stationary=TRUE`, a list instead of a matrix is returned. See `value`.

The empirical conditional probability \hat{P}(σ | c) of observing a symbol σ \in A after the subsequence c=c_{1}, …, c_{k} of length k=L is computed as

\hat{P}(σ | c) = \frac{N(cσ)}{∑_{α \in A} N(cα)}

where

N(c)=∑_{i=1}^{\ell} 1 ≤ft[x_{i}, …, x_{i+|c|-1}=c \right], \; x=x_{1}, …, x_{\ell}, \; c=c_{1}, …, c_{k}

is the number of occurrences of the subsequence c in the sequence x and cσ is the concatenation of the subsequence c and the symbol σ.

Considering a - possibly weighted - sample of m sequences having weights w^{j}, \; j=1 … m, the function N(c) is replaced by

N(c)=∑_{j=1}^{m} w^{j} ∑_{i=1}^{\ell} 1 ≤ft[x_{i}^{j}, …, x_{i+|c|-1}^{j}=c \right], \; c=c_{1}, …, c_{k}

where x^{j}=x_{1}^{j}, …, x_{\ell}^{j} is the jth sequence in the sample. For more details, see Gabadinho 2016.

If stationary=TRUE a matrix with one row for each subsequence of length L and minimal frequency nmin appearing in object. If stationary=FALSE a list where each element corresponds to one subsequence and contains a matrix whith the probability distribution at each position p where a state is preceded by the subsequence.

Alexis Gabadinho

Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), pp. 1-39.

## Example with the single sequence s1
data(s1)
s1 <- seqdef(s1)
cprob(s1, L=0, prob=FALSE)
cprob(s1, L=1, prob=TRUE)

## Preparing a sequence object with the SRH data set
data(SRH)
state.list <- levels(SRH$p99c01)
## sequential color palette
mycol5 <- rev(brewer.pal(5, "RdYlGn"))
SRH.seq <- seqdef(SRH, 5:15, alphabet=state.list, states=c("G1", "G2", "M", "B2", "B1"), 
	labels=state.list, weights=SRH$wp09lp1s, right=NA, cpal=mycol5)
names(SRH.seq) <- 1999:2009

## Example 1: 0th order: weighted and unweigthed counts
cprob(SRH.seq, L=0, prob=FALSE, weighted=FALSE)
cprob(SRH.seq, L=0, prob=FALSE, weighted=TRUE)

## Example 2: 2th order: weighted and unweigthed probability distrib.
cprob(SRH.seq, L=2, prob=TRUE, weighted=FALSE)
cprob(SRH.seq, L=2, prob=TRUE, weighted=TRUE)

Loading required package: TraMineR

TraMineR stable version 2.0-7 (Built: "Sat,)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.

Loading required package: RColorBrewer

PST version 0.94 (Built: 2017-09-22)
Website: http://r-forge.r-project.org/projects/pst
 [>] 2 distinct states appear in the data: 
     1 = a
     2 = b
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  a           a        a
     2  b           b        b
 [>] 1 sequences in the data set
 [>] min/max sequence length: 27/27
 [>] 1 sequences, min/max length: 27/27
 [>] computing prob., L=0, 1 distinct context(s)
 [>] total time: 0.007 secs
   a  b [n]
e 13 14  27
 [>] 1 sequences, min/max length: 27/27
 [>] computing prob., L=1, 2 distinct context(s)
 [>] total time: 0.004 secs
          a         b [n]
a 0.3846154 0.6153846  13
b 0.5384615 0.4615385  13
 [>] found missing values ('NA') in sequence data
 [>] preparing 2612 sequences
 [>] coding void elements with '%' and missing values with '*'
 [>] state coding:
       [alphabet]       [label]  [long label] 
     1  very well        G1       very well
     2  well             G2       well
     3  so, so (average) M        so, so (average)
     4  not very well    B2       not very well
     5  not well at all  B1       not well at all
 [>] sum of weights: 2653.77 - min/max: 0.232673704624176/4.55576086044312
 [>] 2612 sequences in the data set
 [>] min/max sequence length: 11/11
 [>] 2612 sequences, min/max length: 11/11
 [>] computing prob., L=0, 1 distinct context(s)
 [>] total time: 0.26 secs
    G1    G2    M  B2 B1   [n]
e 6224 17616 3591 371 52 27854
 [>] 2612 sequences, min/max length: 11/11
 [>] computing prob., L=0, 1 distinct context(s)
 [>] total time: 0.036 secs
        G1       G2        M       B2       B1   [n]
e 6201.074 17830.85 3758.909 368.4725 51.10312 27854
 [>] 2612 sequences, min/max length: 11/11
 [>] computing prob., L=2, 35 distinct context(s)
 [>] removing 11 context(s) containing missing values
 [>] total time: 0.024 secs
              G1         G2          M          B2           B1  [n]
B1-B1 0.00000000 0.00000000 0.14285714 0.285714286 0.5714285714    7
B1-B2 0.00000000 0.12500000 0.00000000 0.625000000 0.2500000000    8
B1-G1 1.00000000 0.00000000 0.00000000 0.000000000 0.0000000000    1
B1-G2 0.00000000 1.00000000 0.00000000 0.000000000 0.0000000000    5
B1-M  0.00000000 0.11111111 0.55555556 0.333333333 0.0000000000    9
B2-B1 0.00000000 0.00000000 0.16666667 0.416666667 0.4166666667   12
B2-B2 0.00000000 0.09756098 0.43902439 0.317073171 0.1463414634   41
B2-G1 0.22222222 0.55555556 0.22222222 0.000000000 0.0000000000    9
B2-G2 0.06666667 0.53333333 0.34444444 0.055555556 0.0000000000   90
B2-M  0.02702703 0.24324324 0.56756757 0.135135135 0.0270270270  111
G1-B2 0.21428571 0.35714286 0.35714286 0.071428571 0.0000000000   14
G1-G1 0.58900634 0.38266385 0.02663848 0.001691332 0.0000000000 2365
G1-G2 0.29896497 0.64888535 0.04936306 0.002786624 0.0000000000 2512
G1-M  0.25000000 0.52450980 0.20588235 0.014705882 0.0049019608  204
G2-B1 0.10000000 0.30000000 0.50000000 0.000000000 0.1000000000   10
G2-B2 0.06741573 0.58426966 0.30337079 0.044943820 0.0000000000   89
G2-G1 0.34759867 0.60294817 0.04660010 0.002853067 0.0000000000 2103
G2-G2 0.12129573 0.77948089 0.09298999 0.005518087 0.0007153076 9786
G2-M  0.06004289 0.63187991 0.28162974 0.024303074 0.0021443888 1399
M-B1  0.00000000 0.33333333 0.33333333 0.222222222 0.1111111111    9
M-B2  0.00000000 0.23364486 0.52336449 0.214953271 0.0280373832  107
M-G1  0.31428571 0.54857143 0.13142857 0.005714286 0.0000000000  175
M-G2  0.06699929 0.65146115 0.25944405 0.019957234 0.0021382751 1403
M-M   0.02576336 0.36832061 0.53625954 0.065839695 0.0038167939 1048
 [>] 2612 sequences, min/max length: 11/11
 [>] computing prob., L=2, 35 distinct context(s)
 [>] removing 11 context(s) containing missing values
 [>] total time: 0.032 secs
              G1         G2          M          B2           B1  [n]
B1-B1 0.00000000 0.00000000 0.09818135 0.335028651 0.5667899946    7
B1-B2 0.00000000 0.12965429 0.00000000 0.553821776 0.3165239369    8
B1-G1 1.00000000 0.00000000 0.00000000 0.000000000 0.0000000000    1
B1-G2 0.00000000 1.00000000 0.00000000 0.000000000 0.0000000000    5
B1-M  0.00000000 0.20500861 0.53019242 0.264798970 0.0000000000    9
B2-B1 0.00000000 0.00000000 0.18658460 0.408762718 0.4046526865   12
B2-B2 0.00000000 0.08762715 0.49743688 0.304345521 0.1105904504   41
B2-G1 0.15269255 0.56929607 0.27801138 0.000000000 0.0000000000    9
B2-G2 0.06736504 0.50905666 0.35812437 0.065453926 0.0000000000   90
B2-M  0.02400026 0.24139846 0.58999910 0.119801242 0.0248009471  111
G1-B2 0.19126499 0.37109769 0.35074171 0.086895609 0.0000000000   14
G1-G1 0.58674034 0.38404366 0.02747219 0.001743816 0.0000000000 2365
G1-G2 0.29732333 0.65016030 0.04976264 0.002753734 0.0000000000 2512
G1-M  0.24415949 0.51633832 0.21334358 0.020040363 0.0061182509  204
G2-B1 0.04016650 0.30014000 0.61271478 0.000000000 0.0469787227   10
G2-B2 0.06616120 0.53516824 0.33857735 0.060093210 0.0000000000   89
G2-G1 0.34300931 0.60359596 0.05129085 0.002103880 0.0000000000 2103
G2-G2 0.11902674 0.77887958 0.09646083 0.004992888 0.0006399592 9786
G2-M  0.05424277 0.63001545 0.28661320 0.026701475 0.0024271068 1399
M-B1  0.00000000 0.32258387 0.26163294 0.262630848 0.1531523341    9
M-B2  0.00000000 0.21069475 0.54266810 0.214763222 0.0318739189  107
M-G1  0.35606381 0.51180852 0.12949028 0.002637382 0.0000000000  175
M-G2  0.06042470 0.65290600 0.26436511 0.019544672 0.0027595167 1403
M-M   0.02771553 0.36685274 0.53683261 0.064524580 0.0040745414 1048