# cprob: Empirical conditional probability distributions of order 'L' In PST: Probabilistic Suffix Trees and Variable Length Markov Chains

## Description

Compute the empirical conditional probability distributions of order L from a set of sequences

## Usage

 1 2 3 ## S4 method for signature 'stslist' cprob(object, L, cdata=NULL, context, stationary=TRUE, nmin=1, prob=TRUE, weighted=TRUE, with.missing=FALSE, to.list=FALSE) 

## Arguments

 object a sequence object, that is an object of class stslist as created by TraMineR seqdef function. L integer. Context length. cdata under development context character. An optional subsequence (a character string where symbols are separated by '-') for which the conditional probability distribution is to be computed. stationary logical. If FALSE probability distributions are computed for each sequence position L+1 ... l where l is the maximum sequence length. If TRUE the probability distributions are stationary that is time homogenous. nmin integer. Minimal frequency of a context. See details. prob logical. If TRUE the probability distributions are returned. If FALSE the function returns the empirical counts on which the probability distributions are computed. weighted logical. If TRUE case weights attached to the sequence object are used in the computation of the probabilities. with.missing logical. If FALSE only contexts contining no missing status are considered. to.list logical. If TRUE and stationary=TRUE, a list instead of a matrix is returned. See value.

## Details

The empirical conditional probability \hat{P}(σ | c) of observing a symbol σ \in A after the subsequence c=c_{1}, …, c_{k} of length k=L is computed as

\hat{P}(σ | c) = \frac{N(cσ)}{∑_{α \in A} N(cα)}

where

N(c)=∑_{i=1}^{\ell} 1 ≤ft[x_{i}, …, x_{i+|c|-1}=c \right], \; x=x_{1}, …, x_{\ell}, \; c=c_{1}, …, c_{k}

is the number of occurrences of the subsequence c in the sequence x and is the concatenation of the subsequence c and the symbol σ.

Considering a - possibly weighted - sample of m sequences having weights w^{j}, \; j=1 … m, the function N(c) is replaced by

N(c)=∑_{j=1}^{m} w^{j} ∑_{i=1}^{\ell} 1 ≤ft[x_{i}^{j}, …, x_{i+|c|-1}^{j}=c \right], \; c=c_{1}, …, c_{k}

where x^{j}=x_{1}^{j}, …, x_{\ell}^{j} is the jth sequence in the sample. For more details, see Gabadinho 2016.

## Value

If stationary=TRUE a matrix with one row for each subsequence of length L and minimal frequency nmin appearing in object. If stationary=FALSE a list where each element corresponds to one subsequence and contains a matrix whith the probability distribution at each position p where a state is preceded by the subsequence.

## References

Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), pp. 1-39.

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ## Example with the single sequence s1 data(s1) s1 <- seqdef(s1) cprob(s1, L=0, prob=FALSE) cprob(s1, L=1, prob=TRUE) ## Preparing a sequence object with the SRH data set data(SRH) state.list <- levels(SRH$p99c01) ## sequential color palette mycol5 <- rev(brewer.pal(5, "RdYlGn")) SRH.seq <- seqdef(SRH, 5:15, alphabet=state.list, states=c("G1", "G2", "M", "B2", "B1"), labels=state.list, weights=SRH$wp09lp1s, right=NA, cpal=mycol5) names(SRH.seq) <- 1999:2009 ## Example 1: 0th order: weighted and unweigthed counts cprob(SRH.seq, L=0, prob=FALSE, weighted=FALSE) cprob(SRH.seq, L=0, prob=FALSE, weighted=TRUE) ## Example 2: 2th order: weighted and unweigthed probability distrib. cprob(SRH.seq, L=2, prob=TRUE, weighted=FALSE) cprob(SRH.seq, L=2, prob=TRUE, weighted=TRUE) 

### Example output

Loading required package: TraMineR

TraMineR stable version 2.0-7 (Built: "Sat,)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.

PST version 0.94 (Built: 2017-09-22)
Website: http://r-forge.r-project.org/projects/pst
[>] 2 distinct states appear in the data:
1 = a
2 = b
[>] state coding:
[alphabet]  [label]  [long label]
1  a           a        a
2  b           b        b
[>] 1 sequences in the data set
[>] min/max sequence length: 27/27
[>] 1 sequences, min/max length: 27/27
[>] computing prob., L=0, 1 distinct context(s)
[>] total time: 0.007 secs
a  b [n]
e 13 14  27
[>] 1 sequences, min/max length: 27/27
[>] computing prob., L=1, 2 distinct context(s)
[>] total time: 0.004 secs
a         b [n]
a 0.3846154 0.6153846  13
b 0.5384615 0.4615385  13
[>] found missing values ('NA') in sequence data
[>] preparing 2612 sequences
[>] coding void elements with '%' and missing values with '*'
[>] state coding:
[alphabet]       [label]  [long label]
1  very well        G1       very well
2  well             G2       well
3  so, so (average) M        so, so (average)
4  not very well    B2       not very well
5  not well at all  B1       not well at all
[>] sum of weights: 2653.77 - min/max: 0.232673704624176/4.55576086044312
[>] 2612 sequences in the data set
[>] min/max sequence length: 11/11
[>] 2612 sequences, min/max length: 11/11
[>] computing prob., L=0, 1 distinct context(s)
[>] total time: 0.26 secs
G1    G2    M  B2 B1   [n]
e 6224 17616 3591 371 52 27854
[>] 2612 sequences, min/max length: 11/11
[>] computing prob., L=0, 1 distinct context(s)
[>] total time: 0.036 secs
G1       G2        M       B2       B1   [n]
e 6201.074 17830.85 3758.909 368.4725 51.10312 27854
[>] 2612 sequences, min/max length: 11/11
[>] computing prob., L=2, 35 distinct context(s)
[>] removing 11 context(s) containing missing values
[>] total time: 0.024 secs
G1         G2          M          B2           B1  [n]
B1-B1 0.00000000 0.00000000 0.14285714 0.285714286 0.5714285714    7
B1-B2 0.00000000 0.12500000 0.00000000 0.625000000 0.2500000000    8
B1-G1 1.00000000 0.00000000 0.00000000 0.000000000 0.0000000000    1
B1-G2 0.00000000 1.00000000 0.00000000 0.000000000 0.0000000000    5
B1-M  0.00000000 0.11111111 0.55555556 0.333333333 0.0000000000    9
B2-B1 0.00000000 0.00000000 0.16666667 0.416666667 0.4166666667   12
B2-B2 0.00000000 0.09756098 0.43902439 0.317073171 0.1463414634   41
B2-G1 0.22222222 0.55555556 0.22222222 0.000000000 0.0000000000    9
B2-G2 0.06666667 0.53333333 0.34444444 0.055555556 0.0000000000   90
B2-M  0.02702703 0.24324324 0.56756757 0.135135135 0.0270270270  111
G1-B2 0.21428571 0.35714286 0.35714286 0.071428571 0.0000000000   14
G1-G1 0.58900634 0.38266385 0.02663848 0.001691332 0.0000000000 2365
G1-G2 0.29896497 0.64888535 0.04936306 0.002786624 0.0000000000 2512
G1-M  0.25000000 0.52450980 0.20588235 0.014705882 0.0049019608  204
G2-B1 0.10000000 0.30000000 0.50000000 0.000000000 0.1000000000   10
G2-B2 0.06741573 0.58426966 0.30337079 0.044943820 0.0000000000   89
G2-G1 0.34759867 0.60294817 0.04660010 0.002853067 0.0000000000 2103
G2-G2 0.12129573 0.77948089 0.09298999 0.005518087 0.0007153076 9786
G2-M  0.06004289 0.63187991 0.28162974 0.024303074 0.0021443888 1399
M-B1  0.00000000 0.33333333 0.33333333 0.222222222 0.1111111111    9
M-B2  0.00000000 0.23364486 0.52336449 0.214953271 0.0280373832  107
M-G1  0.31428571 0.54857143 0.13142857 0.005714286 0.0000000000  175
M-G2  0.06699929 0.65146115 0.25944405 0.019957234 0.0021382751 1403
M-M   0.02576336 0.36832061 0.53625954 0.065839695 0.0038167939 1048
[>] 2612 sequences, min/max length: 11/11
[>] computing prob., L=2, 35 distinct context(s)
[>] removing 11 context(s) containing missing values
[>] total time: 0.032 secs
G1         G2          M          B2           B1  [n]
B1-B1 0.00000000 0.00000000 0.09818135 0.335028651 0.5667899946    7
B1-B2 0.00000000 0.12965429 0.00000000 0.553821776 0.3165239369    8
B1-G1 1.00000000 0.00000000 0.00000000 0.000000000 0.0000000000    1
B1-G2 0.00000000 1.00000000 0.00000000 0.000000000 0.0000000000    5
B1-M  0.00000000 0.20500861 0.53019242 0.264798970 0.0000000000    9
B2-B1 0.00000000 0.00000000 0.18658460 0.408762718 0.4046526865   12
B2-B2 0.00000000 0.08762715 0.49743688 0.304345521 0.1105904504   41
B2-G1 0.15269255 0.56929607 0.27801138 0.000000000 0.0000000000    9
B2-G2 0.06736504 0.50905666 0.35812437 0.065453926 0.0000000000   90
B2-M  0.02400026 0.24139846 0.58999910 0.119801242 0.0248009471  111
G1-B2 0.19126499 0.37109769 0.35074171 0.086895609 0.0000000000   14
G1-G1 0.58674034 0.38404366 0.02747219 0.001743816 0.0000000000 2365
G1-G2 0.29732333 0.65016030 0.04976264 0.002753734 0.0000000000 2512
G1-M  0.24415949 0.51633832 0.21334358 0.020040363 0.0061182509  204
G2-B1 0.04016650 0.30014000 0.61271478 0.000000000 0.0469787227   10
G2-B2 0.06616120 0.53516824 0.33857735 0.060093210 0.0000000000   89
G2-G1 0.34300931 0.60359596 0.05129085 0.002103880 0.0000000000 2103
G2-G2 0.11902674 0.77887958 0.09646083 0.004992888 0.0006399592 9786
G2-M  0.05424277 0.63001545 0.28661320 0.026701475 0.0024271068 1399
M-B1  0.00000000 0.32258387 0.26163294 0.262630848 0.1531523341    9
M-B2  0.00000000 0.21069475 0.54266810 0.214763222 0.0318739189  107
M-G1  0.35606381 0.51180852 0.12949028 0.002637382 0.0000000000  175
M-G2  0.06042470 0.65290600 0.26436511 0.019544672 0.0027595167 1403
M-M   0.02771553 0.36685274 0.53683261 0.064524580 0.0040745414 1048


PST documentation built on May 29, 2017, 5:16 p.m.