Description Usage Arguments Details Value Author(s) References See Also Examples

The function `seqcost`

proposes different ways to generate substitution costs
(supposed to reflect state dissimilarities) and possibly indel costs. Proposed methods are:
`"CONSTANT"`

(same cost for all substitutions), `"TRATE"`

(derived from the observed transition rates), `"FUTURE"`

(Chi-squared distance between conditional state distributions `lag`

positions ahead), `"FEATURES"`

(Gower distance between state features), `"INDELS"`

, `"INDELSLOG"`

(based on estimated indel costs).
The substitution-cost matrix is intended to serve as `sm`

argument in the `seqdist`

function that computes distances between sequences. `seqsubm`

is an alias that returns only the substitution cost matrix, i.e., no indel.

1 2 3 4 5 6 |

`seqdata` |
A sequence object as returned by the seqdef function. |

`method` |
String. How to generate the costs. One of |

`cval` |
Scalar. For method |

`with.missing` |
Logical. Should an additional entry be added in the matrix for the missing states?
If |

`miss.cost` |
Scalar or vector. Cost for substituting the missing state. Default is |

`miss.cost.fixed` |
Logical. Should the substitution cost for missing be set as the |

`time.varying` |
Logical. If |

`weighted` |
Logical. Should weights in |

`transition` |
String. Only used if |

`lag` |
Integer. For methods |

`state.features` |
Data frame with features values for each state. |

`feature.weights` |
Vector of feature weights with length equal to the number of columns of |

`feature.type` |
List of feature types. See |

`proximities` |
Logical: should state proximities be returned instead of substitution costs? |

`...` |
Arguments passed to |

The substitution-cost matrix has dimension *ns*ns*, where
*ns* is the number of states in the alphabet of the
sequence object. The element *(i,j)* of the matrix is the cost of
substituting state *i* with state *j*. It defines the dissimilarity between the states *i* and *j*.

With method `CONSTANT`

, the substitution costs are all set equal to the `cval`

value, the default value being 2.

With method `TRATE`

(transition rates), the transition probabilities between all pairs of
states is first computed (using the seqtrate function). Then, the
substitution cost between states *i* and *j* is obtained with
the formula

*SC(i,j) = cval - P(i|j) -P(j|i)*

where *P(i|j)* is the probability of transition from state *j* to
*i* `lag`

positions ahead. Default `cval`

value is 2. When `time.varying=TRUE`

and `transition="both"`

, the substitution cost at position *t* is set as

*SC(i,j,t) = cval - P(i|j,t-1) -P(j|i,t-1) - P(i|j,t) - P(j|i,t)*

where *P(i|j,t-1)* is the probability to transit from state *j* at *t-1* to *i* at t. Here, the default `cval`

value is 4.

With method `FUTURE`

, the cost between *i* and *j* is the Chi-squared distance between the vector (*d(alphabet | i)*) of probabilities of transition from states *i* and
*j* to all the states in the alphabet `lag`

positions ahead:

*SC(i,j) = ChiDist(d(alphabet | i), d(alphabet | j))*

With method `FEATURES`

, each state is characterized by the variables `state.features`

, and the cost between *i* and *j* is computed as the Gower distance between their vectors of `state.features`

values.

With methods `INDELS`

and `INDELSLOG`

, values of indels are first derived from the state relative frequencies *f_i*. For `INDELS`

, *indel_i = 1/f_i* is used, and for `INDELSLOG`

, *indel_i = log[2/(1 + f_i)]*.
Substitution costs are then set as *SC(i,j) = indel_i + indel_j*.

For all methods but `INDELS`

and `INDELSLOG`

, the indel is set as *max(sm)/2* when `time.varying=FALSE`

and as *1* otherwise.

For `seqcost`

, a list of two elements, `indel`

and `sm`

or `prox`

:

`indel` |
The indel cost. Either a scalar or a vector of size |

`sm` |
The substitution-cost matrix (or array) when |

`prox` |
The state proximity matrix when |

`sm`

and `prox`

are, when `time.varying = FALSE`

, a matrix of size *ns * ns*, where *ns*
is the number of states in the alphabet of the sequence object. When `time.varying = TRUE`

, they are a three dimensional array of size *ns * ns * L*, where *L* is the maximum sequence length.

For `seqsubm`

, only one element, the matrix (or array) `sm`

.

Gilbert Ritschard and Matthias Studer (and Alexis Gabadinho for first version of `seqsubm`

)

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. *Journal of Statistical Software* **40**(4), 1-37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2010). Mining Sequence Data in
`R`

with the `TraMineR`

package: A user's guide. Department of Econometrics and
Laboratory of Demography, University of Geneva.

Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", *Journal of the Royal Statistical Society, Series A*. **179**(2), 481-511. doi: 10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". *LIVES Working Papers*, **33**. NCCR LIVES, Switzerland, 2014. doi: 10.12682/lives.2296-1658.2014.33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | ```
## Defining a sequence object with columns 10 to 25
## of a subset of the 'biofam' example data set.
data(biofam)
biofam.seq <- seqdef(biofam[501:600,10:25])
## Indel and substitution costs based on log of inverse state frequencies
lifcost <- seqcost(biofam.seq, method="INDELSLOG")
## Here lifcost$indel is a vector
biofam.om <- seqdist(biofam.seq, method="OM", indel=lifcost$indel, sm=lifcost$sm)
## Optimal matching using transition rates based substitution-cost matrix
## and the associated indel cost
## Here trcost$indel is a scalar
trcost <- seqcost(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", indel=trcost$indel, sm=trcost$sm)
## Using costs based on FUTURE with a forward lag of 4
fucost <- seqcost(biofam.seq, method="FUTURE", lag=4)
biofam.om <- seqdist(biofam.seq, method="OM", indel=fucost$indel, sm=fucost$sm)
## Optimal matching using a unique substitution cost of 2
## and an insertion/deletion cost of 3
ccost <- seqsubm(biofam.seq, method="CONSTANT", cval=2)
biofam.om.c2 <- seqdist(biofam.seq, method="OM",indel=3, sm=ccost)
## Displaying the distance matrix for the first 10 sequences
biofam.om.c2[1:10,1:10]
## =================================
## Example with weights and missings
## =================================
data(ex1)
ex1.seq <- seqdef(ex1[,1:13], weights=ex1$weights)
## Unweighted
subm <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=FALSE)
ex1.om <- seqdist(ex1.seq, method="OM", indel=subm$indel, sm=subm$sm, with.missing=TRUE)
## Weighted
subm.w <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=TRUE)
ex1.omw <- seqdist(ex1.seq, method="OM", indel=subm.w$indel, sm=subm.w$sm, with.missing=TRUE)
ex1.om == ex1.omw
``` |

```
TraMineR stable version 2.0-11.1 (Built: 2019-05-12)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.
[>] 8 distinct states appear in the data:
1 = 0
2 = 1
3 = 2
4 = 3
5 = 4
6 = 5
7 = 6
8 = 7
[>] state coding:
[alphabet] [label] [long label]
1 0 0 0
2 1 1 1
3 2 2 2
4 3 3 3
5 4 4 4
6 5 5 5
7 6 6 6
8 7 7 7
[>] 100 sequences in the data set
[>] min/max sequence length: 16/16
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.04 secs
[>] creating substitution-cost matrix using transition rates ...
[>] computing transition probabilities for states 0/1/2/3/4/5/6/7 ...
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.089 secs
[>] creating substitution-cost matrix using common future...
[>] computing transition probabilities for states 0/1/2/3/4/5/6/7 ...
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.017 secs
[>] creating 8x8 substitution-cost matrix using 2 as constant value
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.017 secs
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 16 26 12 26 16 16 26 26 26
[2,] 16 0 22 20 18 6 22 24 16 16
[3,] 26 22 0 14 22 22 12 2 22 22
[4,] 12 20 14 0 22 20 8 16 22 22
[5,] 26 18 22 22 0 18 24 24 18 18
[6,] 16 6 22 20 18 0 24 24 12 14
[7,] 16 22 12 8 24 24 0 12 24 24
[8,] 26 24 2 16 24 24 12 0 24 24
[9,] 26 16 22 22 18 12 24 24 0 2
[10,] 26 16 22 22 18 14 24 24 2 0
[>] found missing values ('NA') in sequence data
[>] preparing 7 sequences
[>] coding void elements with '%' and missing values with '*'
[!] sequence with index: 7 contains only missing values.
This may produce inconsistent results.
[>] 4 distinct states appear in the data:
1 = A
2 = B
3 = C
4 = D
[>] state coding:
[alphabet] [label] [long label]
1 A A A
2 B B B
3 C C C
4 D D D
[>] sum of weights: 60 - min/max: 0/29.3
[>] 7 sequences in the data set
[>] min/max sequence length: 10/13
[>] including missing values as an additional state
[>] 7 sequences with 5 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 7 distinct sequences
[>] min/max sequence length: 10/13
[>] computing distances using the OM metric
[>] elapsed time: 0.009 secs
[>] including missing values as an additional state
[>] 7 sequences with 5 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 7 distinct sequences
[>] min/max sequence length: 10/13
[>] computing distances using the OM metric
[>] elapsed time: 0.009 secs
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.