seqcost: Generate substitution and indel costs In TraMineR: Trajectory Miner: a Toolbox for Exploring and Rendering Sequences

Description

The function `seqcost` proposes different ways to generate substitution costs (supposed to reflect state dissimilarities) and possibly indel costs. Proposed methods are: `"CONSTANT"` (same cost for all substitutions), `"TRATE"` (derived from the observed transition rates), `"FUTURE"` (Chi-squared distance between conditional state distributions `lag` positions ahead), `"FEATURES"` (Gower distance between state features), `"INDELS"`, `"INDELSLOG"` (based on estimated indel costs). The substitution-cost matrix is intended to serve as `sm` argument in the `seqdist` function that computes distances between sequences. `seqsubm` is an alias that returns only the substitution cost matrix, i.e., no indel.

Usage

 ```1 2 3 4 5 6``` ```seqcost(seqdata, method, cval = NULL, with.missing = FALSE, miss.cost = NULL, time.varying = FALSE, weighted = TRUE, transition = "both", lag = 1, miss.cost.fixed = NULL, state.features = NULL, feature.weights = NULL, feature.type = list(), proximities = FALSE) seqsubm(...) ```

Arguments

 `seqdata` A sequence object as returned by the seqdef function. `method` String. How to generate the costs. One of `"CONSTANT"` (same cost for all substitutions), `"TRATE"` (derived from the observed transition rates), `"FUTURE"` (Chi-squared distance between conditional state distributions `lag` positions ahead), `"FEATURES"` (Gower distance between state features), `"INDELS"`, `"INDELSLOG"` (based on estimated indel costs). `cval` Scalar. For method `"CONSTANT"`, the single substitution cost. For method `"TRATE"`, a base value from which transition probabilities are subtracted. If `NULL`, `cval=2` is used, unless `transition` is `"both"` and `time.varying` is `TRUE`, in which case `cval=4`. `with.missing` Logical. Should an additional entry be added in the matrix for the missing states? If `TRUE`, the ‘missing’ state is also added to the alphabet. Set as `TRUE` if you want to use the costs for distances between sequences containing non deleted (non void) missing values. Forced as `FALSE` when there are no non-void missing values in `seqdata`. See Gabadinho et al. (2010) for more details on the options for handling missing values when creating the state sequence object with `seqdef`. `miss.cost` Scalar or vector. Cost for substituting the missing state. Default is `cval`. `miss.cost.fixed` Logical. Should the substitution cost for missing be set as the `miss.cost` value. When `NULL` (default) it will be set as `FALSE` when `method = "INDELS"` or `"INDELSLOG"`, and `TRUE` otherwise. `time.varying` Logical. If `TRUE` return an `array` with a distinct matrix for each time unit. Time is the third dimension (subscript) of the returned array. Time varying works only with `method='CONSTANT'`, `'TRATE'`, `'INDELS'`, and `'INDELSLOG'`. `weighted` Logical. Should weights in `seqdata` be used when applicable? `transition` String. Only used if `method="TRATE"` and `time.varying=TRUE`. On which transition are rates based? Should be one of `"previous"` (from previous state), `"next"` (to next state) or `"both"`. `lag` Integer. For methods `TRATE` and `FUTURE` only. Time ahead to which transition rates are computed (default is `lag=1`). `state.features` Data frame with features values for each state. `feature.weights` Vector of feature weights with length equal to the number of columns of `state.features`. `feature.type` List of feature types. See `daisy` for details. `proximities` Logical: should state proximities be returned instead of substitution costs? `...` Arguments passed to `seqcost`

Details

The substitution-cost matrix has dimension ns*ns, where ns is the number of states in the alphabet of the sequence object. The element (i,j) of the matrix is the cost of substituting state i with state j. It defines the dissimilarity between the states i and j.

With method `CONSTANT`, the substitution costs are all set equal to the `cval` value, the default value being 2.

With method `TRATE` (transition rates), the transition probabilities between all pairs of states is first computed (using the seqtrate function). Then, the substitution cost between states i and j is obtained with the formula

SC(i,j) = cval - P(i|j) -P(j|i)

where P(i|j) is the probability of transition from state j to i `lag` positions ahead. Default `cval` value is 2. When `time.varying=TRUE` and `transition="both"`, the substitution cost at position t is set as

SC(i,j,t) = cval - P(i|j,t-1) -P(j|i,t-1) - P(i|j,t) - P(j|i,t)

where P(i|j,t-1) is the probability to transit from state j at t-1 to i at t. Here, the default `cval` value is 4.

With method `FUTURE`, the cost between i and j is the Chi-squared distance between the vector (d(alphabet | i)) of probabilities of transition from states i and j to all the states in the alphabet `lag` positions ahead:

SC(i,j) = ChiDist(d(alphabet | i), d(alphabet | j))

With method `FEATURES`, each state is characterized by the variables `state.features`, and the cost between i and j is computed as the Gower distance between their vectors of `state.features` values.

With methods `INDELS` and `INDELSLOG`, values of indels are first derived from the state relative frequencies f_i. For `INDELS`, indel_i = 1/f_i is used, and for `INDELSLOG`, indel_i = log[2/(1 + f_i)]. Substitution costs are then set as SC(i,j) = indel_i + indel_j.

For all methods but `INDELS` and `INDELSLOG`, the indel is set as max(sm)/2 when `time.varying=FALSE` and as 1 otherwise.

Value

For `seqcost`, a list of two elements, `indel` and `sm` or `prox`:

 `indel` The indel cost. Either a scalar or a vector of size ns. When `time.varying=TRUE` `sm` The substitution-cost matrix (or array) when `proximities = FALSE` (default). `prox` The state proximity matrix when `proximities = TRUE`.

`sm` and `prox` are, when `time.varying = FALSE`, a matrix of size ns * ns, where ns is the number of states in the alphabet of the sequence object. When `time.varying = TRUE`, they are a three dimensional array of size ns * ns * L, where L is the maximum sequence length.

For `seqsubm`, only one element, the matrix (or array) `sm`.

Author(s)

Gilbert Ritschard and Matthias Studer (and Alexis Gabadinho for first version of `seqsubm`)

References

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1-37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2010). Mining Sequence Data in `R` with the `TraMineR` package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. doi: 10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland, 2014. doi: 10.12682/lives.2296-1658.2014.33

`seqtrate`, `seqdef`, `seqdist`.

Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43``` ```## Defining a sequence object with columns 10 to 25 ## of a subset of the 'biofam' example data set. data(biofam) biofam.seq <- seqdef(biofam[501:600,10:25]) ## Indel and substitution costs based on log of inverse state frequencies lifcost <- seqcost(biofam.seq, method="INDELSLOG") ## Here lifcost\$indel is a vector biofam.om <- seqdist(biofam.seq, method="OM", indel=lifcost\$indel, sm=lifcost\$sm) ## Optimal matching using transition rates based substitution-cost matrix ## and the associated indel cost ## Here trcost\$indel is a scalar trcost <- seqcost(biofam.seq, method="TRATE") biofam.om <- seqdist(biofam.seq, method="OM", indel=trcost\$indel, sm=trcost\$sm) ## Using costs based on FUTURE with a forward lag of 4 fucost <- seqcost(biofam.seq, method="FUTURE", lag=4) biofam.om <- seqdist(biofam.seq, method="OM", indel=fucost\$indel, sm=fucost\$sm) ## Optimal matching using a unique substitution cost of 2 ## and an insertion/deletion cost of 3 ccost <- seqsubm(biofam.seq, method="CONSTANT", cval=2) biofam.om.c2 <- seqdist(biofam.seq, method="OM",indel=3, sm=ccost) ## Displaying the distance matrix for the first 10 sequences biofam.om.c2[1:10,1:10] ## ================================= ## Example with weights and missings ## ================================= data(ex1) ex1.seq <- seqdef(ex1[,1:13], weights=ex1\$weights) ## Unweighted subm <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=FALSE) ex1.om <- seqdist(ex1.seq, method="OM", indel=subm\$indel, sm=subm\$sm, with.missing=TRUE) ## Weighted subm.w <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=TRUE) ex1.omw <- seqdist(ex1.seq, method="OM", indel=subm.w\$indel, sm=subm.w\$sm, with.missing=TRUE) ex1.om == ex1.omw ```

Example output

```TraMineR stable version 2.0-11.1 (Built: 2019-05-12)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.

[>] 8 distinct states appear in the data:
1 = 0
2 = 1
3 = 2
4 = 3
5 = 4
6 = 5
7 = 6
8 = 7
[>] state coding:
[alphabet]  [label]  [long label]
1  0           0        0
2  1           1        1
3  2           2        2
4  3           3        3
5  4           4        4
6  5           5        5
7  6           6        6
8  7           7        7
[>] 100 sequences in the data set
[>] min/max sequence length: 16/16
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.04 secs
[>] creating substitution-cost matrix using transition rates ...
[>] computing transition probabilities for states 0/1/2/3/4/5/6/7 ...
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.089 secs
[>] creating substitution-cost matrix using common future...
[>] computing transition probabilities for states 0/1/2/3/4/5/6/7 ...
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.017 secs
[>] creating 8x8 substitution-cost matrix using 2 as constant value
[>] 100 sequences with 8 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 76 distinct sequences
[>] min/max sequence length: 16/16
[>] computing distances using the OM metric
[>] elapsed time: 0.017 secs
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0   16   26   12   26   16   16   26   26    26
[2,]   16    0   22   20   18    6   22   24   16    16
[3,]   26   22    0   14   22   22   12    2   22    22
[4,]   12   20   14    0   22   20    8   16   22    22
[5,]   26   18   22   22    0   18   24   24   18    18
[6,]   16    6   22   20   18    0   24   24   12    14
[7,]   16   22   12    8   24   24    0   12   24    24
[8,]   26   24    2   16   24   24   12    0   24    24
[9,]   26   16   22   22   18   12   24   24    0     2
[10,]   26   16   22   22   18   14   24   24    2     0
[>] found missing values ('NA') in sequence data
[>] preparing 7 sequences
[>] coding void elements with '%' and missing values with '*'
[!] sequence with index: 7 contains only missing values.
This may produce inconsistent results.
[>] 4 distinct states appear in the data:
1 = A
2 = B
3 = C
4 = D
[>] state coding:
[alphabet]  [label]  [long label]
1  A           A        A
2  B           B        B
3  C           C        C
4  D           D        D
[>] sum of weights: 60 - min/max: 0/29.3
[>] 7 sequences in the data set
[>] min/max sequence length: 10/13
[>] including missing values as an additional state
[>] 7 sequences with 5 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 7 distinct sequences
[>] min/max sequence length: 10/13
[>] computing distances using the OM metric
[>] elapsed time: 0.009 secs
[>] including missing values as an additional state
[>] 7 sequences with 5 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
[>] 7 distinct sequences
[>] min/max sequence length: 10/13
[>] computing distances using the OM metric
[>] elapsed time: 0.009 secs
[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
```

TraMineR documentation built on June 3, 2021, 5:06 p.m.