seqcost: Generate substitution and indel costs

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/seqcost.R

Description

The function seqcost proposes different ways to generate substitution costs (supposed to reflect state dissimilarities) and possibly indel costs. Proposed methods are: "CONSTANT" (same cost for all substitutions), "TRATE" (derived from the observed transition rates), "FUTURE" (Chi-squared distance between conditional state distributions lag positions ahead), "FEATURES" (Gower distance between state features), "INDELS", "INDELSLOG" (based on estimated indel costs). The substitution-cost matrix is intended to serve as sm argument in the seqdist function that computes distances between sequences. seqsubm is an alias that returns only the substitution cost matrix, i.e., no indel.

Usage

1
2
3
4
5
6
seqcost(seqdata, method, cval = NULL, with.missing = FALSE, miss.cost = NULL,
  time.varying = FALSE, weighted = TRUE, transition = "both", lag = 1,
  miss.cost.fixed = NULL, state.features = NULL, feature.weights = NULL,
  feature.type = list(), proximities = FALSE)

seqsubm(...)

Arguments

seqdata

A sequence object as returned by the seqdef function.

method

String. How to generate the costs. One of "CONSTANT" (same cost for all substitutions), "TRATE" (derived from the observed transition rates), "FUTURE" (Chi-squared distance between conditional state distributions lag positions ahead), "FEATURES" (Gower distance between state features), "INDELS", "INDELSLOG" (based on estimated indel costs).

cval

Scalar. For method "CONSTANT", the single substitution cost.
For method "TRATE", a base value from which transition probabilities are subtracted.
If NULL, cval=2 is used, unless transition is "both" and time.varying is TRUE, in which case cval=4.

with.missing

Logical. Should an additional entry be added in the matrix for the missing states? If TRUE, the ‘missing’ state is also added to the alphabet. Set as TRUE if you want to use the costs for distances between sequences containing non deleted (non void) missing values. Forced as FALSE when there are no non-void missing values in seqdata. See Gabadinho et al. (2010) for more details on the options for handling missing values when creating the state sequence object with seqdef.

miss.cost

Scalar or vector. Cost for substituting the missing state. Default is cval.

miss.cost.fixed

Logical. Should the substitution cost for missing be set as the miss.cost value. When NULL (default) it will be set as FALSE when method = "INDELS" or "INDELSLOG", and TRUE otherwise.

time.varying

Logical. If TRUE return an array with a distinct matrix for each time unit. Time is the third dimension (subscript) of the returned array. Time varying works only with method='CONSTANT', 'TRATE', 'INDELS', and 'INDELSLOG'.

weighted

Logical. Should weights in seqdata be used when applicable?

transition

String. Only used if method="TRATE" and time.varying=TRUE. On which transition are rates based? Should be one of "previous" (from previous state), "next" (to next state) or "both".

lag

Integer. For methods TRATE and FUTURE only. Time ahead to which transition rates are computed (default is lag=1).

state.features

Data frame with features values for each state.

feature.weights

Vector of feature weights with length equal to the number of columns of state.features.

feature.type

List of feature types. See daisy for details.

proximities

Logical: should state proximities be returned instead of substitution costs?

...

Arguments passed to seqcost

Details

The substitution-cost matrix has dimension ns*ns, where ns is the number of states in the alphabet of the sequence object. The element (i,j) of the matrix is the cost of substituting state i with state j. It defines the dissimilarity between the states i and j.

With method CONSTANT, the substitution costs are all set equal to the cval value, the default value being 2.

With method TRATE (transition rates), the transition probabilities between all pairs of states is first computed (using the seqtrate function). Then, the substitution cost between states i and j is obtained with the formula

SC(i,j) = cval - P(i|j) -P(j|i)

where P(i|j) is the probability of transition from state j to i lag positions ahead. Default cval value is 2. When time.varying=TRUE and transition="both", the substitution cost at position t is set as

SC(i,j,t) = cval - P(i|j,t-1) -P(j|i,t-1) - P(i|j,t) - P(j|i,t)

where P(i|j,t-1) is the probability to transit from state j at t-1 to i at t. Here, the default cval value is 4.

With method FUTURE, the cost between i and j is the Chi-squared distance between the vector (d(alphabet | i)) of probabilities of transition from states i and j to all the states in the alphabet lag positions ahead:

SC(i,j) = ChiDist(d(alphabet | i), d(alphabet | j))

With method FEATURES, each state is characterized by the variables state.features, and the cost between i and j is computed as the Gower distance between their vectors of state.features values.

With methods INDELS and INDELSLOG, values of indels are first derived from the state relative frequencies f_i. For INDELS, indel_i = 1/f_i is used, and for INDELSLOG, indel_i = log[2/(1 + f_i)]. Substitution costs are then set as SC(i,j) = indel_i + indel_j.

For all methods but INDELS and INDELSLOG, the indel is set as max(sm)/2 when time.varying=FALSE and as 1 otherwise.

Value

For seqcost, a list of two elements, indel and sm or prox:

indel

The indel cost. Either a scalar or a vector of size ns. When time.varying=TRUE

sm

The substitution-cost matrix (or array) when proximities = FALSE (default).

prox

The state proximity matrix when proximities = TRUE.

sm and prox are, when time.varying = FALSE, a matrix of size ns * ns, where ns is the number of states in the alphabet of the sequence object. When time.varying = TRUE, they are a three dimensional array of size ns * ns * L, where L is the maximum sequence length.

For seqsubm, only one element, the matrix (or array) sm.

Author(s)

Gilbert Ritschard and Matthias Studer (and Alexis Gabadinho for first version of seqsubm)

References

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1-37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2010). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. doi: 10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland, 2014. doi: 10.12682/lives.2296-1658.2014.33

See Also

seqtrate, seqdef, seqdist.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
## Defining a sequence object with columns 10 to 25
## of a subset of the 'biofam' example data set.
data(biofam)
biofam.seq <- seqdef(biofam[501:600,10:25])

## Indel and substitution costs based on log of inverse state frequencies
lifcost <- seqcost(biofam.seq, method="INDELSLOG")
## Here lifcost$indel is a vector
biofam.om <- seqdist(biofam.seq, method="OM", indel=lifcost$indel, sm=lifcost$sm)

## Optimal matching using transition rates based substitution-cost matrix
## and the associated indel cost
## Here trcost$indel is a scalar
trcost <- seqcost(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", indel=trcost$indel, sm=trcost$sm)

## Using costs based on FUTURE with a forward lag of 4
fucost <- seqcost(biofam.seq, method="FUTURE", lag=4)
biofam.om <- seqdist(biofam.seq, method="OM", indel=fucost$indel, sm=fucost$sm)

## Optimal matching using a unique substitution cost of 2
## and an insertion/deletion cost of 3
ccost <- seqsubm(biofam.seq, method="CONSTANT", cval=2)
biofam.om.c2 <- seqdist(biofam.seq, method="OM",indel=3, sm=ccost)

## Displaying the distance matrix for the first 10 sequences
biofam.om.c2[1:10,1:10]

## =================================
## Example with weights and missings
## =================================
data(ex1)
ex1.seq <- seqdef(ex1[,1:13], weights=ex1$weights)

## Unweighted
subm <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=FALSE)
ex1.om <- seqdist(ex1.seq, method="OM", indel=subm$indel, sm=subm$sm, with.missing=TRUE)

## Weighted
subm.w <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=TRUE)
ex1.omw <- seqdist(ex1.seq, method="OM", indel=subm.w$indel, sm=subm.w$sm, with.missing=TRUE)

ex1.om == ex1.omw

Example output

TraMineR stable version 2.0-11.1 (Built: 2019-05-12)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.

 [>] 8 distinct states appear in the data: 
     1 = 0
     2 = 1
     3 = 2
     4 = 3
     5 = 4
     6 = 5
     7 = 6
     8 = 7
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  0           0        0
     2  1           1        1
     3  2           2        2
     4  3           3        3
     5  4           4        4
     6  5           5        5
     7  6           6        6
     8  7           7        7
 [>] 100 sequences in the data set
 [>] min/max sequence length: 16/16
 [>] 100 sequences with 8 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
 [>] 76 distinct sequences
 [>] min/max sequence length: 16/16
 [>] computing distances using the OM metric
 [>] elapsed time: 0.04 secs
 [>] creating substitution-cost matrix using transition rates ...
 [>] computing transition probabilities for states 0/1/2/3/4/5/6/7 ...
 [>] 100 sequences with 8 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
 [>] 76 distinct sequences
 [>] min/max sequence length: 16/16
 [>] computing distances using the OM metric
 [>] elapsed time: 0.089 secs
 [>] creating substitution-cost matrix using common future...
 [>] computing transition probabilities for states 0/1/2/3/4/5/6/7 ...
 [>] 100 sequences with 8 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
 [>] 76 distinct sequences
 [>] min/max sequence length: 16/16
 [>] computing distances using the OM metric
 [>] elapsed time: 0.017 secs
 [>] creating 8x8 substitution-cost matrix using 2 as constant value
 [>] 100 sequences with 8 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
 [>] 76 distinct sequences
 [>] min/max sequence length: 16/16
 [>] computing distances using the OM metric
 [>] elapsed time: 0.017 secs
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0   16   26   12   26   16   16   26   26    26
 [2,]   16    0   22   20   18    6   22   24   16    16
 [3,]   26   22    0   14   22   22   12    2   22    22
 [4,]   12   20   14    0   22   20    8   16   22    22
 [5,]   26   18   22   22    0   18   24   24   18    18
 [6,]   16    6   22   20   18    0   24   24   12    14
 [7,]   16   22   12    8   24   24    0   12   24    24
 [8,]   26   24    2   16   24   24   12    0   24    24
 [9,]   26   16   22   22   18   12   24   24    0     2
[10,]   26   16   22   22   18   14   24   24    2     0
 [>] found missing values ('NA') in sequence data
 [>] preparing 7 sequences
 [>] coding void elements with '%' and missing values with '*'
 [!] sequence with index: 7 contains only missing values.
     This may produce inconsistent results.
 [>] 4 distinct states appear in the data: 
     1 = A
     2 = B
     3 = C
     4 = D
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  A           A        A
     2  B           B        B
     3  C           C        C
     4  D           D        D
 [>] sum of weights: 60 - min/max: 0/29.3
 [>] 7 sequences in the data set
 [>] min/max sequence length: 10/13
 [>] including missing values as an additional state
 [>] 7 sequences with 5 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
 [>] 7 distinct sequences
 [>] min/max sequence length: 10/13
 [>] computing distances using the OM metric
 [>] elapsed time: 0.009 secs
 [>] including missing values as an additional state
 [>] 7 sequences with 5 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
 [>] 7 distinct sequences
 [>] min/max sequence length: 10/13
 [>] computing distances using the OM metric
 [>] elapsed time: 0.009 secs
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

TraMineR documentation built on June 3, 2021, 5:06 p.m.