Description Usage Arguments Details Value Author(s) References See Also Examples

The function `seqcost`

proposes different ways to generate substitution costs
(supposed to reflect state dissimilarities) and possibly indel costs. Proposed methods are:
`"CONSTANT"`

(same cost for all substitutions), `"TRATE"`

(derived from the observed transition rates), `"FUTURE"`

(Chi-squared distance between conditional state distributions `lag`

positions ahead), `"FEATURES"`

(Gower distance between state features), `"INDELS"`

, `"INDELSLOG"`

(based on estimated indel costs).
The substitution-cost matrix is intended to serve as `sm`

argument in the `seqdist`

function that computes distances between sequences. `seqsubm`

is an alias that return only the substitution cost matrix, i.e., no indel.

1 2 3 4 5 6 7 8 | ```
seqcost(seqdata, method, cval = NULL, with.missing = FALSE, miss.cost = NULL,
time.varying = FALSE, weighted = TRUE, transition = "both", lag = 1,
missing.trate = FALSE, state.prop = NULL, prop.weights = NULL,
prop.type = list(), proximities = FALSE, miss.cost.fixed = !missing.trate,
state.features = state.prop, feature.weights = prop.weights,
feature.type = prop.type)
seqsubm(...)
``` |

`seqdata` |
a sequence object as returned by the seqdef function. |

`method` |
Character string. How to generate the costs. One of |

`cval` |
Scalar. For method |

`with.missing` |
Logical. Should an additional entry be added in the matrix for the missing states?
If |

`miss.cost` |
Scalar or vector. Cost for substituting the missing state. Default is |

`miss.cost.fixed` |
Logical. Should the substitution cost for missing be set as the |

`time.varying` |
Logical. If |

`weighted` |
Logical. Should weights in |

`transition` |
Character string. Only used if |

`lag` |
Integer. For methods |

`missing.trate` |
Logical. Deprecated, use |

`state.features` |
Data frame with features values for each state. |

`feature.weights` |
Vector of feature weights with length equal to the number of columns of |

`feature.type` |
List of feature types. See |

`state.prop` |
??? Deprecated. Use |

`prop.weights` |
??? Deprecated. Use |

`prop.type` |
??? Deprecated. Use |

,

`proximities` |
Logical: should state proximities be returned instead of substitution costs? |

`...` |
Arguments passed to |

The substitution-cost matrix has dimension *ns*ns*, where
*ns* is the number of states in the alphabet of the
sequence object. The element *(i,j)* of the matrix is the cost of
substituting state *i* with state *j*. It defines the dissimilarity between

With method `CONSTANT`

, the substitution costs are all set equal to the `cval`

value, the default value being 2.

With method `TRATE`

(transition rates), the transition rates between all pairs of
states is first computed (using the seqtrate function). Then, the
substitution cost between states *i* and *j* is obtained with
the formula

*SC(i,j) = cval - P(i,j) -P(j,i)*

where *P(i,j)* is the rate of transition from state *i* to
*j* `lag`

positions ahead.

With method `FUTURE`

, the cost between *i* and *j* is the Chi-squared distance between the vector (*d(alphabet | i)*) of rates of transition from states *i* and
*j* to all the states in the alphabet `lag`

positions ahead:

*SC(i,j) = ChiDist(d(alphabet | i), d(alphabet | j))*

With method `FEATURES`

, each state is characterized by the variables `state.prop`

, and the cost between *i* and *j* is computed as the Gower distance between their vectors of `state.prop`

values.

With methods `INDELS`

and `INDELSLOG`

, values of indels are first derived from the state relative frequencies *f_i*. For `INDELS`

, *indel_i = 1/f_i*, and for `INDELSLOG`

, *indel_i = log[2/(1 + f_i)]*.
Substitution costs are then set as *SC(i,j) = indel_i + indel_j*.

For all methods but `INDELS`

and `INDELSLOG`

, the indel is set as *max(sm)/2* when `time.varying=FALSE`

and as *1* otherwise.

A list of two elements `indel`

and `sm`

with

`indel` |
The indel cost. Either a scalar or a vector of size |

`sm` |
The substitution cost matrix of size |

For `seqsubm`

, the matrix `sm`

.

Matthias Studer and Alexis Gabadinho (first version) (with Gilbert Ritschard for the help page)

Gabadinho, A., G. Ritschard, N. S. M\"uller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. *Journal of Statistical Software* **40**(4), 1-37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. M\"uller (2010). Mining Sequence Data in
`R`

with the `TraMineR`

package: A user's guide. Department of Econometrics and
Laboratory of Demography, University of Geneva.

Studer, M. & Ritschard, G. (2015), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", *Journal of the Royal Statistical Society, Series A*. **179**(2), 481-511. DOI: http://dx.doi.org/10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". *LIVES Working Papers*, **33**. NCCR LIVES, Switzerland, 2014. DOI: http://dx.doi.org/10.12682/lives.2296-1658.2014.33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ```
## Defining a sequence object with columns 10 to 25
## in the 'biofam' example data set
data(biofam)
biofam.seq <- seqdef(biofam,10:25)
## Optimal matching using transition rates based substitution-cost matrix
## and insertion/deletion costs of 3
trcost <- seqcost(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", indel=3, sm=trcost$sm)
## Using the insertion/deletion cost returned by seqcost
biofam.om <- seqdist(biofam.seq, method="OM", indel=trcost$indel, sm=trcost$sm)
## Using costs based on FUTURE with a forward lag of 4
fucost <- seqcost(biofam.seq, method="FUTURE", lag=4)
biofam.om <- seqdist(biofam.seq, method="OM", indel=fucost$indel, sm=fucost$sm)
## Optimal matching using a unique substitution cost of 2
## and an insertion/deletion cost of 3
ccost <- seqsubm(biofam.seq, method="CONSTANT", cval=2)
biofam.om.c2 <- seqdist(biofam.seq, method="OM",indel=3, sm=ccost)
## Displaying the distance matrix for the first 10 sequences
biofam.om.c2[1:10,1:10]
## =================================
## Example with weights and missings
## =================================
data(ex1)
ex1.seq <- seqdef(ex1,1:13, weights=ex1$weights)
## Unweighted
subm <- seqcost(ex1.seq, method="TRATE", with.missing=TRUE, weighted=FALSE)
ex1.om <- seqdist(ex1.seq, method="OM", sm=subm$sm, with.missing=TRUE)
## Weighted
subm.w <- seqcost(ex1.seq, method="TRATE", with.missing=TRUE, weighted=TRUE)
ex1.omw <- seqdist(ex1.seq, method="OM", sm=subm.w$sm, with.missing=TRUE)
ex1.om == ex1.omw
``` |

seqdist2 documentation built on May 21, 2017, 12:46 a.m.

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.