Distances (dissimilarities) between sequences



Computes pairwise dissimilarities between sequences or dissimilarities with a reference sequence. Several dissimilarities measures can be chosen including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between within sequence state distributions.


seqdistOO(seqdata, method, refseq=NULL, norm=FALSE,
        indel=1, sm=NA, with.missing=FALSE, full.matrix=TRUE,
        kweights=rep(1.0, ncol(seqdata)), tpow=1,
        expcost=0.5, context=1-2*expcost,
        link="mean", h=0.5, nu=0,
        transindel="constant", otto, previous=FALSE,
        addcolumn=TRUE, numcpu=1,
        breaks=NULL, step=1, overlap=FALSE, weighted=TRUE)



State sequence object defined with the seqdef function.


Character string indicating the metric to be used. One of "OM", "OMloc", "OMslen", "OMspell", "OMstran", "LCS", "HAM", "DHD", "TWED", "SVRspell", "NMS", "NMSMST", "LCP", "RLCP", "CHI2", "EUCLID" (see details)


Optional baseline sequence to compute the distances from. Can be the index of a sequence in the state sequence object, 0 for the most frequent sequence, or an external sequence passed as a sequence object with a single row and same alphabet as seqdata assigned to it.


String or logical. One of "none", "maxlength", "gmean", "maxdist", "YujianBo". If TRUE the normalization method is automatically selected (see details). Default is FALSE.


Scalar or numeric vector. The insertion/deletion cost (for all OM variants including TWED). Default is 1. Ignored with non OM metrics. If it is a vector, the indel costs are state dependant (the vector length should be equal to the number of states).


substitution-cost matrix (for OM variants, TWED, SVRspell, HAM and DHD method). Can also be one of the seqcost methods "TRATE" or "CONSTANT" for building a sm matrix. Default is NA. A valid value MUST be given for method="OM".


Logical. Must be TRUE when sequences contain non deleted gaps (missing values). See details.


Logical. If TRUE (default), the full distance matrix is returned. This is for compatibility with earlier versions of the seqdist function. If FALSE, an object of class dist is returned, that is, a vector containing only values from the upper triangle of the distance matrix. Objects of class dist are smaller and can be passed directly as arguments to most clustering functions. Ignored when refseq is set.


Numerical vector with at position k the weight applied to subsequences of length k (for all NMS distances).


Scalar. Cost of spell length transformation (the exact interpretation is distance dependant). For OMspell and OMloc.


Scalar. Cost of local insertion (for OMloc only).


Function name. Function used to compute some internal costs.
For OMloc, the function used in the dynamic computation of insertion/deletion costs; one of "mean" (default), "min" (minimum) or "previous" (previous state only).
For "OMslen", the function used to compute substitution costs; one of "mean" (arithmetic average) or "gmean" (geometric mean as in the original proposition of Halpin 2010).


Scalar. Exponential weight of spell length (for OMslen).


Scalar. Exponential weight of spell length (for OMspell and SVRspell).


Character string. Method for computing transition indel costs. One of "constant" (single indel of 1), "subcost" (based on substitution costs), "prob" (based on transition probabilities) (for OMstran).


Scalar. Origin-transition trade-off weight (for OMstran).


if TRUE, add transition from previous state (for OMstran).


if TRUE, repeat the first (if previous=TRUE) and the last column (for OMstran).


An optional list containing the spells used to build the intervals (for CHI2).


The length of the interval (only used if breaks is not NULL) (for CHI2).


Logical. Should distributions account for the sequence weights in the state sequence object? (see seqdef.) Set as FALSE if you want to ignore the weights (for CHI2, OMstran).


Logical. Are overlapping spells allowed (for CHI2, EUCLID).


Scalar. Stiffness (for TWED). See Halpin (2014), p 88.


Scalar. Number of CPU to be used.


The seqdistOO function returns a matrix of distances between sequences or a vector of distances to the reference sequence when refseq is set. The available metrics (see 'method' option) include

  • Edit distances: optimal matching (OM), localized OM (OMloc), spell length sensitive OM (OMslen), OM of spells (OMspell), OM of sequences of transitions (OMstran), Hamming (HAM), dynamic Hamming (DHD), and the time warp edit distance (TWED).

  • Metrics based on counts of common attributes: distance based on the longest common subsequence (LCS), on the longest common prefix (LCP), on the longest common suffix (RLCP), on the number of matching subsequences (NMS), and the subsequence vectorial representation metric (SVRspell).

  • Distances between state distributions: Euclidean (EUCLID), Chi-squared (CHI2).

See Studer and Ritschard (2014) for a description and the comparison of the above dissimilarity measures except TWED for which we refer to Marteau (2009) and Halpin (2014).

Except for a few dissimilarity measures (LCS, LCP, RLCP, NMS) that do not require parameters, each method can be controlled with some parameters as described in the following table.

method parameters
------------------ ---------------------------------
OM sm, indel
OMloc sm, expcost, context
OMslen sm, indel, h
OMspell sm, indel, expcost
OMstran sm, indel, otto, previous, addcolumn, transindel
TWED sm, indel, h, nu
EUCLID, CHI2 step, breaks, overlap
NMS, NMSMST kweights
SVRspell sm, kweights, tpow
------------------ ---------------------------------

In addition, on multiprocessor computers, one can specify the number numcpu of cpu's to be used.

For DHD, when used sm should be a series of matrices grouped in a 3-dimensional matrix with the third index referring to the position in the sequence.

When sm is not specified, a single cost of 2 is used for all substitutions, except for HAM for which it is 1, and for DHD for which the costs are derived from the transition rates at the successive positions.

Distances can optionally be normalized by means of the norm argument. If set to TRUE, Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to LCP, RLCP and LCS distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for OM, HAM and DHD. Elzinga's method can be forced with "gmean" and Abbott's rule with "maxlength". With "maxdist" the distance is normalized by its maximal possible value. For more details, see Gabadinho et al. (2009, 2011). Finally, "YujianBo" is the normalization proposed by Yujian and Bo (2007) that preserves the triangle inequality.

When sequences contain gaps and the gaps=NA option was passed to seqdef—i.e. when there are non deleted missing values—the with.missing argument should be set as TRUE. If left as FALSE the function stops when it encounters a gap. This is to make the user aware that there are gaps in his sequences. For methods that need an sm value, seqdist expects a substitution cost matrix with a row and a column entry for the missing state (symbol defined with the nr option of seqdef). Substitution cost matrices returned by seqcost and seqsubm will include these additional entries when the functions are called with with.missing=FALSE. More details on how to compute distances with sequences containing gaps can be found in Gabadinho et al. (2009).


When refseq is NULL (default), the whole matrix (or a dist object when full.matrix=FALSE) of pairwise distances between sequences is returned. Otherwise a vector with distances from the sequences in the data sequence object to the reference sequence specified with refseq is returned.


Studer, M. & Ritschard, G. (2015), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. DOI: http://dx.doi.org/10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland. DOI: http://dx.doi.org/10.12682/lives.2296-1658.2014.33

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva

Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.-A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. DOI: http://dx.doi.org/10.1007/978-3-319-04969-4_5

Marteau, P.-F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence, 31(2), 306–318. DOI: http://dx.doi.org/10.1109/TPAMI.2008.76

Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions On Pattern Analysis And Machine Intelligence, 29(6), 1091–1095. DOI: http://dx.doi.org/10.1109/TPAMI.2007.1078

See also all references in Studer and Ritschard (2014, 2015)

See Also

seqsubm, seqdef, and for multichannel distances seqdistmc, seqcost.


## optimal matching distances with substitution cost matrix
## derived from transition rates
biofam.seq <- seqdef(biofam, 10:25)
biofam.om <- seqdistOO(biofam.seq, method="OM", indel=3, sm="TRATE")

## LCS distances to an external sequence
refs <- seqdef("(0,5)-(3,5)-(4,6)", informat="SPS",
biofam.ref <- seqdist(biofam.seq, method="LCS", refseq=refs)

## =====================
## Example with missings
## =====================
ex1.seq <- seqdef(ex1,1:13)

subm <- seqsubm(ex1.seq, method="TRATE", with.missing=TRUE)
ex1.om <- seqdistOO(ex1.seq, method="OM", sm=subm, with.missing=TRUE)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.