Distances (dissimilarities) between sequences
Description
Computes pairwise dissimilarities between sequences or dissimilarities with a reference sequence. Several dissimilarities measures can be chosen including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between within sequence state distributions.
Usage
1 2 3 4 5 6 7 8  seqdistOO(seqdata, method, refseq=NULL, norm=FALSE,
indel=1, sm=NA, with.missing=FALSE, full.matrix=TRUE,
kweights=rep(1.0, ncol(seqdata)), tpow=1,
expcost=0.5, context=12*expcost,
link="mean", h=0.5, nu=0,
transindel="constant", otto, previous=FALSE,
addcolumn=TRUE, numcpu=1,
breaks=NULL, step=1, overlap=FALSE, weighted=TRUE)

Arguments
seqdata 
State sequence object defined with the 
method 
Character string indicating the metric to be used. One of

refseq 
Optional baseline sequence to compute the distances from. Can be the index of a sequence in the state sequence object, 0 for the most frequent sequence, or an external sequence passed as a sequence object with a single row and same alphabet as seqdata assigned to it. 
norm 
String or logical. One of 
indel 
Scalar or numeric vector. The insertion/deletion cost (for all OM variants including TWED). Default is 1. Ignored with non OM metrics. If it is a vector, the indel costs are state dependant (the vector length should be equal to the number of states). 
sm 
substitutioncost matrix (for OM variants, TWED, SVRspell, HAM and DHD method). Can also be one of the 
with.missing 
Logical. Must be 
full.matrix 
Logical. If 
kweights 
Numerical vector with at position k the weight applied to subsequences of length k (for all NMS distances). 
expcost 
Scalar. Cost of spell length transformation (the exact interpretation is distance dependant). For 
context 
Scalar. Cost of local insertion (for 
link 
Function name. Function used to compute some internal costs. 
h 
Scalar. Exponential weight of spell length (for 
tpow 
Scalar. Exponential weight of spell length (for 
transindel 
Character string. Method for computing transition indel costs. One of 
otto 
Scalar. Origintransition tradeoff weight (for 
previous 
if 
addcolumn 
if 
breaks 
An optional list containing the spells used to build the intervals (for 
step 
The length of the interval (only used if 
weighted 
Logical. Should distributions account for the sequence weights in the state sequence object? (see 
overlap 
Logical. Are overlapping spells allowed (for 
nu 
Scalar. Stiffness (for 
numcpu 
Scalar. Number of CPU to be used. 
Details
The seqdistOO
function returns a matrix of distances between sequences or a vector of distances to the reference sequence when refseq
is set. The available metrics (see 'method' option) include
Edit distances: optimal matching (
OM
), localized OM (OMloc
), spell length sensitive OM (OMslen
), OM of spells (OMspell
), OM of sequences of transitions (OMstran
), Hamming (HAM
), dynamic Hamming (DHD
), and the time warp edit distance (TWED
).Metrics based on counts of common attributes: distance based on the longest common subsequence (
LCS
), on the longest common prefix (LCP
), on the longest common suffix (RLCP
), on the number of matching subsequences (NMS
), and the subsequence vectorial representation metric (SVRspell
).Distances between state distributions: Euclidean (
EUCLID
), Chisquared (CHI2
).
See Studer and Ritschard (2014) for a description and the comparison of the above dissimilarity measures except TWED
for which we refer to Marteau (2009) and Halpin (2014).
Except for a few dissimilarity measures (LCS, LCP, RLCP, NMS
) that do not require parameters, each method can be controlled with some parameters as described in the following table.
method  parameters 
   
OM  sm, indel 
OMloc  sm, expcost, context 
OMslen  sm, indel, h 
OMspell  sm, indel, expcost 
OMstran  sm, indel, otto, previous, addcolumn, transindel 
TWED  sm, indel, h, nu 
HAM, DHD  sm 
EUCLID, CHI2  step, breaks, overlap 
NMS, NMSMST  kweights 
SVRspell  sm, kweights, tpow 
   
In addition, on multiprocessor computers, one can specify the number numcpu
of cpu's to be used.
For DHD, when used sm
should be a series of matrices grouped in a 3dimensional matrix with the third index referring to the position in the sequence.
When sm
is not specified, a single cost of 2 is used for all substitutions, except for HAM for which it is 1, and for DHD for which the costs are derived from the transition rates at the successive positions.
Distances can optionally be normalized by means of the norm
argument. If set to TRUE
, Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to LCP, RLCP and LCS distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for OM, HAM and DHD. Elzinga's method can be forced with "gmean"
and Abbott's rule with "maxlength"
. With "maxdist"
the distance is normalized by its maximal possible value. For more details, see Gabadinho et al. (2009, 2011). Finally, "YujianBo"
is the normalization proposed by Yujian and Bo (2007) that preserves the triangle inequality.
When sequences contain gaps and the gaps=NA
option was passed to seqdef
—i.e. when there are non deleted missing values—the with.missing
argument should be set as TRUE
. If left as FALSE
the function stops when it encounters a gap. This is to make the user aware that there are gaps in his sequences. For methods that need an sm
value, seqdist
expects a substitution cost matrix with a row and a column entry for the missing state (symbol defined with the nr
option of seqdef
). Substitution cost matrices returned by seqcost
and seqsubm
will include these additional entries when the functions are called with with.missing=FALSE
. More details on how to compute distances with sequences containing gaps can be found in Gabadinho et al. (2009).
Value
When refseq
is NULL
(default), the whole matrix (or a dist
object when full.matrix=FALSE
) of pairwise distances between sequences is returned. Otherwise a vector with distances from the sequences in the data sequence object to the reference sequence specified with refseq
is returned.
References
Studer, M. & Ritschard, G. (2015), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481511. DOI: http://dx.doi.org/10.1111/rssa.12125
Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland. DOI: http://dx.doi.org/10.12682/lives.22961658.2014.33
Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.
Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R
with the TraMineR
package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva
Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. DOI: http://dx.doi.org/10.1007/9783319049694_5
Marteau, P.F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence, 31(2), 306–318. DOI: http://dx.doi.org/10.1109/TPAMI.2008.76
Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions On Pattern Analysis And Machine Intelligence, 29(6), 1091–1095. DOI: http://dx.doi.org/10.1109/TPAMI.2007.1078
See also all references in Studer and Ritschard (2014, 2015)
See Also
seqsubm
, seqdef
, and for multichannel distances seqdistmc
, seqcost
.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  ## optimal matching distances with substitution cost matrix
## derived from transition rates
data(biofam)
biofam.seq < seqdef(biofam, 10:25)
biofam.om < seqdistOO(biofam.seq, method="OM", indel=3, sm="TRATE")
## LCS distances to an external sequence
refs < seqdef("(0,5)(3,5)(4,6)", informat="SPS",
alphabet=alphabet(biofam.seq))
biofam.ref < seqdist(biofam.seq, method="LCS", refseq=refs)
hist(biofam.ref)
## =====================
## Example with missings
## =====================
data(ex1)
ex1.seq < seqdef(ex1,1:13)
subm < seqsubm(ex1.seq, method="TRATE", with.missing=TRUE)
ex1.om < seqdistOO(ex1.seq, method="OM", sm=subm, with.missing=TRUE)
