Description Usage Arguments Details Value References See Also Examples
Computes pairwise dissimilarities between sequences or dissimilarities with a reference sequence. Several dissimilarities measures can be chosen including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between within sequence state distributions.
1 2 3 4 5 6 7 8 | seqdistOO(seqdata, method, refseq=NULL, norm=FALSE,
indel=1, sm=NA, with.missing=FALSE, full.matrix=TRUE,
kweights=rep(1.0, ncol(seqdata)), tpow=1,
expcost=0.5, context=1-2*expcost,
link="mean", h=0.5, nu=0,
transindel="constant", otto, previous=FALSE,
addcolumn=TRUE, numcpu=1,
breaks=NULL, step=1, overlap=FALSE, weighted=TRUE)
|
seqdata |
State sequence object defined with the |
method |
Character string indicating the metric to be used. One of
|
refseq |
Optional baseline sequence to compute the distances from. Can be the index of a sequence in the state sequence object, 0 for the most frequent sequence, or an external sequence passed as a sequence object with a single row and same alphabet as seqdata assigned to it. |
norm |
String or logical. One of |
indel |
Scalar or numeric vector. The insertion/deletion cost (for all OM variants including TWED). Default is 1. Ignored with non OM metrics. If it is a vector, the indel costs are state dependant (the vector length should be equal to the number of states). |
sm |
substitution-cost matrix (for OM variants, TWED, SVRspell, HAM and DHD method). Can also be one of the |
with.missing |
Logical. Must be |
full.matrix |
Logical. If |
kweights |
Numerical vector with at position k the weight applied to subsequences of length k (for all NMS distances). |
expcost |
Scalar. Cost of spell length transformation (the exact interpretation is distance dependant). For |
context |
Scalar. Cost of local insertion (for |
link |
Function name. Function used to compute some internal costs. |
h |
Scalar. Exponential weight of spell length (for |
tpow |
Scalar. Exponential weight of spell length (for |
transindel |
Character string. Method for computing transition indel costs. One of |
otto |
Scalar. Origin-transition trade-off weight (for |
previous |
if |
addcolumn |
if |
breaks |
An optional list containing the spells used to build the intervals (for |
step |
The length of the interval (only used if |
weighted |
Logical. Should distributions account for the sequence weights in the state sequence object? (see |
overlap |
Logical. Are overlapping spells allowed (for |
nu |
Scalar. Stiffness (for |
numcpu |
Scalar. Number of CPU to be used. |
The seqdistOO
function returns a matrix of distances between sequences or a vector of distances to the reference sequence when refseq
is set. The available metrics (see 'method' option) include
Edit distances: optimal matching (OM
), localized OM (OMloc
), spell length sensitive OM (OMslen
), OM of spells (OMspell
), OM of sequences of transitions (OMstran
), Hamming (HAM
), dynamic Hamming (DHD
), and the time warp edit distance (TWED
).
Metrics based on counts of common attributes: distance based on the longest common subsequence (LCS
), on the longest common prefix (LCP
), on the longest common suffix (RLCP
), on the number of matching subsequences (NMS
), on the minimum shared time (NMSMST
), and the subsequence vectorial representation metric (SVRspell
).
Distances between state distributions: Euclidean (EUCLID
), Chi-squared (CHI2
).
See Studer and Ritschard (2014) for a description and the comparison of the above dissimilarity measures except TWED
for which we refer to Marteau (2009) and Halpin (2014).
Except for a few dissimilarity measures (LCS, LCP, RLCP, NMS
) that do not require parameters, each method can be controlled with some parameters as described in the following table.
method | parameters |
------------------ | --------------------------------- |
OM | sm, indel |
OMloc | sm, expcost, context, link |
OMslen | sm, indel, h, link |
OMspell | sm, indel, expcost, tpow |
OMstran | sm, indel, otto, previous, addcolumn, transindel |
TWED | sm, indel, h, nu |
HAM, DHD | sm |
EUCLID, CHI2 | step, breaks, overlap |
NMS, NMSMST | kweights |
SVRspell | sm, kweights, tpow |
------------------ | --------------------------------- |
In addition, on multiprocessor computers, one can specify the number numcpu
of cpu's to be used.
The sm
argument is required for "OM", "OMloc", "OMslen", "OMspell", "OMstran", "TWED"
, and "SVRspell"
and is optional for "HAM"
and "DHD"
. For HAM
a default substitution cost of 1 is used when sm=NULL
, and for DHD
the default costs are derived from the transition rates at the successive positions. When provided for DHD
, sm
should be a series of matrices grouped in a 3-dimensional matrix with the third index referring to the position in the sequence. For "SVRspell"
, the sm
argument must be a matrix of state proximities rather than a matrix of state dissimilarities as for all other cases.
Distances can optionally be normalized by means of the norm
argument. If set to TRUE
, Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to LCP, RLCP and LCS distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for OM, HAM and DHD. Elzinga's method can be forced with "gmean"
and Abbott's rule with "maxlength"
. With "maxdist"
the distance is normalized by its maximal possible value. For more details, see Gabadinho et al. (2009, 2011). Finally, "YujianBo"
is the normalization proposed by Yujian and Bo (2007) that preserves the triangle inequality.
When sequences contain gaps and the gaps=NA
option was passed to seqdef
—i.e. when there are non deleted missing values—the with.missing
argument should be set as TRUE
. If left as FALSE
the function stops when it encounters a gap. This is to make the user aware that there are gaps in his sequences. For methods that need an sm
value, seqdist
expects a substitution cost matrix with a row and a column entry for the missing state (symbol defined with the nr
option of seqdef
). Substitution cost matrices returned by seqcost
and seqsubm
include these additional entries when the functions are called with with.missing=TRUE
. More details on how to compute distances with sequences containing gaps can be found in Gabadinho et al. (2009).
When refseq
is NULL
(default), the whole matrix (or a dist
object when full.matrix=FALSE
) of pairwise distances between sequences is returned. Otherwise a vector with distances from the sequences in the data sequence object to the reference sequence specified with refseq
is returned.
Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. DOI: http://dx.doi.org/10.1111/rssa.12125
Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland. DOI: http://dx.doi.org/10.12682/lives.2296-1658.2014.33
Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.
Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R
with the TraMineR
package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva
Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.-A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. DOI: http://dx.doi.org/10.1007/978-3-319-04969-4_5
Marteau, P.-F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence, 31(2), 306–318. DOI: http://dx.doi.org/10.1109/TPAMI.2008.76
Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions On Pattern Analysis And Machine Intelligence, 29(6), 1091–1095. DOI: http://dx.doi.org/10.1109/TPAMI.2007.1078
See also all references in Studer and Ritschard (2014, 2015)
.
seqcost, seqsubm
, seqdef
, and for multichannel distances seqdistmc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ## optimal matching distances with substitution cost matrix
## derived from transition rates
data(biofam)
biofam.seq <- seqdef(biofam, 10:25)
biofam.om <- seqdistOO(biofam.seq, method="OM", indel=3, sm="TRATE")
## LCS distances to an external sequence
refs <- seqdef("(0,5)-(3,5)-(4,6)", informat="SPS",
alphabet=alphabet(biofam.seq))
biofam.ref <- seqdist(biofam.seq, method="LCS", refseq=refs)
hist(biofam.ref)
## =====================
## Example with missings
## =====================
data(ex1)
ex1.seq <- seqdef(ex1,1:13)
subm <- seqsubm(ex1.seq, method="TRATE", with.missing=TRUE)
ex1.om <- seqdistOO(ex1.seq, method="OM", sm=subm, with.missing=TRUE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.