seqdistOO: Distances (dissimilarities) between sequences
In seqdist2: Distance Between State Sequences

Description Usage Arguments Details Value References See Also Examples

View source: R/seqdistOO.R

Computes pairwise dissimilarities between sequences or dissimilarities with a reference sequence. Several dissimilarities measures can be chosen including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between within sequence state distributions.

seqdistOO(seqdata, method, refseq=NULL, norm=FALSE,
        indel=1, sm=NA, with.missing=FALSE, full.matrix=TRUE,
        kweights=rep(1.0, ncol(seqdata)), tpow=1,
        expcost=0.5, context=1-2*expcost,
        link="mean", h=0.5, nu=0,
        transindel="constant", otto, previous=FALSE,
        addcolumn=TRUE, numcpu=1,
        breaks=NULL, step=1, overlap=FALSE, weighted=TRUE)

`seqdata`	State sequence object defined with the `seqdef` function.
`method`	Character string indicating the metric to be used. One of `"OM", "OMloc", "OMslen", "OMspell", "OMstran", "LCS", "HAM", "DHD", "TWED", "SVRspell", "NMS", "NMSMST", "LCP", "RLCP", "CHI2", "EUCLID"` (see details)
`refseq`	Optional baseline sequence to compute the distances from. Can be the index of a sequence in the state sequence object, 0 for the most frequent sequence, or an external sequence passed as a sequence object with a single row and same alphabet as seqdata assigned to it.
`norm`	String or logical. One of `"none", "maxlength", "gmean", "maxdist", "YujianBo"`. If `TRUE` the normalization method is automatically selected (see details). Default is `FALSE`.
`indel`	Scalar or numeric vector. The insertion/deletion cost (for all OM variants including TWED). Default is 1. Ignored with non OM metrics. If it is a vector, the indel costs are state dependant (the vector length should be equal to the number of states).
`sm`	substitution-cost matrix (for OM variants, TWED, SVRspell, HAM and DHD method). Can also be one of the `seqcost` methods `"TRATE"` or `"CONSTANT"` for building a `sm` matrix. Default is `NA`. A valid value MUST be given for `method="OM"`.
`with.missing`	Logical. Must be `TRUE` when sequences contain non deleted gaps (missing values). See details.
`full.matrix`	Logical. If `TRUE` (default), the full distance matrix is returned. This is for compatibility with earlier versions of the `seqdist` function. If `FALSE`, an object of class `dist` is returned, that is, a vector containing only values from the upper triangle of the distance matrix. Objects of class `dist` are smaller and can be passed directly as arguments to most clustering functions. Ignored when `refseq` is set.
`kweights`	Numerical vector with at position k the weight applied to subsequences of length k (for all NMS distances).
`expcost`	Scalar. Cost of spell length transformation (the exact interpretation is distance dependant). For `OMspell` and `OMloc`.
`context`	Scalar. Cost of local insertion (for `OMloc` only).
`link`	Function name. Function used to compute some internal costs. For `OMloc`, the function used in the dynamic computation of insertion/deletion costs; one of `"mean"` (default), `"min"` (minimum) or `"previous"` (previous state only). For `"OMslen"`, the function used to compute substitution costs; one of `"mean"` (arithmetic average) or `"gmean"` (geometric mean as in the original proposition of Halpin 2010).

`h`	Scalar. Exponential weight of spell length (for `OMslen`).
`tpow`	Scalar. Exponential weight of spell length (for `OMspell` and `SVRspell`).
`transindel`	Character string. Method for computing transition indel costs. One of `"constant"` (single indel of 1), `"subcost"` (based on substitution costs), `"prob"` (based on transition probabilities) (for `OMstran`).
`otto`	Scalar. Origin-transition trade-off weight (for `OMstran`).
`previous`	if `TRUE`, add transition from previous state (for `OMstran`).
`addcolumn`	if `TRUE`, duplicate the first (if `previous=TRUE`) and the last column (for `OMstran`).
`breaks`	An optional list containing the spells used to build the intervals (for `CHI2`).
`step`	The length of the interval (only used if `breaks` is not `NULL`) (for `CHI2`).
`weighted`	Logical. Should distributions account for the sequence weights in the state sequence object? (see `seqdef`.) Set as `FALSE` if you want to ignore the weights (for `CHI2, OMstran`).
`overlap`	Logical. Are overlapping spells allowed (for `CHI2, EUCLID`).
`nu`	Scalar. Stiffness (for `TWED`). See Halpin (2014), p 88.
`numcpu`	Scalar. Number of CPU to be used.

The seqdistOO function returns a matrix of distances between sequences or a vector of distances to the reference sequence when refseq is set. The available metrics (see 'method' option) include

Edit distances: optimal matching (OM), localized OM (OMloc), spell length sensitive OM (OMslen), OM of spells (OMspell), OM of sequences of transitions (OMstran), Hamming (HAM), dynamic Hamming (DHD), and the time warp edit distance (TWED).
Metrics based on counts of common attributes: distance based on the longest common subsequence (LCS), on the longest common prefix (LCP), on the longest common suffix (RLCP), on the number of matching subsequences (NMS), on the minimum shared time (NMSMST), and the subsequence vectorial representation metric (SVRspell).
Distances between state distributions: Euclidean (EUCLID), Chi-squared (CHI2).

See Studer and Ritschard (2014) for a description and the comparison of the above dissimilarity measures except TWED for which we refer to Marteau (2009) and Halpin (2014).

Except for a few dissimilarity measures (LCS, LCP, RLCP, NMS) that do not require parameters, each method can be controlled with some parameters as described in the following table.

method	parameters
------------------	---------------------------------
`OM`	`sm, indel`
`OMloc`	`sm, expcost, context, link`
`OMslen`	`sm, indel, h, link`
`OMspell`	`sm, indel, expcost, tpow`
`OMstran`	`sm, indel, otto, previous, addcolumn, transindel`
`TWED`	`sm, indel, h, nu`
`HAM, DHD`	`sm`
`EUCLID, CHI2`	`step, breaks, overlap`
`NMS, NMSMST`	`kweights`
`SVRspell`	`sm, kweights, tpow`
------------------	---------------------------------

In addition, on multiprocessor computers, one can specify the number numcpu of cpu's to be used.

The sm argument is required for "OM", "OMloc", "OMslen", "OMspell", "OMstran", "TWED", and "SVRspell" and is optional for "HAM" and "DHD". For HAM a default substitution cost of 1 is used when sm=NULL, and for DHD the default costs are derived from the transition rates at the successive positions. When provided for DHD, sm should be a series of matrices grouped in a 3-dimensional matrix with the third index referring to the position in the sequence. For "SVRspell", the sm argument must be a matrix of state proximities rather than a matrix of state dissimilarities as for all other cases.

Distances can optionally be normalized by means of the norm argument. If set to TRUE, Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to LCP, RLCP and LCS distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for OM, HAM and DHD. Elzinga's method can be forced with "gmean" and Abbott's rule with "maxlength". With "maxdist" the distance is normalized by its maximal possible value. For more details, see Gabadinho et al. (2009, 2011). Finally, "YujianBo" is the normalization proposed by Yujian and Bo (2007) that preserves the triangle inequality.

When sequences contain gaps and the gaps=NA option was passed to seqdef—i.e. when there are non deleted missing values—the with.missing argument should be set as TRUE. If left as FALSE the function stops when it encounters a gap. This is to make the user aware that there are gaps in his sequences. For methods that need an sm value, seqdist expects a substitution cost matrix with a row and a column entry for the missing state (symbol defined with the nr option of seqdef). Substitution cost matrices returned by seqcost and seqsubm include these additional entries when the functions are called with with.missing=TRUE. More details on how to compute distances with sequences containing gaps can be found in Gabadinho et al. (2009).

When refseq is NULL (default), the whole matrix (or a dist object when full.matrix=FALSE) of pairwise distances between sequences is returned. Otherwise a vector with distances from the sequences in the data sequence object to the reference sequence specified with refseq is returned.

Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. DOI: http://dx.doi.org/10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland. DOI: http://dx.doi.org/10.12682/lives.2296-1658.2014.33

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva

Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.-A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. DOI: http://dx.doi.org/10.1007/978-3-319-04969-4_5

Marteau, P.-F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence, 31(2), 306–318. DOI: http://dx.doi.org/10.1109/TPAMI.2008.76

Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions On Pattern Analysis And Machine Intelligence, 29(6), 1091–1095. DOI: http://dx.doi.org/10.1109/TPAMI.2007.1078

See also all references in Studer and Ritschard (2014, 2015)

seqcost, seqsubm, seqdef, and for multichannel distances seqdistmc.

## optimal matching distances with substitution cost matrix
## derived from transition rates
data(biofam)
biofam.seq <- seqdef(biofam, 10:25)
biofam.om <- seqdistOO(biofam.seq, method="OM", indel=3, sm="TRATE")

## LCS distances to an external sequence
refs <- seqdef("(0,5)-(3,5)-(4,6)", informat="SPS",
            alphabet=alphabet(biofam.seq))
biofam.ref <- seqdist(biofam.seq, method="LCS", refseq=refs)
hist(biofam.ref)

## =====================
## Example with missings
## =====================
data(ex1)
ex1.seq <- seqdef(ex1,1:13)

subm <- seqsubm(ex1.seq, method="TRATE", with.missing=TRUE)
ex1.om <- seqdistOO(ex1.seq, method="OM", sm=subm, with.missing=TRUE)