seqdistOO: Distances (dissimilarities) between sequences

Description Usage Arguments Details Value References See Also Examples

View source: R/seqdistOO.R

Description

Computes pairwise dissimilarities between sequences or dissimilarities with a reference sequence. Several dissimilarities measures can be chosen including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between within sequence state distributions.

Usage

1
2
3
4
5
6
7
8
seqdistOO(seqdata, method, refseq=NULL, norm=FALSE,
        indel=1, sm=NA, with.missing=FALSE, full.matrix=TRUE,
        kweights=rep(1.0, ncol(seqdata)), tpow=1,
        expcost=0.5, context=1-2*expcost,
        link="mean", h=0.5, nu=0,
        transindel="constant", otto, previous=FALSE,
        addcolumn=TRUE, numcpu=1,
        breaks=NULL, step=1, overlap=FALSE, weighted=TRUE)

Arguments

seqdata

State sequence object defined with the seqdef function.

method

Character string indicating the metric to be used. One of "OM", "OMloc", "OMslen", "OMspell", "OMstran", "LCS", "HAM", "DHD", "TWED", "SVRspell", "NMS", "NMSMST", "LCP", "RLCP", "CHI2", "EUCLID" (see details)

refseq

Optional baseline sequence to compute the distances from. Can be the index of a sequence in the state sequence object, 0 for the most frequent sequence, or an external sequence passed as a sequence object with a single row and same alphabet as seqdata assigned to it.

norm

String or logical. One of "none", "maxlength", "gmean", "maxdist", "YujianBo". If TRUE the normalization method is automatically selected (see details). Default is FALSE.

indel

Scalar or numeric vector. The insertion/deletion cost (for all OM variants including TWED). Default is 1. Ignored with non OM metrics. If it is a vector, the indel costs are state dependant (the vector length should be equal to the number of states).

sm

substitution-cost matrix (for OM variants, TWED, SVRspell, HAM and DHD method). Can also be one of the seqcost methods "TRATE" or "CONSTANT" for building a sm matrix. Default is NA. A valid value MUST be given for method="OM".

with.missing

Logical. Must be TRUE when sequences contain non deleted gaps (missing values). See details.

full.matrix

Logical. If TRUE (default), the full distance matrix is returned. This is for compatibility with earlier versions of the seqdist function. If FALSE, an object of class dist is returned, that is, a vector containing only values from the upper triangle of the distance matrix. Objects of class dist are smaller and can be passed directly as arguments to most clustering functions. Ignored when refseq is set.

kweights

Numerical vector with at position k the weight applied to subsequences of length k (for all NMS distances).

expcost

Scalar. Cost of spell length transformation (the exact interpretation is distance dependant). For OMspell and OMloc.

context

Scalar. Cost of local insertion (for OMloc only).

link

Function name. Function used to compute some internal costs.
For OMloc, the function used in the dynamic computation of insertion/deletion costs; one of "mean" (default), "min" (minimum) or "previous" (previous state only).
For "OMslen", the function used to compute substitution costs; one of "mean" (arithmetic average) or "gmean" (geometric mean as in the original proposition of Halpin 2010).

h

Scalar. Exponential weight of spell length (for OMslen).

tpow

Scalar. Exponential weight of spell length (for OMspell and SVRspell).

transindel

Character string. Method for computing transition indel costs. One of "constant" (single indel of 1), "subcost" (based on substitution costs), "prob" (based on transition probabilities) (for OMstran).

otto

Scalar. Origin-transition trade-off weight (for OMstran).

previous

if TRUE, add transition from previous state (for OMstran).

addcolumn

if TRUE, duplicate the first (if previous=TRUE) and the last column (for OMstran).

breaks

An optional list containing the spells used to build the intervals (for CHI2).

step

The length of the interval (only used if breaks is not NULL) (for CHI2).

weighted

Logical. Should distributions account for the sequence weights in the state sequence object? (see seqdef.) Set as FALSE if you want to ignore the weights (for CHI2, OMstran).

overlap

Logical. Are overlapping spells allowed (for CHI2, EUCLID).

nu

Scalar. Stiffness (for TWED). See Halpin (2014), p 88.

numcpu

Scalar. Number of CPU to be used.

Details

The seqdistOO function returns a matrix of distances between sequences or a vector of distances to the reference sequence when refseq is set. The available metrics (see 'method' option) include

See Studer and Ritschard (2014) for a description and the comparison of the above dissimilarity measures except TWED for which we refer to Marteau (2009) and Halpin (2014).

Except for a few dissimilarity measures (LCS, LCP, RLCP, NMS) that do not require parameters, each method can be controlled with some parameters as described in the following table.

method parameters
------------------ ---------------------------------
OM sm, indel
OMloc sm, expcost, context, link
OMslen sm, indel, h, link
OMspell sm, indel, expcost, tpow
OMstran sm, indel, otto, previous, addcolumn, transindel
TWED sm, indel, h, nu
HAM, DHD sm
EUCLID, CHI2 step, breaks, overlap
NMS, NMSMST kweights
SVRspell sm, kweights, tpow
------------------ ---------------------------------

In addition, on multiprocessor computers, one can specify the number numcpu of cpu's to be used.

The sm argument is required for "OM", "OMloc", "OMslen", "OMspell", "OMstran", "TWED", and "SVRspell" and is optional for "HAM" and "DHD". For HAM a default substitution cost of 1 is used when sm=NULL, and for DHD the default costs are derived from the transition rates at the successive positions. When provided for DHD, sm should be a series of matrices grouped in a 3-dimensional matrix with the third index referring to the position in the sequence. For "SVRspell", the sm argument must be a matrix of state proximities rather than a matrix of state dissimilarities as for all other cases.

Distances can optionally be normalized by means of the norm argument. If set to TRUE, Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to LCP, RLCP and LCS distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for OM, HAM and DHD. Elzinga's method can be forced with "gmean" and Abbott's rule with "maxlength". With "maxdist" the distance is normalized by its maximal possible value. For more details, see Gabadinho et al. (2009, 2011). Finally, "YujianBo" is the normalization proposed by Yujian and Bo (2007) that preserves the triangle inequality.

When sequences contain gaps and the gaps=NA option was passed to seqdef—i.e. when there are non deleted missing values—the with.missing argument should be set as TRUE. If left as FALSE the function stops when it encounters a gap. This is to make the user aware that there are gaps in his sequences. For methods that need an sm value, seqdist expects a substitution cost matrix with a row and a column entry for the missing state (symbol defined with the nr option of seqdef). Substitution cost matrices returned by seqcost and seqsubm include these additional entries when the functions are called with with.missing=TRUE. More details on how to compute distances with sequences containing gaps can be found in Gabadinho et al. (2009).

Value

When refseq is NULL (default), the whole matrix (or a dist object when full.matrix=FALSE) of pairwise distances between sequences is returned. Otherwise a vector with distances from the sequences in the data sequence object to the reference sequence specified with refseq is returned.

References

Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. DOI: http://dx.doi.org/10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland. DOI: http://dx.doi.org/10.12682/lives.2296-1658.2014.33

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva

Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.-A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. DOI: http://dx.doi.org/10.1007/978-3-319-04969-4_5

Marteau, P.-F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence, 31(2), 306–318. DOI: http://dx.doi.org/10.1109/TPAMI.2008.76

Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions On Pattern Analysis And Machine Intelligence, 29(6), 1091–1095. DOI: http://dx.doi.org/10.1109/TPAMI.2007.1078

See also all references in Studer and Ritschard (2014, 2015)

See Also

seqcost, seqsubm, seqdef, and for multichannel distances seqdistmc.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
## optimal matching distances with substitution cost matrix
## derived from transition rates
data(biofam)
biofam.seq <- seqdef(biofam, 10:25)
biofam.om <- seqdistOO(biofam.seq, method="OM", indel=3, sm="TRATE")

## LCS distances to an external sequence
refs <- seqdef("(0,5)-(3,5)-(4,6)", informat="SPS",
            alphabet=alphabet(biofam.seq))
biofam.ref <- seqdist(biofam.seq, method="LCS", refseq=refs)
hist(biofam.ref)

## =====================
## Example with missings
## =====================
data(ex1)
ex1.seq <- seqdef(ex1,1:13)

subm <- seqsubm(ex1.seq, method="TRATE", with.missing=TRUE)
ex1.om <- seqdistOO(ex1.seq, method="OM", sm=subm, with.missing=TRUE)

seqdist2 documentation built on May 2, 2019, 4:53 p.m.

Related to seqdistOO in seqdist2...