Create a substitution-cost matrix

Share:

Description

The substitution-cost matrix is used when computing distances between sequences by the method of optimal matching. The function creates the substitution matrix using either a constant or the transition rates computed from the sequence data or other methods to be implemented in the future.

Usage

1
2
3
 seqsubm(seqdata, method, cval=NULL, with.missing=FALSE,
         miss.cost=NULL, time.varying=FALSE, weighted=TRUE,
		 transition="both", lag=1, missing.trate=FALSE)

Arguments

seqdata

a sequence object as returned by the seqdef function.

method

method to compute transition rates. At this time, the methods available are constant value (method="CONSTANT") or substitution costs using transition rates (method="TRATE")

cval

the constant substitution cost if method "CONSTANT" is chosen. For method "TRATE", the base value from which transition probabilities are subtracted. If NULL, cval=2, unless transition is set to "both" and time.varying is TRUE in which case cval=4.

with.missing

if TRUE, an additional entry is added in the matrix for the missing states. Hence, a new "missing" state is added to the list of "valid" states. Use this if you want to compute distances with missing values inside the sequences. See Gabadinho et al. (2010) for more details on the options for handling missing values when computing distances between sequences.

miss.cost

the substitution cost for the missing state. The default set it to cval

time.varying

Logical. If TRUE return an array containing a distinct matrix for each time unit. The time is the third dimension (subscript).

weighted

Logical. If TRUE compute transition rates using weights specified in seqdata.

transition

Only used if time.varying=TRUE. If transition="both", it uses the transition rates from previous and next state. It can also be set to "previous" or "next".

lag

Integer. Only used with (method="TRATE"). Time between the two states considered to compute transition rates (one by default).

missing.trate

Logical. Only used with (method="TRATE"). If TRUE, substitution costs with missing state are also based on transition rates. If FALSE (default value), the substitution cost for the missing state are set to miss.cost.

Details

The substitution-cost matrix has dimension ns*ns, where ns is the number of states in the alphabet of the sequence object. The element (i,j) of the matrix is the cost of substituting state i with state j.

With the "CONSTANT" method, the substitution costs are the same for all the states, with a default value of 2. An alternative value can be provided by the user. When the "TRATE" (transition rates) method is chosen, the transition rates between all states are computed using the seqtrate function. The substitution cost between states i and j is obtained with the formula

SC(i,j) = cval -P(i,j) -P(j,i)

where P(i,j) is the transition rate from state i to j.

Author(s)

Matthias Studer and Alexis Gabadinho (first version) (with Gilbert Ritschard for the help page)

References

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1-37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2010). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

See Also

seqtrate, seqdef, seqdist.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## Defining a sequence object with columns 10 to 25
## in the 'biofam' example data set
data(biofam)
biofam.seq <- seqdef(biofam,10:25)

## Optimal matching using transition rates based substitution-cost matrix
## and insertion/deletion costs of 3
trcost <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq,method="OM",indel=3,sm=trcost)

## Optimal matching using constant value (2) substitution-cost matrix
## and insertion/deletion costs of 3
ccost <- seqsubm(biofam.seq, method="CONSTANT", cval=2)
biofam.om.c2 <- seqdist(biofam.seq, method="OM",indel=3,sm=ccost)

## Displaying the distance matrix for the first 10 sequences
biofam.om.c2[1:10,1:10]

## =================================
## Example with weights and missings
## =================================
data(ex1)
ex1.seq <- seqdef(ex1,1:13, weights=ex1$weights)

## Unweighted
subm <- seqsubm(ex1.seq, method="TRATE", with.missing=TRUE, weighted=FALSE)
ex1.om <- seqdist(ex1.seq, method="OM", sm=subm, with.missing=TRUE)

## Weighted
subm.w <- seqsubm(ex1.seq, method="TRATE", with.missing=TRUE, weighted=TRUE)
ex1.omw <- seqdist(ex1.seq, method="OM", sm=subm.w, with.missing=TRUE)

ex1.om == ex1.omw

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.