# similarity-methods: Compute Similarities In arulesSequences: Mining Frequent Sequences

## Description

Provides the generic function `similarity` and the S4 method to compute similarities among a collection of sequences.

`is.subset, is.superset` find subsequence or supersequence relationships among a collection of sequences.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11``` ```similarity(x, y = NULL, ...) ## S4 method for signature 'sequences' similarity(x, y = NULL, method = c("jaccard", "dice", "cosine", "subset"), strict = FALSE) ## S4 method for signature 'sequences' is.subset(x, y = NULL, proper = FALSE) ## S4 method for signature 'sequences' is.superset(x, y = NULL, proper = FALSE) ```

## Arguments

 `x, y` an object. `...` further (unused) arguments. `method` a string specifying the similarity measure to use (see details). `strict` a logical value specifying if strict itemset matching should be used. `proper` a logical value specifying if only strict relationships (omitting equality) should be indicated.

## Details

Let the number of common elements of two sequences refer to those that occur in a longest common subsequence. The following similarity measures are implemented:

`jaccard`:

The number of common elements divided by the total number of elements (the sum of the lengths of the sequences minus the length of the longest common subsequence).

`dice`:

Uses two times the number of common elements.

`cosine`:

Uses the square root of the product of the sequence lengths for the denominator.

`subset`:

Zero if the first sequence is not a subsequence of the second. Otherwise the number of common elements divided by the number of elements in the first sequence.

If `strict = TRUE` the elements (itemsets) of the sequences must be equal to be matched. Otherwise matches are quantified by the similarity of the itemsets (as specified by `method`) thresholded at 0.5, and the common sequence by the sum of the similarities.

## Value

For `similarity`, returns an object of class `dsCMatrix` if the result is symmetric (or `method = "subset"`) and and object of class `dgCMatrix` otherwise.

For `is.subset, is.superset` returns an object of class `lgCMatrix`.

## Note

Computation of the longest common subsequence of two sequences of length `n, m` takes `O(n*m)` time.

The supported set of operations for the above matrix classes depends on package Matrix. In case of problems, expand to full storage representation using `as(x, "matrix")` or `as.matrix(x)`.

For efficiency use `as(x, "dist")` to convert a symmetric result matrix for clustering.

## Author(s)

Christian Buchta

Class `sequences`, method `dissimilarity`.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14``` ```## use example data data(zaki) z <- as(zaki, "timedsequences") similarity(z) # require equality similarity(z, strict = TRUE) ## emphasize common similarity(z, method = "dice") ## is.subset(z) is.subset(z, proper = TRUE) ```

### Example output

```Loading required package: arules

Attaching package: 'arules'

The following objects are masked from 'package:base':

abbreviate, write

4 x 4 sparse Matrix of class "dsCMatrix"
1   2    3 4
1 1.00 0.2 0.25 .
2 0.20 1.0 0.50 .
3 0.25 0.5 1.00 .
4 .    .   .    1
4 x 4 sparse Matrix of class "dsCMatrix"
1   2    3 4
1 1.00 0.2 0.25 .
2 0.20 1.0 0.50 .
3 0.25 0.5 1.00 .
4 .    .   .    1
4 x 4 sparse Matrix of class "dsCMatrix"
1         2         3 4
1 1.0000000 0.3333333 0.4000000 .
2 0.3333333 1.0000000 0.6666667 .
3 0.4000000 0.6666667 1.0000000 .
4 .         .         .         1
4 x 4 sparse Matrix of class "lgCMatrix"
1 2 3 4
1 | . . .
2 . | . .
3 | | | .
4 . . . |
4 x 4 sparse Matrix of class "lgCMatrix"
1 2 3 4
1 . . . .
2 . . . .
3 | | . .
4 . . . .
```

arulesSequences documentation built on July 2, 2020, 4:09 a.m.