knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = TRUE)

Smith Waterman

Smith-Waterman is an algorithm to identify similaries between sequences. The algorithm is explained in detail at https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm and finds a local optimal alignment between 2 sequences of letters.

This package implements the algorithm for sequences of letters as well as sequences of words and is usefull for text analytics researchers.

Example usage

The package was set up in order to easily

We show some examples of these use cases below.

library(text.alignment)

Example matching 2 names

a <- "Gaspard   Tournelly cardeur à laine"
b <- "Gaspard   Bourelly cordonnier"
smith_waterman(a, b)

a <- "Gaspard   T.  cardeur à laine"
b <- "Gaspard   Tournelly cardeur à laine"
smith_waterman(a, b, type = "characters")

Example matching 2 translations

a <- system.file(package = "text.alignment", "extdata", "example1.txt")
a <- readLines(a)
a <- paste(a, collapse = "\n")
b <- system.file(package = "text.alignment", "extdata", "example2.txt")
b <- readLines(b)
b <- paste(b, collapse = "\n")
cat(a, sep = "\n")
cat(b, sep = "\n")
smith_waterman(a, b, type = "words")

Find relevant sequences of texts in other texts

x <- smith_waterman("Lange rei", b)
x$b$tokens[x$b$alignment$from:x$b$alignment$to]
overview <- as.data.frame(x)
overview$b_from
overview$b_to
substr(overview$b, overview$b_from, overview$b_to)

Get alignment overview as a data.frame

x <- smith_waterman(a, b)
x <- as.data.frame(x, alignment_id = "matching-a-to-b")
str(x)


DIGI-VUB/text.alignment documentation built on Sept. 18, 2023, 7:26 a.m.