Home

/

GitHub

/

In sigbertklinke/findMatch: Matches observations based on strings with the Levensthein distance

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(findMatch)

Creating some random data sets

For one point in time

A small dataset

d <- generateTestData(6)
head(d)

A small dataset with an additional variable points

d <- generateTestData(6, points=function(n) { sample(0:20, size=n, replace=TRUE)} )
head(d)

A small dataset with overwriting the birthplace with a vector

d <- generateTestData(6, birthplace=c("Berlin", "Hamburg", "Köln", "München"))
head(d)

A small dataset with overwriting the birthplace with a function

d <- generateTestData(6, birthplace=function(n) {
                            sample(c("Berlin", "Hamburg", "Köln", "München"), 
                                   size=n, replace=TRUE, 
                                   prob=c(3520031, 1787408, 1060582, 1450381))
                      }) 
head(d)

For more two points in time

Two small data sets with 6 observations at t1 and 4 observations at t2 without overlapping observations

d <- generateTestData(c(6, 4))
str(d)

For two data sets with 6 observations only in t1, 4 observations only in t2 and 5 observations only in t1 and t2 you have to construct a list of vectors. Each vector has as first entry the number of observations and as further entries the number of the timepoints these observations should be. For example

c(6, 1) means 6 observations only at t1 or
c(4, 2) means 4 observations only at t2 or
c(5, 1, 2) means 5 observations at t1 and t2.

This creates two data frames with an appropriate observation structure

# t1: 6+5=11 observations
# t2: 4+5=9 observations
n <- list(c(6, 1), c(4, 2), c(5, 1, 2))
str(n)
d <- generateTestData(n)
str(d)

For more than two points in time

Three data frames with

6 observations only in t1,
4 observations only in t2,
2 observations only in t3,
5 observations in t1 and t2,
8 observations in t1 and t3,
3 observations in t2 and t3 and
7 observations in t1, t2 and t3.

# t1: 6+5+8+7=26 observations
# t2: 4+5+3+7=19 observations
# t3: 2+8+3+7=20 observations
n <- list(c(6, 1), c(4, 2), c(2, 3), c(5, 1, 2), c(8, 1, 3), c(3, 2, 3), c(7, 1, 2, 3))
str(n)
d <- generateTestData(n)
str(d)

Find matches between two points in time

At first we generate two test data sets and then match on the code variables using the Levenshtein distance. We most likely found 5 matches.

n <- list(c(6, 1), c(4, 2), c(5, 1, 2))
data  <- generateTestData(n)
vars  <-  c("code", "code")
match <- findMatch(data, vars)
# 
summary(match)
# 
head(match)

The summary tells us that we found 5 perfect matches with Levensthein distances of zero.

Each line in head should read as follows:

line.1: observation number in data frame 1
line.2: observation number in data frame 2
uid.1, uid.2: unique ids over all data sets for observation, if not given then dataset:lineno is used
idn.0.ZDV: common code created from vars
idn.1.code, idn.2.code: codes compared with
leven.1, leven.2: Levensthein distances between common code and codes in the two data frames
leven.V3: sum of leven.1 and leven.2

We may allow for a larger Levenshtein distance and find two more possible matches, but they are not exact

set.seed(0)
n <- list(c(6, 1), c(4, 2), c(5, 1, 2))
data  <- generateTestData(n)
vars  <-  c("code", "code")
match <- findMatch(data, vars, dmax=5)
# 
summary(match)
# 
head(match)

Find matches between more than two points in time

sigbertklinke/findMatch documentation built on July 12, 2019, 9:22 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sigbertklinke/findMatch
Matches observations based on strings with the Levensthein distance

In sigbertklinke/findMatch: Matches observations based on strings with the Levensthein distance

Creating some random data sets

For one point in time

For more two points in time

For more than two points in time

Find matches between two points in time

Find matches between more than two points in time

R Package Documentation

Browse R Packages

We want your feedback!

sigbertklinke/findMatch Matches observations based on strings with the Levensthein distance

In sigbertklinke/findMatch: Matches observations based on strings with the Levensthein distance

Creating some random data sets

For one point in time

For more two points in time

For more than two points in time

Find matches between two points in time

Find matches between more than two points in time

R Package Documentation

Browse R Packages

We want your feedback!

sigbertklinke/findMatch
Matches observations based on strings with the Levensthein distance