knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(findMatch)

Creating some random data sets

For one point in time

A small dataset

d <- generateTestData(6)
head(d)

A small dataset with an additional variable points

d <- generateTestData(6, points=function(n) { sample(0:20, size=n, replace=TRUE)} )
head(d)

A small dataset with overwriting the birthplace with a vector

d <- generateTestData(6, birthplace=c("Berlin", "Hamburg", "Köln", "München"))
head(d)

A small dataset with overwriting the birthplace with a function

d <- generateTestData(6, birthplace=function(n) {
                            sample(c("Berlin", "Hamburg", "Köln", "München"), 
                                   size=n, replace=TRUE, 
                                   prob=c(3520031, 1787408, 1060582, 1450381))
                      }) 
head(d)

For more two points in time

Two small data sets with 6 observations at t1 and 4 observations at t2 without overlapping observations

d <- generateTestData(c(6, 4))
str(d)

For two data sets with 6 observations only in t1, 4 observations only in t2 and 5 observations only in t1 and t2 you have to construct a list of vectors. Each vector has as first entry the number of observations and as further entries the number of the timepoints these observations should be. For example

This creates two data frames with an appropriate observation structure

# t1: 6+5=11 observations
# t2: 4+5=9 observations
n <- list(c(6, 1), c(4, 2), c(5, 1, 2))
str(n)
d <- generateTestData(n)
str(d)

For more than two points in time

Three data frames with

# t1: 6+5+8+7=26 observations
# t2: 4+5+3+7=19 observations
# t3: 2+8+3+7=20 observations
n <- list(c(6, 1), c(4, 2), c(2, 3), c(5, 1, 2), c(8, 1, 3), c(3, 2, 3), c(7, 1, 2, 3))
str(n)
d <- generateTestData(n)
str(d)

Find matches between two points in time

At first we generate two test data sets and then match on the code variables using the Levenshtein distance. We most likely found 5 matches.

n <- list(c(6, 1), c(4, 2), c(5, 1, 2))
data  <- generateTestData(n)
vars  <-  c("code", "code")
match <- findMatch(data, vars)
# 
summary(match)
# 
head(match)

The summary tells us that we found 5 perfect matches with Levensthein distances of zero.

Each line in head should read as follows:

We may allow for a larger Levenshtein distance and find two more possible matches, but they are not exact

set.seed(0)
n <- list(c(6, 1), c(4, 2), c(5, 1, 2))
data  <- generateTestData(n)
vars  <-  c("code", "code")
match <- findMatch(data, vars, dmax=5)
# 
summary(match)
# 
head(match)

Find matches between more than two points in time



sigbertklinke/findMatch documentation built on July 12, 2019, 9:22 a.m.