knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(findMatch)
A small dataset
d <- generateTestData(6) head(d)
A small dataset with an additional variable points
d <- generateTestData(6, points=function(n) { sample(0:20, size=n, replace=TRUE)} ) head(d)
A small dataset with overwriting the birthplace
with a vector
d <- generateTestData(6, birthplace=c("Berlin", "Hamburg", "Köln", "München")) head(d)
A small dataset with overwriting the birthplace
with a function
d <- generateTestData(6, birthplace=function(n) { sample(c("Berlin", "Hamburg", "Köln", "München"), size=n, replace=TRUE, prob=c(3520031, 1787408, 1060582, 1450381)) }) head(d)
Two small data sets with 6 observations at t1 and 4 observations at t2 without overlapping observations
d <- generateTestData(c(6, 4)) str(d)
For two data sets with 6 observations only in t1, 4 observations only in t2 and 5 observations only in t1 and t2 you have to construct a list of vectors. Each vector has as first entry the number of observations and as further entries the number of the timepoints these observations should be. For example
c(6, 1)
means 6 observations only at t1 orc(4, 2)
means 4 observations only at t2 orc(5, 1, 2)
means 5 observations at t1 and t2.This creates two data frames with an appropriate observation structure
# t1: 6+5=11 observations # t2: 4+5=9 observations n <- list(c(6, 1), c(4, 2), c(5, 1, 2)) str(n) d <- generateTestData(n) str(d)
Three data frames with
# t1: 6+5+8+7=26 observations # t2: 4+5+3+7=19 observations # t3: 2+8+3+7=20 observations n <- list(c(6, 1), c(4, 2), c(2, 3), c(5, 1, 2), c(8, 1, 3), c(3, 2, 3), c(7, 1, 2, 3)) str(n) d <- generateTestData(n) str(d)
At first we generate two test data sets and then match on the code
variables using the Levenshtein distance. We most likely found 5 matches.
n <- list(c(6, 1), c(4, 2), c(5, 1, 2)) data <- generateTestData(n) vars <- c("code", "code") match <- findMatch(data, vars) # summary(match) # head(match)
The summary
tells us that we found 5 perfect matches with Levensthein distances of zero.
Each line in head
should read as follows:
line.1
: observation number in data frame 1line.2
: observation number in data frame 2uid.1
, uid.2
: unique ids over all data sets for observation, if not given then dataset:lineno
is usedidn.0.ZDV
: common code created from vars
idn.1.code
, idn.2.code
: codes compared withleven.1
, leven.2
: Levensthein distances between common code and codes in the two data framesleven.V3
: sum of leven.1
and leven.2
We may allow for a larger Levenshtein distance and find two more possible matches, but they are not exact
set.seed(0) n <- list(c(6, 1), c(4, 2), c(5, 1, 2)) data <- generateTestData(n) vars <- c("code", "code") match <- findMatch(data, vars, dmax=5) # summary(match) # head(match)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.