findMatch: findMatch

Description Usage Arguments Details Value Examples

View source: R/findMatch.R

Description

Finds matches between two or more data sets based on a text variable (code or e-mail) based on Levensthein distances. For a detailed application see the vignette.

Usage

1
2
3
4
5
6
findMatch(data, ...)

## Default S3 method:
findMatch(data, vars, dmax = 3, exclude = c("", "."),
  ignore.case = FALSE, unique.id = NULL, output = 50,
  cmpfunc = NULL, ...)

Arguments

data

list of data frames

...

further parameters for cmp

vars

vector of variables. One for each data frame.

dmax

maximal levensthein distance for matching in text variables $l(t_i1,tj2]<dmax$), defaults to 3

exclude

entries to be excluded from the unique values, defaults to c('', '.')

ignore.case

if FALSE, the uniques values are case sensitive and if TRUE, case is ignored

unique.id

vector of variables which contain a unique ID over all data sets. If not given then filename:lineno will be used.

output

number of observation to analyse before a progress information is displayed

cmpfunc

function for comparison of strings of form fun(x, y, ignore.case, ...) (default: adist)

Details

The result consists of a list with three elements

line

a matrix with the line numbers of the matching observations

idn

a matrix with the common ID ZDV and the original text variables in the data sets

leven

a matrix with the levenshtein distance between the common ID and the original text variables in the data sets

Value

a list structure with possibly matched observations

a list structure with possibly matched observations

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
set.seed(0)
# create two data sets where the second consists of
# 200 obs. only in t1, 200 obs. in t1 and t2 and
# 100 obs. only in t2
n <- list(c(200, 1), c(200, 1, 2), c(100, 2))
x <- generateTestData(n)
# match by code
match <- findMatch(x, c('code', 'code'))
head(match)
summary(match)

sigbertklinke/findMatch documentation built on July 12, 2019, 9:22 a.m.