match_cases: Fuzzy matching of cases between linelists

Description Usage Arguments Value Author(s) Examples

View source: R/match_cases.R

Description

This function matches cases between linelists on specified columns using user-specified matching thresholds.

Usage

1
2
match_cases(x, y, by, max_dist, match_fun = NULL, output = c("index",
  "merged"), mode = c("inner", "left", "right", "full", "semi", "anti"))

Arguments

x

Linelist 1 as a dataframe.

y

Linelist 2 as a dataframe.

by

Linelist columns to match cases on. This can be a character vector indicating column names found in both linelists, a 2-column integer matrix indicating the pairs of columns to be matched in linelist 1 and linelist 2, or a 2-column character matrix indicating the names of the columns to be matched in linelist 1 and linelist 2.

max_dist

A numeric vector indicating the cutoff distance for fuzzy matching of each column-pair. This can be a single value used for all column-pairs, or a vector of values indicating the cutoff for each column-pair. Distances between numeric columns are calculated as the absolute difference between values, distances between Date columns are calculated as the absolute difference in number of days and distances between character columns are calculated using the stringdist function from the stringdist package.

match_fun

An optional list of functions for customised evaluations of matches. Each function must accept two vectors as arguments and return a logical vector of the same length indicating whether a comparison is a match or not. The list must be of the same length as max_dist.

output

If "index", returns a dataframe of matched indices between the linelists. If "merged", returns a merged linelist.

mode

The type of join when returning a merged linelist. One of "inner", "left", "right", "full", "semi", "anti".

Value

A dataframe of matching indices if output = "index", a merged linelist if output = "merged".

Author(s)

Finlay Campbell (finlaycampbell93@gmail.com)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
data(sample_linelists)
linelist_a <- sample_linelists$linelist_a
linelist_b <- sample_linelists$linelist_b

## examine linelists
head(linelist_a)
head(linelist_b)

## specify matching columns
by <- matrix(c("numeric_a", "numeric_b",
               "character_a", "character_b",
               "date_a", "date_b"),
             ncol = 2, byrow = TRUE)

## define thresholds
max_dist <- c(5, 1, 5)

## find matching case indices
matches <- match_cases(linelist_a, linelist_b, by, max_dist)
head(matches)

## merge linelists
linelist <- match_cases(linelist_a, linelist_b, by, max_dist, output = "merged")
head(linelist)

finlaycampbell/casematcher documentation built on May 8, 2020, 8:29 p.m.