match_rows: Fuzzy matching of cases between linelists

Description Usage Arguments Value Author(s) Examples

View source: R/match_rows.R

Description

This function matches cases between linelists on specified columns using user-specified matching thresholds.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
match_rows(
  x,
  y,
  by,
  score_fun = NULL,
  rescale = TRUE,
  na_score = 0,
  output = c("scores", "merged", "review"),
  top_n = NULL,
  min_score = NULL
)

Arguments

x

A dataframe containing the columns specified in the first column of the by argument.

y

A dataframe containing the columns specified in the second column of the by argument.

by

Linelist columns to match cases on. This can be a character vector indicating column names found in both linelists, a 2-column integer matrix indicating the pairs of columns to be matched in linelist 1 and linelist 2, or a 2-column character matrix indicating the names of the columns to be matched in linelist 1 and linelist 2.

score_fun

An optional list of functions for customised evaluations of matches. Each function must accept two vectors as arguments and return a numeric vector of the same length indicating the quality of the match.

rescale

A logical indicating whether scores for each variable should be rescaled between 0 and 1.

na_score

A numeric indicating the score to be assigned to NA scores. NA handling can also be specified in a variable-specific manner by providing custom scoring functions to score_fun.

output

If "scores", returns a dataframe of matched scores. If "merged", returns a merged linelist using the matched indices. If "review", returns a dataframe for manual reviewing of matches.

top_n

An optional integer indicating the number of matches to keep per per row of the x dataframe, sorted by match score.

min_score

An optional numeric indicating the minimum match score required to keep a match.

Value

Depending on the value of output, a dataframe containing either the matching scores, a merged database or the matches for manual review.

Author(s)

Finlay Campbell (finlaycampbell93@gmail.com)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
data(sample_linelists)

## examine linelists
head(sample_linelists$linelist_a)
head(sample_linelist$linelist_b)

## specify matching columns
by <- matrix(c("numeric_a", "numeric_b",
               "character_a", "character_b",
               "date_a", "date_b"),
             ncol = 2, byrow = TRUE)

## find matching case indices
matches <- match_rows(
sample_linelists$linelist_a,
sample_linelists$linelist_b,
by
)
head(matches)

finlaycampbell/rowmatcher documentation built on May 26, 2020, 12:14 a.m.