pattern_join: common_join, pattern_join, similarity_join

View source: R/pattern_join.R

pattern_joinR Documentation

common_join, pattern_join, similarity_join

Description

pattern_join and similarity_join join two data.frame objects based on regex patterns or similarities to a reference, respectively. The first data.frame contains a dirty column (i.e. "real-world" data originating from a free text field) that needs grouping, categorizing, or classifying based on its content. The second data.frame maps its rows to above dirty column. It achieves this using the unique patterns (pattern_join) or references (similarity_join) given in one of its own columns (specified by parameter by, see below). common_join is a "common trunk" used by both pattern_join and similarity_join and can be used to create custom join functions (provided a custom matcher function is given, see below).

Usage

common_join(
  x,
  y,
  by,
  nomatch_cutoff = 0.2,
  x_split_cutoff = 500,
  multicore = TRUE,
  matcher = NULL
)

pattern_join(
  x,
  y,
  by,
  nomatch_cutoff = 0.2,
  x_split_cutoff = 500,
  multicore = TRUE
)

similarity_join(
  x,
  y,
  by,
  nomatch_cutoff = 0.4,
  x_split_cutoff = 500,
  multicore = TRUE
)

Arguments

x

the first data.frame

y

the second data.frame containing a column with regex patterns

by

character of length 1, specifying either names of corresponding field names in a named (e.g. c("field name in x" = "field name containing patterns in y")) or unnamed (e.g. "field name in both x and y"; here, both x and y contain the same column name) character.

nomatch_cutoff

used by similarity_match: a numeric between 0 and 1 specifying the similarity (using metric optimal string alignment, see stringsim) below which NA is joined to orginal data (meaning: the entry is treated as a "no match")

x_split_cutoff

integer specifying number of rows above which x is split into smaller data.frame objects; this is necessary, as the joining algorithm cannot handle data.frames with many thousand rows.

multicore

logical specifying if multiple cores should be used or not; it defaults to TRUE, although benefits in speed only arise if nrow(x) is substantially greater than x_split_cutoff.

matcher

to create a custom join function using common_join, specify here a function accepting two character vectors and returning a matrix with a custom matching metric; e.g. for similarity_join the custom matching function is function(x1, x2) stringdist::stringsimmatrix(a = x2, b = x1, method = "osa").

Value

a tibble of merged x and y based on found similarities columns specified by argument by.

See Also

pattern_join is similar to regex_join

Examples

# pattern_join 'airplanes' with 'model_type' by columns 'model' and 'pattern'
airplanes_model_type <- pattern_join(airplanes, model_type, c("model" = "pattern"), multicore = FALSE)
# test data for similarity_join
dirty <- data.frame(sample = 1:6, description = c("Bergerx", "Mueler", "Horsst", "Kinga", "Mannn", "Schneemann"))
reference <- data.frame(reference = c("Berger", "Mueller", "Horst", "King", "Mann", "Mustermann"))
# similarity_join with default nomatch_cutoff
dirty %>% similarity_join(reference, by = c("description" = "reference"), multicore = FALSE)
# to avoid mapping "Schneemann" to "Mustermann", increase nomatch_cutoff (default 0.4) to at least 0.51
dirty %>% similarity_join(reference, by = c("description" = "reference"), nomatch_cutoff = 0.51, multicore = FALSE)

joheli/kungfu documentation built on March 25, 2024, 10:10 a.m.