pattern_join: common_join, pattern_join, similarity_join
In joheli/kungfu: kungfu - kicks stubborn data into shape

View source: R/pattern_join.R

pattern_join

R Documentation

common_join, pattern_join, similarity_join

Description

pattern_join and similarity_join join two data.frame objects based on regex patterns or similarities to a reference, respectively. The first data.frame contains a dirty column (i.e. "real-world" data originating from a free text field) that needs grouping, categorizing, or classifying based on its content. The second data.frame maps its rows to above dirty column. It achieves this using the unique patterns (pattern_join) or references (similarity_join) given in one of its own columns (specified by parameter by, see below). common_join is a "common trunk" used by both pattern_join and similarity_join and can be used to create custom join functions (provided a custom matcher function is given, see below).

Usage

common_join(
  x,
  y,
  by,
  nomatch_cutoff = 0.2,
  x_split_cutoff = 500,
  multicore = TRUE,
  matcher = NULL
)

pattern_join(
  x,
  y,
  by,
  nomatch_cutoff = 0.2,
  x_split_cutoff = 500,
  multicore = TRUE
)

similarity_join(
  x,
  y,
  by,
  nomatch_cutoff = 0.4,
  x_split_cutoff = 500,
  multicore = TRUE
)

Arguments

`x`	the first `data.frame`
`y`	the second `data.frame` containing a column with regex patterns
`by`	character of length 1, specifying either names of corresponding field names in a named (e.g. `c("field name in x" = "field name containing patterns in y")`) or unnamed (e.g. `"field name in both x and y"`; here, both `x` and `y` contain the same column name) character.
`nomatch_cutoff`	used by `similarity_match`: a numeric between 0 and 1 specifying the similarity (using metric optimal string alignment, see stringsim) below which `NA` is joined to orginal data (meaning: the entry is treated as a "no match")
`x_split_cutoff`	integer specifying number of rows above which `x` is split into smaller `data.frame` objects; this is necessary, as the joining algorithm cannot handle data.frames with many thousand rows.
`multicore`	logical specifying if multiple cores should be used or not; it defaults to `TRUE`, although benefits in speed only arise if `nrow(x)` is substantially greater than `x_split_cutoff`.
`matcher`	to create a custom join function using `common_join`, specify here a function accepting two character vectors and returning a matrix with a custom matching metric; e.g. for `similarity_join` the custom matching function is `function(x1, x2) stringdist::stringsimmatrix(a = x2, b = x1, method = "osa")`.

Value

a tibble of merged x and y based on found similarities columns specified by argument by.

Examples

# pattern_join 'airplanes' with 'model_type' by columns 'model' and 'pattern'
airplanes_model_type <- pattern_join(airplanes, model_type, c("model" = "pattern"), multicore = FALSE)
# test data for similarity_join
dirty <- data.frame(sample = 1:6, description = c("Bergerx", "Mueler", "Horsst", "Kinga", "Mannn", "Schneemann"))
reference <- data.frame(reference = c("Berger", "Mueller", "Horst", "King", "Mann", "Mustermann"))
# similarity_join with default nomatch_cutoff
dirty %>% similarity_join(reference, by = c("description" = "reference"), multicore = FALSE)
# to avoid mapping "Schneemann" to "Mustermann", increase nomatch_cutoff (default 0.4) to at least 0.51
dirty %>% similarity_join(reference, by = c("description" = "reference"), nomatch_cutoff = 0.51, multicore = FALSE)

joheli/kungfu documentation built on March 25, 2024, 10:10 a.m.