hungarian_merge: Merge Two Data Frames using the Hungarian Method

Description Usage Arguments Details See Also

View source: R/hungarian_merge.R


Merge two data frames by a common column, s.t. (by default) the Damerau-Levenshtein distance is minimized


hungarian_merge(x, y, by.x = NULL, by.y = NULL, FUN = NULL,
  distance_col = FALSE, ...)


x, y

data frames where nrow(x) => nrow(y), or objects to be coerced to one.

by.x, by.y

specifications of the columns used for merging without any duplicates.


function to be used to calculate the distance between potential matches.


A distance column to output?


parameters passed to FUN


Finds the optimal matches using (a fast version of) the Hungarian method as implemented in assignment.

The merge is performed s.t. that all rows of y are matched with exactly one row of x leaving some rows in x unmatched.

The function is most useful if x and |codey are lists with different but unique realizations of names from a third master list. For example, two lists with county names spelled slightly different as in the example below.

See Also

assignment and stringdist

# Matching German county names

dat1 <- data.frame(gem=c("Rosenheim", "Rosenheim, Stadt", "München", "München, Stadt") , size=rnorm(4))

dat2 <- data.frame(kr=c("Rosenheim, Landkreis", "Rosenheim, kreisefreie Stadt", "München, Landeshauptstadt") , pop=rpois(3,10))

hungarian_merge(dat1,dat2,by.x="gem", by.y="kr",distance_col=TRUE )

# User-defined function

dat1 <- data.frame(id=c(12,5,1,100), size=rnorm(4)) dat2 <- data.frame(id=c(10,1000,0,5), size=rnorm(4))

hungarian_merge(dat1,dat2,by.x="id", by.y="id", FUN=function(x,y)abs(x-y), distance_col=TRUE )

sumtxt/datatools documentation built on Oct. 7, 2018, 11:18 p.m.