match.data.frame: Identify the row of 'y' best matching each row of 'x'
In Ecfun: Functions for Ecdat

Description Usage Arguments Details Value Author(s) See Also Examples

For each row of x[, by.x], find the best matching row of y[, by.y], with the best match defined by grep. and split.

grep. and split must either be missing or have the same length as by.x and by.y. If grep.[i] and split[i] are NA, do a complete match of x[, by.x[i]] and y[, by.y[i]]. Otherwise, for each row j, look for a match for strsplit(x[j, by.x[i]], split[i])[[1]][1] among strsplit(y[, by.y[i]], split[i]). See details.

1	match.data.frame(x, y, by, by.x=by, by.y=by, grep., split, sep=':')

`x, y`	data.frames
`by, by.x, by.y`	names of columns of `x` and `y` to match.
`grep.`	a character vector of the type of match for each element of `by.x` and `by.y`. If `NA`, require a perfect match. Alternatives are `grep` and `agrep` to find a match for the first segment in strsplit(x, split=split[i]) among any of the segments of strsplit(y, split=split[i]). Use `fixed=TRUE` with the calls to these functions. NOTE: These alternatives are not examined if a unique match is found betweed x[, by.x[is.na(grep.) & is.na(split)]] and the corresponding columns of `y`.
`split`	A character vector of `split` characters to pass to `strsplit`; `strsplit` is not called if `is.na(split)`.
`sep`	a `sep` argument to use with `paste` to produce a matching key for the columns of `x` and `y` for which perfect matches are required. If(missing(sep) && not(missing(grep.))) sep <- ' ' except where grep. = NA.

1. Check by.x, by.y, grep. and split. If((missing(by.x) | missing(by.y)) && missing(by)) by <- names(x)

2. fullMatch <- (is.na(grep.) & is.na(split)). Create keyfx and keyfy by by pasting columns of x[, by.x[fullMatch]] and y[, by.y[fullMatch]]. Also create x. and y. = strsplit of x[, by.x[!fullMatch]].

3. Iterate over rows of x looking for the best match. This includes an inner loop over columns of x[, by.x[!fullMatch]], stopping on the first unique match. Return (-1) if no unique match is found.

an integer vector of length nrow(x) containing the index of the best matching row of y or NA if no adequate match was found.

Spencer Graves

strsplit, is.na grep, agrep match, row.match, join, match_df classify

newdata <- data.frame(state=c("AL", "MI","NY"),
                      surname=c("Rogers", "Rogers", "Smith"),
                      givenName=c("Mike R.", "Mike K.", "Al"),
                      stringsAsFactors=FALSE)
reference <- data.frame(state=c("NY", "NY", "MI", "AL", "NY", "MI"),
                      surname=c("Smith", "Rogers", "Rogers (MI)",
                                "Rogers (AL)", "Smith", 'Jones'),
                      givenName=c("John", "Mike", "Mike", "Mike",
                                "T. Albert", 'Al Thomas'),
                      stringsAsFactors=FALSE)
newInRef <- match.data.frame(newdata, reference,
       grep.=c(NA, 'agrep', 'agrep'))


all.equal(newInRef, c(4, 3, 5))