matchEpiData: Find duplicates in one or two data sets
In Hackout3/epimatch: Tools for finding close matches in epi data

Description Usage Arguments Details Value Examples

Find duplicates in one or two data sets

1 2	matchEpiData(dat1, dat2 = NULL, funlist = list(), thresh = 0.05, giveWeight = FALSE)

`dat1`	An input linelist
`dat2`	An optional extra linelist
`funlist`	A list containing lists containing: d1vars - variable names for dataset 1 d2vars - variable names for dataset 2 fun - function name to process on these variables extraparams - extra parameters that need to be applied with the function. weights - a weight vector to scale each matrix (not used in processFunctionList).
`thresh`	a threshold below which to consider two rows nearly identical.
`giveWeight`	a logical parameter indicating whether or not the output should be a list of weights or indices (default).

this function will take in one or two data sets, a list of functions to apply to specific columns of the data set, and a threshold to determine what is a match. It will return a list from returnMatches where each element represents a different potential match. Within each element, there is a two-element list where each contains either indices or weights for each sample that matched below the threshold.

something

## Loading Data
indata <- system.file("files", package = "epimatch")
indata <- dir(indata, full.names = TRUE)
x <- lapply(indata, read.csv, stringsAsFactors = FALSE)
names(x) <- basename(indata)

# We will use one data set from the case information and lab results
case <- x[["CaseInformationForm.csv"]]
lab <- x[["LaboratoryResultsForm7.csv"]]

# This will get all of the indices that match the ID and Names with a
# threshold of 0.25
res <- matchEpiData(dat1 = case,
                    dat2 = lab,
                    funlist = list(
                    list(d1vars = "ID",
                         d2vars = "ID",
                         fun = "nameDists",
                         extraparams = NULL,
                         weight = 1),
                    list(d1vars = c("Surname", "OtherNames"),
                         d2vars = c("SurnameLab", "OtherNameLab"),
                         fun = "nameDists",
                         extraparams = NULL,
                         weight = 0.5)
                    ),
                    thresh = 0.25)
# List of indices
res

# Printing out the matching names in decreasing order of matching
invisible(lapply(res, function(i) {
   print(case[i$d1, c("Surname", "OtherNames")])
   print(lab[i$d2, c("SurnameLab", "OtherNameLab")])
   cat("\n\t--------\n")
 }))