matchEpiData: Find duplicates in one or two data sets

Description Usage Arguments Details Value Examples

Description

Find duplicates in one or two data sets

Usage

1
2
matchEpiData(dat1, dat2 = NULL, funlist = list(), thresh = 0.05,
  giveWeight = FALSE)

Arguments

dat1

An input linelist

dat2

An optional extra linelist

funlist

A list containing lists containing:

  • d1vars - variable names for dataset 1

  • d2vars - variable names for dataset 2

  • fun - function name to process on these variables

  • extraparams - extra parameters that need to be applied with the function.

  • weights - a weight vector to scale each matrix (not used in processFunctionList).

thresh

a threshold below which to consider two rows nearly identical.

giveWeight

a logical parameter indicating whether or not the output should be a list of weights or indices (default).

Details

this function will take in one or two data sets, a list of functions to apply to specific columns of the data set, and a threshold to determine what is a match. It will return a list from returnMatches where each element represents a different potential match. Within each element, there is a two-element list where each contains either indices or weights for each sample that matched below the threshold.

Value

something

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
## Loading Data
indata <- system.file("files", package = "epimatch")
indata <- dir(indata, full.names = TRUE)
x <- lapply(indata, read.csv, stringsAsFactors = FALSE)
names(x) <- basename(indata)

# We will use one data set from the case information and lab results
case <- x[["CaseInformationForm.csv"]]
lab <- x[["LaboratoryResultsForm7.csv"]]

# This will get all of the indices that match the ID and Names with a
# threshold of 0.25
res <- matchEpiData(dat1 = case,
                    dat2 = lab,
                    funlist = list(
                    list(d1vars = "ID",
                         d2vars = "ID",
                         fun = "nameDists",
                         extraparams = NULL,
                         weight = 1),
                    list(d1vars = c("Surname", "OtherNames"),
                         d2vars = c("SurnameLab", "OtherNameLab"),
                         fun = "nameDists",
                         extraparams = NULL,
                         weight = 0.5)
                    ),
                    thresh = 0.25)
# List of indices
res

# Printing out the matching names in decreasing order of matching
invisible(lapply(res, function(i) {
   print(case[i$d1, c("Surname", "OtherNames")])
   print(lab[i$d2, c("SurnameLab", "OtherNameLab")])
   cat("\n\t--------\n")
 }))

Hackout3/epimatch documentation built on May 6, 2019, 9:48 p.m.