block: block

Description Usage Arguments Value

View source: R/block.R

Description

Calculates cartesian product of records in two dataframes and filters out pairs which are unlikely to be matches. Blocks can be formed by comparing exact values, comparing values which are with a range of each other, and string comparisons using vector encodings. Encoded vectors can be blocked using the binary method or by clustering encoded vectors.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
block(dfA, dfB, cols.exact = NULL, cols.numeric = NULL,
  numeric.range = NULL, cols.encoder = NULL,
  encoder.model.path = NULL, encoder.trainA = NULL,
  encoder.trainB = NULL, encoder.block.method = "binary",
  encoder.nclusters = 5, encoder.maxiter = 1000,
  known.matches = NULL, dim.latent = 8, dim.encode = 64,
  dim.decode = 64, max.length = 12, num.encode.layers = 2,
  num.decode.layers = 2, batch.size = 32, epochs = 500, lr = 5e-04,
  validation.split = 0.2, save.dir = "~/blocking_models/",
  reconstruct = TRUE, reconstruct.n = 5, reconstruct.display = 20,
  earlystop = FALSE, earlystop.patience = 10, tensorboard = FALSE,
  tensorboard.runid = as.character(Sys.time()), verbose = 2,
  n.cores = parallel::detectCores() - 1)

Arguments

dfA

Dataframe to be linked to dfB

dfB

Dataframe to be linked to dfA, if doing deduplication this is dfA

cols.exact

List of exact match columns

cols.numeric

List of numeric range columns

numeric.range

Range of numeric

cols.encoder

List of encoder columns

encoder.model.path

Path to encoder model

encoder.trainA

Vector of names to train

encoder.trainB

Vector of matching names to train

encoder.block.method

binary or cluster

encoder.nclusters

Number of cluster is encoding by cluster

encoder.maxiter

Max iterations in kmeans clustering

known.matches

Dataframe of known matches with first column having indices of matches from dfA and second column having indices of known matches from dfB

dim.latent

Number of latent dimensions

dim.encode

Number of encoding dimensions

dim.decode

Number of decoding dimensions

max.length

Maximum length of characters

num.encode.layers

Encode layers

num.decode.layers

Decoder layers

batch.size

Training batch size

epochs

Number of training epochs

lr

Learning rate

validation.split

Validation

save.dir

save directory path

reconstruct

Whether or not show reconstructions

reconstruct.n

How many reconstructions to show

reconstruct.display

After many epochs to show reconstructions

earlystop

TRUE if stopping early when validation loss is no longer decreasing, if FALSE then train for all epochs

earlystop.patience

Number of epochs to wait while validation loss does not decrease before stopping training early

tensorboard

TRUE if tensorboard metrics are to be recorded. Logs are recorded in the /tmp/ directory

tensorboard.runid

Unique identifier for the run to separate tensorboard logs

verbose

Verbosity level for training output, 0 = silence, 1 = minimal, 2 = verboses

n.cores

cores to parallelize over

Value

blocklist object with 7 values

dfA

dataframeA with encoded vectors appended

dfB

dataframeB with encoded vectors appended

blocks

data.table with each row representing one pair of records

block.metrics

Metrics on block quality

encoder

encoder model if encoder was used

encoded.A

matrix of encoded values from dataframeA

encoded.B

matrix of encoded values from dataframeB


kailin-lu/recordlinkR documentation built on May 4, 2019, 7:37 a.m.