block: block
In kailin-lu/recordlinkR: recordlinkR

Calculates cartesian product of records in two dataframes and filters out pairs which are unlikely to be matches. Blocks can be formed by comparing exact values, comparing values which are with a range of each other, and string comparisons using vector encodings. Encoded vectors can be blocked using the binary method or by clustering encoded vectors.

block(dfA, dfB, cols.exact = NULL, cols.numeric = NULL,
  numeric.range = NULL, cols.encoder = NULL,
  encoder.model.path = NULL, encoder.trainA = NULL,
  encoder.trainB = NULL, encoder.block.method = "binary",
  encoder.nclusters = 5, encoder.maxiter = 1000,
  known.matches = NULL, dim.latent = 8, dim.encode = 64,
  dim.decode = 64, max.length = 12, num.encode.layers = 2,
  num.decode.layers = 2, batch.size = 32, epochs = 500, lr = 5e-04,
  validation.split = 0.2, save.dir = "~/blocking_models/",
  reconstruct = TRUE, reconstruct.n = 5, reconstruct.display = 20,
  earlystop = FALSE, earlystop.patience = 10, tensorboard = FALSE,
  tensorboard.runid = as.character(Sys.time()), verbose = 2,
  n.cores = parallel::detectCores() - 1)

`dfA`	Dataframe to be linked to `dfB`
`dfB`	Dataframe to be linked to `dfA`, if doing deduplication this is `dfA`
`cols.exact`	List of exact match columns
`cols.numeric`	List of numeric range columns
`numeric.range`	Range of numeric
`cols.encoder`	List of encoder columns
`encoder.model.path`	Path to encoder model
`encoder.trainA`	Vector of names to train
`encoder.trainB`	Vector of matching names to train
`encoder.block.method`	binary or cluster
`encoder.nclusters`	Number of cluster is encoding by cluster
`encoder.maxiter`	Max iterations in kmeans clustering
`known.matches`	Dataframe of known matches with first column having indices of matches from dfA and second column having indices of known matches from dfB
`dim.latent`	Number of latent dimensions
`dim.encode`	Number of encoding dimensions
`dim.decode`	Number of decoding dimensions
`max.length`	Maximum length of characters
`num.encode.layers`	Encode layers
`num.decode.layers`	Decoder layers
`batch.size`	Training batch size
`epochs`	Number of training epochs
`lr`	Learning rate
`validation.split`	Validation
`save.dir`	save directory path
`reconstruct`	Whether or not show reconstructions
`reconstruct.n`	How many reconstructions to show
`reconstruct.display`	After many epochs to show reconstructions
`earlystop`	TRUE if stopping early when validation loss is no longer decreasing, if FALSE then train for all epochs
`earlystop.patience`	Number of epochs to wait while validation loss does not decrease before stopping training early
`tensorboard`	TRUE if tensorboard metrics are to be recorded. Logs are recorded in the /tmp/ directory
`tensorboard.runid`	Unique identifier for the run to separate tensorboard logs
`verbose`	Verbosity level for training output, 0 = silence, 1 = minimal, 2 = verboses
`n.cores`	cores to parallelize over

blocklist object with 7 values

`dfA`	dataframeA with encoded vectors appended
`dfB`	dataframeB with encoded vectors appended
`blocks`	data.table with each row representing one pair of records
`block.metrics`	Metrics on block quality
`encoder`	encoder model if encoder was used
`encoded.A`	matrix of encoded values from dataframeA
`encoded.B`	matrix of encoded values from dataframeB