vrmatch: Voter Registration Database Snapshot Matching

Description Usage Arguments Value

View source: R/vrmatch.R

Description

This function performs probabilistic record linkage between all user-supplied consecutive snapshots of the voter file. Note that the default option is to exclude exact matches of all fields between two snapshots when performing the record linkage, for computational reasons. Note that for multiple matchings, the function uses a loop instead of more sophisticated measures such as purrr::map, because loading and wrangling them simultaneously will often times bring the machine crashing down.

Usage

1
2
3
4
5
6
7
8
9
vrmatch(date_df, exact_exclude = TRUE, sample_exact = FALSE,
  sample_id = FALSE, sample_size = NULL, sample_perc = NULL,
  block = FALSE, path_clean = "clean_df", path_changes = "changes",
  path_reports = "reports", path_matches = "matches",
  clean_prefix = "df_cleaned_", clean_suffix = "",
  exist_files = FALSE, varnames, varnames_str, varnames_num = NULL,
  varnames_id = NULL, partial.match = NULL, varnames_block = NULL,
  vars_change = NULL, n.cores = NULL, file_type = ".Rda",
  date_label = "date_label", nrow = "nrow", seed = 123, ...)

Arguments

date_df

Dataframe of list of snapshots.

exact_exclude

Whether to exclude full exact matches between snapshots when doing probabilistic record linkage. Defaults to TRUE.

sample_exact

Whether to add random samples of full exact matches to correct for underlying population's value distributions for each field. Defaults to FALSE.

sample_id

Whether to add random samples of ID matches (some changes) to correct for underlying population's value distributions for each field. Defaults to FALSE.

sample_size

Sample size of the random sample to add. Defaults to NULL.

sample_perc

Sample percentage of the random sample to add. Defaults to NULL. If both 'sample_size' and 'sample_perc' are NULL, 'sample_perc' is set to 0.01 (1 'sample_perc' is chosen over 'sample_size'.

block

Whether to employ blocking. Defaults to FALSE.

path_clean

Path to the cleaned snapshots. Defaults to "clean_df".

path_changes

Path where the extracted changes are output to. Defaults to "changes".

path_reports

Path where the summarized changes are output to. Defaults to "reports".

path_matches

Path where the match outcomes are output to. Defaults to "matches".

clean_prefix

File prefixes for cleaned snapshots. Defaults to "df_cleaned_".

clean_suffix

File suffixes for cleaned snapshots. Defaults to empty string.

exist_files

Whether previously performed match outcomes exist. Defaults to FALSE.

varnames

Variables to perform probabilistic record linkage.

varnames_str

String variables for matching.

varnames_num

Numeric variables for matching. Defaults to NULL, in which case it will be setdiff(varnames, varnames_str).

varnames_id

Voter IDs variables, if any exists, and is to be excluded from PRL when IDs match.

partial.match

Variables to be partially matched. Defaults to all varnames_str.

varnames_block

Nested list of variables or their combinations for blocking passes.

vars_change

Variables to track changes of. Defaults to NULL, which will then track all variables.

n.cores

Number of cores to parallelize the matching. Defaults to half the existing threads.

file_type

Input file types. Defulats to .Rda.

date_label

Labels for dates (i.e., snapshot IDs), in 'date_df'. Defaults to "date_label".

nrow

Name of list element which will contain the number of rows of the input list dataframes.

seed

Seed to set. Defaults to 123.

...

Other parameters for fastLink.

Value

A nested list of matched dataframes, fastLink output, and arguments.


sysilviakim/voterdiffR documentation built on June 22, 2020, 6:51 p.m.