adjust_dups: Dedupe Ties in a vrmatch Output

Description Usage Arguments Details Value

View source: R/adjust_dups.R

Description

This function takes the 'vrmatch' output and deduplicates the probabilistic record linkage output.

Usage

1
adjust_dups(match, dedup_ids = c("lVoterUniqueID", "sAffNumber"))

Arguments

match

The vrmatch output to correct duplicates.

dedup_ids

Voter IDs used in detecting and correcting duplicates. Defaults to c("lVoterUniqueID", "sAffNumber").

Details

This happens because (1) the snapshot A was not deduplicated, but (2) when snapshot B was. Hence, the old duplicates in A that are exact matches with records in B (in terms of matching variables, not necessarily all) force the remaining records in B to duplicated to be matched to all the duplicates in A.

For instance, if you asked 'dfA[c(1, 1, 2), ]' and 'dfA[c(1, 2)]' to be matched, it will give you three matched outcomes: 'dfA[c(1, 1, 2), ]', even when 'fastLink::dedupeMatches' has been called for. It should be stressed that this is not a bug of fastLink, as it has no means to distinguish these perfect ties.

However, for practical applications, we must sometimes correct for these duplicates, to see which observations have truly changed.

In this function, we use the internal voter ID to correct for these duplicates. Suppose that in snapshot A there are records with IDs a1 and a2 the same name, address, and date of birth, and in snapshot B, the duplicate a2 has been deleted by the Registrar of Voters. Between two matches (1) a1-a1 and (2) a2-a1, we drop (2), and classify a2 as "record only in A".

Sometimes, the following cases happen: a2-a3 vs. a1-a1. In most cases, there is a2 in 'only_B': it was just pushed aside by a1 because it had the same values in the fields called for matching. Hence we recommend calling this function after correcting for nonmatches that in fact have the same internal IDs (i.e., false negatives).

There are still exceptions, which we will break ties by other variables called by 'tie_breakers'.

Note that this function is only relevant when using 'vrmatch' when the reference ID was not used to exclude exact matches. In addition, after the correction, the EM object and et cetera from fastLink has not been corrected accordingly and should not be used in inference.

Value

Corrected vrmatch output.


sysilviakim/voterdiffR documentation built on June 22, 2020, 6:51 p.m.