#' Beta Record Linkage
#'
#' Beta record linkage methodology for probabilistic bipartite record linkage: the task of merging two duplicate-free
#' datafiles that lack unique identifiers.
#' This function runs all the steps of beta record linkage: creates comparisons of the records, runs Gibbs sampler on bipartite matchings,
#' and derives point estimate of bipartite matching (this determines the final linkage). The parameters of \code{BRL} consist of
#' all the parameters needed to run \code{\link{compareRecords}}, \code{\link{bipartiteGibbs}} and \code{\link{linkRecords}},
#' except for intermediate input/output, and in addition to a parameter \code{burn} for the burn-in period of the Gibbs sampler.
#'
#' @param df1,df2 two datasets to be linked, of class \code{data.frame}, with rows representing records and columns
#' representing fields. Without loss of generality,
#' \code{df1} is assumed to have no less records than \code{df2}.
#' @param flds a vector indicating the fields to be used in the linkage. Either a \code{character} vector, in which case
#' all entries need to be names of columns of \code{df1} and \code{df2}, or a \code{numeric} vector
#' indicating the columns in \code{df1} and \code{df2} to be used in the linkage. If provided as a
#' \code{numeric} vector it is assumed that the columns of \code{df1} and \code{df2} are organized such that
#' it makes sense to compare the columns
#' \code{df1[,flds]} and \code{df2[,flds]} in that order.
#' @param flds1,flds2 vectors indicating the fields of \code{df1} and \code{df2} to be used in the linkage.
#' Either \code{character} vectors, in which case
#' all entries need to be names of columns of \code{df1} and \code{df2}, respectively, or \code{numeric} vectors
#' indicating the columns in \code{df1} and \code{df2} to be used in the linkage. It is assumed that
#' it makes sense to compare the columns
#' \code{df1[,flds1]} and \code{df2[,flds2]} in that order. These arguments are ignored if \code{flds} is specified.
#' If none of \code{flds,flds1,flds2} are specified, the columns with the same names in \code{df1} and \code{df2}
#' are compared, if any.
#' @param types a vector of characters indicating the comparison type per comparison field. The options
#' are: \code{"lv"} for comparisons based on the Levenshtein edit distance normalized to \eqn{[0,1]}, with \eqn{0}
#' indicating no disagreement and \eqn{1} indicating maximum disagreement;
#' \code{"bi"} for binary comparisons (agreement/disagreement); \code{"nu"} for numeric comparisons computed as
#' the absolute difference.
#' The default is \code{"lv"}. Fields compared with the \code{"lv"} option are first transformed to \code{character}
#' class. Factors with different levels compared using the \code{"bi"} option are transformed to factors with the union
#' of the levels. Fields compared with the \code{"nu"} option need to be of class \code{numeric}.
#' @param breaks break points for the comparisons to obtain levels of disagreement.
#' It can be a list of length equal to the number of comparison fields, containing one numeric vector with the break
#' points for each comparison field, where entries corresponding to comparison type \code{"bi"} are ignored.
#' It can also be a named list of length two with elements 'lv' and 'nu'
#' containing numeric vectors with the break
#' points for all Levenshtein-based and numeric comparisons, respectively.
#' Finally, it can be a numeric vector with the break points for all comparison fields of type \code{"lv"} and \code{"nu"},
#' which might be meaningful only if all the non-binary comparisons are of a single type, either \code{"lv"} or \code{"nu"}.
#' For comparisons based on the normalized Levenshtein distance, a vector of length \eqn{L} of break
#' points for the interval \eqn{[0,1]} leads to \eqn{L+1} levels of disagreement. Similarly, for comparisons based on the absolute
#' difference, the break points are for the interval \eqn{[0,\infty)}.
#' The default is \code{breaks=c(0,.25,.5)}, which might be meaningful only for comparisons of type \code{"lv"}.
#'
#' @param nIter number of iterations of Gibbs sampler.
#'
#' @param burn number of iterations to discard as part of the burn-in period.
#'
#' @param a,b hyper-parameters of the Dirichlet priors for the \eqn{m} and \eqn{u} parameters
#' in the model for the comparison data among matches and non-matches, respectively.
#' These can be vectors with as many
#' entries as disagreement levels among all comparison fields. If specified as positive constants, they
#' get recycled to the required length. If not specified, flat priors are taken.
#'
#' @param aBM,bBM hyper-parameters of beta prior on bipartite matchings. Default is \code{aBM=bBM=1}.
#'
#' @param seed seed to be used for pseudo-random number generation. By default it sets \code{seed=0}.
#'
#' @param lFNM individual loss of a false non-match in the loss functions of Sadinle (2017), default \code{lFNM=1}.
#'
#' @param lFM1 individual loss of a false match of type 1 in the loss functions of Sadinle (2017), default \code{lFM1=1}.
#'
#' @param lFM2 individual loss of a false match of type 2 in the loss functions of Sadinle (2017), default \code{lFM2=2}.
#'
#' @param lR individual loss of 'rejecting' to make a decision in the loss functions of Sadinle (2017), default \code{lR=Inf}.
#'
#'
#' @details Beta record linkage (BRL, Sadinle, 2017) is a methodology for probabilistic bipartite record linkage, that is, the task of merging two
#' duplicate-free datafiles that lack unique identifiers. This is accomplished by using the common partially identifying information
#' of the entities contained in the datafiles. The duplicate-free requirement means that we expect each entity to be represented maximum once in
#' each datafile. This methodology should not be used with datafiles that contain duplicates nor should it be used for deduplicating a single datafile.
#'
#' The first step of BRL, accomplished by the function \code{\link{compareRecords}}, consists of constructing comparison vectors for each pair of records from the two datafiles.
#' The current implementation allows binary comparisons (agree/disagree), numerical comparisons based on the absolute difference,
#' and Levenshtein-based comparisons.
#' This can be easily extended to other comparison types, so a resourceful user should be able to construct an object that recreates
#' the output of \code{\link{compareRecords}} for other types of comparisons (so long as they get transformed to levels of disagreement), and still be able to run the next step outside
#' the function \code{BRL}.
#'
#' The second step of BRL, accomplished by the function \code{\link{bipartiteGibbs}}, consists of running a Gibbs sampler that explores the space of bipartite matchings
#' representing the plausible ways of linking the datafiles. This sampler is derived from a model for the comparison data and a \emph{beta} prior
#' distribution on the space of bipartite matchings. See Sadinle (2017) for details.
#'
#' The third step of BRL, accomplished by the function \code{\link{linkRecords}}, consists of deriving a point estimate of the bipartite matching
#' (which gives us the optimal way of linking the datafiles)
#' by minimizing the expected value of
#' a loss function that uses different penalties for different types of linkage errors. The current implementation only supports the
#' Bayes point estimates of bipartite matchings that can be obtained in closed form according to Theorems 1, 2 and 3 of Sadinle (2017).
#' The losses have to be positive numbers and satisfy one of three conditions:
#' \enumerate{
#' \item Conditions of Theorem 1 of Sadinle (2017):
#' \code{(lR == Inf) & (lFNM <= lFM1) & (lFNM + lFM1 <= lFM2)}
#' \item Conditions of Theorem 2 of Sadinle (2017):
#' \code{((lFM2 >= lFM1) & (lFM1 >= 2*lR)) | ((lFM1 >= lFNM) & (lFM2 >= lFM1 + lFNM))}
#' \item Conditions of Theorem 3 of Sadinle (2017):
#' \code{(lFM2 >= lFM1) & (lFM1 >= 2*lR) & (lFNM >= 2*lR)}
#' }
#' If one of the last two conditions is satisfied, the point estimate might be partial, meaning that there
#' might be some records in datafile 2 for which the point estimate does not include a linkage decision.
#' For combinations of losses not supported here, the linear sum assignment problem outlined by Sadinle (2017)
#' needs to be solved.
#'
#' @return A vector containing the point estimate of the bipartite matching, as in the output of \code{\link{linkRecords}}. If \code{lR != Inf} the output might be a partial estimate.
#' A number smaller or equal to \code{n1} in entry \code{j} indicates the record in datafile 1 to which record \code{j} in datafile 2
#' gets linked, a number \code{n1+j} indicates that record \code{j} does not get linked to any record in datafile 1, and the value \code{-1}
#' indicates a 'rejection' to link, meaning that the correct linkage decision is not clear.
#'
#' @references Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. \emph{Journal of the
#' American Statistical Association} 112(518), 600-612. [\href{https://doi.org/10.1080/01621459.2016.1148612}{Published}] [\href{https://arxiv.org/abs/1601.06630}{arXiv}]
#'
#' @seealso \code{\link{compareRecords}} for examples on how to work with different types of comparison data,
#' \code{\link{bipartiteGibbs}} for Gibbs sampler on bipartite matchings, and \code{\link{linkRecords}} for examples
#' on full and partial point estimates of the true bipartite matching that indicates which records to link.
#'
#' @examples
#' data(twoFiles)
#'
#' (Zhat <- BRL(df1, df2, flds=c("gname", "fname", "age", "occup"),
#' types=c("lv","lv","bi","bi")))
#'
#' n1 <- nrow(df1)
#'
#' Ztrue <- df2ID
#'
#' ## number of links (estimated matches)
#' nLinks <- sum(Zhat <= n1)
#'
#' ## number of actual matches according to the ground truth
#' nMatches <- sum(Ztrue <= n1)
#'
#' ## number of links that are actual matches
#' nCorrectLinks <- sum(Zhat[Zhat<=n1]==Ztrue[Zhat<=n1])
#'
#' ## compute measures of performance
#'
#' ## precision
#' nCorrectLinks/nLinks
#'
#' ## recall
#' nCorrectLinks/nMatches
#'
#' ## the linked record pairs
#' cbind( df1[Zhat[Zhat<=n1],], df2[Zhat<=n1,] )
#'
#' ## finally, note that we could run BRL step by step as follows
#'
#' ## create comparison data
#' myCompData <- compareRecords(df1, df2,
#' flds=c("gname", "fname", "age", "occup"),
#' types=c("lv","lv","bi","bi"))
#'
#' ## Gibbs sampling from posterior of bipartite matchings
#' chain <- bipartiteGibbs(myCompData)
#'
#' ## bipartite matching Bayes estimate derived from the loss functions of Sadinle (2017)
#' Zhat2 <- linkRecords(chain$Z, n1=n1)
#'
#' identical(Zhat, Zhat2)
#'
BRL <- function(df1, df2, flds=NULL, flds1=NULL, flds2=NULL, types=NULL, breaks=c(0,.25,.5),
nIter=1000, burn=round(nIter*.1), a=1, b=1, aBM=1, bBM=1, seed=0,
lFNM=1, lFM1=1, lFM2=2, lR=Inf){
# 'burn' is not a parameter of the other functions, so we control it here
if( !is.numeric(burn) | (burn<0) | (burn>=nIter) )
stop("burn should be an integer that satisfies 0 <= burn < nIter")
# create comparison data
myCompData <- compareRecords(df1, df2, flds, flds1, flds2, types, breaks)
# Gibbs sampling from posterior of bipartite matchings
chain <- bipartiteGibbs(myCompData, nIter, a, b, aBM, bBM, seed)
# bipartite matching Bayes estimate derived from the loss functions of Sadinle (2017)
Zhat <- linkRecords(chain$Z[,setdiff(1:nIter,seq_len(burn)),drop=FALSE], n1=nrow(df1), lFNM, lFM1, lFM2, lR)
return(Zhat)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.