R/BRL.R

Defines functions BRL

Documented in BRL

#' Beta Record Linkage
#'
#' Beta record linkage methodology for probabilistic bipartite record linkage: the task of merging two duplicate-free 
#' datafiles that lack unique identifiers.  
#' This function runs all the steps of beta record linkage: creates comparisons of the records, runs Gibbs sampler on bipartite matchings,
#' and derives point estimate of bipartite matching (this determines the final linkage).  The parameters of \code{BRL} consist of 
#' all the parameters needed to run \code{\link{compareRecords}}, \code{\link{bipartiteGibbs}} and \code{\link{linkRecords}},
#' except for intermediate input/output, and in addition to a parameter \code{burn} for the burn-in period of the Gibbs sampler.
#' 
#' @param df1,df2 two datasets to be linked, of class \code{data.frame}, with rows representing records and columns
#' 				representing fields.  Without loss of generality, 
#' 				\code{df1} is assumed to have no less records than \code{df2}.
#' @param flds	a vector indicating the fields to be used in the linkage.  Either a \code{character} vector, in which case  
#' 				all entries need to be names of columns of \code{df1} and \code{df2}, or a \code{numeric} vector
#' 				indicating the columns in \code{df1} and \code{df2} to be used in the linkage.  If provided as a 
#' 				\code{numeric} vector it is assumed that the columns of \code{df1} and \code{df2} are organized such that 
#' 				it makes sense to compare the columns
#' 				\code{df1[,flds]} and \code{df2[,flds]} in that order.  
#' @param flds1,flds2	vectors indicating the fields of \code{df1} and \code{df2} to be used in the linkage.  
#' 				Either \code{character} vectors, in which case  
#' 				all entries need to be names of columns of \code{df1} and \code{df2}, respectively, or \code{numeric} vectors
#' 				indicating the columns in \code{df1} and \code{df2} to be used in the linkage.  It is assumed that
#'				it makes sense to compare the columns
#' 				\code{df1[,flds1]} and \code{df2[,flds2]} in that order.  These arguments are ignored if \code{flds} is specified.
#'				If none of \code{flds,flds1,flds2} are specified, the columns with the same names in \code{df1} and \code{df2} 
#'				are compared, if any.  
#' @param types	a vector of characters indicating the comparison type per comparison field.  The options
#'				are: \code{"lv"} for comparisons based on the Levenshtein edit distance normalized to \eqn{[0,1]}, with \eqn{0}  
#'				indicating no disagreement and \eqn{1} indicating maximum disagreement;  
#'				\code{"bi"} for binary comparisons (agreement/disagreement); \code{"nu"} for numeric comparisons computed as 
#'				the absolute difference. 
#' 				The default is \code{"lv"}.  Fields compared with the \code{"lv"} option are first transformed to \code{character}
#'				class.  Factors with different levels compared using the \code{"bi"} option are transformed to factors with the union 
#' 				of the levels.  Fields compared with the \code{"nu"} option need to be of class \code{numeric}.
#' @param breaks	break points for the comparisons to obtain levels of disagreement.  
#'				It can be a list of length equal to the number of comparison fields, containing one numeric vector with the break 
#'				points for each comparison field, where entries corresponding to comparison type \code{"bi"} are ignored.  
#'				It can also be a named list of length two with elements 'lv' and 'nu' 
#'				containing numeric vectors with the break 
#'				points for all Levenshtein-based and numeric comparisons, respectively.  
#'				Finally, it can be a numeric vector with the break points for all comparison fields of type \code{"lv"} and \code{"nu"},
#'				which might be meaningful only if all the non-binary comparisons are of a single type, either \code{"lv"} or \code{"nu"}.  
#'				For comparisons based on the normalized Levenshtein distance, a vector of length \eqn{L} of break 
#'				points for the interval \eqn{[0,1]} leads to \eqn{L+1} levels of disagreement.  Similarly, for comparisons based on the absolute 
#'				difference, the break points are for the interval \eqn{[0,\infty)}.  
#'				The default is \code{breaks=c(0,.25,.5)}, which might be meaningful only for comparisons of type \code{"lv"}.
#' 
#' @param nIter	number of iterations of Gibbs sampler.
#' 
#' @param burn	number of iterations to discard as part of the burn-in period.
#' 
#' @param a,b	hyper-parameters of the Dirichlet priors for the \eqn{m} and \eqn{u} parameters
#' 				in the model for the comparison data among matches and non-matches, respectively.
#' 				These can be vectors with as many 
#'				entries as disagreement levels among all comparison fields.  If specified as positive constants, they 
#'				get recycled to the required length.  If not specified, flat priors are taken.  
#' 
#' @param aBM,bBM	hyper-parameters of beta prior on bipartite matchings. Default is \code{aBM=bBM=1}.
#' 
#' @param seed	seed to be used for pseudo-random number generation.  By default it sets \code{seed=0}.
#' 
#' @param lFNM		individual loss of a false non-match in the loss functions of Sadinle (2017), default \code{lFNM=1}.
#' 
#' @param lFM1		individual loss of a false match of type 1 in the loss functions of Sadinle (2017), default \code{lFM1=1}.
#' 
#' @param lFM2		individual loss of a false match of type 2 in the loss functions of Sadinle (2017), default \code{lFM2=2}.
#' 
#' @param lR		individual loss of 'rejecting' to make a decision in the loss functions of Sadinle (2017), default \code{lR=Inf}.
#'
#' 
#' @details	Beta record linkage (BRL, Sadinle, 2017) is a methodology for probabilistic bipartite record linkage, that is, the task of merging two
#'			duplicate-free datafiles that lack unique identifiers.  This is accomplished by using the common partially identifying information 
#' 			of the entities contained in the datafiles.  The duplicate-free requirement means that we expect each entity to be represented maximum once in 
#'			each datafile.  This methodology should not be used with datafiles that contain duplicates nor should it be used for deduplicating a single datafile. 
#' 
#' 			The first step of BRL, accomplished by the function \code{\link{compareRecords}}, consists of constructing comparison vectors for each pair of records from the two datafiles.  
#'			The current implementation allows binary comparisons (agree/disagree), numerical comparisons based on the absolute difference, 
#' 			and Levenshtein-based comparisons.  
#' 			This can be easily extended to other comparison types, so a resourceful user should be able to construct an object that recreates 
#' 			the output of \code{\link{compareRecords}} for other types of comparisons (so long as they get transformed to levels of disagreement), and still be able to run the next step outside 
#' 			the function \code{BRL}.
#' 
#' 			The second step of BRL, accomplished by the function \code{\link{bipartiteGibbs}}, consists of running a Gibbs sampler that explores the space of bipartite matchings 
#' 			representing the plausible ways of linking the datafiles.  This sampler is derived from a model for the comparison data and a \emph{beta} prior 
#' 			distribution on the space of bipartite matchings.  See Sadinle (2017) for details. 
#' 
#' 			The third step of BRL, accomplished by the function \code{\link{linkRecords}}, consists of deriving a point estimate of the bipartite matching 
#' 			(which gives us the optimal way of linking the datafiles) 
#' 			by minimizing the expected value of 
#' 			a loss function that uses different penalties for different types of linkage errors.  The current implementation only supports the 
#' 			Bayes point estimates of bipartite matchings that can be obtained in closed form according to Theorems 1, 2 and 3 of Sadinle (2017).
#' 			The losses have to be positive numbers and satisfy one of three conditions:
#' 			\enumerate{
#' 						\item Conditions of Theorem 1 of Sadinle (2017):
#'						\code{(lR == Inf) & (lFNM <= lFM1) & (lFNM + lFM1 <= lFM2)}
#' 						\item Conditions of Theorem 2 of Sadinle (2017):
#'						\code{((lFM2 >= lFM1) & (lFM1 >= 2*lR)) | ((lFM1 >= lFNM) & (lFM2 >= lFM1 + lFNM))}
#' 						\item Conditions of Theorem 3 of Sadinle (2017):
#'						\code{(lFM2 >= lFM1) & (lFM1 >= 2*lR) & (lFNM >= 2*lR)}
#' 			}
#' 			If one of the last two conditions is satisfied, the point estimate might be partial, meaning that there
#' 			might be some records in datafile 2 for which the point estimate does not include a linkage decision.
#' 			For combinations of losses not supported here, the linear sum assignment problem outlined by Sadinle (2017)
#' 			needs to be solved.
#' 
#' @return 	A vector containing the point estimate of the bipartite matching, as in the output of \code{\link{linkRecords}}.  If \code{lR != Inf} the output might be a partial estimate.
#'			A number smaller or equal to \code{n1} in entry \code{j} indicates the record in datafile 1 to which record \code{j} in datafile 2 
#'			gets linked, a number \code{n1+j} indicates that record \code{j} does not get linked to any record in datafile 1, and the value \code{-1} 
#'			indicates a 'rejection' to link, meaning that the correct linkage decision is not clear.
#'  
#' @references Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. \emph{Journal of the
#' American Statistical Association} 112(518), 600-612. [\href{https://doi.org/10.1080/01621459.2016.1148612}{Published}] [\href{https://arxiv.org/abs/1601.06630}{arXiv}]
#' 
#' @seealso \code{\link{compareRecords}} for examples on how to work with different types of comparison data, 
#' 			\code{\link{bipartiteGibbs}} for Gibbs sampler on bipartite matchings, and \code{\link{linkRecords}} for examples 
#'			on full and partial point estimates of the true bipartite matching that indicates which records to link.
#' 
#' @examples
#' data(twoFiles)
#' 
#' (Zhat <- BRL(df1, df2, flds=c("gname", "fname", "age", "occup"), 
#' 				types=c("lv","lv","bi","bi")))
#' 
#' n1 <- nrow(df1)
#' 
#' Ztrue <- df2ID
#' 
#' ## number of links (estimated matches)
#' nLinks <- sum(Zhat <= n1) 
#' 
#' ## number of actual matches according to the ground truth
#' nMatches <- sum(Ztrue <= n1) 
#' 
#' ## number of links that are actual matches
#' nCorrectLinks <- sum(Zhat[Zhat<=n1]==Ztrue[Zhat<=n1])
#' 
#' ## compute measures of performance 
#' 
#' ## precision 
#' nCorrectLinks/nLinks
#' 
#' ## recall
#' nCorrectLinks/nMatches
#' 
#' ## the linked record pairs
#' cbind( df1[Zhat[Zhat<=n1],], df2[Zhat<=n1,] )
#' 
#' ## finally, note that we could run BRL step by step as follows
#' 
#' ## create comparison data 
#' myCompData <- compareRecords(df1, df2, 
#'                              flds=c("gname", "fname", "age", "occup"), 
#'                              types=c("lv","lv","bi","bi"))
#' 
#' ## Gibbs sampling from posterior of bipartite matchings
#' chain <- bipartiteGibbs(myCompData)
#' 
#' ## bipartite matching Bayes estimate derived from the loss functions of Sadinle (2017)
#' Zhat2 <- linkRecords(chain$Z, n1=n1)
#' 
#' identical(Zhat, Zhat2)
#' 

BRL <- function(df1, df2, flds=NULL, flds1=NULL, flds2=NULL, types=NULL, breaks=c(0,.25,.5), 
				nIter=1000, burn=round(nIter*.1), a=1, b=1, aBM=1, bBM=1, seed=0,
				lFNM=1, lFM1=1, lFM2=2, lR=Inf){
	
	# 'burn' is not a parameter of the other functions, so we control it here
	if( !is.numeric(burn) | (burn<0) | (burn>=nIter) ) 
		stop("burn should be an integer that satisfies 0 <= burn < nIter")
	
	# create comparison data 
	myCompData <- compareRecords(df1, df2, flds, flds1, flds2, types, breaks)
	# Gibbs sampling from posterior of bipartite matchings
	chain <- bipartiteGibbs(myCompData, nIter, a, b, aBM, bBM, seed)
	# bipartite matching Bayes estimate derived from the loss functions of Sadinle (2017)
	Zhat <- linkRecords(chain$Z[,setdiff(1:nIter,seq_len(burn)),drop=FALSE], n1=nrow(df1), lFNM, lFM1, lFM2, lR)
	
	return(Zhat)
}
msadinle/BRL documentation built on Jan. 12, 2020, 1 a.m.