specify_prior: Specify the Prior Distributions
In multilink: Multifile Record Linkage and Duplicate Detection

View source: R/specify_prior.R

specify_prior

R Documentation

Specify the Prior Distributions

Description

Specify the prior distributions for the m and u parameters of the models for comparison data among matches and non-matches, and the partition.

Usage

specify_prior(
  comparison_list,
  mus = NA,
  nus = NA,
  flat = 0,
  alphas = NA,
  dup_upper_bound = NA,
  dup_count_prior_family = NA,
  dup_count_prior_pars = NA,
  n_prior_family = NA,
  n_prior_pars = NA
)

Arguments

`comparison_list`	the output from a call to `create_comparison_data` or `reduce_comparison_data`. Note that in order to correctly specify the prior, if `reduce_comparison_data` is used to the reduce the number of record pairs that are potential matches, then the output of `reduce_comparison_data` (not `create_comparison_data`) should be used for this argument.
`mus, nus`	The hyperparameters of the Dirichlet priors for the `m` and `u` parameters for the comparisons among matches and non-matches, respectively. These are positive `numeric` vectors which have length equal to the number of columns of `comparison_list$comparisons` times the number of file pairs `(comparison_list$K * (comparison_list$K + 1) / 2)`. If set to `NA`, flat priors are used. We recommend using flat priors for `m` and `u`.
`flat`	A `numeric` indicator of whether a flat prior for partitions should be used. `flat` should be `1` if a flat prior is used, and `flat` should be `0` if a structured prior is used. If a flat prior is used, the remaining arguments should be set to `NA`. Otherwise, the remaining arguments should be specified. We do not recommend using a flat prior for partitions in general.
`alphas`	The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive `numeric` vector of length `2 ^ comparison_list$K - 1`. The indexing of these hyperparameters is based on the the `comparison_list$K`-bit binary representation of the inclusion patterns of the overlap table. To give a few examples, suppose `comparison_list$K` is `3`. `1` in `3`-bit binary is `001`, so `alphas[1]` is the hyperparameter for the `001` cell of the overlap table, representing clusters containing only records from the third file. `2` in `3`-bit binary is `010`, so `alphas[2]` is the hyperparameter for the `010` cell of the overlap table, representing clusters containing only records from the second file. `3` in `3`-bit binary is `011`, so `alphas[3]` is the hyperparameter for the `011` cell of the overlap table, representing clusters containing only records from the second and third files. If set to `NA`, the hyperparameters will all be set to `1`.
`dup_upper_bound`	A `numeric` vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file `k`, `dup_upper_bound[k]` should be between `1` and `comparison_list$file_sizes[k]`, i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file. If set to `NA`, the upper bound for file `k` will be set to `1` if no duplicates are allowed for that file, or `comparison_list$file_sizes[k]` if duplicates are allowed for that file.
`dup_count_prior_family`	A `character` vector indicating the prior distribution family used for the number of duplicates in each cluster, for each file. Currently the only option is `"Poisson"` for a Poisson prior, truncated to lie between `1` and `dup_upper_bound[k]`. The mean parameter of the Poisson distribution is specified using the `dup_count_prior_pars` argument. If set to `NA`, a Poisson prior with mean `1` will be used.
`dup_count_prior_pars`	A `list` containing the parameters for the prior distribution for the number of duplicates in each cluster, for each file. For file `k`, when `dup_count_prior_family[k]="Poisson"`, `dup_count_prior_pars[[k]]` is a positive constant representing the mean of the Poisson prior.
`n_prior_family`	A `character` indicating the prior distribution family used for `n`, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using `reduce_comparison_data`. Currently the there are two options: `"uniform"` for a uniform prior for `n`, i.e. `p(n) \propto 1`, and `"scale"` for a scale prior for `n`, i.e. `p(n) \propto 1/n`. If set to `NA`, a uniform prior will be used.
`n_prior_pars`	Currently set to `NA`. When more prior distribution families for `n` are implemented, this will be a vector of parameters for those priors.

Details

The purpose of this function is to specify prior distributions for all parameters of the model. Please note that if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used as input.

For the hyperparameters of the Dirichlet priors for the m and u parameters for the comparisons among matches and non-matches, respectively, we recommend using a flat prior. This is accomplished by setting mus=NA and nus=NA. Informative prior specifications are possible, but in practice they will be overwhelmed by the large number of comparisons.

For the prior for partitions, we do not recommend using a flat prior. Instead we recommend using our structure prior for partitions. By setting flat=0 and the remaining arguments to NA, one obtains the default specification for the structured prior that we have found to perform well in simulation studies. The structured prior for partitions is specified as follows:

Specify a prior for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently, a uniform prior and a scale prior for n are supported. Our default specification uses a uniform prior.
Specify a prior for the overlap table (see the documentation for alphas for more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to 1.
For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through dup_upper_bound. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to 1. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean 1.

Please contact the package maintainer if you need new prior families for n or the number of duplicates in each cluster to be supported.

Value

a list containing:

mus: The hyperparameters of the Dirichlet priors for the m parameters for the comparisons among matches.
nus: The hyperparameters of the Dirichlet priors for the u parameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data.
flat: A numeric indicator of whether a flat prior for partitions should be used. flat is 1 if a flat prior is used, and flat is 0 if a structured prior is used.
no_dups: A numeric indicator of whether no duplicates are allowed in all of the files.
alphas: The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K, where the first element is 0.
alpha_0: The sum of alphas.
dup_upper_bound: A numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file.
log_dup_count_prior: A list containing the log density of the prior distribution for the number of duplicates in each cluster, for each file.
log_n_prior: A numeric vector containing the log density of the prior distribution for the number of clusters represented in the records.
nus_specified: The nus before data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data are added. Used for input checking.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [\Sexpr[results=rd]{tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")}] [arXiv]

Examples

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
 alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
 dup_count_prior_family = NA, dup_count_prior_pars = NA,
 n_prior_family = "uniform", n_prior_pars = NA)

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)

multilink documentation built on July 9, 2023, 6:42 p.m.