View source: R/specify_prior.R
specify_prior | R Documentation |
Specify the prior distributions for the m
and u
parameters of the
models for comparison data among matches and non-matches, and the partition.
specify_prior(
comparison_list,
mus = NA,
nus = NA,
flat = 0,
alphas = NA,
dup_upper_bound = NA,
dup_count_prior_family = NA,
dup_count_prior_pars = NA,
n_prior_family = NA,
n_prior_pars = NA
)
comparison_list |
the output from a call to
|
mus, nus |
The hyperparameters of the Dirichlet priors for the |
flat |
A |
alphas |
The hyperparameters for the Dirichlet-multinomial overlap table
prior, a positive |
dup_upper_bound |
A |
dup_count_prior_family |
A |
dup_count_prior_pars |
A |
n_prior_family |
A |
n_prior_pars |
Currently set to |
The purpose of this function is to specify prior distributions for all
parameters of the model. Please note that if
reduce_comparison_data
is used to the reduce the number of
record pairs that are potential matches, then the output of
reduce_comparison_data
(not
create_comparison_data
) should be used as input.
For the hyperparameters of the Dirichlet priors for the m
and u
parameters for the comparisons among matches and non-matches,
respectively, we recommend using a flat prior. This is accomplished by
setting mus=NA
and nus=NA
. Informative prior specifications
are possible, but in practice they will be overwhelmed by the large number of
comparisons.
For the prior for partitions, we do not recommend using a flat prior. Instead
we recommend using our structure prior for partitions. By setting
flat=0
and the remaining arguments to NA
, one obtains the
default specification for the structured prior that we have found to perform
well in simulation studies. The structured prior for partitions is specified
as follows:
Specify a prior for n
, the number of clusters represented in
the records. Note that this includes records determined not to be potential
matches to any other records using reduce_comparison_data
.
Currently, a uniform prior and a scale prior for n
are supported.
Our default specification uses a uniform prior.
Specify a prior for the overlap table (see the documentation for
alphas
for more information). Currently a Dirichlet-multinomial
prior is supported. Our default specification sets all hyperparameters of
the Dirichlet-multinomial prior to 1
.
For each file, specify a prior for the number of duplicates in each
cluster. As a part of this prior, we specify the maximum number of records
in a cluster for each file, through dup_upper_bound
. When there
are assumed to be no duplicates in a file, the maximum number of records in
a cluster for that file is set to 1
. When there are assumed to be
duplicates in a file, we recommend setting the maximum number of records in
a cluster for that file to be less than the file size, if prior knowledge
allows. Currently, a Poisson prior for the the number of duplicates in
each cluster is supported. Our default specification uses a Poisson prior
with mean 1
.
Please contact the package maintainer if you need new prior families
for n
or the number of duplicates in each cluster to be supported.
a list containing:
mus
The hyperparameters of the Dirichlet priors for the
m
parameters for the comparisons among matches.
nus
The hyperparameters of the Dirichlet priors for the
u
parameters for the comparisons among non-matches. Includes data
from comparisons of record pairs that were declared to not be potential
matches using reduce_comparison_data
.
flat
A numeric
indicator of whether a flat prior for
partitions should be used. flat
is 1
if a flat prior is used,
and flat
is 0
if a structured prior is used.
no_dups
A numeric
indicator of whether no duplicates
are allowed in all of the files.
alphas
The hyperparameters for the Dirichlet-multinomial
overlap table prior, a positive numeric
vector of length
2 ^ comparison_list$K
, where the first element is 0
.
alpha_0
The sum of alphas
.
dup_upper_bound
A numeric
vector indicating the
maximum number of duplicates, from each file, allowed in each cluster. For
a given file k
, dup_upper_bound[k]
should be between 1
and comparison_list$file_sizes[k]
, i.e. even if you don't want to
impose an upper bound, you have to implicitly place an upper bound: the
number of records in a file.
log_dup_count_prior
A list
containing the log density
of the prior distribution for the number of duplicates in each cluster, for
each file.
log_n_prior
A numeric
vector containing the log
density of the prior distribution for the number of clusters represented in
the records.
nus_specified
The nus
before data from comparisons of
record pairs that were declared to not be potential matches using
reduce_comparison_data
are added. Used for input checking.
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [\Sexpr[results=rd]{tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")}] [arXiv]
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.