preflight: Check whether dataset is ready for linkage

View source: R/prep.R

preflightR Documentation

Check whether dataset is ready for linkage

Description

preflight checks whether a dataset is ready for linkage

Usage

preflight(
  dat,
  vrbs = NULL,
  modstring = c("m_boost_stel_rein", "m_rf_baptisms_sparse", "m_rf_baptisms_full")
)

Arguments

dat

the dataset to check, as a data.table.

vrbs

a vector of the names of variables to check. Defaults to NULL, in which case variable names are extracted from the model specified with modstring

modstring

the name of the pretrained model to check against

Details

Preflight does a number of checks on the dataset.

  • The type of the variables

  • The share of missing observations by variable, and whether they are coded as missing values or empty strings

  • The share of length one observations by variable (these might break linkage)

  • The share of case (lower, upper, title) by variable. capelinker currently does not automatically convert to a single case, but leaves this choice to the user. Different cases do count for the string distance.

  • The share of accented letters and non-alphabetic symbols by variable. A gain, these are not automatically fixed, but left to the user's discretion. Note that some analphabetics (for instance "." or "-") could be frequently and consistently applied in the data. String distance calculation does not fail on accented letters, but it does count towards the string distance.

  • The set of unique characters in each of the variables.

  • Whether the variables required for one of the pretrained models in capelinker are present and correctly named.

Capelinker also checks the dataset for the requirements of a number of pretrainined models in the capelinker package. The following models are available:

  • m_boost_stel_rein a model linking households from one year to the next in the opgaafrollen (default).

  • m_rf_baptisms_sparse a model linking parents in baptism records to marriage records, based on minimal information: male surname (mlast), male first name (mfirst), female first name (wfirst, female surname not used because it would typically not be reported in the baptism records), and year of marriage/baptism (year).

  • m_rf_baptisms_full a model linking parents in baptism and marriage records, using additional information: initials, profession, and soundex distances of the names. Performance is not much better than the sparse model.

The following variables are expected by all these models:

  • mlast the male surname.

  • mfirst the male first name.

  • wlast the female last name.

  • wfistthe female first name.

  • minitils male initials in the form JF (so no K, and no punctuation)

  • winitils female initials in the form JF (so no K, and no punctuation)

  • year the year of observation of the two records

The baptism record models also expect and check:

  • mprof the male profession

The opgaafrollen model also expects and checks:

  • settlerchildren

Value

Text to the console showing the results of the tests.

Examples

d2 = data.table::data.table(mlast = c("jongh", "Jong", "smit (Smid)"), persid = c(1:3))
preflight(d2)


rijpma/capelinker documentation built on Nov. 7, 2024, 3:06 a.m.