verify_ids: Verify record consistency across databases

View source: R/verify.R

verify_idsR Documentation

Verify record consistency across databases

Description

Compares demographic information across datasets to determine if the entity identified with ID x is the same across all datasets.

Usage

verify_ids(
  dat_list,
  id_col,
  unique_id_col,
  file = NULL,
  database_col = "database",
  variables = NULL,
  tolerances = NULL,
  extra_metrics = NULL,
  extra_cols = NULL,
  verbose = TRUE,
  ...
)

Arguments

dat_list

A named list of data.frames

id_col

The name of the ID, or primary key, column. For consistency, should be the same across datasets.

unique_id_col

The name of the row ID, or surrogate key, column. For consistency, should be the same across datasets.

file

If not NULL, a path to where the output spreadsheet will be saved.

database_col

The column name to store the dat_list names

variables

A character vector of integer or character columns to be used for comparison across datasets.

tolerances

If not NULL, a list of parameters to be used as tolerances. The list names must be variable names provided to variables, and the type of tolerances depends on the variable:

  • If the variable is an integer, the tolerance is the maximum difference allowed

  • If the variable is a character, the tolerance is maximum dissimilarity allowed, measured between 0 and 1.

extra_metrics

A metrics() call that contains a collection of metric() calls

extra_cols

A character vector of columns to be included in the output verification spreadsheet, mainly for reference and support during manual inspection

verbose

Enables logging

...

Extra parameters passed to anara::fix_format

Value

A data.frame in the fix format

Examples

if (FALSE) {
  anara::verify_ids(
    list(
      database1 = dat_1,
      database2 = dat_2
    ),
    id_col = "participant_id",
    unique_id_col = "unique_id",
    variables = c("female", "grade", "teacher_name", "form"),
    tolerances = list(
      form = 0,
      teacher_name = 0.05
    ),
    extra_cols = c(
      "start", "end",
      "incdnt_01", "incdnt_01_o", "incdnt_02", "incdnt_02_o"
    ),
    file = file.path("path", "to", "issues.csv")
  )
}

nyuglobalties/anara documentation built on July 17, 2024, 4:05 p.m.