match_statistics: Create explicit factor level for missing values.

View source: R/match-statistics.R

match_statisticsR Documentation

Create explicit factor level for missing values.

Description

Missing values are converted to a factor level. This explicit assignment can reduce the chances that missing values are inadvertently ignored. It also allows the presence of a missing to become a predictor in models.

Usage

match_statistics(d_parent, d_child, join_columns)

Arguments

d_parent

A data.frame of the parent table.

d_child

A data.frame of the child table.

join_columns

The character vector of the column names used to join to parent and child tables.

Details

If a nonexistent column is passed to join_columns, an error will be thrown naming the violating column name.

More information about the 'parent' and 'child' terminology and concepts can be found in the Hierarchical Database Model Wikipedia entry, among many other sources.

Value

A numeric array of the following elements:

  • parent_in_child The count of parent records found in the child table.

  • parent_not_in_child The count of parent records not found in the child table.

  • parent_na_any The count of parent records with a NA in at least one of the join columns.

  • deadbeat_proportion The proportion of parent records not found in the child table.

  • child_in_parent The count of child records found in the parent table.

  • child_not_in_parent The count of child records not found in the parent table.

  • child_na_any The proportion of child records not found in the parent table.

  • orphan_proportion The count of child records with a NA in at least one of the join columns.

Note

The join_columns parameter is passed directly to dplyr::semi_join() and dplyr::anti_join().

Author(s)

Will Beasley

Examples

ds_parent <- data.frame(
  parent_id         = 1L:10L,
  letter            = rep(letters[1:5], each=2),
  index             = rep(1:2, times=5),
  dv                = runif(10),
  stringsAsFactors  = FALSE
)
ds_child <- data.frame(
  child_id          = 101:140,
  parent_id         = c(4, 5, rep(6L:14L, each=4), 15, 16),
  letter            = rep(letters[3:12], each=4),
  index             = rep(1:2, each=2, length.out=40),
  dv                = runif(40),
  stringsAsFactors  = FALSE
)

#Match on one column:
match_statistics(ds_parent, ds_child, join_columns="parent_id")

#Match on two columns:
match_statistics(ds_parent, ds_child, join_columns=c("letter", "index"))

## Produce better format for humans to read
match_statistics_display(ds_parent, ds_child, join_columns="parent_id")
match_statistics_display(ds_parent, ds_child, join_columns=c("letter", "index"))

OuhscBbmc/OuhscMunge documentation built on March 2, 2024, 11:44 a.m.