check_dates: Produce a dictionary of non-valid date values within a...

View source: R/check_dates.R

check_datesR Documentation

Produce a dictionary of non-valid date values within a dataset, for use in subsequent data cleaning

Description

The resulting cleaning dictionary can be manually reviewed to fill in appropriate replacement values for each non-valid date value, or a missing-value keyword indicating that the value should be converted to NA, and then used with function clean_dates.

Similar to check_numeric, values are considered 'non-valid' if they cannot be coerced using a given function. The default date-coercing function is parse_dates, which can handle a wide variety of date formats, but the user could alternatively specify a simpler function like as.Date. The user may also specify additional expressions that would indicate a non-valid date value. For example, the expression date_admit > Sys.Date() could be used to check for admission dates in the future.

Usage

check_dates(
  x,
  vars,
  vars_id,
  queries = list(),
  dict_clean = NULL,
  fn = parse_dates,
  na = ".na",
  populate_na = FALSE
)

Arguments

x

A data frame with one or more columns to check

vars

Names of date columns within x to check

vars_id

Vector of one or more ID columns within x on which corrections should be conditional.

queries

Optional list of expressions to check for non-valid dates. May include a .x selector which is a stand-in for any of the date variables specified in argument vars. E.g.

list(
  date_admit > date_exit,  # admission later than exit
  .x > Sys.Date()          # any date in future
)
dict_clean

Optional dictionary of value-replacement pairs (e.g. produced by a prior run of check_dates). Must include columns "variable", "value", "replacement", and all columns specified by vars_id.

fn

Function to parse raw date values. Defaults to parse_dates. Any value not coercible by fn will be flagged as a "Non-valid date".

na

Keyword to use within column "replacement" for values that should be converted to NA. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to NA.

populate_na

Logical indicating whether to pre-populate column "replacement" with values specified by keyword na, for queries of type "Non-valid date". If most non-valid dates in x are non-correctable, pre-populating the keyword na can save time during the manual verification/correction phase. Defaults to FALSE.

Value

Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:

  • columns specified in vars_id

  • variable: column name of date variable within x

  • value: raw date value

  • date: parsed date value

  • replacement: correct value that should replace a given non-valid value

  • query: which query was triggered by the given raw date value (if any)

Note that, unlike functions check_numeric and check_categorical, which only return rows corresponding to non-valid values, this function returns all date values corresponding to any observation (i.e. row) with at least one non-valid date value. This is to provide context for the non-valid value and aid in making the appropriate correction.

Examples

# load example dataset
data(ll1)

# basic output
check_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id"
)

# add additional queries to evaluate
check_dates(
  ll1,
  vars = c("date_onset", "date_admit", "date_exit"),
  vars_id = "id",
  queries = list(
    date_onset > date_admit,
    date_admit > date_exit,
    .x > as.Date("2021-01-01")
  )
)


epicentre-msf/dbc documentation built on Oct. 24, 2023, 9:25 p.m.