In mdlincoln/dateSampler: Generate Randomized Dates

dateSampler

Generate randomized dates within separately-specified ranges of years, months, and days.

Data generated from archival sources often contains fragmentary dates, in which many records may be precise to the day, but some will be missing components such as the precise day, the month, or even the exact year.

Multiple imputation can be used to simulate the effect of these missing values when producing models based on such fragmentary data. Unlike assinging categorical or numeric values, however, generating random dates under certain restrictions can be difficult. dateSampler generates randomized dates that abide to restrictions on the year, month, and day components of a given date.

Installation

# install.packages("devtools")
devtools::install_github("mdlincoln/dateSampler")

Usage

Say you have a record that falls within January of 1875:

set.seed(101)
library(dateSampler)
sample_date(year_min = 1875, year_max = 1875, month_min = 1, month_max = 1, n = 5)

However, it may also be that you know the year and the day of the record, but not the month. If this day happens to place an implicit restriction on the month (e.g. 31 rules out September, April, June, November, and February as a possible month), sample_date similarly restricts its results, regenerating a new date if its initial pass returns an illegal combination:

sample_date(year_min = 1875, year_max = 1875, day_min = 31, day_max = 31, n = 5)

(Messages may be suppressed by passing quiet = TRUE)

You may also set ranges for each of these components:

sample_date(year_min = 1875, year_max = 1892, month_min = 1, month_max = 5, n = 5)

If you pass a range that can only produce illegal dates, however, sample_date will throw an error:

sample_date(year_min = 1875, month_min = 2, month_max = 2, day_min = 30, n = 5)

Expanding data frames

sample_date_df is a convenience function that generates replicates of a given data frame and appends new sampled dates. New dates are sampled rowwise, allowing you to specify different date component restrictions for each row. This can be useful when used as part of a modeling pipeline in which you need to use multiple imputation.

head(sample_date_df(iris, year_min = 1850:1999, n = 5), n = 10)

TBD: non-standard evaluation of column names to make sample_date_df extra pipe-friendly

Custom predicates

A custom predicate function can also be supplied to sample_date (and sample_date_df, if passed inside a list()) in order to intorduce more complex restrictions on the dates returned. The predicate function must take a year, month, and day value, and return FALSE if the supplied date is invalid, prompting sample_date to generate a new date.

For example, if you wished to produce dates that were only Mondays, you could produce a predicate function like so:

is_monday <- function(y, m, d) {
  lubridate::wday(lubridate::ymd(paste(y, m, d, sep = "-"), quiet = TRUE)) == 2
}

sample_date(1850, n = 5, .p = is_monday, quiet = TRUE)