query: Data validation queries with tidy, stackable output

View source: R/query.R

queryR Documentation

Data validation queries with tidy, stackable output

Description

Find observations within a data frame matching a given query (a logical expression relating to one or more variables), and return tidy output that can be stacked across different queries on different variables. Stackability is achieved by pivoting the columns indicated in the query expression to long-form, e.g. "variable1", "value1", "variable2", "value2", ...

The query expression can optionally incorporate up to two dot-selectors (".x" and ".y"), which each refer to a set of variables specified separately using tidy-selection (see section Using a dot-selector). If both selectors are used in a given query expression, the sets of variables they respectively match can either be "crossed" such that all combinations are evaluated, or evaluated in parallel.

By default, only the data columns referenced in the query expression are returned, but additional columns can optionally be added with argument cols_base.

Usage

query(
  data,
  cond,
  cols_dotx,
  cols_doty,
  crossed = FALSE,
  cols_base,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE,
  count = FALSE
)

Arguments

data

A data frame

cond

An expression to evaluate with respect to variables within data. Can specify multiple variables using a dot-selector (".x" and ".y") within the expression (e.g. .x > 0) and then separately specifying the columns that the selector refers to with arguments cols_dotx/cols_doty.

cols_dotx, cols_doty

Tidy-selection of one or more columns represented by a .x or .y selector. Only used if cond contains the relevant selector. See section Using a dot-selector below.

crossed

if cond contains both a .x and .y selector, should the variables matched by cols_dotx and cols_doty be "crossed" such that all combinations are evaluated (TRUE), or should they be evaluated in parallel (FALSE). The latter requires that the number of variables matched by cols_dotx and cols_doty is the same. Defaults to FALSE.

cols_base

(Optional) Tidy-selection of other columns within data to retain in the output. Can optionally be set for an entire session using option "queryr_cols_base", e.g. options(queryr_cols_base = quote(id:site)).

pivot_long

Logical indicating whether to pivot the variables referenced within cond to a long (i.e. stackable) format, with default column names "variable1", "value1", "variable2", "value2", ... Defaults to TRUE.

pivot_var

Prefix for pivoted variable column(s). Defaults to "variable". Only used if pivot_long = TRUE.

pivot_val

Prefix for pivoted value column(s). Defaults to "value". Only used if pivot_long = TRUE.

as_chr

Logical indicating whether to coerce the columns referenced in the query expression cond to character prior to returning. This enables row-binding multiple queries with variables of different classes, but is only important if pivot_long = TRUE. Defaults to TRUE.

count

Logical indicating whether to summarize the output by counting the number of unique combinations across all returned columns (with count column "n"). Defaults to FALSE.

Value

A data frame reflecting the rows of data that match the given query. Returned columns include:

  • (optional) columns matched by argument cols_base

  • columns referenced within the query expression (pivoted to long form by default)

  • (optional) count column "n" (if count = TRUE)

Using a dot-selector

A query expression can optionally incorporate up to two dot-selectors (".x" and ".y"), which each refer to a set of variables specified separately using tidy-selection (arguments cols_dotx and cols_doty).

When cond contains a .x selector, the query expression is evaluated repeatedly with each relevant variable from cols_dotx individually substituted into the .x position of the expression. The results of these multiple 'subqueries' are then combined with dplyr::bind_rows.

If cond contains both a .x and .y selector, the sets of variables matched by cols_dotx and cols_doty respectively can either be "crossed" such that all combinations are evaluated, or evaluated in parallel. Evaluating in parallel requires that the number of variables matched by cols_dotx and cols_doty is the same.

Consider a hypothetical query checking that, if a patient has a particular symptom, the date of onset of that symptom is not missing. E.g.
cond = .x == "Yes" & is.na(.y)
cols_dotx = c(symptom_fever, symptom_headache)
cols_doty = c(date_symptom_fever, date_symptom_headache)

If argument crossed is FALSE, the relevant variables from cols_dotx and cols_doty will be evaluated in parallel, as in:
has_symptom_fever == "Yes" & is.na(date_symptom_fever)
has_symptom_headache == "Yes" & is.na(date_symptom_headache)

Conversely, if argument crossed is TRUE, all combinations of the relevant variables will be evaluated, which for this particular query wouldn't make sense:
symptom_fever == "Yes" & is.na(date_symptom_fever)
symptom_fever == "Yes" & is.na(date_symptom_headache) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_fever) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_headache)

Note that if a dot-selector is used with argument pivot_long = FALSE, the row-binding of multiple subqueries may result in a sparse output with respect to the variables represented by the dot-selector, because for each subquery only the columns matched by expression cond are returned.

Examples

# load example dataset, an epidemiological 'linelist'
data(ll)

# find observations where date_exit is earlier than date_admit
query(
  ll,
  date_exit < date_admit,
  cols_base = id:site
)

# find any date value in the future using a .x column selector
query(
  ll,
  .x > Sys.Date(),
  cols_dotx = starts_with("date"),
  cols_base = id:site
)

# incorporate an external object into the query expression
lab_result_valid <- c("Positive", "Negative", "Inc.", NA)

query(
  ll,
  !lab_result %in% lab_result_valid,
  cols_base = id:site,
)


epicentre-msf/queryr documentation built on July 17, 2025, 12:22 a.m.