query: Data validation queries with tidy, stackable output
In epicentre-msf/queryr: Data Validation Queries With Tidy Output

View source: R/query.R

query

R Documentation

Data validation queries with tidy, stackable output

Description

Find observations within a data frame matching a given query (a logical expression relating to one or more variables), and return tidy output that can be stacked across different queries on different variables. Stackability is achieved by pivoting the columns indicated in the query expression to long-form, e.g. "variable1", "value1", "variable2", "value2", ...

The query expression can optionally incorporate up to two dot-selectors (".x" and ".y"), which each refer to a set of variables specified separately using tidy-selection (see section Using a dot-selector). If both selectors are used in a given query expression, the sets of variables they respectively match can either be "crossed" such that all combinations are evaluated, or evaluated in parallel.

By default, only the data columns referenced in the query expression are returned, but additional columns can optionally be added with argument cols_base.

Usage

query(
  data,
  cond,
  cols_dotx,
  cols_doty,
  crossed = FALSE,
  cols_base,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE,
  count = FALSE
)

Arguments

`data`	A data frame
`cond`	An expression to evaluate with respect to variables within `data`. Can specify multiple variables using a dot-selector ("`.x`" and "`.y`") within the expression (e.g. `.x > 0`) and then separately specifying the columns that the selector refers to with arguments `cols_dotx`/`cols_doty`.
`cols_dotx`, `cols_doty`	Tidy-selection of one or more columns represented by a .x or .y selector. Only used if `cond` contains the relevant selector. See section Using a dot-selector below.
`crossed`	if `cond` contains both a .x and .y selector, should the variables matched by `cols_dotx` and `cols_doty` be "crossed" such that all combinations are evaluated (`TRUE`), or should they be evaluated in parallel (`FALSE`). The latter requires that the number of variables matched by `cols_dotx` and `cols_doty` is the same. Defaults to `FALSE`.
`cols_base`	(Optional) Tidy-selection of other columns within `data` to retain in the output. Can optionally be set for an entire session using option "queryr_cols_base", e.g. `options(queryr_cols_base = quote(id:site))`.
`pivot_long`	Logical indicating whether to pivot the variables referenced within `cond` to a long (i.e. stackable) format, with default column names "variable1", "value1", "variable2", "value2", ... Defaults to `TRUE`.
`pivot_var`	Prefix for pivoted variable column(s). Defaults to "variable". Only used if `pivot_long = TRUE`.
`pivot_val`	Prefix for pivoted value column(s). Defaults to "value". Only used if `pivot_long = TRUE`.
`as_chr`	Logical indicating whether to coerce the columns referenced in the query expression `cond` to character prior to returning. This enables row-binding multiple queries with variables of different classes, but is only important if `pivot_long = TRUE`. Defaults to `TRUE`.
`count`	Logical indicating whether to summarize the output by counting the number of unique combinations across all returned columns (with count column "n"). Defaults to `FALSE`.

Value

A data frame reflecting the rows of data that match the given query. Returned columns include:

(optional) columns matched by argument cols_base
columns referenced within the query expression (pivoted to long form by default)
(optional) count column "n" (if count = TRUE)

Using a dot-selector

A query expression can optionally incorporate up to two dot-selectors (".x" and ".y"), which each refer to a set of variables specified separately using tidy-selection (arguments cols_dotx and cols_doty).

When cond contains a .x selector, the query expression is evaluated repeatedly with each relevant variable from cols_dotx individually substituted into the .x position of the expression. The results of these multiple 'subqueries' are then combined with dplyr::bind_rows.

If cond contains both a .x and .y selector, the sets of variables matched by cols_dotx and cols_doty respectively can either be "crossed" such that all combinations are evaluated, or evaluated in parallel. Evaluating in parallel requires that the number of variables matched by cols_dotx and cols_doty is the same.

Consider a hypothetical query checking that, if a patient has a particular symptom, the date of onset of that symptom is not missing. E.g.
cond = .x == "Yes" & is.na(.y)
cols_dotx = c(symptom_fever, symptom_headache)
cols_doty = c(date_symptom_fever, date_symptom_headache)

If argument crossed is FALSE, the relevant variables from cols_dotx and cols_doty will be evaluated in parallel, as in:
has_symptom_fever == "Yes" & is.na(date_symptom_fever)
has_symptom_headache == "Yes" & is.na(date_symptom_headache)

Conversely, if argument crossed is TRUE, all combinations of the relevant variables will be evaluated, which for this particular query wouldn't make sense:
symptom_fever == "Yes" & is.na(date_symptom_fever)
symptom_fever == "Yes" & is.na(date_symptom_headache) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_fever) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_headache)

Note that if a dot-selector is used with argument pivot_long = FALSE, the row-binding of multiple subqueries may result in a sparse output with respect to the variables represented by the dot-selector, because for each subquery only the columns matched by expression cond are returned.

Examples

# load example dataset, an epidemiological 'linelist'
data(ll)

# find observations where date_exit is earlier than date_admit
query(
  ll,
  date_exit < date_admit,
  cols_base = id:site
)

# find any date value in the future using a .x column selector
query(
  ll,
  .x > Sys.Date(),
  cols_dotx = starts_with("date"),
  cols_base = id:site
)

# incorporate an external object into the query expression
lab_result_valid <- c("Positive", "Negative", "Inc.", NA)

query(
  ll,
  !lab_result %in% lab_result_valid,
  cols_base = id:site,
)

epicentre-msf/queryr documentation built on July 17, 2025, 12:22 a.m.