query2: Data validation queries across two data frames

View source: R/query2.R

query2R Documentation

Data validation queries across two data frames

Description

Find observations matching a query that concerns two data frames, and return tidy, stackable output. Entails three steps:

  1. separately query each of the two data frames using query

  2. combine the resulting query outputs based on a given join type (semi, anti, left, or inner)

  3. execute a third query on the joined output

Each of the query steps is optional — unspecified query expressions are replaced with TRUE such that all rows of the relevant input are returned.

Usage

query2(
  data1,
  data2,
  cond1,
  cond2,
  cols_base1,
  cols_base2,
  join_type,
  join_by,
  cond3,
  pivot_long = TRUE,
  pivot_var = "variable",
  pivot_val = "value",
  as_chr = TRUE
)

Arguments

data1

Data frame to query (#1)

data2

Data frame to query (#2)

cond1

(Optional) Expression to evaluate with respect to data1. If missing will be set to TRUE to select all rows.

cond2

(Optional) Expression to evaluate with respect to data2. If missing will be set to TRUE to select all rows.

cols_base1

(Optional) Tidy-selection of other columns within data1 to retain in the final output. Can be set for an entire session using option "queryr_cols_base", e.g. options(queryr_cols_base = quote(id:site)).

cols_base2

(Optional) Tidy-selection of other columns within data2 to retain in the final output.

join_type

How to join the output from the two initial queries ("semi", "anti", "left", or "inner"). Based on dplyr join types.

join_by

A character vector of variables to join by. If the join key columns have different names in data1 and data2, use a named vector. For example, by = c("a" = "b") will match data1$a to data2$b.

cond3

(Optional) Expression to evaluate with respect to the joined output of the two initial queries. If missing will be set to TRUE to select all rows.

Note that if join_type is a filtering join ("anti" or "semi"), only variables from data1 can be referenced in cond3 (referencing a variable that only exists in data2 will result in an error).

If join_type is instead a mutating join ("left" or "inner"), all variables from data1 and data2 will be available to cond3, even if not otherwise referenced with cond1/cond2 or cols_base1/cols_base2.

pivot_long

Logical indicating whether to pivot the variables referenced within the query expression(s) to a long (i.e. stackable) format, with default column names "variable1", "value1", "variable2", "value2", ... Defaults to TRUE. If cond3 is specified and pivot_long is TRUE, the pivot happens only in the final query (i.e. cond3).

pivot_var

Prefix for pivoted variable column(s). Defaults to "variable". Only used if pivot_long = TRUE.

pivot_val

Prefix for pivoted value column(s). Defaults to "value". Only used if pivot_long = TRUE.

as_chr

Logical indicating whether to coerce the columns referenced in the query expression(s) to character prior to returning. This enables row-binding multiple queries with variables of different classes, but is only important if pivot_long = TRUE. Defaults to TRUE.

Value

A data frame reflecting the rows of data1 that match the given query. Returned columns include:

  • Columns matched by argument cols_base1

  • Columns matched by argument cols_base2 (only if join type is "left" or "inner")

  • Columns referenced within the relevant condition statements (pivoted to long form by default).

    If the join type is a mutating join ("left" or "inner"), variables from data1 or data2 referenced in any of the condition statements (cond1, cond2, or cond3) will appear in the output. However, with a filtering join ("anti" or "semi") only variables from data1 will appear in the output.

Examples

# example datasets: two related epidemiological linelists
data(ll)  # ll from treatment center (all cases, confirmed and non-confirmed)
data(sll) # summary linelist (only confirmed/probable cases)

# find patients in ll that don't appear in sll
query2(
  ll,
  sll,
  cols_base1 = c(id, site, status),
  join_type = "anti",
  join_by = c("id" = "tc_id")
)

# find patients with different outcome status in ll vs sll
query2(
  ll,
  sll,
  cols_base1 = id:site,
  join_type = "inner",
  join_by = c("id" = "tc_id"),
  cond3 = status != sll_status
)


epicentre-msf/queryr documentation built on July 17, 2025, 12:22 a.m.