query | R Documentation |
Find observations within a data frame matching a given query (a logical expression relating to one or more variables), and return tidy output that can be stacked across different queries on different variables. Stackability is achieved by pivoting the columns indicated in the query expression to long-form, e.g. "variable1", "value1", "variable2", "value2", ...
The query expression can optionally incorporate up to two dot-selectors
(".x
" and ".y
"), which each refer to a set of variables specified
separately using tidy-selection (see section Using a dot-selector). If
both selectors are used in a given query expression, the sets of variables
they respectively match can either be "crossed" such that all combinations
are evaluated, or evaluated in parallel.
By default, only the data columns referenced in the query expression are
returned, but additional columns can optionally be added with argument
cols_base
.
query(
data,
cond,
cols_dotx,
cols_doty,
crossed = FALSE,
cols_base,
pivot_long = TRUE,
pivot_var = "variable",
pivot_val = "value",
as_chr = TRUE,
count = FALSE
)
data |
A data frame |
cond |
An expression to evaluate with respect to variables within
|
cols_dotx , cols_doty |
Tidy-selection of one or more columns represented
by a .x or .y selector. Only used if |
crossed |
if |
cols_base |
(Optional) Tidy-selection of other columns within |
pivot_long |
Logical indicating whether to pivot the variables
referenced within |
pivot_var |
Prefix for pivoted variable column(s). Defaults to
"variable". Only used if |
pivot_val |
Prefix for pivoted value column(s). Defaults to "value".
Only used if |
as_chr |
Logical indicating whether to coerce the columns referenced in
the query expression |
count |
Logical indicating whether to summarize the output by counting
the number of unique combinations across all returned columns (with count
column "n"). Defaults to |
A data frame reflecting the rows of data
that match the given query.
Returned columns include:
(optional) columns matched by argument cols_base
columns referenced within the query expression (pivoted to long form by default)
(optional) count column "n" (if count
= TRUE)
A query expression can optionally incorporate up to two dot-selectors
(".x
" and ".y
"), which each refer to a set of variables specified
separately using tidy-selection (arguments cols_dotx
and cols_doty
).
When cond
contains a .x
selector, the query expression is evaluated
repeatedly with each relevant variable from cols_dotx
individually
substituted into the .x
position of the expression. The results of these
multiple 'subqueries' are then combined with
dplyr::bind_rows
.
If cond
contains both a .x and .y selector, the sets of variables matched
by cols_dotx
and cols_doty
respectively can either be "crossed" such that
all combinations are evaluated, or evaluated in parallel. Evaluating in
parallel requires that the number of variables matched by cols_dotx
and
cols_doty
is the same.
Consider a hypothetical query checking that, if a patient has a particular
symptom, the date of onset of that symptom is not missing. E.g.
cond = .x == "Yes" & is.na(.y)
cols_dotx = c(symptom_fever, symptom_headache)
cols_doty = c(date_symptom_fever, date_symptom_headache)
If argument crossed
is FALSE
, the relevant variables from cols_dotx
and
cols_doty
will be evaluated in parallel, as in:
has_symptom_fever == "Yes" & is.na(date_symptom_fever)
has_symptom_headache == "Yes" & is.na(date_symptom_headache)
Conversely, if argument crossed
is TRUE
, all combinations of the relevant
variables will be evaluated, which for this particular query wouldn't make
sense:
symptom_fever == "Yes" & is.na(date_symptom_fever)
symptom_fever == "Yes" & is.na(date_symptom_headache) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_fever) # not relevant
symptom_headache == "Yes" & is.na(date_symptom_headache)
Note that if a dot-selector is used with argument pivot_long = FALSE
, the
row-binding of multiple subqueries may result in a sparse output with respect
to the variables represented by the dot-selector, because for each subquery
only the columns matched by expression cond
are returned.
# load example dataset, an epidemiological 'linelist'
data(ll)
# find observations where date_exit is earlier than date_admit
query(
ll,
date_exit < date_admit,
cols_base = id:site
)
# find any date value in the future using a .x column selector
query(
ll,
.x > Sys.Date(),
cols_dotx = starts_with("date"),
cols_base = id:site
)
# incorporate an external object into the query expression
lab_result_valid <- c("Positive", "Negative", "Inc.", NA)
query(
ll,
!lab_result %in% lab_result_valid,
cols_base = id:site,
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.