knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Filtering relies on a filter
key in the feature JSON, for which the corresponding value is an object:
{ "output_feature_name": "..." ..., "filter": { ... }, ... }
This vignette will describe the types of filters available and how they can be expressed in this JSON object.
The most basic filter would be to accept all rows where the value in a given column meets a certain criterion.
For example, we may only want to look at rows for which the value in the age
column is greater than 18.
There are three pieces of information here which must be encoded:
column
: the name of the column to filter ontype
: the type of filter to applyvalue
: the value to compare against. The type of this must match the type of the column specified.The type
key can take one of the following values:
"in"
In this case, the value
key can either be a single value or an array of multiple values. This filers for rows where the value in the column matches one of the provided values.
"gt"
/ "lt"
/ "gt_eq"
/ "lt_eq"
The value in the column must be greater than, less than, greater than or equal to, or less than or equal to the value
respectively.
For these filters, only one value
can be passed.
"date_gt"
/ "date_lt"
/ "date_gt_eq"
/ "date_lt_eq"
The same as above, but to be used for dates.
This distinction is mandatory because JSON does not have a native date type, so the value
must be passed as a string in the format YYYY-MM-DD
.
Thus, for example, if we assume that the column age
contains integers, then
{ "filter": { "column": "age", "type": "in", "value": [18, 19, 20] } }
selects for people who are 18, 19, or 20 years old; and likewise if the date_of_birth
column contains dates, then
{ "filter": { "column": "date_of_birth", "type": "date_gt_eq", "value": "2000-01-01" } }
selects for people born on or after January 1, 2000.
As a reminder, the table to be filtered is passed as the source_table
key in the feature JSON.
This is described in the introductory vignette.
Filters may also be combined using the Boolean logic operators AND, OR, and NOT.
Instead of providing the three keys column
, type
, and value
, only type
is specified, together with subfilter
, which is itself an object which maps filter names to filter objects.
An example is probably the easiest way to explain this. If we wanted to select for people who are not 18, 19, or 20 years old, we could write:
{ "filter": { "type": "not", "subfilter": { "age_filter": { "column": "age", "type": "in", "value": [18, 19, 20] } } } }
The filter name, age_filter
, can be any string you want: it is used only for logging purposes.
Notice how the object that age_filter
refers to, i.e.,
{ "column": "age", "type": "in", "value": [18, 19, 20] }
is itself just a basic filter as has already been described above.
or
and and
filters work in exactly the same way, with the sole difference being that to accomplish any meaningful results you would generally want the subfilter
object to contain two or more filters that work in tandem.
For example, the below filters for 18-year-olds who have a height of over 180 (presumably centimetres; although this depends on the input table!):
{ "filter": { "type": "and", "subfilter": { "age_filter": { "column": "age", "type": "in", "value": [18] }, "height_filter": { "column": "height", "type": "gt", "value": 180 } } } }
Again, age_filter
and height_filter
are just names for logging purposes, and themselves refer to basic filters.
Finally, note that the subfilters need not necessarily be basic filters, though: they can be compound filters themselves. So it is possible to nest filters as deeply as you like to accomplish the desired result.
library(eider)
Consider the following input table with five rows:
input_table <- data.frame( id = c(1, 1, 1, 2, 3), date = as.Date(c( "2024-01-01", "2024-01-02", "2024-01-03", "2023-01-01", "2023-01-01" )), diagnosis = c("A", "B", "C", "B", "C") ) input_table
and suppose we would like to count the number of times each person has had a diagnosis of "B".
This is a transformation_type
of "count"
; but if we do not perform any filtering, "count"
will simply return the number of times each person appears in this dataframe.
The following JSON is provided as json_examples/filter1.json
:
writeLines(readLines("json_examples/filter1.json"))
Passing this into run_pipeline()
will give us the following result:
results <- run_pipeline( data_sources = list(input_table = input_table), feature_filenames = "json_examples/filter1.json" ) results$features
Instead of this, if we specify a filter for the diagnosis
column (in json_examples/filter2.json
), we can achieve the desired result:
writeLines(readLines("json_examples/filter2.json"))
results <- run_pipeline( data_sources = list(input_table = input_table), feature_filenames = "json_examples/filter2.json" ) results$features
Note how patient 3 has a count of 0, because they have never been diagnosed with "B".
This is specified by the absent_default_value
key in the JSON.
If we would like to find diagnoses of "B" only in 2024, then we can add a date filter to the mix, and combine it with the diagnosis filter with an AND filter (json_examples/filter3.json
):
writeLines(readLines("json_examples/filter3.json"))
results <- run_pipeline( data_sources = list(input_table = input_table), feature_filenames = "json_examples/filter3.json" ) results$features
Patient 2's diagnosis has been excluded this time because the corresponding entry came in 2023.
Filtering is used widely in the Gallery examples. See, for example:
NA
values);Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.