knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(interfacer)
interfacer
is designed to work with list columns, as generated by purrr
.
purrr
style list columns may contain any arbitrary data type within a list.
Consider the following complex dataframe for example, which includes a
single regular factor column, a nested dataframe as a list column,
a nested S3 lm
object as a list column and a nested matrix as a list column:
tmp = iris %>% tidyr::nest(by_species = -Species) %>% dplyr::mutate( model = purrr::map(by_species, ~ stats::lm(Sepal.Length ~ Sepal.Width, .x)), quantiles = purrr::map(by_species, ~ sapply(.x, quantile)) ) tmp %>% dplyr::glimpse()
interfacer
can be used to both represent and validate this data structure.
Here the initial specifications were generated using iclip(tmp)
and hand
modified:
# Pasted from `iclip(tmp)` with minor modification: i_tmp = interfacer::iface( Species = enum(`setosa`,`versicolor`,`virginica`) ~ "the Species column", by_species = list(i_by_species) ~ "the by_species column", model = list(of_type(lm)) ~ "the model column", quantiles = list(matrix) ~ "the quantiles column", .groups = NULL ) i_by_species = interfacer::iface( Sepal.Length = numeric ~ "the Sepal.Length column", Sepal.Width = numeric ~ "the Sepal.Width column", Petal.Length = numeric ~ "the Petal.Length column", Petal.Width = numeric ~ "the Petal.Width column", .groups = NULL )
We can then test that the input matches this specification:
tmp %>% iconvert(i_tmp) %>% dplyr::glimpse()
Such specifications could be used for validation, or controlling function dispatch. However it must be recognised that validation of nested dataframes is potentially computationally expensive as each individual nested dataframe must be completely validated. This could create a high overhead in situations where there are a large number of small nested dataframes.
Another example of a nested list column using the diamonds dataframe demonstrates this overhead, where 276 nested dataframes need to be validated individually. This takes a few seconds on my machine.
i_diamonds_cat = interfacer::iface( cut = enum(`Fair`,`Good`,`Very Good`,`Premium`,`Ideal`, .ordered=TRUE) ~ "the cut column", color = enum(`D`,`E`,`F`,`G`,`H`,`I`,`J`, .ordered=TRUE) ~ "the color column", clarity = enum(`I1`,`SI2`,`SI1`,`VS2`,`VS1`,`VVS2`,`VVS1`,`IF`, .ordered=TRUE) ~ "the clarity column", data = list(i_diamonds_data) ~ "A nested data column must be specified as a list", .groups = FALSE ) i_diamonds_data = interfacer::iface( carat = numeric ~ "the carat column", depth = numeric ~ "the depth column", table = numeric ~ "the table column", price = integer ~ "the price column", x = numeric ~ "the x column", y = numeric ~ "the y column", z = numeric ~ "the z column", .groups = FALSE ) nested_diamonds = ggplot2::diamonds %>% tidyr::nest(data = c(-cut,-color,-clarity)) system.time( nested_diamonds %>% iconvert(i_diamonds_cat) %>% dplyr::glimpse() )
In this example the price column is removes before nesting. Errors in the validation of nested columns are bubbled up to the top level.
try( ggplot2::diamonds %>% dplyr::select(-price) %>% tidyr::nest(data = c(-cut,-color,-clarity)) %>% iconvert(i_diamonds_cat) %>% dplyr::glimpse() )
interfacer
does work with nested dataframes but there is a performance hit if
there are nested columns with iface
specifications. Care must be taken if this
capability is used to keep data validation performant.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.