Finding Features in Data

options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

When you are presented with longitudinal data, it is useful to summarise the data into a format where you have one row per key. That means one row per unique identifier of the data - if you aren't sure what this means, see the vignette, "Longitudinal Data Structures".

So, say for example you wanted to find features in the wages data, which looks like this:

library(brolgar)
wages

You can return a dataset that has one row per key, with say the minimum value for ln_wages, for each key:

wages_min <- wages %>%
  features(ln_wages, 
           list(min = min))

wages_min

This then allows us to summarise these kinds of data, to say for example find the distribution of minimum values:

library(ggplot2)
ggplot(wages_min,
       aes(x = min)) + 
  geom_density()

We call these summaries features of the data.

This vignette discusses how to calculate these features of the data.

Calculating features

We can calculate features of longitudinal data using the features function (from fabletools, made available in brolgar).

features works by specifying the data, the variable to summarise, and the feature to calculate:

features(<DATA>, <VARIABLE>, <FEATURE>)

or with the pipe:

<DATA> %>% features(<VARIABLE>, <FEATURE>)

As an example, we can calculate a five number summary (minimum, 25th quantile, median, mean, 75th quantile, and maximum) of the data using feat_five_num, like so:

wages_five <- wages %>%
  features(ln_wages, feat_five_num)

wages_five

Here we are taking the wages data, piping it to features, and then telling it to summarise the ln_wages variable, using feat_five_num.

There are several handy functions for calculating features of the data that brolgar provides. These all start with feat_.

You can, for example, find those whose values only increase or decrease with feat_monotonic:

wages_mono <- wages %>%
  features(ln_wages, feat_monotonic)

wages_mono

These could then be used to identify individuals who only increase like so:

library(dplyr)
wages_mono %>%
  filter(increase)

They could then be joined back to the data

wages_mono_join <- wages_mono %>%
  filter(increase) %>%
  left_join(wages, by = "id")

wages_mono_join

And these could be plotted:

ggplot(wages_mono_join,
       aes(x = xp,
           y = ln_wages,
           group = id)) + 
  geom_line()

To get a sense of the data and where it came from, we could create a plot with gghighlight to highlight those that only increase, by using gghighlight(increase) - since increase is a logical, this tells gghighlight to highlight those that are TRUE.

library(gghighlight)
wages_mono %>%
  left_join(wages, by = "id") %>%
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) +
  geom_line() + 
  gghighlight(increase)

You can explore the available features, see the function References

Creating your own Features

To create your own features or summaries to pass to features, you provide a named list of functions. For example:

library(brolgar)
feat_three <- list(min = min,
                   med = median,
                   max = max)

feat_three

These are then passed to features like so:

wages %>%
  features(ln_wages, feat_three)

heights %>%
  features(height_cm, feat_three)

Inside brolgar, the features are created with the following syntax:

feat_five_num <- function(x, ...) {
  list(
    min = b_min(x, ...),
    q25 = b_q25(x, ...),
    med = b_median(x, ...),
    q75 = b_q75(x, ...),
    max = b_max(x, ...)
  )
}

Here the functions b_ are functions with a default of na.rm = TRUE, and in the cases of quantiles, they use type = 8, and names = FALSE.

Accessing sets of features

If you want to run many or all features from a package on your data you can collect them all with feature_set. For example:

library(fabletools)
feat_brolgar <- feature_set(pkgs = "brolgar")
length(feat_brolgar)

You could then run these like so:

wages %>%
  features(ln_wages, feat_brolgar)

For more information see ?fabletools::feature_set

Registering a feature in a package

If you create features in your own package and want to make them accessible with feature_set, do the following.

Functions can be registered via fabletools::register_feature(). To register features in a package, I create a file called zzz.R, and use the .onLoad(...) function to set this up on loading the package:

.onLoad <- function(...) {
  fabletools::register_feature(feat_three_num, c("summary"))
  # ... and as many as you want here!
}


Try the brolgar package in your browser

Any scripts or data that you put into this service are public.

brolgar documentation built on June 22, 2024, 11:24 a.m.