knitr::opts_chunk$set( collapse = TRUE, comment = "##", fig.width = 8, fig.height = 5.5, out.width = "90%" ) library(tidyverse) library(viridisLite) theme_set(theme_minimal() + theme(legend.position = "bottom")) options( ggplot2.continuous.colour = "viridis", ggplot2.continuous.fill = "viridis" ) scale_colour_discrete <- scale_colour_viridis_d scale_fill_discrete <- scale_fill_viridis_d library("tidyfun") # try(devtools::load_all("~/fda/tidyfun")) # try(devtools::load_all("~/Work/fda/tidyfun")) pal_5 <- viridis(7)[-(1:2)] set.seed(1221)
tidyfun
The goal of tidyfun
is to provide accessible and well-documented software that makes functional data analysis in R
easy. In this vignette, we explore some aspects of data manipulation that are possible using tidyfun
, emphasizing compatibility with the tidyverse
.
Other vignettes have examined the tfd
& tfb
data types, and how to convert common formats for functional data (e.g. matrices, long- and wide-format data frames, fda
objects) in these new data types. Because our goal is "tidy" data manipulation for functional data analysis, the result of data conversion processes has been a data frame in which a column contains the functional data of interest. This vignette starts from that point.
Throughout, we make use of some visualization tools -- these are explained in more detail in the visualization vignette.
The datasets used in this vignette are the tidyfun::chf_df
and tidyfun::dti_df
dataset. The first contains minute-by-minute observations of log activity counts (stored as a tfd
vector called activity
) over seven days for each of 47 subjects with congestive heart failure. In addition to id
and activity
, we observe several covariates.
data(chf_df)
chf_df
A quick plot of the first 5 curves:
chf_df |> slice(1:5) |> ggplot(aes(y = activity)) + geom_spaghetti(alpha = 0.1)
The tidyfun::dti_df
contains fractional anisotropy (FA) tract profiles for the corpus callosum (cca) and the right corticospinal tract (rcst), along with several covariates.
data(dti_df)
dti_df
A quick plot of the cca
tract profiles is below.
dti_df |> ggplot(aes(y = cca)) + geom_spaghetti(alpha = 0.05)
tidyverse
functionsDataframes using tidyfun
to store functional observations can be manipulated using tools from dplyr
, including select
and filter
:
chf_df |> select(id, day, activity) |> filter(day == "Mon") |> ggplot(aes(y = activity)) + geom_spaghetti(alpha = 0.05)
Operations using group_by
and summarize
also work -- let's look at some
daily averages of these activity profiles:
chf_df |> group_by(day) |> summarize(mean_act = mean(activity)) |> ggplot(aes(y = mean_act, color = day)) + geom_spaghetti()
One can mutate
functional observations -- here we exponentiate the log activity counts to obtain original recordings:
chf_df |> slice(1:5) |> mutate(exp_act = exp(activity)) |> ggplot(aes(y = exp_act)) + geom_spaghetti(alpha = 0.2)
Functions for data manipulation from tidyr
are also supported. We illustrate by using pivot_wider
to create new tfd
-columns containing the activity profiles for each day of the week:
chf_df |> select(id, day, activity) |> pivot_wider( names_from = day, values_from = activity )
(Note that this has made the data less "tidy" and is therefore not generally recommended, but may be useful in some cases).
It's also possible to join datasets based on non-functional keys. To illustrate, we'll first create a pair of datasets:
monday_df <- chf_df |> filter(day == "Mon") |> select(id, monday_act = activity) friday_df <- chf_df |> filter(day == "Fri") |> select(id, friday_act = activity)
These can be joined using the id
variable as a key (and then tidied using pivot_longer
):
monday_df |> left_join(friday_df, by = "id") |> pivot_longer(monday_act:friday_act, names_to = "day", values_to = "activity")
Similar tidying can be done for the DTI data -- let's look at average RCST tract values for gender and case status:
dti_df |> group_by(case, sex) |> summarize(mean_rcst = mean(rcst, na.rm = TRUE)) |> ggplot(aes(y = mean_rcst, color = case)) + geom_spaghetti(linewidth = 2) + facet_grid(~sex)
tidyfun
functionsSome dplyr
functions are useful in conjunction with new functions in tidyfun
. For example, one might use filter
with tf_anywhere
to filter based on the values of observed functions:
like_to_move_it_move_it <- chf_df |> filter(tf_anywhere(activity, value > 9)) glimpse(like_to_move_it_move_it) like_to_move_it_move_it |> ggplot(aes(y = activity)) + geom_spaghetti(aes(colour = id))
A second example of this functionality in the DTI data is below.
dti_df |> filter(tf_anywhere(cca, value < 0.26)) |> ggplot(aes(y = cca)) + geom_spaghetti()
The existing mutate
function can be combined with several tidyfun
functions, including tf_smooth
, tf_zoom
, and tf_deriv
.
One can smooth existing observations using tf_smooth
:
chf_df |> filter(id == 1) |> mutate(smooth_act = tf_smooth(activity)) |> ggplot(aes(y = smooth_act)) + geom_spaghetti()
This can be combined with previous steps, like group_by
and summarize
, to build intution through descriptive plots and summaries:
chf_df |> group_by(day) |> summarize(mean_act = mean(activity)) |> mutate(smooth_mean = tf_smooth(mean_act)) |> ggplot(aes(y = mean_act, color = day)) + geom_spaghetti(alpha = 0.2) + geom_spaghetti(aes(y = smooth_mean), linewidth = 2)
One can also extract observations over a subset of the full domain using tf_zoom
:
chf_df |> filter(id == 1) |> mutate(daytime_act = tf_zoom(activity, 360, 1200)) |> ggplot(aes(y = daytime_act)) + geom_spaghetti(alpha = 0.2)
We can also convert from tfd
to tfb
inside a mutate
statement as part of a data processing pipeline:
dti_df <- dti_df |> mutate(cca_tfb = tfb(cca, k = 25))
It's also possible to compute derivatives as part of a processing pipeline:
dti_df |> slice(1:10) |> mutate( cca_raw_deriv = tf_derive(cca), cca_tfb_deriv = tf_derive(cca_tfb) ) |> ggplot() + geom_spaghetti(aes(y = cca_raw_deriv), alpha = 0.3, linewidth = 0.3, col = "blue") + geom_spaghetti(aes(y = cca_tfb_deriv), alpha = 0.3, linewidth = 0.3, col = "red") + ylab("d/dt f(t)")
data.table
tidyfun
functional data objects work within data.table
as well.
However, there is one specific known caveat when calculating the mean of a tf
vector within a data.table
context:
By default, data.table
will use optimized routines for summary statistics, which may cause mean
to dispatch to data.table
's own optimized implementation,
leading to unexpected behavior for tf
vectors.
To ensure the correct mean calculation for tf
vectors, there are two solutions:
options(data.table.optimize = 0)
, which disables optimization, allowing the mean function to correctly dispatch for tf
vectors.tf:::mean.tf(activity)
in your code to directly specify the correct tf
-mean calculation.This caveat applies specifically to mean()
and is the only known issue.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.