knitr::opts_chunk$set( eval = TRUE, collapse = TRUE, comment = "#>", dpi = 150, fig.show = 'hold', fig.width = 5 ) options(scipen = 9999)
The R package distantia
2.0 introduces Time Series Lists as obect to organize time series for dissimilarity analyses, and provides a complete toolset to manage them.
This article describes Time Series Lists in detail, and showcases the most common data handling procedures enabled by the new functions included in the package.
In this new version of distantia
, groups of time series are organized as named lists of zoo objects. These lists are named Time Series Lists (TSL) within the package, and are designed to facilitate the parallelization of dissimilarity analyses.
TSL is not a class by choice, as the idea is keeping them as simple as possible to grant users the autonomy to create or modify them as needed.
The R package zoo
provides an S3 class of the same name designed to handle observations ordered by an index. It supports various index classes, such as Date, POSIXct, or even custom numeric or character indices, and handles regular and irregular time series equally well.
Other advantages of using zoo objects include a seamless intergration with base R methods, and built-in tools for alignment, merging, and subsetting.
Let's take a look at a little zoo object.
library(distantia) z <- distantia::zoo_simulate( name = "my_zoo", cols = 3, rows = 10, time_range = c( "2024-01-01", "2024-12-31" ), na_fraction = 0.1, irregular = TRUE, seed = 1 ) zoo_plot(x = z)
Zoo objects have two main components, a data matrix with the time series observations, and an index representing time or sample order.
The data matrix is extracted with zoo::coredata()
.
zoo::coredata(z) class(zoo::coredata(z))
The core data of a zoo object can also be a vector when the time series is univariate.
x <- zoo::zoo(x = runif(10)) is.vector(zoo::coredata(x))
However, this is frowned upon in distantia
, and these vectors should always be converted to matrices.
x <- distantia::zoo_vector_to_matrix(x = x) is.vector(zoo::coredata(x)) is.matrix(zoo::coredata(x))
The index of zoo time series is extracted with zoo::index()
.
zoo::index(z) class(zoo::index(z))
The classes for zoo indices explicitly supported in distantia
are Date
, POSIXct
, and numeric
. The function distantia::zoo_time()
helps summarize the time features of a zoo object, including the time class.
distantia::zoo_time(x = z)
Additionally, in distantia
all zoo objects are expected to have the attribute name
.
attributes(z)$name
This attribute, not part of the zoo
class, is used to facilitate plotting operations, and it is managed internally by tsl_...()
functions. There are several functions in distantia
to manage the names of zoo objects.
#reset zoo name z <- distantia::zoo_name_set( x = z, name = "My_Zoo" ) #get zoo name distantia::zoo_name_get(x = z) #clean zoo name z <- distantia::zoo_name_clean( x = z, lowercase = TRUE ) distantia::zoo_name_get(x = z)
This package comes with several functions to manipulate zoo objects:
zoo_name_set()
, zoo_name_get()
and zoo_name_clean()
: handle the attribute "name".zoo_time()
: details of the zoo index.zoo_aggregate()
: time aggregation of zoo objects.zoo_resample()
: interpolation or extrapolation to a different time, or from irregular to regular time.zoo_smooth()
: rolling-window smoothing.zoo_permute()
: restricted permutation.zoo_vector_to_matrix()
and zoo_to_tsl()
: internal functions to facilitate handling zoo objects within TSLs.rm(x, z)
TSLs are named lists of zoo time series. The example below shows how to build a TSL from scratch with zoo objects. But this is not the most common or comfortable case, so please, visit the section Creating Time Series Lists to find out how to convert your data easily to TSL.
#create simple tsl my_tsl <- list( A = distantia::zoo_simulate( cols = 4, na_fraction = 0.2 ), B = distantia::zoo_simulate() ) names(my_tsl) class(my_tsl) #names of the zoo objects lapply(X = my_tsl, FUN = distantia::zoo_name_get) #class of the objects within the list lapply(X = my_tsl, FUN = class)
TSLs ready for dissimilarity analyses must follow several rules to ensure that dissimilarity analyses run without issues:
matrix
.I understand these are way too many rules, but the functions tsl_diagnose()
and tsl_repair()
are there to help you forget about them. When applying tsl_diagnose()
to my_tsl
we can see it has several issues.
distantia::tsl_diagnose(tsl = my_tsl)
From there, we can either follow the suggestions, or apply tsl_repair()
directly, as done below.
my_tsl <- distantia::tsl_repair(tsl = my_tsl)
This function identifies the issues raised up by tsl_diagnose()
and repairs them when possible. If tsl_diagnose()
is run again, it should stay silent if everything is ok.
distantia::tsl_diagnose( tsl = my_tsl )
From this point, our TSL is ready to go!
distantia::tsl_plot( tsl = my_tsl, guide = FALSE )
rm(my_tsl)
The function tsl_initialize()
(with the alias tsl_init()
) is designed to help transform several data structures to Time Series List.
Long and tidy data frames are convenient structures to store multivariate time series of a reasonable size. For example, the data frame fagus_dynamics
shown below has the column "name" identifying separate time series, the column "time" with observation dates, and three numeric columns with environmental observations.
head(fagus_dynamics)
Transforming this data frame to TSL is quite straightforward:
tsl <- distantia::tsl_initialize( x = fagus_dynamics, name_column = "name", time_column = "time" ) #even shorter! tsl <- distantia::tsl_init( x = fagus_dynamics, name = "name", time = "time" ) distantia::tsl_plot( tsl = tsl )
Once manipulated and/or analyzed, a TSL can be converted back to data frame with tsl_to_df()
.
df <- distantia::tsl_to_df(tsl = tsl) head(df)
rm(df, tsl)
evi_wide <- stats::reshape( data = distantia::fagus_dynamics[, c( "name", "time", "evi" )], timevar = "name", idvar = "time", direction = "wide", sep = "_" )
A wide data frame is a useful structure to store univariate time series observed in different places at the same times.
head(evi_wide)
When no name_column
is provided, tsl_initialize()
assumes that each column is a univariate time series.
tsl <- distantia::tsl_initialize( x = evi_wide, time_column = "time" ) tsl_plot( tsl = tsl, guide = FALSE )
In this case, the column names of the univariate zoo objects will "x".
distantia::tsl_colnames_get(tsl = tsl)
This name can be reset as needed with tsl_colnames_set()
.
tsl <- distantia::tsl_colnames_set( tsl = tsl, names = "evi" ) distantia::tsl_colnames_get(tsl = tsl)
This TSL can be converted to data frame, but this time the result is a long data frame.
df <- distantia::tsl_to_df(tsl = tsl) head(df)
rm(df, evi_wide, tsl)
A list of numeric vectors can also be converted to TSL. In this case, the zoo index is a sequence of integers.
tsl <- distantia::tsl_initialize( x = list( a = runif(10), b = runif(10) ) ) distantia::tsl_plot( tsl = tsl, guide = FALSE )
The same thing can be done with matrices as well.
tsl <- distantia::tsl_initialize( x = list( a = matrix(data = runif(100), ncol = 2, nrow = 50), b = matrix(data = runif(100), ncol = 2, nrow = 50) ) ) distantia::tsl_plot( tsl = tsl, guide = FALSE )
rm(tsl)
The functions in distantia
to handle TSLs are named after the pattern tsl_...()
. Some of these functions are designed to explore and better understand your TSLs, while others are designed to manipulate and transform them to facilitate dissimilarity analyses.
This section offers a complete overview of these functions and their applications.
The functions to manage TSLs are a bunch of lapply
in a trench coat. Most use future.apply::future_lapply()
, a parallelized version of lapply
from the future.apply
package, and a few combine foreach::foreach()
with doFuture::%iterator%
in shameless parallelized loops.
As such, they all support a parallelization backend provided by the future
package, as shown below. However, take in mind that parallelization is only worth it for very large datasets.
library(future) library(parallelly) future::plan( future::multisession, workers = parallelly::availableCores() - 1 )
They also support progress bars via the progressr
package. However, this option, commented in the code below, does not work in Rmarkdown.
#progress bar (does not work in Rmarkdown) #progressr::handlers(global = TRUE)
This section showcases all tools available in distantia
to better understand our Time Series Lists. Logically, the section starts with distantia::tsl_simulate()
, a function that has nothing to do with such purpose.
tsl <- distantia::tsl_simulate( n = 4, cols = 5, rows = 1000, time_range = c("2010-01-01", "2020-01-01"), data_range = c(0, 1), seasons = 10, na_fraction = 0.1, irregular = TRUE ) tsl_plot( tsl = tsl, guide = FALSE )
The simpler tools to explore TSLs are focused on simpler things, like extracting TSL names and dimensions.
#time series names distantia::tsl_names_get(tsl = tsl) #column names distantia::tsl_colnames_get(tsl = tsl) #number of columns distantia::tsl_ncol(tsl = tsl) #number of rows distantia::tsl_nrow(tsl = tsl)
The function tsl_time()
summarizes the time features of all time series in a TSL.
distantia::tsl_time(tsl = tsl)
The function tsl_stats()
, on the other hand, computes stats by time series and variable to summarize the TSL values.
df_stats <- distantia::tsl_stats( tsl = tsl, lags = 1 #temporal autocorrelation lag ) df_stats
The dissimilarity analyses implemented in version 2.0 of distantia
do not support NA cases in time series lists. There are two alternate workflows to handle NA cases in time series list.
The first uses tsl_count_NA()
and tsl_handle_NA()
. The former function converts Inf and NaN to NA and counts NA cases in each time series, while the latter either omits or imputes NA cases via zoo::na.approx()
.
#count NA distantia::tsl_count_NA(tsl = tsl) #impute NA cases tsl_notNA <- distantia::tsl_handle_NA( tsl = tsl, na_action = "impute" ) #re-count distantia::tsl_count_NA(tsl = tsl_notNA)
The second workflow involves the functions tsl_diagnose()
and tsl_repair()
, is more general because it addresses other potential issues at once.
#diagnose issues with NA values distantia::tsl_diagnose(tsl = tsl) #impute NA cases tsl <- tsl_repair(tsl = tsl) #re-diagnose to check result distantia::tsl_diagnose(tsl = tsl)
The function tsl_subset()
helps focus on particular regions of a time series list. Additionally, by default this function returns the numeric columns that are shared across all time series in a TSL.
tsl_new <- distantia::tsl_subset( tsl = tsl, names = c("A", "C", "D"), colnames = c("a", "b"), time = c("2014-01-01", "2018-01-01"), numeric_cols = TRUE, shared_cols = TRUE ) distantia::tsl_plot( tsl = tsl_new )
The function tsl_burst()
transforms a multivariate TSL into a univariate TSL by creating a new zoo object from each column of the original zoo objects. This function helps apply dissimilarity analysis between individual variables in multivariate time series.
#burst multivariate time series to univariate tsl_univariate <- distantia::tsl_burst( tsl = tsl_new ) #check new time series names distantia::tsl_names_get(tsl = tsl_univariate) #check new column names distantia::tsl_colnames_get(tsl = tsl_univariate) #plot univariate time series distantia::tsl_plot( tsl = tsl_univariate, guide = FALSE )
rm(tsl, tsl_new, tsl_notNA, tsl_univariate, df_stats)
Aggregation reduces time series frequency (a.k.a downsampling) by summarizing multiple data points into a single value over a specified time interval. It results in a reduction in the number of samples, smooths noise out, can transform irregular time series into regular, and can generate entirely new time series depending on the aggregation stats. On the other hand, it obscures fine grain detail and alters statistical properties such as variance and temporal autocorrelation.
In distantia
, this operation is supported by the function tsl_aggregate()
, with two arguments:
new_time
: time vector or keyword defining time intervals to aggregate over.f
: aggregation function summarizing observations over aggregation time intervals.The code below illustrates this function's usage to compute yearly temperature and precipitation indicators from the monthly observations in the dataset honeycomb_climate
.
tsl <- distantia::tsl_init( x = distantia::honeycomb_climate, name = "cell", time = "time" ) |> tsl_subset( names = 1:5 #subset first five elements ) distantia::tsl_plot( tsl = tsl )
The easiest way to define aggregation intervals in distantia
is to use a keyword. The function tsl_time()
returns the supported keywords for a given TSL.
df_time <- distantia::tsl_time( tsl = tsl, keywords = "aggregate" ) df_time$keywords |> unlist() |> unique()
Let's focus on "quarters" for the rest of the example.
interval <- "quarters"
The code below subsets the column "temperature" in tsl
, aggregates by computing the minimum, maximum, and mean per quarter, and adds a suffix to each new aggregated column.
#subset temperature column tsl_temperature <- distantia::tsl_subset( tsl = tsl, colnames = "temperature" ) #compute stats: minimum, maximum, and mean tsl_temperature_min <- distantia::tsl_aggregate( tsl = tsl_temperature, new_time = interval, f = min ) |> distantia::tsl_colnames_suffix( suffix = "_min" #set suffix for aggregated column ) tsl_temperature_max <- distantia::tsl_aggregate( tsl = tsl_temperature, new_time = interval, f = max ) |> distantia::tsl_colnames_suffix( suffix = "_max" ) tsl_temperature_mean <- distantia::tsl_aggregate( tsl = tsl_temperature, new_time = interval, f = mean ) |> distantia::tsl_colnames_suffix( suffix = "_mean" )
Similar steps can be followed to process the variable "precipitation".
#subset temperature column tsl_precipitation <- distantia::tsl_subset( tsl = tsl, colnames = "precipitation" ) #compute stats: minimum, maximum, and mean tsl_precipitation_sum <- distantia::tsl_aggregate( tsl = tsl_precipitation, new_time = interval, f = sum ) |> distantia::tsl_colnames_suffix( suffix = "_sum" #set suffix for aggregated column ) tsl_precipitation_max <- distantia::tsl_aggregate( tsl = tsl_precipitation, new_time = interval, f = max ) |> distantia::tsl_colnames_suffix( suffix = "_max" ) tsl_precipitation_min <- distantia::tsl_aggregate( tsl = tsl_precipitation, new_time = interval, f = min ) |> distantia::tsl_colnames_suffix( suffix = "_min" )
Finally, we join all new TSLs together to examine the result.
tsl_climate_stats <- distantia::tsl_join( tsl_temperature_min, tsl_temperature_max, tsl_temperature_mean, tsl_precipitation_sum, tsl_precipitation_max, tsl_precipitation_min ) distantia::tsl_plot( tsl = tsl_climate_stats, ylim = "relative" )
rm(
tsl_temperature_min,
tsl_temperature_max,
tsl_temperature_mean,
tsl_precipitation_sum,
tsl_precipitation_max,
tsl_precipitation_min,
tsl_climate_stats,
tsl_precipitation,
tsl_temperature,
interval
)
Resampling is a model-based method to change the frequency of a time series via interpolation. It is useful to align time series that are irregular or have different resolutions. It is important to take in mind that resampling to a frequency much higher than the original will definitely result in interpolation artifacts and distorted time series.
The code below creates two irregular time series with different number of rows.
tsl <- distantia::tsl_init( x = list( A = zoo_simulate(rows = 100, seasons = 10), B = zoo_simulate(rows = 50, seasons = 10) ) ) distantia::tsl_plot( tsl = tsl, guide = FALSE )
The time features of this TSL show clear differences in frequency between these time series.
distantia::tsl_time(tsl = tsl)[ , c("name", "rows", "resolution", "begin", "end") ]
By default, if the argument new_time
is omitted, a linear model (computed via zoo:::na.approx()
) is used to resample the TSL to its average resolution over the intersection of all its time ranges.
tsl_resampled <- distantia::tsl_resample( tsl = tsl ) distantia::tsl_time(tsl = tsl_resampled)[ , c("name", "rows", "resolution", "begin", "end") ] distantia::tsl_plot( tsl = tsl_resampled, guide = FALSE )
Otherwise, a keyword can be used to define the resampling frequency. Again, we can find valid keywords using tsl_time()
, but replacing "aggregate" with "resample" in the argument keywords
.
df <- distantia::tsl_time( tsl = tsl, keywords = "resample" ) df$keywords |> unlist() |> unique()
tsl_resampled <- distantia::tsl_resample( tsl = tsl, new_time = "weeks", method = "loess" ) tsl_time(tsl = tsl_resampled)[ , c("name", "rows", "resolution", "begin", "end") ] distantia::tsl_plot( tsl = tsl_resampled, guide = FALSE )
rm(df, df_time, tsl, tsl_resampled)
Time series smoothing is used to mitigate noise in high-frequency time series, or to highlight general trends rather than fine-grain details.
The function tsl_smooth()
implements two smoothing methods: rolling-window and exponential.
To check how these methods work we first need a long and noisy TSL.
tsl <- distantia::tsl_simulate( rows = 1000, irregular = FALSE ) distantia::tsl_plot( tsl = tsl, guide = FALSE )
This method computes a statistic over a fixed-width window of consecutive cases and replaces each central value with the computed statistic. It should not be applied to highly irregular time series, as it ignores time distance.
The code below applies this method with two different window sizes.
#smoothing with window 100 tsl_smooth_100 <- tsl_smooth( tsl = tsl, window = 100, f = mean ) distantia::tsl_plot( tsl = tsl_smooth_100, guide = FALSE ) #smoothing with window 10 tsl_smooth_10 <- tsl_smooth( tsl = tsl, window = 10, f = mean ) distantia::tsl_plot( tsl = tsl_smooth_10, guide = FALSE )
The window size determines the smoothing scale, and with that, the type of trend highlighted in the smoothing results.
This method generates each new value as the weighted average of the current value and past smoothed values. This weight is defined by the argument alpha
in tsl_smooth()
.
#smoothing with alpha 0.2 tsl_smooth_exp_0.2 <- tsl_smooth( tsl = tsl, alpha = 0.2 ) distantia::tsl_plot( tsl = tsl_smooth_exp_0.2, guide = FALSE ) #smoothing with alpha 0.8 tsl_smooth_exp_0.8 <- tsl_smooth( tsl = tsl, alpha = 0.8 ) distantia::tsl_plot( tsl = tsl_smooth_exp_0.8, guide = FALSE )
Alpha values closer to zero produce smoother results, as the plots above show.
rm(
tsl,
tsl_smooth_10,
tsl_smooth_100,
tsl_smooth_exp_0.2,
tsl_smooth_exp_0.8
)
The function tsl_transform()
applies a function f
to transform the values of a TSL. The names of the available f
functions can be listed with f_list()
.
distantia::f_list()
Scaling and centering multivariate time series is essential in dynamic time warping to ensure all variables contribute equally, regardless of their range or units.
For example, the dataset fagus_dynamics
has variables in different units.
tsl <- distantia::tsl_init( x = distantia::fagus_dynamics, name = "name", time = "time" ) distantia::tsl_plot( tsl = tsl )
Due to the differences in magnitude between variables, a dynamic time warping analysis will focus on rainfall
disproportionately, biasing the results.
To solve this issue, the package distantia
implements two flavors of scaling and/or centering:
#local scaling tsl_local_scaling <- distantia::tsl_transform( tsl = tsl, f = distantia::f_scale_local ) #global scaling tsl_global_scaling <- distantia::tsl_transform( tsl = tsl, f = distantia::f_scale_global )
The stats of both operations show that the global one preserves variable offsets between locations, while the local one shows an average 0 and standard deviation 1 across all variables.
stats_cols <- c("name", "variable", "mean", "sd") #stats of local scaling distantia::tsl_stats( tsl = tsl_local_scaling )[, stats_cols] #stats of global scaling distantia::tsl_stats( tsl = tsl_global_scaling )[, stats_cols]
The functions f_rescale_local
and f_rescale_global
work under the same principle to rescale variable ranges. By default, these functions rescale all time series between 0 and 1, but the arguments new_min
and new_max
can receive numeric vectors to set different ranges for different variables.
tsl_local_rescaling <- tsl_transform( tsl = tsl, f = f_rescale_local, new_min = 0, #same for all variables new_max = c(0, 100, 10) ) tsl_global_rescaling <- tsl_transform( tsl = tsl, f = f_rescale_global, new_min = 0, new_max = c(0, 100, 10) )
The stats below show the results of the local and global rescaling.
stats_cols <- c("name", "variable", "min", "max") distantia::tsl_stats( tsl = tsl_local_rescaling )[, stats_cols] distantia::tsl_stats( tsl = tsl_global_rescaling )[, stats_cols]
rm(
tsl,
tsl_global_scaling,
tsl_local_scaling,
tsl_global_rescaling,
tsl_local_rescaling,
stats_cols
)
Removing trends in time series before applying dynamic time warping may help prevents inflated dissimilarity scores caused by non-stationary components. Detrending also ensures that the alignment generated by dynamic time warping focuses on meaningful shape features rather than being distorted by long-term trends.
The transformation function f_trend_linear
transforms time series into their linear trend to help identify whether a time series needs detrending or not.
The example below applies this function to the cities_temperature
dataset.
#loading cities_temperature as tsl tsl <- distantia::tsl_init( x = distantia::cities_temperature, name = "name", time = "time" ) #computing linear trends tsl_trend <- distantia::tsl_transform( tsl = tsl, f = distantia::f_trend_linear ) #plotting linear trends tsl_plot( tsl = tsl_trend, columns = 2, guide = FALSE, line_color = "red4", line_width = 1.5 )
We can now compute the stats of these linear trends to identify the cities with a steeper long-term temperature change.
#compute stats of linear trends df_stats <- distantia::tsl_stats( tsl = tsl_trend, lags = 0 ) #arrange from higher to lower range df_stats[ order(df_stats$range, decreasing = TRUE), c("name", "range") ]
Now that we identified the issue, the function f_detrend_linear
helps remove these temporal trends from the data.
tsl_detrended <- distantia::tsl_transform( tsl = tsl, f = distantia::f_detrend_linear )
The code below applies f_trend_linear
again to demonstrate that the linear detrending worked as expected.
tsl_trend <- distantia::tsl_transform( tsl = tsl_detrended, f = distantia::f_trend_linear ) df_stats <- distantia::tsl_stats( tsl = tsl_trend, lags = 0 ) df_stats[, c("name", "range")]
rm(
tsl,
tsl_trend,
tsl_detrended,
df_stats
)
The distantia
package offers several transformations for compositional data:
f_percent
: percentages.f_proportion
: proportions.f_proportion_sqrt
: square root of proportionsf_hellinger
: transforms to proportion and applies Hellinger transformation.f_clr
: centered log-ratio.f_log
: logarithm transformation.f_binary
: transform continuous data into discrete events.The example below shows the application of several of these transformation on the counts dataset eemian_pollen
.
tsl_counts <- distantia::tsl_init( x = distantia::eemian_pollen, name = "name", time = "time" ) |> distantia::tsl_subset( names = c(1, 2) ) distantia::tsl_plot( tsl = tsl_counts, guide_columns = 4 )
Transformation to percentages with f_percent
.
tsl_percent <- distantia::tsl_transform( tsl = tsl_counts, f = distantia::f_percent ) distantia::tsl_plot( tsl = tsl_percent, guide_columns = 4 )
The function f_binary
helps transform continuous data into discrete events. The code below converts the percentages above into presence/absence using 5% as threshold. Notice only the variable "Carpinus" is shown in the plot to facilitate visualization.
tsl_binary <- distantia::tsl_transform( tsl = tsl_percent, f = distantia::f_binary, threshold = 5 ) distantia::tsl_plot( tsl = distantia::tsl_subset( tsl = tsl_binary, colnames = "Carpinus" ) )
Transformation to square root of proportions with f_proportion_sqrt
.
tsl_prop_sqrt <- distantia::tsl_transform( tsl = tsl_counts, f = distantia::f_proportion_sqrt ) distantia::tsl_plot( tsl = tsl_prop_sqrt, guide_columns = 4 )
The function f_hellinger
transforms the taxa counts to proportions and the applies the Hellinger transformation.
tsl_hellinger <- distantia::tsl_transform( tsl = tsl_counts, f = distantia::f_hellinger ) tsl_plot( tsl = tsl_hellinger, guide_columns = 4 )
Centered log-ration computed with f_clr
.
tsl_clr <- distantia::tsl_transform( tsl = tsl_counts, f = distantia::f_clr ) tsl_plot( tsl = tsl_clr, guide_columns = 4 )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.