knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 7, warning = FALSE, fig.align = "center" )
library(theft)
theft
enables the standardised calculation of time-series features from multiple existing feature sets, and any user-supplied features.
To explore package functionality, we are going to use a dataset that comes standard with theft
called simData
. This dataset contains a collection of randomly generated time series for six different types of processes. The dataset can be accessed via:
theft::simData
The data follows the following structure:
head(simData)
The core function that automates the calculation of the feature statistics at once is calculate_features
. You can choose which subset of features to calculate with the feature_set
argument. The choices are currently "catch22"
, "feasts"
, "Kats"
, "tsfeatures"
, "tsfresh"
, and/or "TSFEL"
.
Note that Kats
, tsfresh
and TSFEL
are Python packages. The R package reticulate
is used to call Python code that uses these packages and applies it within the broader tidy data philosophy embodied by theft
. At present, depending on the input time-series, theft
provides access to $>1200$ features.
However, as discussed in the functionality demonstrations below, you can also supply your own list of features too! But more on that later...
Prior to using theft
(only if you want to use the Kats
, tsfresh
or TSFEL
feature sets; the R-based sets will run fine) you should have a working Python 3.9 installation and run the function install_python_pkgs(venv)
after first installing theft
, where the venv
argument is the name of the virtual environment you want to create.
For example, if you wanted to install the Python libraries to the default virtual environment folder used by reticulate
, you would run the following after first having installed theft
(here I am just creating a new virtual environment called "theft-package"
---you can call it whatever you like!):
install_python_pkgs("theft-package")
You can then run the following to activate the virtual environment:
init_theft("theft-package")
You are now ready to commit theft using all six potential factory feature sets!
However, you do not necessarily have to use these convenience functions. If you have another method for pointing R to the correct Python (such as reticulate
or findpython
), you can use those in your workflow instead and make sure you install Kats
, tsfresh
or TSFEL
as required
NOTE 1: You only need to call init_theft
or your other solution once per session.
NOTE 2: If you have issues installing Kats
with install_python_pkgs
, try install_python_pkgs("theft-package", standard_kats = FALSE)
.
You are then ready to use the rest of the package's functionality, beginning with the extraction of time-series features. Here is an example with the catch22
set:
feature_matrix <- calculate_features(data = simData, id_var = "id", time_var = "timepoint", values_var = "values", group_var = "process", feature_set = "catch22", seed = 123) head(feature_matrix)
Note that for the catch22
set you can set the additional catch24
argument to calculate the mean and standard deviation in addition to the standard 22 features:
feature_matrix <- calculate_features(data = simData, id_var = "id", time_var = "timepoint", values_var = "values", group_var = "process", feature_set = "catch22", catch24 = TRUE, seed = 123)
NOTE: If using the tsfresh
feature set, you might want to consider the tsfresh_cleanup
argument to calculate_features
. This argument defaults to FALSE
and specifies whether to use the in-built tsfresh
relevant feature filter or not.
You can also supply your own named list of functions to compute as time-series features. Below is an example with mean and standard deviation. Note that the list must be named as theft
uses the list element names to label the time-series features internally. Note that if you don't want to use any of the existing feature sets in theft
and only calculate the features you supply to features
, just set feature_set = NULL
.
feature_matrix2 <- calculate_features(data = simData, group_var = "process", feature_set = NULL, features = list("mean" = mean, "sd" = sd)) head(feature_matrix2)
For a detailed comparison of the six feature sets, see this paper for a detailed review^[T. Henderson and B. D. Fulcher, "An Empirical Evaluation of Time-Series Feature Sets," 2021 International Conference on Data Mining Workshops (ICDMW), 2021, pp. 1032-1038, doi: 10.1109/ICDMW53433.2021.00134.].
As theft
is based on the foundations laid by hctsa
, there is also functionality for reading in hctsa
-formatted Matlab files and automatically processing them into tidy dataframes ready for feature extraction in theft
. The process_hctsa_file
function takes a string filepath to the Matlab file and does all the work for you, returning a dataframe with naming conventions consistent with other theft
functionality. As per hctsa
specifications for Input File Format 1, this file should have 3 variables with the following exact names: timeSeriesData
, labels
, and keywords
. The filepath can be a local drive path or a URL.
Please see the companion package theftdlc
('theft
downloadable content') for a large suite of functions.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.