Feature extraction is a crucial step for tackling machine learning
problems. Many machine learning problems start with complex (often
timestamped) raw data with many grouped variables (e.g. heart rate
measurements of many patients, gps data for analysis of movements of
many participants of a study, etc.). Often times, this raw data cannot
directly be used for machine learning algorithms. User-defined features
must be extracted for this purpose. Examples could be the heart rate
variability of a patient, or the maximum distance traveled by a
participant of a gps study. Since there are many different machine
learning applications and therefore many inherently different raw
datasets and features which need to be calculated, we do not supply any
automated features. fxtract
assists you in the feature extraction
process by helping with the data wrangling needed, but still allows you
to extract your own defined features.
The user only needs to define functions which have a dataset as input
and named vector (or list) with the desired features as output. The
whole data wrangling (calculating the features for each ID and
collecting the results in one final dataframe) is handled by fxtract
.
This package works with very large datasets and many different IDs and
the main functionality is written in
R6. Parallelization
is available via future.
See the tutorial on how to use this package.
For the release version use:
install.packages("fxtract")
For the development version use devtools:
devtools::install_github("QuayAu/fxtract")
dplyr
or other packages?At first glance it looks like we just rewrote the summarize()
functionality of dplyr
. Another similar functionality is covered by
the aggregate()
-function from the base stats
package. For small
datasets and few (easy to calculate) features, using fxtract
may
indeed be a little overkill (and slower too).
However, this package was especially designed for projects with large
datasets, many IDs, and many different feature functions. fxtract
streamlines the process of loading datasets and adding feature
functions. Once your dataset (with all IDs) becomes too big for memory,
or if some feature functions fail on some IDs, using our package can
save you many lines of code.
library(fxtract)
# user-defined function:
fun = function(data) {
c(mean_sepal_length = mean(data$Sepal.Length),
sd_sepal_length = sd(data$Sepal.Length))
}
# R6 object:
xtractor = Xtractor$new("xtractor")
xtractor$add_data(iris, group_by = "Species")
xtractor$add_feature(fun)
xtractor$calc_features()
xtractor$results
## Species mean_sepal_length sd_sepal_length
## 1: setosa 5.006 0.3524897
## 2: versicolor 5.936 0.5161711
## 3: virginica 6.588 0.6358796
Xtractor
:future
-package.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.