Although we can perform dataset preprocessing inside the feature functions (e.g. clustering GPS points), it can be useful to perform preprocessing on the dataset once. Especially, if this takes a long time and is done again and again for different feature functions. To make sure that this is done for each ID of a grouping variable individually, it is safer to use the method shown here.
In this example we cluster the gps points and add a new column with the clusters.
For simplicity reasons, we only use the gps data of this dataset:
library(fxtract) library(dplyr) gps_data = studentlife_small %>% select(userId, latitude, longitude) %>% filter(!is.na(latitude)) head(gps_data)
unlink("fxtract_files", recursive = TRUE)
library(fxtract) xtractor = Xtractor$new("xtractor") xtractor$add_data(gps_data, group_by = "userId")
We need to define a function which has a dataframe as input and the preprocessed dataframe as output. The method $preprocess_data
will then read the RDS files for each ID of the grouping variable, apply the function on each dataframe individually and save those as RDS files again. Parallelization is available via future
.
library(fpc) fun = function(data) { lat = data$latitude lon = data$longitude clust = dbscan(cbind(lat, lon), eps = 1.5, MinPts = 3) data$cluster = clust$cluster return(data) }
xtractor$preprocess_data(fun = fun)
The data has successfully been preprocessed:
head(xtractor$get_data())
unlink("fxtract_files", recursive = TRUE)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.