library(knitr) knitr::opts_knit$set( style="max-width: 1000px", width=1000 ) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(width = 20)
SUMSarizer is an R library that can be used to help analyze
Install the sumsarizer
package from GitHub.
library(devtools) install_github("geocene/sumsarizer")
Include all the necessary libraries and setup some temporary directories to hold data.
library(sumsarizer) library(tools) library(data.table) # tmp_path <- tempdir() tmp_path <- "~/tmp" example_data_path <- file.path(tmp_path, "example_data")
We have provided some example SUMS files for download. Download from an AWS S3 bucket using the download_example_data()
function. These files are a subset of the iButton DS1922E files collected for the study discussed in "Measuring and Increasing Adoption Rates of Cookstoves in a Humanitarian Crisis" by Wilson et al., 2016.
download_example_data(example_data_path)
Choose an example file for further exposition. The import_sums()
function can import data from iButtons, Wellzion, Lascar, and kSUMS data loggers.
example_sums_file <- file.path(example_data_path, "raw_sums_files","alfashir1_B12.csv") one_sums <- import_sums(example_sums_file)
Detector functions apply a true
or false
label (represented by an integer 1
or 0
) to timestamp and value pairs. Later, these labels can be aggregated together into runs that define events.
The threshold detector will label points based solely on a threshold and a direction. The threshold detector can detect events by comparing a threshold
to the value of the data using >
, <
, >=
, or <=
. The default values for the threshold_detector()
are threshold=75
and direction=">"
.
one_sums_thresholded <- apply_detector(one_sums, threshold_detector, threshold=75, direction=">")
The threshold function is not a very good algorithm for detecting cooking events, but it can be a very good algorithm for detecting broken sensors. For example, many thermocouple data loggers will report large negative numbers when the thermocouple is missing or damaged; detecting events <-200C can be a good way to detect these kinds of damanged SUMS.
FireFinder is Geocene's simplified deterministic algorithm for detecting cooking events. FireFinder considers many features of the data including absolute temperature, slope, running quantiles, and gaps in data when labeling points true or false. Although FireFinder has many steps, we limit the arguements of the firefinder_detector()
to just primary_threshold
, min_event_sec
, and min_break_sec
.
Roughly speaking, the primary_threshold
can be thought of as the value above which cooking is likely happening, and below which cooking is unlikely to be happening. However FireFinder may sometimes determine that points above primary_threshold
are not cooking and points below primary_threshold
are indeed cooking. The default value of primary_threshold
is 75C.
To remove short events and short gaps between events, you can use the min_event_sec
and min_break_sec
arguments. The min_event_sec
is the minimum number of seconds for an event to be considered an event (and not just an erroneous blip). The min_break_sec
is the minimum break between two events for those events to be considered separate events; if the break between two events is shorter than min_break_sec
, the two events will be merged into one event. The default value for min_event_sec
is 300 seconds (5 minutes), and the default value for min_break_sec
is 1800 seconds (30 minutes).
one_sums_firefinder <- apply_detector(one_sums, firefinder_detector)
After you have detected events, you will want to review the results of your detectors. The simplist way to view the results for a detector is to list its events using the list_events()
function.
events <- list_events(one_sums_firefinder)
We can see the first few events here:
kable(head(events))
We have included a some helpful plotting functions to help you visualize your SUMS data. The plot_sums()
function takes the processed data from a single SUMS file and plots those data week-by-week with events highlighted in red. This function should help evaluate the performance of the detection.
plot_sums(one_sums_firefinder)
In the case above, the default values for firefinder_detector()
did not do a very good job calculating events for this file. Many cooking events were missed. We can adjust the arguments for firefinder_detector()
to try and get better results. Specifcally, we can adjust the primary_threshold
:
one_sums_firefinder_refined <- apply_detector(one_sums, firefinder_detector, primary_threshold=45)
Lowering the threshold from the default of 75C to 45C will substantially increase the sensitivity of FireFinder.
plot_sums(one_sums_firefinder_refined)
If the simple FireFinder model will not work for your dataset, it is possible to train a custom machine learning model using the sumsarizer
package. To create a custom-trained model, you will need to to created a labeled training set. TRAINSET is an online app that has made it easy to label time series data and create a labeled training set. Documentation for how to use TRAINSET is on the TRAINSET website.
To get your SUMS data to TRAINSET for labeling, you will need to export your data in the TRAINSET format. The SUMSarizer package has a function, raw_sums_to_trainset()
, to export data to TRAINSET. This function takes a directory full of SUMS files and turns them into another directory full of TRAINSET-compatible files.
Note: you do not need to label all of your data to create a training set (if you did, you would already have your results!). We recommend labeling about 5% of your data or 25 files, whichever is larger. Make sure you pick a wide variety of very different files to label; if you only label the easy files, your learner will not perform well. The machine learner needs to be trained how to label both easy and difficult files!
raw_sums_path <- file.path(example_data_path, "raw_sums_files") trainset_path <- file.path(example_data_path, "trainset_files") raw_sums_to_trainset(raw_sums_path, trainset_path)
Once you have labeled a subset of your data in TRAINSET, you will need to import the labeled data back into R to train your model:
labeled_path <- file.path(example_data_path, "labeled_files") labeled_data <- import_folder(labeled_path)
SUMSarizer uses Super Learer 3 to create custom ensemble models. See the sl3 Introductory materials for more information. By default, we use a single XGBoost model. If you would like to use a more complex and poweful ensemble model, please see the sl3
documentation and pass the model object as the sl3_learner
arugement to the learn_labels()
function.
To train a model, just pass the labeled dataset to the learn_labels()
function. This will return in a trained model object which can use to analyze your data.
model_obj <- learn_labels(labeled_data)
If you would like to use your trained model in the future, you can save it. However, the model is only somewhat portable; it only takes a single change to your R configuration to cause your model to break. However, you can always just train a new (same) model using learn_labels()
.
model_file <- file.path(tmp_path,"sumsarizer_model_fit.rdata") save(model_obj, file=model_file)
In the sumsarizer
package, trained machine learning models are used to detect events in SUMS data in the exact same way that the more explicit threshold_detector
, constant_detector
, and firefinder_detector
work. The machine learning detector function is called sl3_model_detector()
, and it takes a model object, model_obj
, as an argument. If you spend the time to do the training, these custom-trained models can have great results:
one_sums_ml <- apply_detector(one_sums, sl3_model_detector, model_obj) plot_sums(one_sums_ml)
You can tune the "sensitivity" of your custom machine learning algorithm by adjusting the threshold at which a point in time is labeled true
or false
. This is because the machine learning algorithm does not, itself, return a vector of booleans. Instead, the machine learning algorithm will return a vector of probabilities between 0 and 1. By default, sl3_model_detector
thresholds probabilities at 0.5. In other words, probabilites above 0.5 are considered to be label=TRUE
and below 0.5 label=FALSE
. However, if you want to make your model more or less sensitive, you can adjust the threshold at which an event will be considered to be happening; lower thresholds result in lower sensitivity.
one_sums_ml_sensitive <- apply_detector(one_sums, sl3_model_detector, model_obj, threshold=0.05) plot_sums(one_sums_ml_sensitive)
Up to this point, most of these examples have focused on a single SUMS file. However, you will probably need to process many tens or hundreds of files. To do this, just give the sumsarizer
package a path (raw_sums_path
) to the directory of SUMS files. Then import_folder()
to import the whole directory. You can use apply_detector()
on the group of SUMS files imported by import_folder()
. Then, create event-wise summaries with list_events()
, and finally create file-wise summaries using event_summaries()
.
TODO: we should probably rename list_events and event_summaries to something more intuitive.
raw_sums_path <- file.path(example_data_path, "raw_sums_files")
all_sums <- import_folder(raw_sums_path) all_sums <- apply_detector(all_sums, sl3_model_detector, model_obj) all_events <- list_events(all_sums) summaries <- event_summaries(all_events) kable(head(summaries))
It is almost always necessary to compare analytics results to different metadata variables. For example, to compare average cooking times across stove types, fuel types, villages, etc. To join your metadata with your analytics results, make a metadata file with a format similar to metadata.csv
. The filename
column should match the filenes you imported using SUMSarizer, but the other metadata variables like stove_type
can be defined by you. Joining metadata with time series or events is easy:
TODO: make some nice charts and tables showing off the power of metadata variables
metadata <- read.csv(file.path(example_data_path, "metadata.csv")) all_sums_with_metadata <- merge(all_sums, metadata, by="filename") all_events_with_metadata <- merge(all_events, metadata, by="filename") summaries_with_metadata <- merge(summaries, metadata, by="filename")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.