knitr::opts_chunk$set( warning=FALSE, collapse = TRUE, comment = "#>" )
Wanja Rast Version x.x 15th, Oktober 2020
The purpose is to show an exemplary workflow going from the raw data to a data set that is ready to be used by any machine learning framework. The functions and their applications will not be explained in full detail in this manual. All functions, however, have detailed descriptions in their ?help pages.
I wrote this guide with acceleration data produced by eobs loggers (e-obs GmbH) in mind. The functions might still work with data from other companies, but that was not tested.
Acceleration data was successfuly used in a number of species to predict their behaviour (Nathan et al. 2012, Bidder et al. 2014, Fehlmann et al. 2017). These studies usually include the measurment of acceleration data by animal born loggers and simultanious behaviour observations. Machine learning models are than trained on these data to predict the behaviour of animals that were not observed (Rast et al. 2019). This package is a collection of functions that on one hand provide convenient handling of acceleration data and on the other hand it can provide a standart framework of how to approach this type of study.
The package can be installed from its github repository. Set the dependencies to TRUE in order to install those as well as accelerateR used a number of other packages.
require(devtools) devtools::install_github("wanjarast/accelerateR" , dependencies = T)
You can load the package like any other R package.
library(accelerateR)
To access this vignette you can use:
vignette("accelerateR")
This first step assumes that you have a local copy of the raw acceleration data. All eobs files should be of the "txt" file type. The rounding argument is used to set the seconds in the timestamp to a specific value. In some cases the acceleration measurment does not start on the programmed time but a few seconds later. By setting rounding to 10 all seconds will by rounded down to the closes full 10 second figure, which in most cases should fix all delayed timestamps. You could also round down to the full minute by setting this to 60. Setting it to 1 will result in no rounding. In case you got eobs data you can set the eobs agrument to TRUE. This assumes that the file has the default eobs formating. Here it is important that the date format is day-month-year (eobs default) and that the time hase the same value for all measurments of the same burst (select option without miliseconds when processing the logger.bin file). In case you do not have an eobs file set the eobs agrument to FALSE. In this case you have to do all subsequent formating by hand. You can import the txt file into your R session with "tag_import". When you are using an R project and have the txt files in the project folder you there is no need to set a path otherwise set the path to the folder were your files are stored.
data <- tag_import(tag_id = "test" , rounding = 10 , eobs = TRUE , data_path = "C:/Users/rast/Desktop")
In case you downloaded the acceleration data from the Movebank you can convert them into the same structure as the output from "tag_import" with the "move_trans" function. If the file is in the R project folder you only need to specify the file name in any other case you need to specify the full path. You also need to set the number of measurments taken in each burst with the burst argument.
data2 <- move_trans(file = "C:/Users/rast/Desktop/test_movebank.csv" , burst = 264)
In the following I will continue with the data object from the "tag_import". The first step after data import should be a check for data integrety. In some cases there are single measurments missing from some bursts. You can check if this is the case with "count_test". You need to specify how many measurments for each bursts you expect. In case measurments are missing the output of this function is a data.frame that lists all timestamps of bursts that got missing measurments.
count_test(data = data , burstcount = 110)
You can either remove those bursts using the output of "count_test" or set the remove argument of count_test to TRUE. In the latter case the output will be the complete data set without the bursts with missing measurments. There will be an output in the console indicating how many bursts were removed. If you choose this option you can no longer check which timestamps were effected so in case there are a lot of missing measuremnts it would be beneficial to check by hand which bursts are effected.
cleaned_data <- count_test(data = data , burstcount = 110 , remove = TRUE)
There are many ways the behaviour observation data can be organised. Some might have a file for each day and other might only have one combined file with all observations for a single individual. One issue that has to be considerd is that eobs loggers record the time in the GPS "timezone" which is not in sync with the regular radio clock that you would use during the behavour observation. To synchronise acceleration and behaviour data an additional step is necessary. For this reason I recomend saving daily behaviour observations in one file per day.
To synchronise acceleration and behaviour data a startegy that worked quite well so far was to calculate the standart deviation for each burst and then look for changes in the observed behaviour. The best indicators are changes from an active behaviour to a passive behaviour or vice versa. Such a change should also be visable in the standart deviation were passive bahviours are usually small and active behaviours hight. The lists of behaviour observations should be changed to match the changes in standart deviation. To calculate the standart deviation the function "sum_data" can be used. This function will be explained more later. Here we just take the data set we cleaned before, indicate the column names of the timestamp, x, y and z axes and choose only "sd" as to calculate.
sd_sync <- sum_data(data = cleaned_data , time = "timestamp" , x = "x" , y = "y" , z = "z" , stats = "sd")
The "sd_sync" file can be exported or just from within R. After the behaviour observation files were synchronised with the acceleration data they can be imported into R. there are two options to to this: You can simply read in all behaviour observation files from one individual and combine them into one data.frame using "bev_import". With the "id" agrument you set the individual whose data you want to import. At this stage you are required to name all behaviour files of this individual as follows:
In case the file are not in the R project folder you need to specify the path to those files.
behaviour <- bev_import(id = "animal" , date = "date" , time = "time" , behaviour = "Bev" , rounding = 10 , time_format = dmy_hms , data_path = "C:/Users/rast/Desktop")
In case you do not need the behaviour labels in a seperate object you can direcly add them to the acceleration data. Use the same import function and add set the match agrument to true and provide the name of the acceleration data object with acc_data. For this to work properly both acceleration and behaviour data have to have the same timestamps which should be handled by the function itself.
Importing the behaviour data this way will remove the date and time columns from the acceleration data object. Also this will remove all acceleration measurments that do not have a corresponding behaviour label.
complete_raw_data <- bev_import(id = "animal" , date = "date" , time = "time" , behaviour = "Bev" , rounding = 10 , time_format = dmy_hms , data_path = "C:/Users/rast/Desktop" , match = T , acc_data = cleaned_data)
I found that the latter option is the best way to store these data. As they are labeled raw data you can later come back and do a different type of analysis as you did originally which would be quite hard to impossible when you do not have the raw data with labels. So save these as either a csv or rds file.
write.csv(complete_raw_data , "complete_raw_data.csv" , row.names = FALSE) saveRDS(complete_raw_data , "complete_raw_data.rds")
Most commenly behaviour prediction is not performed on raw data. A series of summary statistics is calculated from each burst which are than used as predictors for the machine learning algorithm (Kröschel et al. 2017, Tatler et al. 2018, Rast et al. 2019). You can calculate the summary statistics of your choice with the "sum_data" function. As before when you only calculated the standart devaition you need to provide some information of you raw acceleration data set. With the "stats" agrument you select the summary statistics to calculate. You can find a full list of all available summary statistics in the help file (?sum_data).
Some problems can occure at this stage: The coeficient of variation, inverse coeficient of variation, skewness and kurtosis (not shown here) include a division by either the mean or standart deviation. In theory the standart deviation can be zero when the animal does absolutly not move. This could be only a rar occurance or typical for the species. in any case dividing by zero is not possible with will lead to errors in the calculation of the coeficient of variation ,skweness and kurtosis. If tehse are single occurances these timestamps could be removed to keep those summary statistics. If, however, this is typical for the study species at hand you could use the coeficient of variation instead of the inverse version (this would only not work if the mean becomes zero). In this case you can not use skweness and kurtosis.
In case your raw acceleration data set has a behvaiour labels you can add them to the summary statistics data set by providing the coulmn name of the behaviour in the original data set.
acc_predcitors <- sum_data(data = complete_raw_data , time = "timestamp" , x = "x" , y = "y" , z = "z" , stats = c("mean","sd","ICV","Skewness","Pitch","ODBA") , behaviour = "Bev")
To finally train the machine learining model and validate its performance you need to seperate the full dataset into subsets. It is proposed in (Chollet, F; Allaire, J.J, 2018) to create three exclusive subsets of the full data set. Exclusive here means that no data can be part of more than one subset.
You can use the function "machine_split" to create all three data sets. The function need to know which rows of data belong together, so they will not be sepereated. In the context of acceleratio data and behaviour recognition this would be the timestamp. This would not be an issue when used on the "acc_prediction" dataset since here all data that belong to the same timestamp are in one row. This function, however, could also be used on the raw acceleration data. There it would be important to not split up data from the same timestamp.
The size of all subsets can be freely choosen. However, the subsets on which the model is trained should be the biggest one. You can set the size of the training data set with the "train_size" argument. This will set the proportion of number of rows from the full data set. So this value must be between 0 and 1. The argument "val_size" will determind the size of the validation data set. The third subset will be the test dataset. Its size will be 1 - train_size - val_size. In case you only want to work with the training and validation set you need to set thoses sizes so they add up to 1.
This function will also respect the initial proportions of all behaviour classes and keeps these proportions in all subsets. This is in case the number of observations for each behaviour class differs.
You run this function without saving the output into an object. The function will automaticlly create the three subsets in the global environment.
machine_split(data = acc_predcitors , group = "timestamp" , behaviour = "Bev" , train_size = 0.6 , val_size = 0.20)
There are a number of different machine learning algorithms that were already succesfully used for behaviour recognition (Tatler et al. 2018). The syntax for all of them is quite simmilar so I will provide an example here only for the support vector machine (SVM) fomr the package "e1071".
In a balanced data set all behaviour classes were observed the same number if times. Some algorithms offer the option the also use imbalanced data sets which entails that you need to provide the algorithm with the proportion of each behviour class. This will force the algorithm to treat errors in behavaiour classes with less observation more seriously.
To use an unbalanced dataset you need to create an object with the class weights first. For the SVM the class with the highest count has to be set to one. All behaviour classes with less observation will have to have a weight greaer than one. You can calculate the weight vector like this:
behaviour_weights <- c(max(table(train_data$Bev)) / table(train_data$Bev))
This is only optional and not necessary if you use a balanced data set.
You can than use the dataset to train any machine learnging algorithm of you choice. Here is the example for the SVM. Most machine learning algorithms got a simmilar syntax at this point. You can either train them using the formular style (Behaviour ~ predictor1 + predictor2 + ...) or provide two seperate objects like I do here. x should be the data frame holding all predictor values. It is important to exclude all coulumn that do not contain predictors. In this case here I excluded the first and second column as they contained the behaviour and the timestamp. y should be a vector containing the behaviour lables in the same order as predictor data set. In this case this is not an issue since behaviour and predictors are stored in the same data frame but you could also use behaviour labels from a seperate R object. Type and kernel are SVM specific parameters. I got the best results using tehse setting but others are available. Refer to the ?svm help file to check for mre options. You can provide the svm with the weight of the behaviour classes that we calculated before with the class.weights argument. Know that this is optional but might increase the performance of the model.
model_svm <- svm(x = train_data[ , -c(1,2)] , y = train_data$Bev , type = "C-classification" , kernel = "radial" , class.weights = behaviour_weights)
To validate the performance of the model you need to predict the behaviour of data that the model was not trained on. For this reason you created the validation dataset. This step is identical for most machine learning algorithms. You need to specify the model you trained in the step before with the "object" argument. The "newdata" argument is supposed to be the validation dataset. In this case you should also use the rescaled data as the model was train on rescaled data as well. You also need to exclude the column for the behaviour and the timestamp. As an output you will get a vector of the predicted behaviour classes.
You can then use the "pre_metrics" function to compare the predicted behaviour classes to the known behaviour classes from the observation. The output of the "pre_metrics" function is a list with two elements. The first element is a confusion matrix. This can be used to check for the confusion of different behaviour classes. The second element are the metrics. From all the metrics that could be calculated I find the recall and precision the most usefull. You can also see how well you can deal with specific behaviour classes in case you are particularily interessed in certain behaviours.
val_predict <- predict(object = model_svm , newdata = val_data[ , -c(1,2)]) pre_metrics(predicted = val_data , expected = val_data$Bev)
From the confusion matrix in this example you can see that overall there are some behaviour that are rarely confused with other and some were there is more confusion. Eating and sniffing are confused quite often. in this case you might consider combining them into one behaviour class called foraging as they might both be food related. This might be an option when the behaviour classes are ecologicly close to each other which might not always be the case. Than you could also consider to drop one of the classes. Especially if it was a rare behaviour. A third option would be to tune your model. You could change the machine learning algorithm specific parameters or even include new predictors or exclude some predictors and check with the validation dataset again if the performance improves. The latter option would be the prefered one but sometimes behaviours are just to simmilar in terms of their acceleration patterns that machine learning can not distinguish them.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.