knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(bee.count) #library(xts) library(printr) data <- bee.data original <- bee.data[!is.na(TOTAL_COUNT)]
par(mar = c(0,0,0,0)) plot(ts(original$TOTAL_COUNT[1:140]), main = NULL, ylab = NULL, xlab = NULL)
Our dataset was produced by an AI using digital particle image velocimetry or DPIV [Willert and Gharib, 1991] to count the level of bee activity outside a hive marked R_4_5. These csv files are part of the output of Dr. Sarbajit Mukherjee during his dissertation at Utah State University. We have received permission of Dr. Kulyukin to use these files for time series analysis. The counts correspond to a 28 second period recorded approximately every 15 minutes throughout the day. The monitors shut-down at night and went off line several times throughout the season leaving significant coverage gaps.
This library contains an R data file named bee.data. It is a collection of bee motion counts calculated by DPIV and paired with the weather conditions organized by timestamp.
Long Description of bee.data
help(bee.data)
The R_4_5 monitor worked reasonable well during the 2017 season. However, the data collected from is was heavily right skewed and missing multiple values.
par(mfrow = c(1, 2)) hist(data[!is.na(TOTAL_COUNT), TOTAL_COUNT], breaks = seq(0, 13500, 500), main = 'Distribution of Bee Counts', xlab = 'Bee Count') plot(ts(data$TOTAL_COUNT[1:200]), main = "Data Accounting for Missing Times", ylab = 'Count', xlab = 'Time (15 Min. Intervals)')
Time series data sets are most often indexed by time and store a single value for each index. A distinctive feature of time series is that the mean and variance may often be misleading as there is a significant auto correlation between a point and the preceding or following points. This makes linear models unsatisfactory and instead the data scientist must use the more poorly understood techniques of time series analysis.
The purpose of the collected data set and associated tools is to provide a frame-work for the data scientist to better understand the processing involved in preparing a data set for time series analysis and interpreting the results.
The base_impute function replaces missing values to a value indicated to by the programmer.
Long Description of base_impute
help(base_impute)
The sparse_ar function computes an autoregressive integrated moving average without the user selecting all the needed parameters of the stats package arima function.
Long Description of spare_ar
help(sparse_ar)
The sparse_arma function computes an autoregressive moving average while limiting the variables needed from the user.
Long Description of sparse_arma
help(sparse_arma)
The sparse_ma function computes an moving average for the user.
Long Description of sparse_ma
help(sparse_ma)
After the bee data is read in, the data must be inputed to be usable. After consideration, The hours between midnight and 6:00 am were decided to be imputed as a zero activity level, and other missing values to be the median for that day of the month.
data <- base_impute(data, impute_col = 'TOTAL_COUNT', group_by = c('DAY', 'MONTH')) missing_idxs <- which(is.na(data$TOTAL_COUNT[1:200])) data[HOUR < 7 & is.na(TOTAL_COUNT), IMPUTED_VALS := 0]
plot(ts(data$IMPUTED_VALS[1:200]), ylab = 'Total Bee Count', main = "First 200 Observations with Imputed Values (Red)", xlab = 'Time (15 Min. Intervals)') lines(missing_idxs, data$IMPUTED_VALS[missing_idxs], col = 'red', type = 'p', pch = 20)
Next we take the data and separate it into training and testing groups. In this case, we chose the first three days of October to be our test group.
# The training set is June - September train <- data[1:9658] train_count <- ts(train$IMPUTED_VALS) train_count_96 <- diff(train_count, lag = 96) # The test set is the first three days of October test <- data[9659:9894] test_count <- ts(test$IMPUTED_VALS) test_count_96 <- diff(test_count, lag = 96)
First we try to make predictions on the moving average approach. From the figure, we can see that there is more to this series fits a trend line.
# p-val: 1 sps_ma <- sparse_ma(train_count_96) # p-val: 0.9998 sps_ar <- sparse_ar(train_count_96) # p-val: 0.6397 sps_arma <- sparse_arma(train_count_96) # Fit the MA model and make future predictions sps_ma_fit <- ts(train_count_96 - sps_ma$residuals) sps_ma_pred <- predict(sps_ma, n.ahead = 30, se.fit = TRUE) sps_ma_lb <- ts(sps_ma_pred$pred - 1.96*sps_ma_pred$se, start = 9693) sps_ma_ub <- ts(sps_ma_pred$pred + 1.96*sps_ma_pred$se, start = 9693) plot_vals <- ts(rep(0, 191), start = 9532) plot_vals[1:161] <- sps_ma_fit[9402:9562] + mean(train_count_96) plot_vals[162:191] <- sps_ma_pred$pred + mean(train_count_96) ts.plot(ts(train_count_96[9402:9562], start = 9532), plot_vals, sps_ma_lb, sps_ma_ub, gpars = list(col = c('black', 'red', 'blue', 'blue'), main = 'Sarse MA(25) Predicted Values (Red) with 95% CI'))
We can also use the ar model to predict the future values.
# Fit the AR model and make future sps_ar_fit <- ts(train_count_96 - sps_ar$residuals) sps_ar_pred <- predict(sps_ar, n.ahead = 30, se.fit = TRUE) sps_ar_lb <- ts(sps_ar_pred$pred - 1.96*sps_ar_pred$se, start = 9693) sps_ar_ub <- ts(sps_ar_pred$pred + 1.96*sps_ar_pred$se, start = 9693) plot_vals_ar <- ts(rep(0, 191), start = 9532) plot_vals_ar[1:161] <- sps_ar_fit[9402:9562] + mean(train_count_96) plot_vals_ar[162:191] <- sps_ar_pred$pred + mean(train_count_96) ts.plot(ts(train_count_96[9402:9562], start = 9532), plot_vals, sps_ar_lb, sps_ar_ub, gpars = list(col = c('black', 'red', 'blue', 'blue'), main = 'Sarse AR(15) Predicted Values (Red) with 95% CI'))
Lastly, the moving average with lag 25 can be used to make some close predictions of our final day of testing.
ma_orig <- diffinv(plot_vals, lag = 96, differences = 1, xi = train_count[9402:9497]) ts.plot(ts(train_count[9402:9562], start = 9532), ma_orig, gpars = list(col = c('black', 'red'), main = "MA(25) Model on Original Count Scale (Tail End)", xlab = 'Time (15 Min. Intervals)', xlim = c(9630, 9680)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.