bee.count

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(bee.count)
#library(xts)
library(printr)
data <- bee.data
original <- bee.data[!is.na(TOTAL_COUNT)]
par(mar = c(0,0,0,0))
plot(ts(original$TOTAL_COUNT[1:140]), main = NULL,
     ylab = NULL,
     xlab = NULL)

Data Set

Our dataset was produced by an AI using digital particle image velocimetry or DPIV [Willert and Gharib, 1991] to count the level of bee activity outside a hive marked R_4_5. These csv files are part of the output of Dr. Sarbajit Mukherjee during his dissertation at Utah State University. We have received permission of Dr. Kulyukin to use these files for time series analysis. The counts correspond to a 28 second period recorded approximately every 15 minutes throughout the day. The monitors shut-down at night and went off line several times throughout the season leaving significant coverage gaps.

bee.data

This library contains an R data file named bee.data. It is a collection of bee motion counts calculated by DPIV and paired with the weather conditions organized by timestamp.

Long Description of bee.data

help(bee.data)


The R_4_5 monitor worked reasonable well during the 2017 season. However, the data collected from is was heavily right skewed and missing multiple values.

par(mfrow = c(1, 2))
hist(data[!is.na(TOTAL_COUNT), TOTAL_COUNT], breaks = seq(0, 13500, 500),
     main = 'Distribution of Bee Counts', xlab = 'Bee Count')
plot(ts(data$TOTAL_COUNT[1:200]), main = "Data Accounting for Missing Times",
     ylab = 'Count',
     xlab = 'Time (15 Min. Intervals)')

Time series

Time series data sets are most often indexed by time and store a single value for each index. A distinctive feature of time series is that the mean and variance may often be misleading as there is a significant auto correlation between a point and the preceding or following points. This makes linear models unsatisfactory and instead the data scientist must use the more poorly understood techniques of time series analysis.


Purpose

The purpose of the collected data set and associated tools is to provide a frame-work for the data scientist to better understand the processing involved in preparing a data set for time series analysis and interpreting the results.


Functions

base_impute

The base_impute function replaces missing values to a value indicated to by the programmer.

Long Description of base_impute

help(base_impute)

sparse_ar

The sparse_ar function computes an autoregressive integrated moving average without the user selecting all the needed parameters of the stats package arima function.

Long Description of spare_ar

help(sparse_ar)

sparse_arma

The sparse_arma function computes an autoregressive moving average while limiting the variables needed from the user.

Long Description of sparse_arma

help(sparse_arma)

sparse_ma

The sparse_ma function computes an moving average for the user.

Long Description of sparse_ma

help(sparse_ma)


Tutorial

After the bee data is read in, the data must be inputed to be usable. After consideration, The hours between midnight and 6:00 am were decided to be imputed as a zero activity level, and other missing values to be the median for that day of the month.

data <- base_impute(data, impute_col = 'TOTAL_COUNT',
                    group_by = c('DAY', 'MONTH'))

missing_idxs <- which(is.na(data$TOTAL_COUNT[1:200]))

data[HOUR < 7 & is.na(TOTAL_COUNT), IMPUTED_VALS := 0]
plot(ts(data$IMPUTED_VALS[1:200]), ylab = 'Total Bee Count',
     main = "First 200 Observations with Imputed Values (Red)",
     xlab = 'Time (15 Min. Intervals)')
lines(missing_idxs, data$IMPUTED_VALS[missing_idxs], col = 'red',
      type = 'p', pch = 20)

Next we take the data and separate it into training and testing groups. In this case, we chose the first three days of October to be our test group.

# The training set is June - September
train <- data[1:9658]
train_count <- ts(train$IMPUTED_VALS)
train_count_96 <- diff(train_count, lag = 96)

# The test set is the first three days of October
test  <- data[9659:9894]
test_count <- ts(test$IMPUTED_VALS)
test_count_96 <- diff(test_count, lag = 96)

First we try to make predictions on the moving average approach. From the figure, we can see that there is more to this series fits a trend line.

# p-val: 1
sps_ma   <- sparse_ma(train_count_96)

# p-val: 0.9998
sps_ar   <- sparse_ar(train_count_96)

# p-val: 0.6397
sps_arma <- sparse_arma(train_count_96)

# Fit the MA model and make future predictions
sps_ma_fit  <- ts(train_count_96 - sps_ma$residuals)
sps_ma_pred <- predict(sps_ma, n.ahead = 30, se.fit = TRUE)
sps_ma_lb   <- ts(sps_ma_pred$pred - 1.96*sps_ma_pred$se, start = 9693)
sps_ma_ub   <- ts(sps_ma_pred$pred + 1.96*sps_ma_pred$se, start = 9693)

plot_vals <- ts(rep(0, 191), start = 9532)
plot_vals[1:161] <- sps_ma_fit[9402:9562] + mean(train_count_96)
plot_vals[162:191] <- sps_ma_pred$pred + mean(train_count_96)

ts.plot(ts(train_count_96[9402:9562], start = 9532), plot_vals,
        sps_ma_lb, sps_ma_ub,
        gpars = list(col = c('black', 'red', 'blue', 'blue'),
                     main = 'Sarse MA(25) Predicted Values (Red) with 95% CI'))

We can also use the ar model to predict the future values.

# Fit the AR model and make future
sps_ar_fit  <- ts(train_count_96 - sps_ar$residuals)
sps_ar_pred <- predict(sps_ar, n.ahead = 30, se.fit = TRUE)
sps_ar_lb   <- ts(sps_ar_pred$pred - 1.96*sps_ar_pred$se, start = 9693)
sps_ar_ub   <- ts(sps_ar_pred$pred + 1.96*sps_ar_pred$se, start = 9693)

plot_vals_ar <- ts(rep(0, 191), start = 9532)
plot_vals_ar[1:161] <- sps_ar_fit[9402:9562] + mean(train_count_96)
plot_vals_ar[162:191] <- sps_ar_pred$pred + mean(train_count_96)

ts.plot(ts(train_count_96[9402:9562], start = 9532), plot_vals,
        sps_ar_lb, sps_ar_ub,
        gpars = list(col = c('black', 'red', 'blue', 'blue'),
                     main = 'Sarse AR(15) Predicted Values (Red) with 95% CI'))

Lastly, the moving average with lag 25 can be used to make some close predictions of our final day of testing.

ma_orig <- diffinv(plot_vals, lag = 96, differences = 1,
                   xi = train_count[9402:9497])

ts.plot(ts(train_count[9402:9562], start = 9532), ma_orig,
        gpars = list(col = c('black', 'red'),
                     main = "MA(25) Model on Original Count Scale (Tail End)",
                     xlab = 'Time (15 Min. Intervals)',
                     xlim = c(9630, 9680)))


chaosreader/stat_6550_group_project documentation built on Dec. 19, 2021, 3:01 p.m.