knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  fig.width = 5,
  fig.height = 5
)
options(rmarkdown.html_vignette.check_title = FALSE)
library(pbsEDM)

Introduction

Here we provide reproducible details of our results in the manuscript that predicted values of lagged variables can give unrealistic negative predictions of the original unlagged populations, and demonstrate a simple proposed solution. We also show how correlations based on predictions of lagged variables can differ somewhat from those based on the predictions of the actual populations (which is what we are interested in).

Again we use the simulated population time series from the manuscript, with population size denoted $N_t$ and first-differenced values denoted $Y_t$ at $t = 1, 2, 3, ..., 99, 100$. We extract them from the saved tibble NY_lags_example_3:

input <-  dplyr::select(NY_lags_example_3, t, N_t, Y_t) %>%
  dplyr::rename(Time = t)
input

Calculations using pbsEDM (as done in the aspect 2 vignette)

See the analyse_simple_time_series vignette for general details on using the code. Here do the simplex calculation for $E = 2$:

pbsEDM_res_full <- pbsEDM::pbsEDM(input,
                                  lags = list(Y_t = c(0:1)))   # Gives full calculations

pbsEDM_pred_Y <- pbsEDM_res_full$X_forecast[-length(pbsEDM_res_full$X_forecast)]
                                           # Just the predictions, excluding
                                           #  t=101, so for t = 1:100

Explaining how negative population predictions arise

Use the $Y_t$ calculations and then convert to $N_t$. Equation (1) in the manuscript defines $Y_t$ as:

$Y_t = N_{t+1} - N_t$.

So having estimated $\hat{Y}t$ we can rearrange that to estimate $\hat{N}{t+1}$ as:

$\hat{N}_{t+1} = N_t + \hat{Y}_t$

or

$\hat{N}t = N{t-1} + \hat{Y}_{t-1}$,

to give a vector of estimated $\hat{N}_t$ values:

pbsEDM_pred_N = c(NA,
                  input$N_t + pbsEDM_pred_Y)
pbsEDM_pred_N

The first three NA values demonstrate that we cannot predict $\hat{N}_1$ (because we do not know $N_0$), or $\hat{N}_2$ and $\hat{N}_3$ (because we cannot predict $\hat{Y}_1$ or $\hat{Y}_2$).

Of note are the several negative predictions of $\hat{N}_t$, which are unrealistic because our data represent populations (and negative zooplankton, for example, do not exist. The indices of these give the predicted time values (i.e. $t^* + 1$):

which(pbsEDM_pred_N < 0)

Of particular interest is the negative prediction for $\hat{N}_{101}$ of r pbsEDM_pred_N[101], since the aim of the analysis is generally to forecast next year's population from all currently available data.

Simple solution

The simplest solution this is to just replace the negative predictions $\hat{N}_t$ with the minimum observed value of $N_t$.

N_t_min <- min(input$N_t)

pbsEDM_pred_N_replace <- pbsEDM_pred_N
pbsEDM_pred_N_replace[pbsEDM_pred_N_replace < 0] = N_t_min

Graphically this is shown by plotting all the original $(N_t, \hat{N}_t)$ points as open black circles, then the set with the negative $\hat{N}_t$ values replaced as red filled circles (so that 11 points can be seen to change; the 1:1 line is added in black):

plot(c(input$N_t, NA),
     pbsEDM_pred_N,
     xlab = "Observed N_t",
     ylab = "Predicted N_t",
     main = "Red: change -ve predictions to min +ve observation")

points(c(input$N_t, NA),
       pbsEDM_pred_N_replace,
       pch = 20,
       col = "red")
abline(a = 0, b = 1)
# abline(h = mean_pred_N)
# abline(h = mean_pred_N_replace,
#        col = "red")
# abline(v = mean(input$N_t),
#        col = "grey")

# lm_fit  <- lm(pbsEDM_pred_N ~ c(input$N_t, NA))
# abline(lm_fit)
# lm_fit_replace  <- lm(pbsEDM_pred_N_replace ~ c(input$N_t, NA))
# abline(lm_fit_replace, col = "red")

A good fit for the first-differenced values ($Y_t$) does not necessarily translate to a good fit to the original population sizes ($N_t$)

The above simple fix removes unrealistic negative predictions. However, the figure raises a fundamental issue, in that the predicted values $\hat{N}_t$ do not seem to match the original data $N_t$ too well.

The correlation coefficient of $\hat{N}_t$ and $N_t$ before correcting the negative values is:

pbsEDM_rho_N <- cor(c(input$N_t, NA),
                    pbsEDM_pred_N,
                    use = "pairwise.complete.obs")
pbsEDM_rho_N

and after correcting the negative values it is

pbsEDM_rho_N_replace <- cor(c(input$N_t, NA),
                            pbsEDM_pred_N_replace,
                            use = "pairwise.complete.obs")
pbsEDM_rho_N_replace    # cor(input$N_t[4:100], pbsEDM_pred_N_replace[4:100]) gives same

The latter $\rho$ with the negatives corrected is (unintuitively) very slightly lower than the original value. But the more important point is that a $\rho$ of 0.28 is much lower than that based on $Y_t$ and $\hat{Y}_t$, which is

pbsEDM_rho_Y <- pbsEDM_res_full$results$X_rho
pbsEDM_rho_Y

We highlight it as something for practioners to be aware of.

Plotting $\hat{Y}_t$ against $Y_t$ (these are the $E=2$ points in the manuscript's Figure~1(f):

plot(input$Y_t,
     pbsEDM_pred_Y,
     xlab = "Observed Y",
     ylab = "Predicted Y")
abline(a=0, b=1)

shows a good correspondence, as given by the correlation coefficient of r pbsEDM_rho_Y.

Thus, a good correspondence based on the first-differenced $Y_t$ values does not necessarily lead to good correspondence in the $N_t$ values, which is what we are interested in for practical situations. Hence our recommendation that practitioners also check the fits of the predictions based on non-first-differenced values ($N_t$) and not just first-differenced values ($Y_t$).

Repeating analyses with $E=3$

The above was all for the simplest embedding dimension of $E=2$. We repeat the calculations here for $E=3$, which is the optimal embedding dimension (based on the maximum $\rho$ value), and so should be used for forecasting. The first-differencing calculations remain the same, the higher embedding dimension just allows more lags (and more neighbours) to be used in the making predictions.

pbsEDM_res_full_E_3 <- pbsEDM::pbsEDM(input,
                                      lags = list(Y_t = c(0:2)))   # Gives full calculations

pbsEDM_pred_Y_E_3 <- pbsEDM_res_full_E_3$X_forecast[-length(pbsEDM_res_full$X_forecast)]
                                           # Just the predictions, excluding
                                           #  t=101, so for t = 1:100
pbsEDM_pred_N_E_3 = c(NA,
                  input$N_t + pbsEDM_pred_Y_E_3)
pbsEDM_pred_N_E_3

Interestingly, there are fewer negative predictions of $\hat{N}_t$ for $E=3$ than the 11 we had for $E=2$. The indices of these give the predicted time values (i.e. $t^* + 1$):

which(pbsEDM_pred_N_E_3 < 0)

Though there is still a negative prediction for $\hat{N}_{101}$ of r pbsEDM_pred_N_E_3[101].

Now replace the negative predictions $\hat{N}_t$ with the minimum observed value of $N_t$.

N_t_min <- min(input$N_t)

pbsEDM_pred_N_E_3_replace <- pbsEDM_pred_N_E_3
pbsEDM_pred_N_E_3_replace[pbsEDM_pred_N_E_3_replace < 0] = N_t_min

And repeat the above figure:

plot(c(input$N_t, NA),
     pbsEDM_pred_N_E_3,
     xlab = "Observed N_t",
     ylab = "Predicted N_t",
     main = "Red: change -ve predictions to min +ve observation")

points(c(input$N_t, NA),
       pbsEDM_pred_N_E_3_replace,
       pch = 20,
       col = "red")
abline(a = 0, b = 1)

The above figure does look better (close to the 1:1 line) than for $E=2$. The correlation coefficient of $\hat{N}_t$ and $N_t$ before correcting the negative values is:

pbsEDM_rho_N_E_3 <- cor(c(input$N_t, NA),
                        pbsEDM_pred_N_E_3,
                        use = "pairwise.complete.obs")
pbsEDM_rho_N_E_3

and after correcting the negative values it is

pbsEDM_rho_N_E_3_replace <- cor(c(input$N_t, NA),
                                pbsEDM_pred_N_E_3_replace,
                                use = "pairwise.complete.obs")
pbsEDM_rho_N_E_3_replace

which again is slightly lower than the original value. But the $\rho$ has improved somewhat from that for $E=2$, but ist still not as high as based on $Y_t$ and $\hat{Y}_t$, which is

pbsEDM_rho_Y_E_3 <- pbsEDM_res_full_E_3$results$X_rho
pbsEDM_rho_Y_E_3

As above, plotting $\hat{Y}_t$ against $Y_t$ (these are the $E=3$ points in the manuscript's Figure~1(f):

plot(input$Y_t,
     pbsEDM_pred_Y_E_3,
     xlab = "Observed Y",
     ylab = "Predicted Y")
abline(a=0, b=1)

shows a good correspondence, as given by the correlation coefficient of r pbsEDM_rho_Y_E_3.

Test that the results have not changed

Here we run some tests to check that if any of the above results from pbsEDM have changed from when this vignette was written (some results are saved in NY_lags_example_3). The tests will give errors if they fail (and any such failures should be investigated and this vignette updated).

Check the pbsEDM $Y_t$ predictions for $E=2$ have not changed:

testthat::expect_equal(NY_lags_example_3$my.pred,
                       pbsEDM_pred_Y)             # If no error then they still match


luke-a-rogers/pbsedm documentation built on June 3, 2024, 5:20 a.m.