knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7.1,
  fig.height = 6,
  fig.retina = 2
)
load_all() # library(pbsEDM)
library(dplyr)
library(tidyr)
library(magrittr)
library(ggplot2)

Jump to Where we are at section. Some earlier code probably won't work now as I've been continuously updating functions. This is not a vignette, more figuring out as go along. So unlikely to get it fully rebuilt.

Adapting Luke's original mve.Rmd to dig into his multiview embedding functions. Was going use the same default time series as in analyse_simple_time_series.Rmd but that's univariate, so stick with his larkin one.

Andy has updated several functions and appended the names with for_sve().

Hao and Sugihara (2016) define, for, say Hastings and Powell x-y-z food-chain model:

Not sure how $E$ gets prescribed, though could loop over it. In multiview_embedding() looks like it will just do all potential combinations (andy $E$) based on the lags argument.

larkin
plot(larkin$time,
     larkin$recruits,
     type = "o",
     ylim = c(0, 0.2))

This is different format to what we have in salmon_sim(), so switching to ours.

simulated <- EDMsimulate::salmon_sim()
simulated

Here's what we have in EDMsimulate::sim_and_fit and ..._realisations(), so adapt those (not evaluating here), though we are going to use R_t primarily, not the default R_prime_t, just want to be able to switch:

# sim_and_fit():
sim_and_fit <- function(salmon_sim_args = list(),
                        pbsEDM_args = list(
                          lags = list(R_switch = 0,
                                      S_t = 0:3)),
                        R_switch = "R_prime_t"){

  stopifnot(R_switch %in% c("R_t", "R_prime_t"))
  stopifnot(names(pbsEDM_args$lags)[1] == "R_switch")
  names(pbsEDM_args$lags)[1] <- R_switch

  simulated <- do.call(salmon_sim,
                       salmon_sim_args)

  simulated_use <- simulated
  simulated_use[nrow(simulated_use), R_switch] = NA    # Ensure no
                                        # knowledge of it for pbsEDM(), so it
                                        # doesn't affect the rho values (else it does)

  fit <- do.call(pbsEDM::pbsEDM,
                 c(list(N = simulated_use),
                   pbsEDM_args))

  to_return <- list(simulated = simulated,
                    fit = fit)
  # concatenated list, with simulated tibble first then list of all results from pbsEDM
  return(to_return) # for now, probably want to scale down or have option to
                    # just return basic results for when doing many simulations
}

# sim_and_fit_realisations(), part of:
    if(edm_fit){
      fit_edm <- do.call(pbsEDM::pbsEDM,
                         c(list(N = all_sims[[m]]),
                           pbsEDM_args))

      testthat::expect_equal(dplyr::pull(all_sims[[m]], R_switch),
                             fit_edm$N_observed[-(T+1)])  # Extra check

      fit_edm_full_series[m, ]  <- t(c(m,
                                       fit_edm$N_forecast))

      res_realisations[m, "R_switch_T_edm_fit"] = fit_edm$N_forecast[T] # TODO
                                        # double check what to do when pbsedm
                                        # arguments change
      res_realisations[m, "E"] = fit_edm$results$E  # Though will need specific
                                                    # lags also kept track of or specified
      res_realisations[m, "N_rho"] = fit_edm$results$N_rho
      res_realisations[m, "N_rmse"] = fit_edm$results$N_rmse
      res_realisations[m, "X_rho"] = fit_edm$results$X_rho
      res_realisations[m, "X_rmse"] = fit_edm$results$X_rmse
    }

First just try a simple call to single_view_embedding, and based on the top function above, use

lags_use <- list(# R_t = 0,  # not for single_view I think
                 S_t = 0:3)

Here's what Luke had for Larkin, then try that for mine

# Larkin
if(TRUE){
  f0 <- pbsEDM::single_view_embedding(
    data = larkin,
    response = "recruits",
    lags = list(spawners = 0:8),
    index = 60,
    buffer = 10,
    window = integer(0),
    metric = "rmse",
    beyond = FALSE
  )

  f0 %>% as.data.frame()
}

Now try that for one of our examples, then dig into code with browser() to trace what is being done. Then change arguments.

index_val <- 60
buffer_val <- 10
sve_on_simulated <- single_view_embedding(data = simulated,
                                          response = "R_t",
                                          lags = lags_use,
                                          index = index_val,   # not sure what this is
                                          buffer = buffer_val,  # number of forecasts
                                            # prior to index, not sure exatly
                                            # what that means
                                          window = integer(0),  # forecast
                                            # metric moving window width, aha -
                                            # our issue with before and after I
                                            # think maybe
                                          metric = "rmse",  # presumably the
                                            # metric to test something
                                          beyond = FALSE)
sve_on_simulated %>% as.data.frame()
summary(sve_on_simulated)

Looks like it's only using information up to that point in time, going by the points column. Don't think we need that (i.e. use all the data) as we're going to test performance on the $T+1$ point. Deconstruct the function to understand methods (see later for simpler example):

ssr <- state_space_reconstruction(data = simulated,
                                  response = "R_t",
                                  lags = lags_use)   # deconstructing later,
                                        # realised it doesn't do first-differencing
head(simulated)
head(ssr)

Returns centred and scaled values. Have gone through function.

distances <- state_space_distances(ssr,
                                   index = index_val, # time index of the first value
                                        # to forecast - but why would you do
                                        # this? Ignore a transient, or a library
                                        # thing? TODO
                                   buffer = buffer_val)  # number of values to forecast
                                        # before index. Not sure why, vaguely
                                        # remember a discussion about it. TODO

# From single_view_embedding() for this (changing X to ssr, X_distance to distances)
  # - Rows in X are points in the SSR
  # - Each row in X_distance corresponds to a focal point in the SSR
  # - Each column in X_distance corresponds to a potential neighbour in the SSR
  # - Elements of X_distance correspond to distances to neighbours
  # - NA elements indicate disallowed neighbours for a given focal point

as_tibble(distances)
summary(distances) # First three and last columns always NA's. Rest have at
                   # least 49.
tail(distances)

# Next:
  # Compute centred and scaled forecasts ---------------------------------------

  # - Create neighbour index matrix
  # - Create neighbour matrices
  # - Project neighbour matrices
  # - Compute ssr_forecast vector
ssr_forecasts <- state_space_forecasts(ssr,
                                       distances,
                                       beyond = FALSE)
ssr_forecasts


observed <- c(dplyr::pull(simulated,
                          "R_t"),
              NA)[seq_along(ssr_forecasts)]  # Not sure what NA does, gets
                                        # ignored anyway.
observed   # vector of observations

forecast <- untransform_forecasts(observed,
                                  ssr_forecasts)   # scale back to
                                                   # non-normalised

rows <- seq_along(forecast)

# This is what single_view_embedding() returns, set is being defined here:
sve_on_simulated_manually <- tibble(set = rep(0:1,
                                              c(index_val - 1,
                                                nrow(simulated) - index_val + 2))[rows],
                                    time = seq_len(nrow(simulated) + 1L)[rows],
                                    points = c(0,
                                               as.vector(apply(distances,
                                                               1,
                                                               function (x) sum(!is.na(x)))))[rows],
                                    dim = rep(ncol(ssr),
                                              nrow(simulated) + 1L)[rows],
                                    observed = observed,
                                    forecast = forecast,
                                    forecast_metrics(observed, forecast, integer(0),
                                                     "rmse"),   # 0 gives all NA's, but
                                                       # integer(0) does not;
                                                       # strange. Not quite sure
                                                       # what these are.
                                    superset_columns(simulated, lags_use, NULL))

superset_columns(simulated, lags_use, NULL)   # so creates extra columns, all 1's
                                        # though, not sure of the point.

expect_equal(sve_on_simulated,
             sve_on_simulated_manually)
# Error: `sve_on_simulated` not equal to `sve_on_simulated_manually`.
# Names: 4 string mismatches
# Length mismatch: comparison on first 12 components
# Component "mre": Modes: numeric, logical
# Component "mre": target is numeric, current is logical
# Component "rmse": Modes: numeric, logical
# Component "rmse": target is numeric, current is logical

Now deconstruct multiview_embedding

Use as template when get to multiview_embedding. First, Luke's Larkin call:

# Larkin
if(FALSE){
  f01 <- pbsEDM::multiview_embedding(
    data = larkin,
    response = "recruits",
    lags = list(spawners = 0:8),
    index = 60,
    buffer = 10,
    window = integer(0),
    metric = "rmse",
    beyond = FALSE,
    weight = NULL,
    n_weight = 1,
    cores = NULL)

  f01$ranks    # 51,000 rows!
  f01$summary
  f01$hindsight
  f01$results
  f01$forecast
}

Now to do on our simulated values, using similar call:

lags_use_multi <- list(R_t = 0,
                       S_t = 0:3)
# TODO Think that may not make sense, as Luke always has a response variable as
# well. So you shouldn't use R_t for both, presumably.
mve_on_simulated <- multiview_embedding(data = simulated,
                        response = "R_t",
                        lags = lags_use_multi,   # when I had this as lags_use,
                        # first six non-NA forecast values were
                        # same as for lags_use_multi, but not the rest
                        index = index_val,
                        buffer = buffer_val,
                        window = integer(0),
                        metric = "rmse",
                        beyond = FALSE,
                        weight = NULL,
                        n_weight = 1,
                        cores = NULL)

mve_on_simulated

Deconstruct multiview_embedding() to understand the steps.

lags_use
subset_lags <- create_subset_lags(lags_use)
subset_lags

So that gives every combination, including some without any 0-lags. That's just univariate, so try multi:

subset_lags_multi <- create_subset_lags(lags_use_multi)
# Order is not intuitive, but looks to cover them all (took a while to figure
# out formatting)
subset_lags_multi

Now do the calcs:

mve_forecasts <- lapply(
  subset_lags_multi,
  FUN = single_view_embedding,
  data = simulated,
  response = "R_t",
  index = index_val,
  buffer = buffer_val,
  window = integer(0),
  metric = "rmse",
  beyond = FALSE,
  superset = lags_use_multi)

weighted <- weight_single_view_embeddings(mve_forecasts,
                                          "rmse",
                                          weight = NULL,
                                          n_weight = 1)
weighted

mve_on_simulated_manually <- structure(
  list(
    data = simulated,
    observed = c(dplyr::pull(simulated, "R_t"), NA),
    forecast = c(rep(NA_real_, index_val - 1), weighted$results$forecast),
    response = "R_t",
    lags = lags_use_multi,
    index = index_val,
    buffer = buffer_val,
    window = integer(0),
    metric = "rmse",
    beyond = FALSE,
    n_weight = 1,
    raw_forecasts = mve_forecasts,
    ranks = weighted$ranks,
    summary = weighted$summary,
    hindsight = weighted$hindsight,
    results = weighted$results),
  class = "multiview_embedding")

expect_equal(mve_on_simulated,
             mve_on_simulated_manually)

tail(mve_on_simulated$results)

Next - don't think this actually gives the best forecast for the next step. Does return the values for each view.

Check that multiview that includes lag of response variable of 0 should not

be included - or rather realise that of course it should!

So the first combination of lags is (currently) just R_t with lag of 0, which is the response variable, so presumably the results come out really well, or just a bunch of NA's. No - was thinking wrong. See #16 notes and below

subset_lags_multi[[1]]
mve_on_simulated$raw_forecasts[[1]] %>% as.data.frame()

Actually, it shouldn't be perfect should it, as we're excluding the $t^*$ value. Hence it's using the next nearest neighbour in 1-d R_t space, which is obviously not itself. This is all fine then -- the 1-d space is likely a poor representation of the dynamics. So we can include lag of 0.

Carrying on

So, for single_view_embedding figure out what the following do:

  1. index:
  2. buffer
  3. windows
  4. beyond (default is FALSE and hasn't been changed)
  5. superset (default is NULL and hasn't been changed)

So:

  1. Change index value, it gets used in state_space_distances() in:
# Exclude focal points in the training set -----------------------------------

distances[seq_len(index - buffer - 1L), ] <- NA_real_

We want to do leave-one-out instead, so maybe just don't need this at all? Have done this in state_space_distances_for_sve() and then state_space_forecasts_for_sve(); following is figuring out how to do those and then doing them.

Also want to change

# Exclude focal point and future neighbours ----------------------------------

distances[upper.tri(distances, diag = TRUE)] <- NA_real_

to just exclude the focal point -- we are okay with having future neighbours, as is the usual situation in EDM.

Go through each stage of state_space_distances(), checking what happens as we go along. Above we had

# distances <- state_space_distances(ssr,
#                                   index = index_val,
#                                   buffer = buffer_val)

distances_1 <- state_space_distances(ssr,
                                     index = 60, # 0 gives error
                                     buffer = 1) # 0 gives error, not sure what
                                     # this really means yet

Do smaller one to understand:

# Above we defined
# ssr <- state_space_reconstruction(data = simulated,
#                                  response = "R_t",
#                                  lags = lags_use)  # = S_t of 0, 1, 2, 3

ssr_small <- ssr[1:10, ]
ssr_small
distances_small <- state_space_distances(ssr_small,
                                         index = 2, # 0 gives error
                                         buffer = 1) # 0 gives error, not sure what
distances_small

# Now adapted to not have index or buffer, and return symmetric matrix not lower
# triangular (plus extra NA's)

distances_small_for_sve <- state_space_distances_for_sve(ssr_small)
distances_small_for_sve

So, first three points of ssr_small do not exist because of the lags. Carrying on:

ssr_forecasts_small <- state_space_forecasts(ssr_small,
                                             distances_small,
                                             beyond = FALSE)
ssr_forecasts_small

ssr_forecasts_small_for_sve <- state_space_forecasts_for_sve(ssr_small,
                                                             distances_small_for_sve)
ssr_forecasts_small_for_sve
# Maybe change name here
sve_on_simulated_2 <- single_view_embedding(data = simulated,
                                            response = "R_t",
                                            lags = lags_use,
                                            index = index_val,   # not sure what this is
                                            buffer = buffer_val,  # number of forecasts
                                            # prior to index, not sure exatly
                                            # what that means
                                            window = integer(0),  # forecast
                                            # metric moving window width, aha -
                                            # our issue with before and after I
                                            # think maybe
                                            metric = "rmse",  # presumably the
                                            # metric to test something
                                            beyond = FALSE)
sve_on_simulated %>% as.data.frame()
summary(sve_on_simulated)

Figuring out ssr - need first differencing also

Realised that state_space_reconstruction() doesn't do what we say in Appendix of first manuscript. So deconstructing it and writing state_space_reconstruction_for_sve().

Do first 15 of rows of simulated, starting fresh here:

simulated_small_2 <- simulated[1:15, ]

# Above had
ssr <- state_space_reconstruction(data = simulated,
                                  response = "R_t",
                                  lags = lags_use)   # deconstructing later,
                                        # realised it doesn't do first-differencing

lags_use_multi

ssr_small_2_orig_fun <- state_space_reconstruction(data = simulated_small_2,
                                                   response = "R_t",
                                                   lags = lags_use_multi)

# Now new function that also does first-differencing; am digging into to check I
# understand, then including first differening early on in this new one:
ssr_small_2 <- state_space_reconstruction_for_sve(data = simulated_small_2,
                                                   response = "R_t",
                                                   lags = lags_use_multi)
# Okay, in the function it puts the response variable first as "R_t_s" (I added
# s for scaled), but if
# we've prescribed a lag of 0 for "R_t" you get an identical column but labelled
# "R_t_s_0". BUT, not all state spaces will include a zero lag of "R_t". Stick
# with the response column for now - I think it may be slightly inefficient
# (could be calculated once for all ssr's, rather than for each ssr), but we do
# need the scaled version calculated (can also check on the non-scaled predictions).

# Example of ssr without repeated first column:
state_space_reconstruction_for_sve(data = simulated_small_2,
                                   response = "R_t",
                                   lags = list(R_t = 1,
                                               S_t = 0:3)) %>% head()

head(ssr_small_2)

Where we were at - now see Where we are at further down

So single_view_embedding_for_sve() [cumbersome name but consistent for now with what I've been doing] now being created as I go along, based on single_view_embedding(). This is to help write the function.

Starting fresh (so can just run from here):

set.seed(42)
# Have realised the correlation issue betwen S_t and R_t so modifying code
# and rerunning from here.
h_simulated <- 0.1095 + sample(1:180) * 0.001 # has mean of 0.2
simulated_4 <- EDMsimulate::salmon_sim(h = h_simulated)

lags_use_multi <- list(R_t = 0,
                       S_t = 0:3)

# TODO maybe: not sure why this failed a while ago, might be the use of random numbers
#  expect_equal(EDMsimulate::salmon_sim()[1:15, ], EDMsimulate::salmon_sim(T = 15))

# Redoing later with simulation with changing harvest rate. And change error in
# `state_space_reconstruction_for_sve.R` to something more useful.

simulated_small_4 <- simulated_4[1:15, ]
# TODO need different lags to not return an NA
ssr_new <- state_space_reconstruction_for_sve(data = simulated_small_4,
                                              lags = lags_use_multi)
ssr_new     # Has the correct NA's in top three rows and final row

response_s <- state_space_reconstruction_for_sve(data = simulated_small_4,
                                                 response_only = TRUE,
                                                 lags = list(R_t = 0))

response_s  # Just what we're considering as the response variable, properly
            #  labelled; don't have a first-differenced value for T (as no
            #  T+1). This value is what we want to forecast.

# Calculate the distances between all points
distances_new <- state_space_distances_for_sve(ssr_new)
distances_new
# Note that column 14 is not filled because it projects to the undefined row 15,
# so no point in having it as a neighbour. Row 14 is filled because 14 can be a
# valid t_star. TODO this doesn't happen for later example with only S_t = 1 lag
# and R_t as response.

prediction_indices <- state_space_forecasts_for_sve(ssr_new,
                                                    distances_new,
                                                    max_lag = max(unlist(lags_use_multi,
                                                                         use.names = FALSE)))
# Those are the indices to use the make the prediction of where each t_star will
# go for t_star+1.

# Take off the final one because we are shifting (predicting t_star+1 from the
# prediction indices for each t_star); need NA at beginning.
response_s_predicted <- c(NA,
                          response_s[prediction_indices[-length(prediction_indices)]])

response_s_predicted

rho_response_s <- cor(response_s,
                      response_s_predicted,
                      use = "pairwise.complete.obs") %>% as.vector()

rho_response_s

Can also calculate $\rho$ on the untransformed original response variable.

Now untransform the scaling and first differencing.

N_observed <- simulated_small_4$"R_t"
N_predicted <- untransform_predictions(N_observed = N_observed,
                                       Y_predicted = response_s_predicted,
                                       max_lag = max(unlist(lags_use_multi,
                                                            use.names = FALSE)))
rho_orig <- cor(N_observed, N_predicted[-length(N_predicted)], use =
                                                                 "pairwise.complete.obs")
rho_orig
# Low, but only a few points to get things working
# Leave the rho and other calculations to outside of the function, though do
# need the rho for the state space.

Then put that into single_view_embedding_for_sve(), work out what should be returned, and then call the proper thing from the relevant mve function.

sve_predicted_for_sve <- single_view_embedding_for_sve(
  data = simulated_small_4,
  response = "R_t",
  lags = lags_use_multi)

sve_predicted_for_sve

rho_orig_from_function <- cor(sve_predicted_for_sve$R_t,
                              sve_predicted_for_sve$R_t_predicted,
                              use = "pairwise.complete.obs")
expect_equal(rho_orig, rho_orig_from_function)

rho_response_s_from_function <- cor(sve_predicted_for_sve$R_t_s,
                                    sve_predicted_for_sve$R_t_s_predicted,
                                    use = "pairwise.complete.obs")
rho_response_s_from_function
expect_equal(rho_response_s, rho_response_s_from_function)

Now to link that from multiview_embedding()

Keeping the function name multiview_embedding() as that will be called from elsewhere (keeping my other new ones as ..._for_sve() though). Saved Luke's as ..._original().

# Doing this to work out what failed
simulated <- EDMsimulate::salmon_sim()   # HERE
sve_this_failed <- single_view_embedding_for_sve(
  data = simulated[1:15, ],    # Did fail, but figured out below and no longer does.
  response = "R_t",
  lags = list(S_t = 1))
# Originally failed with:
# Error in untransform_predictions(N_observed = response_observed, Y_predicted = response_s_predicted),: !any(is.na(Y_predicted[(min_t_Y_predicted):length(Y_predicted)])) is not TRUE

# Originally hadn't had NA's in final column of distances, or final
# row. Thought it was meant to. Deconstructing, it just needs them in final column:
ssr_5 <- state_space_reconstruction_for_sve(data = simulated[1:15, ],
                                            lags = list(S_t = 1))
ssr_5
# So just the S_t_s_1 column, which is correct.
distances_5 <- state_space_distances_for_sve(ssr_5)

distances_5  # Did not have NA's in final columns of distances, but have now
             # updated state_space_distances_for_sve() to put NA's in final
             # column - cannot have a nearest neighbour being the final time
             # point, as we do not know where it goes as where it goes does not
             # exist (I previously just used the condition that
             # where-it-goes has NA). Final point is well defined (since has no
             # lag 0, so all the values are included), but just
             # can't be a neighbour.

sve_this_worked <- single_view_embedding_for_sve(
  data = simulated_small_4,
  response = "R_t",
  lags = list(S_t = 1))
sve_this_worked             # Gives negative forecast though

# And this to work on correlation issue above
sve_use_multi <- single_view_embedding_for_sve(
  data = simulated_small_4,
  response = "R_t",
  lags = lags_use_multi)
sve_use_multi
# All R_t_predicted come out negative !?

# Try the longer time series, only effectively using around 13 candidate
# neighbours in that one.
sve_simulated_4 <- single_view_embedding_for_sve(
  data = simulated_4,
  response = "R_t",
  lags = lags_use_multi)
sve_simulated_4
# Can see line 28 to 29 has a jump in R_t_observed of 14.2, and R_t_s[28] is
# 4.23 (considers jump from 28 to 29, then rescaled). R_t_s[29] is -3.43 as it
# comes down again. But can see that t* = 35 gives a big jump, but then doesn't
# come back down again. Aha - since we're not using the realistic lags yet I
# think, only R_t so no detection of R_t lags (a big jump always followed by a
# big crash). (bottom right arm of Figure 1(d) in manuscript 1). But we're not
# allowing, in this single view embedding, such lags of R_t. Presumably this
# will just come out as a very poor ssr and get dumped.
lags_St2 <- list(S_t = 2)
# This fails (during multiview below) because I didn't disallow all the nearest
# neighbours at the end, was only thinking about no lag 0, not no lag 0 or lag
# 1. So need to consider the minimum lag in state_space_distances_for_
sve_simulated_St2 <- single_view_embedding_for_sve(
  data = simulated_4,
  response = "R_t",
  lags = lags_St2)

# Redid this one and it ran okay, numbers look off though.
# And this for testing also
sve_simulated_lags_many <- single_view_embedding_for_sve(
  data = simulated_4,
  response = "R_t",
  lags = list(R_t = 0:4,
              S_t = 0:3))
This fails - see browser() in function. Fails for S_t with lag
of 2; must be next one I think, as one above runs - need to figure out this
DID THIS: need to repeat the fix I just did, have to disallow max_lag -ish candidate
nearest neighbours at the
end that have undefined projections. See in state_space_distances_for_sve()
- just tidy up the example.
mve_small <- multiview_embedding(data = simulated_small_4,
                                 response = "R_t",
                                 lags = lags_use_multi)

# Earlier example was for simulated_4 not simulated_small_4; expect latter is
# just running out of lags kind of thing. Do explicitly:
# This now does not have 15 in prediction_indices, having fixed. Not obvious why
# the error arose with untransform_predictions().
sve_simulated_small_St2 <- single_view_embedding_for_sve(
  data = simulated_small_4,
  response = "R_t",
  lags = lags_St2)
sve_simulated_small_St2

That runs but thought had something wrong, as R_t_predicted should only be values of R_t, but that's only true if not doing first-differencing and scaling I think. diff(sve_simulated_small_St2$R_t_predicted) correctly has repeated values (i.e. first differences are repated). Had thought max_lag in untransform_predictions() was wrong, and maybe depend on response only, but pretty sure it is okay.Given the results are so squiffy, there must be something wrong in untransform_predictions(). BUT, this is only a very short time series so likely to not be good. Retry with the longer one:

sve_simulated_St2_again <- single_view_embedding_for_sve(
  data = simulated_4,
  response = "R_t",
  lags = lags_St2)
sve_simulated_St2_again

Runs okay, numbers not great but don't expect them to be, as only using lag of 2 for S_t. Retry the multiview again:

mve_small_again <- multiview_embedding(data = simulated_small_4,
                                       response = "R_t",
                                       lags = lags_use_multi)

That runs until errors, but seems to loop through everything. BUT, results shouldn't be too great, given the short data set and short lags.

How about full shebang:

set.seed(42)
h_simulated <- 0.1095 + sample(1:180) * 0.001 # has mean of 0.2
simulated_4 <- EDMsimulate::salmon_sim(h = h_simulated)

lags_use_multi <- list(R_t = 0,
                       S_t = 0:3)

lags_lots <- list(R_t = 0:4,
                  S_t = 0:8)

mve_full_shebang <- multiview_embedding(data = simulated_4,
                                       response = "R_t",
                                       lags = lags_lots)
# Currently returns:
#   return(list(rho_each_subset = rho_each_subset,
#              rho_s_each_subset = rho_s_each_subset))
# Doing quickly:
sqrt_length <- round(sqrt(length(mve_full_shebang[[1]])))  # approx 2^(N/2) I think

rho_each_subset <- mve_full_shebang[[1]]
top_rho <- sort(rho_each_subset, decreasing = TRUE)[1:sqrt_length]
top_rho

rho_s_each_subset <- mve_full_shebang[[2]]
top_rho_s <- sort(rho_s_each_subset, decreasing = TRUE)[1:sqrt_length]
top_rho_s

That gave top top_rho value of 0.597 (and top_rho_s 0.592 I think) when I was incorrectly using response_predicted in untransform_predictions() instead of response_observed (i.e. add the first differences to the known observation, which we know (!), rather than the predicted, which can compound errors). So re-running gives:

After correcting, top_rho max is only 0.36 (i.e. lower?!), while top top_rho_s is 0.67. Need to step back and do for a single view.

Redo a single view example

sve_7 <- single_view_embedding_for_sve(
  data = simulated_4,
  response = "R_t",
  lags = lags_lots)
sve_7_rho <- cor(sve_7$R_t,
                 sve_7$R_t_predicted,
                 use = "pairwise.complete.obs")
sve_7_rho
# 0.173301 before changing unstransform_predictions() to set negative predictions
# to the minimum observed.
# Once done that the value is the same??? Should be a bit better you'd
# think. Changed to a high value (200) instead of minimum and it did change rho
# to negative, so code is changing properly. Changes to 0.17277 when I make it
# 10* minimum, so code is working properly, just that change surprisingly made
# no difference (for this example).

Try small multiview again:

mve_small_again_2 <- multiview_embedding(data = simulated_small_4,
                                         response = "R_t",
                                         lags = lags_use_multi)
mve_small_again_2

Where we are at

Can start from here.

It's all working okay. Try shebang one again with longer simulated data and more lags. Though the actual top rho values shouldn't change from shebang; the values are calculated in the function now.

set.seed(42)
h_simulated <- 0.1095 + sample(1:180) * 0.001 # has mean of 0.2
simulated_4 <- EDMsimulate::salmon_sim(h = h_simulated)

lags_use_multi <- list(R_t = 0,
                       S_t = 0:3)

lags_lots <- list(R_t = 0:4,
                  S_t = 0:8)

mve_full_shebang_2 <- multiview_embedding(data = simulated_4,
                                          response = "R_t",
                                          lags = lags_lots)  # Do lags_use_multi
                                            # for shorter and quicker one
mve_full_shebang_2$rho_prediction_from_mve
mve_full_shebang_2$rho_each_top_subset

Wow, fascinating. The overall $\rho$ for mve, averaging over the top subsets, is higher than every $\rho$ from every subset! So the averaging is indeed paying off, and improving results. Other results are:

mve_full_shebang_2$lags_of_top_subsets  # These are in order of highest rho
mve_full_shebang_2$response_predicted_from_mve   # Overall mve prediction, including
                                            # for the forecast T+1.

Can now move onto including this into sim_and_fit_realisations() in EDMsimulate.

Still having issues in EDMsimulate (see issue 17).

set.seed(42)
simulated_5 <- EDMsimulate::salmon_sim()  # Back to default h_t, not noisy,
                                        # to see if that's an issue. Shouldn't
                                        # be. BUT is!
mve_args_manual = list(lags = list(R_t = 0:1, S_t = 0:3))
R_switch_manual <- "R_t"
fit_mve_manual_5 <- do.call(pbsEDM::multiview_embedding,
                     c(list(data = simulated_5,
                            response = R_switch_manual),
                       mve_args_manual))

Aha - it doesn't work with h_t constant (NULL to give default) in salmon_sim(), but does for simulated_4 with h_t varying. HERE HERE See fit_models help, have been fixing things and made new tests.

TODO Maybe - could do with a function to put the list of (top) lags into a simple table. Maybe. function for size of subset_lags, just needs to be based on total number of lags. Will need in function (as a check) anyway. Isnt it 2^N - 1 ish. Yes. Either in or out.

Chat with Carrie:

Think Andy and Carrie have been not on the same page about what is being forecast -- will chat Tuesday and resolve. TODO

We will use R_t with lags of 1, 2, and 3. Andy thinks now do want lag 0 also, see Issue 16. Check our thinking of $t^*$ is the same - discuss as it isn't S_t with lags 1 to 7. When we need the forecast for R_t we don't yet know S_t because that is calculated at the end of the year. Number of 5-dimensional (half the max) combinations is

choose(10, 1:10)
choose(10, 1:10) %>% sum()
choose(10, 1:10) %>% sum() %>% sqrt()

So that's manageable.

Should calculate $\rho$ only on what we care about, i.e. R_t.



luke-a-rogers/pbsEDM documentation built on June 4, 2024, 3:10 a.m.