# In boettiger-lab/mdplearning: Bayesian Learning Algorithms for Markov Decision Processes

library("mdplearning")
library("appl")
library("ggplot2")
library("tidyr")
library("dplyr")

## these functions should be added to package properly, with proper testing
source("mdp_learning.R")

##
knitr::opts_chunk$set(cache=TRUE)  Initialize a simple POMDP model for fisheries: set.seed(1234) states <- 0:50 actions <- states observed_states <- states reward_fn <- function(x, a) pmin(x, a) sigma_g <- 0.28 sigma_m <- 0.28 discount <- 0.95 Tmax <- 20  We consider parameter uncertainty by examining the Ricker model over a range of r values, a range of K values, and a range of growth noise sigma_g values. We also consider the Allen model over a range of Allee threshold C values: rs <- seq(0, 2, length.out = 11) r_models <- lapply(rs, function(r) function(x, h){ K <- 45 s <- pmax(x - h, 0) s * exp(r * (1 - s / K) ) }) Ks <- seq(5, 45, length.out = 11) K_models <- lapply(Ks, function(K) function(x, h){ r <- 1 s <- pmax(x - h, 0) s * exp(r * (1 - s / K) ) }) Cs <- seq(0,25, length.out = 11) C_models <- lapply(Cs, function(C) function(x, h){ r <- 1 K <- 45 s <- pmax(x - h, 0) s * exp(r * (1 - s / K) * (s - C) / K ) })  We compute the transition matrices for each, as before: get_matrices <- function(models){ matrices <- lapply(models, function(model) appl::fisheries_matrices(states = states, actions = actions, observed_states = states, reward_fn = reward_fn, f = model, sigma_g = sigma_g, sigma_m = sigma_m)) lapply(matrices, [[, "transition") } r_matrices <- get_matrices(r_models) K_matrices <- get_matrices(K_models) C_matrices <- get_matrices(C_models) sigmas <- seq(0.01, 0.5, length.out = 11) matrices <- lapply(sigmas, function(sigma_g) appl::fisheries_matrices(states = states, actions = actions, observed_states = states, reward_fn = reward_fn, f = r_models[], sigma_g = sigma_g, sigma_m = sigma_m)) sigma_matrices <- lapply(matrices, [[, "transition") utility <- matrices[]$reward


## Planning for Uncertainty

First we consider the optimal solution where we simply average over the uncertainty. We consider the policy resulting from a uniform uncertainty over $r \in [r min(rs), r max(rs)]$ compared to fixing r at a low ($r = r rs$), average ($r = r rs$), and high values ($r = r rs$) from that range.

Likewise, we consider $K \in [r min(Ks), r max(Ks)]$ compared to fixing C at a low ($K = r Ks$), average ($K = r Ks$), and high values ($K = r Ks$)

For $C$ parameter in the Allen model, we consider $C \in [r min(Cs), r max(Cs)]$ compared to fixing C at a low ($C = r Cs$), average ($C = r Cs$), and high values ($C = r Cs$)

For $\sigma$ parameter in the Ricker model, we consider $\sigma \in [r min(sigmas), r max(sigmas)]$ compared to fixing sigma at a low ($\sigma = r sigmas$), average ($\sigma = r sigmas$), and high values ($\sigma = r sigmas$)

compare_policies <- function(transition){

qt_unif <- rep(1, length=length(transition)) / length(transition)
unif <- mdp_compute_policy(transition, utility, discount, qt_unif, Tmax)

qt_high <- numeric(length(transition))
qt_high <- 1
high <- mdp_compute_policy(transition, utility, discount, qt_high, Tmax)

qt_mean <- numeric(length(transition))
qt_mean <- 1
mean <- mdp_compute_policy(transition, utility, discount, qt_mean, Tmax)

qt_low <- numeric(length(transition))
qt_low <- 1
low <- mdp_compute_policy(transition, utility, discount, qt_low, Tmax)

data.frame(states, low = low$policy, mean = mean$policy, high = high$policy, unif = unif$policy)

}

policies <- purrr::map2_df(list(r_matrices, K_matrices, C_matrices, sigma_matrices), list("r", "K", "C", "sigma"),
function(matrices,label) data.frame(model = label, compare_policies(matrices))
)

policies %>%
gather(prior, policy, -states, -model) %>%
ggplot(aes(states, states - actions[policy], col=prior, lty=prior)) +
geom_line() + ylab("escapement") + facet_wrap(~model)


### Value of Perfect Information

From this, we can also define the value of perfect information for these scenarios. We define the value of information as a function of the starting state, x0. For each possible starting state, we compute the value of each of the policies above under the assumption that reality is either fixed at the high parameter value or the low parameter value of the range. For example, $r = r rs$, and $r = r rs$. We examine the value resulting from the assumption of a uniform prior normalized by the value resulting from perfect knowledge of the parameter, assuming the paramater is the higher value (unif_high) or the lower (unif_low).
We also compare the value resulting from using the wrong value of the parameter without accounting for uncertainty at all, i.e. assuming $r = r rs$ when it is $r = r rs$ (referred to as low_high) and vice versa (high_low).

We compute the value of a fixed policy using the identical dynamic programming around Bellman recursion as in finding the optimal solution; only the given policy is used in computing the value of each step rather than necessarily selecting the action that would maximize that value. This allows us to compute the value of a policy derived under one set of assumptions (transition matrix) when applied under a different set.

Alternately we could have computed the value of information by averaging the value of replicate forward simulations, which is more noisy but has the added benefit of illustrating the distribution of outcomes. We will revisit that approach later.

value_of_information <- function(x){
policies <- x[]
transition <- x[]
model <- x[]
low_low = mdp_value_of_policy(policies$low, transition[], utility, Tmax = 200) high_high = mdp_value_of_policy(policies$high, transition[], utility, Tmax = 200)

unif_low = mdp_value_of_policy(policies$unif, transition[], utility, Tmax = 200) / low_low high_low = mdp_value_of_policy(policies$high, transition[], utility, Tmax = 200)  / low_low

unif_high = mdp_value_of_policy(policies$unif, transition[], utility, Tmax = 200) / high_high low_high = mdp_value_of_policy(policies$low, transition[], utility, Tmax = 200)  / high_high

data.frame(state = policies\$states, unif_low = unif_low, unif_high = unif_high, low_high = low_high, high_low = high_low, model = model)
}

relative_value <- list(list(filter(policies, model=="r"), r_matrices, "r"),
list(filter(policies, model=="K"), K_matrices, "K"),
list(filter(policies, model=="C"), C_matrices, "C"),
list(filter(policies, model=="sigma"), sigma_matrices, "sigma")) %>%
purrr::map_df(value_of_information)

relative_value  %>%
gather(scenario, value, -state, -model) %>%
ggplot(aes(state, value, col = scenario)) +
geom_point() +
facet_wrap(~model)


Several patterns are clear. Since no policy is better than the optimal policy for the scenario, these relative values are always all be less than 1, as expected. Second, accounting for uncertainty should always be better than ignoring it and using the wrong value: that is, unif_high is always better than low_high, and unif_low always better than high_low. Note that scenarios with different underlying models cannot easily be directly compared since they are normalized to different optimums -- there is no general reason to expect that unif_low is larger or smaller than unif_high.

Beyond these expected patterns are the more interesting ones. For instance, the value of sigma results in only small differences in policy, such that even this large uncertainty in sigma (unif) permits almost as much value as knowing sigma precisely. This holds regardless of whether the actual shocks are large (unif_high) or small (unif_low), both curves are consistently near 1. In fact, even using the wrong value of sigma directly results in only a small loss of value. In this context, learning sigma matters little to the overall value.

For the parameter r, there is a noticable cost to using the wrong value and ignoring the parameter uncertainty. (The relative cost is greater when underestimating r, green being below red). Accounting for uncertainty largely resolves this, with the values of both unif_low and unif_high being near 1, which suggests there is little room to improve further by learning, even if learning was very efficient.

With parameter K in the Ricker model, there is greater value to learning; particularly if K tends to be towards the low end of what is expected (unif_low), where accounting for uniform uncertainty results in values around only 50% of perfect information. Even if the true K is large, uniform uncertainty gives only 75% of the value of perfect information, indicating plenty of room for improvement.

Under each of the above scenarios in the Ricker model, the value is realtively independent of the starting state. The slight increase for higher starting states simply reflects the value of the first harvest that brings the population down towards in constant escapement target. Under the Allen model, things are different due to the presense of the tipping point (Allee threshold). Lower starting stock sizes all account for greater and greater probability of collapse (or below the threshold, probability of escape from collapse). Due to the stochastic growth this does not appear as a sharp transition. Assuming the population starts at a size large enough to avoid accidental collapse at the start, we see a pattern similar but somewhat less pronounced than we did with K in the Ricker model. If the true Allee threshold is only at very low stocks, then accounting for uncertainty can result in a nearly optimal value, and perfect knowledge is of little additional utility. For thresholds at more precarious, higher stock levels, information is more valuable (particularly if the starting stock is not too small).

still to be done

## Comparing to learning

still to be done, but see mdp_learning()

## topics considered, to be considered in mdp voi analysis

• Uncertainty Planning

• Illustrate how to integrate over model uncertainty, compare to ignoring this

• Note that comparing fixing the uncertain parameter to prior mean (i.e. ignoring uncertainty) against the case of the proper integration over model uncertainty provides a more objective comparison of the role of accounting for uncertainty than simply comparing the uncertain case to fixing the uncertain parameter arbitrarily.

• Compute policy difference and value-of-perfect-information (VOPI) under various scenarios (e.g. select differences in parameters/models that result in very different optimal escapements). (e.g. presumably VOPI in noise parameters may not be so much? Is knowing measurement noise or growth noise more valuable?)

• Does including a model with a tipping point result in more or less cautious behavior?

• Uncertainty Planning & Learning

• Compare passive learning to no learning when VOPI is large.

• How many models can be considered, and how to select which to include? (e.g. those with larger VOPI differences)
• Compare passive learning to post-hoc inference? Theoretically identical? Does/can passive learning out-perform direct post-hoc model estimation?

• POMDP cases: Consequences of assuming perfect observability

• Compare applying MDP with model uncertainty when reality is POMDP with model uncertainty. Compare policies predicted by each.
• Compare applying MDP (passive) learning with POMDP (passive) learning. Compare rate of learning & outcomes.

boettiger-lab/mdplearning documentation built on May 13, 2019, 8:23 a.m.