Introductory examples in approximate dynamic programming

require(multipleuncertainty)
require(pdgControl)
require(reshape2)
require(ggplot2)
require(data.table)
````

```r
opts_knit$set(upload.fun = socialR::flickr.url)
opts_chunk$set(dev.args=list(bg="transparent"), 
               cache=TRUE, comment=NA, message=FALSE, 
               warning=FALSE, tidy=FALSE)

opts_chunk$set(fig.path = paste("figure/", 
                                gsub(".Rmd", "", 
                                     knitr:::knit_concord$get('infile')),
                                "-", sep=""), 
               cache.path = paste(gsub(".Rmd", "", 
                                       knitr:::knit_concord$get('infile') ), 
                                "/", sep="")
              )
require(ggplot2)
theme_notebook <- 
  theme_grey() + 
  theme(plot.background = element_rect(fill = "transparent", colour = NA),
        panel.grid.minor = element_line(colour = "transparent"))
theme_set(theme_notebook)

Based on Powell 2006, page 97. I try to conform to that notation throughout

Setup

First we define the Beverton-Holt stock-recruitment relationship as a function of stock size x, harvest h and parameters p

``` {r bevholt} f <- function(x, h, p){ A <- p[1] B <- p[2] s <- pmax(x-h, 0) A * s/(1 + B * s) } p <- pars <- c(1.5, 0.5) K <- (p[1] - 1)/p[2] sigma_g <- 0.2

We begin with a simulation method $X_{t+1} = f(X_t, Z_t)$.  For illustration, let us consider $f(X_t, Z_t) = Z_t \frac{a X_t}{b + X_t}$ with a = `r pars[1]` and b = `r pars[2]`.  We define a statespace $S$


```r
S <- seq(0, 1, length=11) 

as a uniform grid of r length(S) points from r min(S) to r max(S).
We also need a value function on the state space, $C_t(S_t)$. For simplicity, we set the price of harvest at unity and the cost of harvesting at zero, so that $C_t(S_t, x_t) = \min(x_t, S_t)$.
($C_t$ is sometimes denoted $\mathbb{\Pi}$).
We also need an action space $\chi_t$ of possible harvest values.
Again for simplicity we assume that harvest can be set to any possible state size, $\chi_t \equiv S_t$,

chi <- S
T <- 10
N <- 10

The approximate dynamic programming algorithm will perform a finite number $N$ = r N iterations over a window of time $T$ =r T in our example. The algorithm can then be described as follows:

Algorithm (1)

V <- numeric(length(S))
S_0 <- 0.5
sigma <- 0.2
omega_n <- rlnorm(T, 0, sigma)

$$V_t(S_t) = \max_{x_t \in \chi_t} \left(C(S_t, x_t) + \gamma \sum_{s^{\prime} \in \mathcal{S}} \mathbb{P}(s^{\prime} | S_t^n, x_t) V_{t+1}^{n-1} s^{\prime} \right)$$

That is, choose action $x_t$ that maximizes the value of the next step.

Let's start with $t=0$, $n=1$ and fix an $x_0$ from the set of $\chi$ (allowing the action space to be the same in each period, we can omit the subscript on $\chi$) to get started. We first compute $C(S_0, x_0)$.

$S_0 = S_0^1$ which we fixed in step 0b arbitrarily at r S_0.

The profits/costs $C(S_t, x_t)$ are the value derived by action (harvest) $x_t$ at state (stock) $S_t$. Assuming a fixed price and no costs to harvesting, this is just whichever number is smaller (since we cannot harvest more than the available stock,

C <- function(S, X) pmin(S, X)

(where we have used R's vectorized form of the min function).

This forward dynamic programming will still rely on the one-step transition matrix, $\mathbb{P}$.

Let's get the trasition matrices for this problem, assuming log-normal noise,

sdp_matrix <- determine_SDP_matrix(f, p, x_grid=S, h_grid=chi, sigma_g)

Which is a list of matrices, one for each harvest (action) $x_t$.

Then we want to consider a fixed $S_t^n$ and fixed $x_t$, and take the sum of $\mathbb{P}(s^{\prime} | S_t^n, x_t)$ over the $s^{\prime}$, which means we want the $x_t$ element from the list, and then we need sum over the distribution of future states given the current state $S_t^n$, e.g. a row of the matrix, e.g. sdp_matrix[[x]][s,], which we (vector) multiply by $V_{t+1}^{n-1}(s^{\prime})$.

This value $V$ is of course unknown, other than our initial random guess $V_{t}^0$.
As we step through the iterations $V_t^1$, $V_t^2$, $V_t^3$, etc., this should convgerge to something meaningful.

Note that the index along $S$ corresponding to $S_t^n$ is given by

s <- which.min(abs(S-S_0))

So our maximization across $x$ just involves:

values <- 
  sapply(1:length(chi), function(x)
    C(S[s], chi[x]) + sdp_matrix[[x]][s,] %*% V
)

max_x <- which.max(values)
v_hat <- max(values)

Trivially, this is just the harvest level that maximizes $C$ so far (which is just harvesting the $S_0$, since $\bar V^0_t$ begins at zero:

chi[max_x]

$$V_t^n(S_t) = \begin{cases} \hat v_t^n & S_t = S_t^n \ \bar V_t^{n-1}(S_t) & \textrm{otherwise} \end{cases}$$

e.g. use our maximum value for the case of the state we just considered $S_t = S_t^n$, otherwise leave $V_t$ unchanged. Our new $V$ is thus:

V[s] = v_hat

We compute the next state using our max_x for $x^n_t$, our random samples and the transition function...

S_1 <- omega_n[1] * f(S_0, chi[x_nt], p)

assembling the solution...

As a recursive algorithm

Things we want to keep track of:

We have time, iterations, and state/action space to keep track over. eek.

N <- 5000 # iterations
M <- 20 # gridsize 
Tmax <- 5 # Time horizon

gamma <- 0.95 # Discount
# f 
# p 

sigma_g <- 0.5 # larger variation in random draws helps
chi <- seq(0, 1, length.out = M)
S <- seq(0, 1, length.out = M)

sdp_matrix <- determine_SDP_matrix(f, p, x_grid=S, h_grid=chi, sigma_g)

V <- matrix(1, M, Tmax)  # A* strategy
# Fails to explore at matrix(0, M, Tmax)
# consider: # V <- matrix(rep(chi, Tmax), nrow=M) # 
# V[,1] <- chi   # fails to explore if it doesn't have at least some non-zero values

C <- function(S, X) pmin(S, X)
S_0 <- 0.5 
alpha <- 1 # learning rate


for(n in 1:N){

  omega_n <- rlnorm(Tmax, 0, sigma_g)
  S_current <- S_0 #runif(1,0,1) # explores faster when this is random

  for(t in 1:Tmax){
    # index of the state we're considering
    s <- which.min(abs(S-S_current)) 

    # Find the action maximizing the value
    values <- sapply(1:length(chi), function(x)
      C(S[s], chi[x]) + gamma * sdp_matrix[[x]][s,] %*% V[,t])
    hat <-  c(x_nt = which.max(values), v_nt = max(values))

    # Update value V as mixture of new value and previous value
    V[hat["x_nt"], t] <- (1 - alpha) * V[hat["x_nt"],t] + alpha * hat["v_nt"] 

    # Advance the state in time along random path  
    S_current <- omega_n[t] * f(S_current, chi[hat["x_nt"]], p)

  }
}

for comparison: the SDP solution

opt <- find_dp_optim(sdp_matrix, S, chi, 70, 0.5, C, 1-gamma, reward=0)
opt$V

Problems arising from the discretization

Note that after the first iteration, $n=1$, the value matrix $V$ is no longer all zeros. There is a single state, $S = S_0 =$ r S_0, at which we have value. That value is lost if we set harvest $x$ too high, since we know we will not then end up in that state -- from whence comes the incentive to consider future value. Unfortunately, the value exists only if we hit that state exactly -- all other states are assumed to have zero value still.

Additional problems

We no longer have the loop-over-all-states problem, but we face several new or remaining issues:

  1. We still require the use of the one-step transition matrix, with the equally troublesome sum over all states $\sum_{s^{\prime}\in S} \mathbb{P}(s^{\prime} | S_t^n, x_t)$. We will fix this by approximating the transitions in step 2b using random draws as well.

  2. We only update the values of states we visit. We still need a way to estimate the value of states we have not visited.

  3. Worse, we might not visit states that seem bad relative to states we have visited. This is particularly atrocious in this example. Since we initialize the value of all states at 0, the algorithm prefers to harvest all stock from the current state rather than risk a transition into a state starting at 0. There is no convergence guarentee that we will ever escape this cycle of avoiding states we have not seen. We can alter the initial guess of the value of course, and we could alter the starting condition to better explore.

Using Non-stochastic transition information only, step 2b can be written as:

Taking $x_0$ as the smallest harvest, $\min(\chi)$ = r min(chi) and evaluating $C(S_0,X_0) = \min(S_0, X_0)$ gives us r min(S_0, min(chi)), rather trivially.
The next terms depend on the value $\tilde V^0_1(s^{\prime})$ for all $s^{\prime} \in S$, which we have no idea about. Fortunately we have assumed a value for each of these in step 0a.

We must also come up with some values for the probability $\mathbb{P}(s^{\prime} | S_1^0, x_1)$ for each state, given our current state $S_1^0$ and considered action $x_1$. This is more straight forward, since it is determined by our one-step transition function (without simulation - recall that the single step transition is given exactly).

To do so, we evaluate the argument for each value in our action space, $x_t \in \chi_t$,

s <- S_0
C <- function(S, X) pmin(S, X)
arg <- sapply(chi, function(x) C(s, x) + f(S, x, p) %*% V)
x_nt = which.max(arg)
v_nt = max(arg)
V[x_nt] = v_nt


cboettig/nonparametric-bayes documentation built on May 13, 2019, 2:09 p.m.