mdp_planning: mdp planning
In cboettig/mdplearning: Bayesian Learning Algorithms for Markov Decision Processes

Description Usage Arguments Value Examples

Simulate MDP under a given policy

1 2	mdp_planning(transition, reward, discount, model_prior = NULL, x0, Tmax = 20, observation = NULL, a0 = 1, policy, ...)

`transition`	list of transition matrices, one per model
`reward`	the utility matrix U(x,a) of being at state x and taking action a
`discount`	the discount factor (1 is no discounting)
`model_prior`	the prior belief over models, a numeric of length(transitions). Uniform by default
`x0`	initial state
`Tmax`	length of time to simulate
`observation`	NULL by default, simulate perfect observations
`a0`	previous action before starting, irrelivant unless actions influence observations and true_observation is not null
`policy`	a vector of length n_obs, whose i'th entry is the index of the optimal action given the system is in (observed) state i.
`...`	additional arguments to `mdp_compute_policy`

a data frame "df" with the state, action and a value at each time step in the simulation

source(system.file("examples/K_models.R", package="mdplearning"))
transition <- lapply(models, `[[`, "transition")
reward <- models[[1]]$reward

df <- mdp_compute_policy(transition, reward, discount, model_prior = c(0.5, 0.5))
out <- mdp_planning(transition[[1]], reward, discount, x0 = 10,
               Tmax = 20, policy = df$policy)

## Simulate MDP strategy under observation uncertainty
out <- mdp_planning(transition[[1]], reward, discount, x0 = 10,
               Tmax = 20, policy = df$policy, 
               observation = models[[1]]$observation)