solve_MDP | R Documentation |
A simple implementation of value iteration and modified policy iteration.
solve_MDP(
model,
horizon = NULL,
discount = NULL,
terminal_values = NULL,
method = "value",
eps = 0.01,
max_iterations = 1000,
k_backups = 10,
verbose = FALSE
)
q_values_MDP(model, U = NULL)
random_MDP_policy(model, prob = NULL)
approx_MDP_policy_evaluation(pi, model, U = NULL, k_backups = 10)
model |
a POMDP problem specification created with |
horizon |
an integer with the number of epochs for problems with a
finite planning horizon. If set to |
discount |
discount factor in range |
terminal_values |
a vector with terminal utilities for each state. If
|
method |
string; one of the following solution methods: |
eps |
maximum error allowed in the utility of any state (i.e., the maximum policy loss). |
max_iterations |
maximum number of iterations allowed to converge. If the maximum is reached then the non-converged solution is returned with a warning. |
k_backups |
number of look ahead steps used for approximate policy evaluation
used by method |
verbose |
logical, if set to |
U |
a vector with state utilities (expected sum of discounted rewards from that point on). |
prob |
probability vector for actions. |
pi |
a policy as a data.frame with columns state and action. |
solve_MDP()
returns an object of class POMDP which is a list with the
model specifications (model
), the solution (solution
).
The solution is a list with the elements:
policy
a list representing the policy graph. The list only has one element for converged solutions.
converged
did the algorithm converge (NA
) for finite-horizon problems.
delta
final delta (infinite-horizon only)
iterations
number of iterations to convergence (infinite-horizon only)
q_values_MDP()
returns a state by action matrix specifying the Q-function,
i.e., the utility value of executing each action in each state.
random_MDP_policy()
returns a data.frame with columns state and action to define a policy.
approx_MDP_policy_evaluation()
is used by the modified policy
iteration algorithm and returns an approximate utility vector U estimated by evaluating policy pi
.
Michael Hahsler
Other solver:
solve_POMDP()
,
solve_SARSOP()
Other MDP:
MDP()
,
POMDP_accessors
,
simulate_MDP()
,
transition_graph()
data(Maze)
Maze
# use value iteration
maze_solved <- solve_MDP(Maze, method = "value")
policy(maze_solved)
# value function (utility function U)
plot_value_function(maze_solved)
# Q-function (states times action)
q_values_MDP(maze_solved)
# use modified policy iteration
maze_solved <- solve_MDP(Maze, method = "policy")
policy(maze_solved)
# finite horizon
maze_solved <- solve_MDP(Maze, method = "value", horizon = 3)
policy(maze_solved)
# create a random policy where action n is very likely and approximate
# the value function. We change the discount factor to .9 for this.
Maze_discounted <- Maze
Maze_discounted$discount <- .9
pi <- random_MDP_policy(Maze_discounted, prob = c(n = .7, e = .1, s = .1, w = 0.1))
pi
# compare the utility function for the random policy with the function for the optimal
# policy found by the solver.
maze_solved <- solve_MDP(Maze)
approx_MDP_policy_evaluation(pi, Maze, k_backup = 100)
approx_MDP_policy_evaluation(policy(maze_solved)[[1]], Maze, k_backup = 100)
# Note that the solver already calculates the utility function and returns it with the policy
policy(maze_solved)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.