simulate_MDP | R Documentation |
Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.
simulate_MDP(
model,
n = 100,
start = NULL,
horizon = NULL,
return_states = FALSE,
epsilon = NULL,
engine = "cpp",
verbose = FALSE,
...
)
model |
a MDP model. |
n |
number of trajectories. |
start |
probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform". |
horizon |
number of epochs for the simulation. If |
return_states |
logical; return visited states. |
epsilon |
the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1. |
engine |
|
verbose |
report used parameters. |
... |
further arguments are ignored. |
A native R implementation is available (engine = 'r'
) and the default is a
faster C++ implementation (engine = 'cpp'
).
Both implementations support parallel execution using the package
foreach. To enable parallel execution, a parallel backend like
doparallel needs to be available needs to be registered (see
doParallel::registerDoParallel()
).
Note that small simulations are slower using parallelization. Therefore, C++ simulations
with n * horizon less than 100,000 are always executed using a single worker.
A list with elements:
avg_reward
: The average discounted reward.
reward
: Reward for each trajectory.
action_cnt
: Action counts.
state_cnt
: State counts.
states
: a vector with state ids.
Rows represent trajectories.
A vector with state ids (in the final epoch or all). Attributes containing action counts, and rewards for each trajectory may be available.
Michael Hahsler
Other MDP:
MDP()
,
POMDP_accessors
,
solve_MDP()
,
transition_graph()
data(Maze)
# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol
policy(sol)
# U in the policy is and estimate of the utility of being in a state when using the optimal policy.
## Example 1: simulate 10 trajectories, only the final belief state is returned
sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE)
sim
# Calculate proportion of actions used
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)
# reward distribution
hist(sim$reward)
## Example 2: simulate starting always in state s_1 and return all visited states
sim <- simulate_MDP(sol, n = 100, start = "s_1", horizon = 10, return_states = TRUE)
sim$avg_reward
# how often was each state visited?
table(sim$states)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.