simulate_MDP: Simulate Trajectories in a MDP

View source: R/simulate_MDP.R

simulate_MDPR Documentation

Simulate Trajectories in a MDP

Description

Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.

Usage

simulate_MDP(
  model,
  n = 100,
  start = NULL,
  horizon = NULL,
  return_states = FALSE,
  epsilon = NULL,
  engine = "cpp",
  verbose = FALSE,
  ...
)

Arguments

model

a MDP model.

n

number of trajectories.

start

probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform".

horizon

number of epochs for the simulation. If NULL then the horizon for the model is used.

return_states

logical; return visited states.

epsilon

the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.

engine

'cpp' or 'r' to perform simulation using a faster C++ or a native R implementation which supports sparse matrices.

verbose

report used parameters.

...

further arguments are ignored.

Details

A native R implementation is available (engine = 'r') and the default is a faster C++ implementation (engine = 'cpp').

Both implementations support parallel execution using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be available needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. Therefore, C++ simulations with n * horizon less than 100,000 are always executed using a single worker.

Value

A list with elements:

  • avg_reward: The average discounted reward.

  • reward: Reward for each trajectory.

  • action_cnt: Action counts.

  • state_cnt: State counts.

  • states: a vector with state ids. Rows represent trajectories.

A vector with state ids (in the final epoch or all). Attributes containing action counts, and rewards for each trajectory may be available.

Author(s)

Michael Hahsler

See Also

Other MDP: MDP(), POMDP_accessors, solve_MDP(), transition_graph()

Examples

data(Maze)

# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol
policy(sol)
# U in the policy is and estimate of the utility of being in a state when using the optimal policy.

## Example 1: simulate 10 trajectories, only the final belief state is returned
sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE)
sim

# Calculate proportion of actions used
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)

# reward distribution
hist(sim$reward)

## Example 2: simulate starting always in state s_1 and return all visited states
sim <- simulate_MDP(sol, n = 100, start = "s_1", horizon = 10, return_states = TRUE)
sim$avg_reward

# how often was each state visited?
table(sim$states)

pomdp documentation built on Sept. 9, 2023, 1:07 a.m.