R/POMDP.R
In pomdp: Infrastructure for Partially Observable Markov Decision Processes (POMDP)

Documented in epoch_to_episode is_converged_POMDP is_solved_POMDP is_timedependent_POMDP O_ POMDP R_ T_

#' Define a POMDP Problem
#'
#' Defines all the elements of a POMDP problem including the discount rate, the
#' set of states, the set of actions, the set of observations, the transition
#' probabilities, the observation probabilities, and rewards.
#'
#' In the following we use the following notation. The POMDP is a 7-duple:
#'
#' \eqn{(S,A,T,R, \Omega ,O, \gamma)}.
#'
#' \eqn{S} is the set of states; \eqn{A}
#' is the set of actions; \eqn{T} are the conditional transition probabilities
#' between states; \eqn{R} is the reward function; \eqn{\Omega} is the set of
#' observations; \eqn{O} are the conditional observation probabilities; and
#' \eqn{\gamma} is the discount factor. We will use lower case letters to
#' represent a member of a set, e.g., \eqn{s} is a specific state. To refer to
#' the size of a set we will use cardinality, e.g., the number of actions is
#' \eqn{|A|}.
#'
#' Note that the observation model is in the literature
#' often also denoted by the letter \eqn{Z}.
#'
#' **Names used for mathematical symbols in code**
#'
#' * \eqn{S, s, s'}: `'states', start.state', 'end.state'`
#' * \eqn{A, a}: `'actions', 'action'`
#' * \eqn{\Omega, o}: `'observations', 'observation'`
#'
#' State names, actions and observations can be specified as strings or index numbers
#' (e.g., `start.state` can be specified as the index of the state in `states`).
#' For the specification as data.frames below, `NA` can be used to mean
#' any  `start.state`, `end.state`, `action` or `observation`. Note that some POMDP solvers and the POMDP
#' file format use `'*'` for this purpose.
#'
#' The specification below map to the format used by pomdp-solve
#' (see \url{http://www.pomdp.org}).
#'
#' **Specification of transition probabilities: \eqn{T(s' | s, a)}**
#'
#' Transition probability to transition to state \eqn{s'} from given state \eqn{s}
#' and action \eqn{a}. The transition probabilities can be
#' specified in the following ways:
#'
#' * A data.frame with columns exactly like the arguments of `T_()`.
#'   You can use `rbind()` with helper function `T_()` to create this data
#'   frame. Probabilities can be specified multiple times and the definition that
#'   appears last in the data.frame will take affect.
#'
#' * A named list of matrices, one for each action. Each matrix is square with
#'   rows representing start states \eqn{s} and columns representing end states \eqn{s'}.
#'   Instead of a matrix, also the strings `'identity'` or `'uniform'` can be specified.
#'
#' * A function with the same arguments are `T_()`, but no default values
#'   that returns the transition probability.
#'
#' **Specification of observation probabilities: \eqn{O(o | a, s')}**
#'
#' The POMDP specifies the probability for each observation \eqn{o} given an
#' action \eqn{a} and that the system transitioned to the end state
#' \eqn{s'}. These probabilities can be specified in the
#' following ways:
#'
#' * A data frame with columns named exactly like the arguments of `O_()`.
#'   You can use `rbind()`
#'   with helper function `O_()` to create this data frame. Probabilities can be
#'   specified multiple times and the definition that
#'   appears last in the data.frame will take affect.
#'
#' * A named list of matrices, one for each action. Each matrix has
#'   rows representing end states \eqn{s'} and columns representing an observation \eqn{o}.
#'   Instead of a matrix, also the string `'uniform'` can be specified.
#'
#' * A function with the same arguments are `O_()`, but no default values
#'   that returns the observation probability.
#'
#' **Specification of the reward function: \eqn{R(a, s, s', o)}**
#'
#' The reward function can be specified in the following
#' ways:
#'
#' * A data frame with columns named exactly like the arguments of `R_()`.
#'   You can use `rbind()`
#'   with helper function `R_()` to create this data frame. Rewards can be specified
#'   multiple times and the definition that
#'   appears last in the data.frame will take affect.
#'
#' * A list of lists. The list levels are `'action'` and `'start.state'`. The list elements
#'   are matrices with
#'   rows representing end states \eqn{s'} and columns representing an observation \eqn{o}.
#'
#' * A function with the same arguments are `R_()`, but no default values
#'   that returns the reward.
#'   
#' To avoid overflow problems with rewards, reward values should stay well within the 
#' range of 
#' `[-1e10, +1e10]`. `-Inf` can be used as the reward for unavailable actions and
#' will be translated into a large negative reward for solvers that only support
#' finite reward values. 
#'
#' **Start Belief**
#'
#' The initial belief state of the agent is a distribution over the states. It is used to calculate the
#' total expected cumulative reward printed with the solved model. The function [reward()] can be
#' used to calculate rewards for any belief.
#'
#' Some methods use this belief to decide which belief states to explore (e.g.,
#' the finite grid method).
#'
#' Options to specify the start belief state are:
#'
#' * A probability distribution over the states. That is, a vector
#'   of \eqn{|S|} probabilities, that add up to \eqn{1}.
#'
#'* The string `"uniform"` for a uniform
#'   distribution over all states.
#'* An integer in the range \eqn{1} to \eqn{n} to specify the index of a single starting state.
#'* A string specifying the name of a single starting state.
#'
#' The default initial belief is a uniform
#' distribution over all states.
#'
#' **Convergence**
#'
#' A infinite-horizon POMDP needs to converge to provide a valid value
#' function and policy.
#'
#' A finite-horizon POMDP may also converging to a infinite horizon solution
#' if the horizon is long enough.
#'
#'
#' **Time-dependent POMDPs**
#'
#' Time dependence of transition probabilities, observation probabilities and
#' reward structure can be modeled by considering a set of **episodes**
#' representing **epoch** with the same settings. The length of each episode is
#' specified as a vector for `horizon`, where the length is the number of
#' episodes and each value is the length of the episode in epochs. Transition
#' probabilities, observation probabilities and/or reward structure can contain
#' a list with the values for each episode. The helper function `epoch_to_episode()` converts
#' an epoch to the episode it belongs to.
#'
#' @family POMDP
#' @family POMDP_examples
#'
#' @param states a character vector specifying the names of the states. Note that
#' state names have to start with a letter.
#' @param actions a character vector specifying the names of the available actions.
#' Note that action names have to start with a letter.
#' @param observations a character vector specifying the names of the
#' observations. Note that observation names have to start with a letter.
#' @param transition_prob Specifies action-dependent transition probabilities
#' between states.  See Details section.
#' @param observation_prob Specifies the probability that an action/state
#' combination produces an observation.  See Details section.
#' @param reward Specifies the rewards structure dependent on action, states
#' and observations.  See Details section.
#' @param discount numeric; discount factor between 0 and 1.
#' @param horizon numeric; Number of epochs. `Inf` specifies an infinite
#' horizon.
#' @param terminal_values a vector with the terminal values for each state or a
#' matrix specifying the terminal rewards via a terminal value function (e.g.,
#' the alpha component produced by `solve_POMDP()`).  A single 0 specifies that all
#' terminal values are zero.
#' @param start Specifies the initial belief state of the agent. A vector with the
#' probability for each state is supplied. Also the string `'uniform'`
#' (default) can be used.  The belief is used to calculate the total expected cumulative
#' reward. It is also used by some solvers. See Details section for more
#' information.
#' @param info A list with additional information.
#' @param name a string to identify the POMDP problem.
#' @param action,start.state,end.state,observation,probability,value Values
#' used in the helper functions `O_()`, `R_()`, and `T_()` to
#' create an entry for `observation_prob`, `reward`, or
#' `transition_prob` above, respectively. The default value `'*"'`
#' matches any action/state/observation.
#'
#' @return The function returns an object of class POMDP which is list of the model specification.
#' [solve_POMDP()] reads the object and adds a list element named
#' `'solution'`.
#' @author Hossein Kamalzadeh, Michael Hahsler
#' @references
#' pomdp-solve website: \url{http://www.pomdp.org}
#' @examples
#' ## Defining the Tiger Problem (it is also available via data(Tiger), see ? Tiger)
#'
#' Tiger <- POMDP(
#'   name = "Tiger Problem",
#'   discount = 0.75,
#'   states = c("tiger-left" , "tiger-right"),
#'   actions = c("listen", "open-left", "open-right"),
#'   observations = c("tiger-left", "tiger-right"),
#'   start = "uniform",
#'
#'   transition_prob = list(
#'     "listen" =     "identity",
#'     "open-left" =  "uniform",
#'     "open-right" = "uniform"
#'   ),
#'
#'   observation_prob = list(
#'     "listen" = rbind(c(0.85, 0.15),
#'                      c(0.15, 0.85)),
#'     "open-left" =  "uniform",
#'     "open-right" = "uniform"
#'   ),
#'
#'   # the reward helper expects: action, start.state, end.state, observation, value
#'   # missing arguments default to NA which matches any value (often denoted as * in POMDPs).
#'   reward = rbind(
#'     R_("listen",                    v =   -1),
#'     R_("open-left",  "tiger-left",  v = -100),
#'     R_("open-left",  "tiger-right", v =   10),
#'     R_("open-right", "tiger-left",  v =   10),
#'     R_("open-right", "tiger-right", v = -100)
#'   )
#' )
#'
#' Tiger
#'
#' ### Defining the Tiger problem using functions
#'
#' trans_f <- function(action, start.state, end.state) {
#'   if(action == 'listen')
#'     if(end.state == start.state) return(1)
#'     else return(0)
#'
#'   return(1/2) ### all other actions have a uniform distribution
#' }
#'
#' obs_f <- function(action, end.state, observation) {
#'   if(action == 'listen')
#'     if(end.state == observation) return(0.85)
#'   else return(0.15)
#'
#'   return(1/2)
#' }
#'
#' rew_f <- function(action, start.state, end.state, observation) {
#'   if(action == 'listen') return(-1)
#'   if(action == 'open-left' && start.state == 'tiger-left') return(-100)
#'   if(action == 'open-left' && start.state == 'tiger-right') return(10)
#'   if(action == 'open-right' && start.state == 'tiger-left') return(10)
#'   if(action == 'open-right' && start.state == 'tiger-right') return(-100)
#'   stop('Not possible')
#' }
#'
#' Tiger_func <- POMDP(
#'   name = "Tiger Problem",
#'   discount = 0.75,
#'   states = c("tiger-left" , "tiger-right"),
#'   actions = c("listen", "open-left", "open-right"),
#'   observations = c("tiger-left", "tiger-right"),
#'   start = "uniform",
#'   transition_prob = trans_f,
#'   observation_prob = obs_f,
#'   reward = rew_f
#' )
#'
#' Tiger_func
#'
#' # Defining a Time-dependent version of the Tiger Problem called Scared Tiger
#'
#' # The tiger reacts normally for 3 epochs (goes randomly two one
#' # of the two doors when a door was opened). After 3 epochs he gets
#' # scared and when a door is opened then he always goes to the other door.
#'
#' # specify the horizon for each of the two different episodes
#' Tiger_time_dependent <- Tiger
#' Tiger_time_dependent$name <- "Scared Tiger Problem"
#' Tiger_time_dependent$horizon <- c(normal_tiger = 3, scared_tiger = 3)
#' Tiger_time_dependent$transition_prob <- list(
#'   normal_tiger = list(
#'     "listen" = "identity",
#'     "open-left" = "uniform",
#'     "open-right" = "uniform"),
#'   scared_tiger = list(
#'     "listen" = "identity",
#'     "open-left" = rbind(c(0, 1), c(0, 1)),
#'     "open-right" = rbind(c(1, 0), c(1, 0))
#'   )
#' )
#' @export
POMDP <- function(states,
                  actions,
                  observations,
                  transition_prob,
                  observation_prob,
                  reward,
                  discount = .9,
                  horizon = Inf,
                  terminal_values = NULL,
                  start = "uniform",
                  info = NULL,
                  name = NA) {
  ### unsolved pomdp model
  x <- list(
    name = name,
    discount = discount,
    horizon = horizon,
    states = states,
    actions = actions,
    observations = observations,
    transition_prob = transition_prob,
    observation_prob = observation_prob,
    reward = reward,
    start = start,
    terminal_values = terminal_values,
    info = info
  )
  
  class(x) <- c("POMDP", "list")
  x <- check_and_fix_MDP(x)
  
  x
}

#' @export
print.POMDP <- function(x, ...) {
  writeLines(paste(paste(class(x), collapse = ", "),
                   "-",
                   x$name))
  
  if (!is.null(x$discount))
    writeLines(sprintf("  Discount factor: %s",
                       paste(x$discount, collapse = "+")))
  
  if (!is.null(x$horizon))
    writeLines(sprintf("  Horizon: %s epochs",
                       paste(x$horizon, collapse = " + ")))
  
  writeLines(sprintf(
    "  Size: %d states / %d actions / %d obs.",
    length(x$states),
    length(x$actions),
    length(x$observations)
  ))
  
  writeLines(paste0("  Start: ", shorten(paste(
    x$start, collapse = ", "
  ), n = -10L)))
  
  if (is_solved_POMDP(x))
    writeLines(c(
      "  Solved:",
      sprintf("    Method: %s",
              sQuote(x$solution$method)),
      sprintf("    Solution converged: %s",
              x$solution$converged),
      sprintf("    # of alpha vectors: %i",
              sum(sapply(
                x$solution$alpha, nrow
              ))),
      sprintf(
        "    Total expected reward: %f",
        x$solution$total_expected_reward
      ),
      ""
    ))
  else
    writeLines(c("  Solved: FALSE\n"))
  
  
  writeLines(strwrap(
    paste("List components:", paste(sQuote(names(
      x
    )), collapse = ", "), "\n"),
    indent = 2,
    exdent = 4
  ))
}

# check if x is a solved POMDP
#' @rdname POMDP
#' @param x a POMDP.
#' @param stop logical; stop with an error.
#' @export
is_solved_POMDP <- function(x, stop = FALSE, message = "") {
  if (!inherits(x, "POMDP"))
    stop("x needs to be a POMDP object!")
  
  solved <- !is.null(x$solution)
  if (stop && !solved)
    stop("x needs to be a solved POMDP. Use solve_POMDP() first.",
         message,
         call. = FALSE)
  
  solved
}

#' @rdname POMDP
#' @export
is_timedependent_POMDP <- function(x)
  ! is.null(x$horizon) && length(x$horizon) > 1L


# is a field time-dependent? For time-dependence we have a list of
# matrices/data.frames or for observation_prob we have a list of a list
.is_timedependent_field <- function(x, field) {
  field <-
    match.arg(field, c("transition_prob", "observation_prob", "reward"))
  m <- x[[field]]
  if (is.null(m))
    stop("Field ", field, " does not exist.")
  
  # it is a list. time dependent is a list (episodes) of lists
  if (!is.list(m) || is.data.frame(m))
    return(FALSE)
  if (!is.list(m[[1]]))
    return(FALSE)
  
  # time dependent reward is a list of lists of lists
  if (field == "reward" && !is.list(m[[1]][[1]]))
    return(FALSE)
  
  if (length(m) != length(x$horizon))
    stop(
      "Inconsistent POMDP specification. Field ",
      field,
      " does not contain data for the appropriate number of episodes."
    )
  
  TRUE
}

#' @rdname POMDP
#' @param epoch integer; an epoch that should be converted to the 
#'              corresponding episode in a time-dependent POMDP.
#' @export
epoch_to_episode <- function(x, epoch) {
  if (is.null(epoch))
    return(1L)
  
  episode <- which(epoch <= cumsum(x$horizon))[1]
  if (is.na(episode))
    stop("Epoch does not exist")
  
  episode
}

#' @rdname POMDP
#' @param message a error message to be displayed displayed
#' @export
is_converged_POMDP <- function(x, stop = FALSE, message = "") {
  is_solved_POMDP(x, stop = stop)
  
  converged <- x$solution$converged && length(x$solution$pg) == 1L
  
  if (stop && !converged)
    stop("POMDP solution has not converged. ", message, call. = FALSE)
  
  converged
}

# get pg and alpha for a epoch
.get_pg_index <- function(model, epoch) {
  #is_solved_POMDP(model, stop = TRUE)
  
  epoch <- as.integer(epoch)
  if (epoch < 1L)
    stop("Epoch has to be >= 1")
  
  ### (converged) infinite horizon POMDPs. We ignore epoch.
  if (length(model$solution$pg) == 1L)
    return(1L)
  
  ### regular epoch for finite/infinite horizon case
  if (epoch > length(model$solution$pg))
    stop("POMDP model has only solutions for ",
         length(model$solution$pg),
         " epochs!")
  
  return(epoch)
}

.get_pg <-
  function(model, epoch)
    model$solution$pg[[.get_pg_index(model, epoch)]]

.get_alpha <-
  function(model, epoch)
    model$solution$alpha[[.get_pg_index(model, epoch)]]

#' @rdname POMDP
#' @export
O_ <-
  function(action = NA,
           end.state = NA,
           observation = NA,
           probability)
    data.frame(
      action = action,
      end.state = end.state,
      observation = observation,
      probability = as.numeric(probability),
      stringsAsFactors = FALSE
    )

#' @rdname POMDP
#' @export
T_ <-
  function(action = NA,
           start.state = NA,
           end.state = NA,
           probability)
    data.frame(
      action = action,
      start.state = start.state,
      end.state = end.state,
      probability = as.numeric(probability),
      stringsAsFactors = FALSE
    )

#' @rdname POMDP
#' @export
R_ <-
  function(action = NA,
           start.state = NA,
           end.state = NA,
           observation = NA,
           value)
    data.frame(
      action = action,
      start.state = start.state,
      end.state = end.state,
      observation = observation,
      value = as.numeric(value),
      stringsAsFactors = FALSE
    )
Any scripts or data that you put into this service are public.
pomdp documentation built on April 3, 2025, 10:58 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
pomdp
Infrastructure for Partially Observable Markov Decision Processes (POMDP)

R/POMDP.R
In pomdp: Infrastructure for Partially Observable Markov Decision Processes (POMDP)

Defines functions R_ T_ O_ .get_alpha .get_pg .get_pg_index is_converged_POMDP epoch_to_episode .is_timedependent_field is_timedependent_POMDP is_solved_POMDP print.POMDP POMDP

Documented in epoch_to_episode is_converged_POMDP is_solved_POMDP is_timedependent_POMDP O_ POMDP R_ T_

Try the pomdp package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

pomdp Infrastructure for Partially Observable Markov Decision Processes (POMDP)

R/POMDP.R In pomdp: Infrastructure for Partially Observable Markov Decision Processes (POMDP)

Defines functions R_ T_ O_ .get_alpha .get_pg .get_pg_index is_converged_POMDP epoch_to_episode .is_timedependent_field is_timedependent_POMDP is_solved_POMDP print.POMDP POMDP

Documented in epoch_to_episode is_converged_POMDP is_solved_POMDP is_timedependent_POMDP O_ POMDP R_ T_

Try the pomdp package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

pomdp
Infrastructure for Partially Observable Markov Decision Processes (POMDP)

R/POMDP.R
In pomdp: Infrastructure for Partially Observable Markov Decision Processes (POMDP)