R/document_funcs.R
In multiRL: Reinforcement Learning Tools for Multi-Armed Bandit

#' @title Core Functions
#' @name funcs
#' @description
#'
#'  The Markov Decision Process (MDP) underlying Reinforcement Learning can be
#'    decomposed into six fundamental components. By modifying these six
#'    functions, an immense number of distinct Reinforcement Learning models
#'    can be created. Users only need to grasp the basic Markov Decision
#'    Process process and subsequently tailor these six functions to construct
#'    a unique reinforcement learning model.
#'
#' @section Class:
#' \code{funcs [List]}
#'
#' @section Details:
#' \itemize{
#'    \item Action Select
#'    \itemize{
#'        \item Step 1: Agent uses \code{bias_func}
#'              to apply a bias term to the value of each option.
#'        \item Step 2: Agent uses \code{expl_func}
#'              to decide whether to make a purely random exploratory choice.
#'        \item Step 3: Agent uses \code{prob_func}
#'              to compute the selection probability for each action.
#'    }
#'    \item Value Update
#'    \itemize{
#'        \item Step 4: Agent uses \code{util_func}
#'              to translate the objective reward into subjective utility.
#'        \item Step 5: Agent uses \code{dcay_func}
#'              to regress the values of unchosen options toward a baseline.
#'        \item Step 6: Agent uses \code{lrng_func}
#'              to update the value of the chosen option.
#'
#'    }
#' }
#'
#' @section Learning Rate (\eqn{\alpha}):
#'
#'  Inner \code{lrng_func} is the function that determines the learning rate
#'    (\eqn{\alpha}). This function governs how the model selects the
#'    \eqn{\alpha}. For instance, you can set different learning rates for
#'    different circumstances. Rather than 'learning' in a general sense, the
#'    learning rate determines whether the agent updates its expected values
#'    (Q-values) using an aggressive or conservative step size across different
#'    conditions.
#'
#'  \deqn{Q_{new} = Q_{old} + \alpha \cdot (R - Q_{old})}
#'
#' @section Probability Function (\eqn{\beta}):
#'
#'  Inner \code{prob_func} is the function defined by the inverse temperature
#'    parameter (\eqn{\beta}) and the \code{lapse} parameter.
#'
#'    The inverse temperature parameter governs the randomness of choice.
#'    If \eqn{\beta} approaches 0, the agent will choose between different
#'    actions completely at random.
#'    As \eqn{\beta} increases, the choice becomes more dependent on the
#'    expected value (\eqn{Q_{t}}), meaning actions with higher expected values
#'    have a proportionally higher probability of being chosen.
#'
#'    Note: This formula includes a normalization of the (\eqn{Q_{t}}) values.
#'
#'  \deqn{
#'    P_{t}(a) =
#'    \frac{
#'      \exp\left( \beta \cdot \left( Q_t(a) - \max_{j} Q_t(a_j) \right) \right)
#'    }{
#'      \sum_{i=1}^{k} \exp\left(
#'        \beta \cdot \left( Q_t(a_i) - \max_{j} Q_t(a_j) \right) \right
#'      )
#'    }
#'  }
#'
#'  The function below, which incorporates the constant lapse rate, is a
#'    correction to the standard soft-max rule. This is designed to prevent the
#'    probability of any action from becoming exactly 0
#'    (Wilson and Collins, 2019 \doi{10.7554/eLife.49547}).
#'    When the lapse parameter is set to 0.01, every action has at least a 1\%
#'    probability of being executed. If the number of available actions becomes
#'    excessively large (e.g., greater than 100), it would be more appropriate
#'    to set the lapse parameter to a much smaller value.
#'
#'  \deqn{
#'    P_{t}(a) = (1 - lapse \cdot N_{shown}) \cdot P_{t}(a) + lapse
#'  }
#'
#'  When multiple cognitive processes (e.g., RL and WM) coexist within an MDP,
#'    the \code{prob_func} integrates the Q-tables from both systems by
#'    weighting the action probabilities generated by each.
#'
#' @section Utility Function (\eqn{\gamma}):
#'
#'  Inner \code{util_func} is defined by the utility exponent parameter 
#'    (\eqn{\gamma}). Its purpose is to account for the fact that the objective 
#'    reward received by human may hold a different subjective value (utility) 
#'    across different subjects.
#'
#'    Note: The built-in function is formulated according to Stevens' power law.
#'
#'  \deqn{U(R) = {R}^{\gamma}}
#'
#' @section Bias Function (\eqn{\delta}):
#'
#'  Inner \code{bias_func} is the function defined by the parameter 
#'    (\eqn{\delta}). This function signifies that the expected value of an 
#'    action is not solely determined by the received reward, but is also 
#'    influenced by the number of times the action has been executed. 
#'    Specifically, an action that has been executed fewer times receives a 
#'    larger exploration bias. (Sutton and Barto,
#'    \href{http://incompleteideas.net/book/the-book-2nd.html}{2018})
#'    This mechanism prompts exploration and ensures the agent to execute
#'    every action at least once.
#'
#'  \deqn{
#'    \text{Bias} = \delta \cdot \sqrt{\frac{\log(N + e)}{N + 10^{-10}}}
#'  }
#'
#'    There are also other types of biases, such as stickiness to the same
#'    key—a tendency to perseverate on the option corresponding to the
#'    previously pressed key.
#'
#' @section Exploration Function (\eqn{\epsilon}):
#'
#'  Inner \code{expl_func} is the function defined by the parameter 
#'    (\eqn{\epsilon}) and the constant \code{threshold}. This function 
#'    controls the probability with which the agent engages in exploration 
#'    (i.e., making a random choice) versus exploitation (i.e., making a 
#'    value-based choice).
#'
#'  \eqn{\epsilon-first}: The agent must choose randomly for a fixed number of
#'    initial trials. Once the number of trials exceeds the threshold, the agent
#'    must exclusively choose based on value.
#'
#'  \deqn{
#'  P(x) =
#'  \begin{cases}
#'    i \le \text{threshold}, & x=1  \\
#'    i > \text{threshold}, & x=0
#'  \end{cases}
#'  }
#'
#'  \eqn{\epsilon-greedy}: The agent performs a random choice with probability
#'    \eqn{\epsilon} and makes a value-based choice with probability
#'    \eqn{1-\epsilon}.
#'
#'  \deqn{
#'  P(x) =
#'  \begin{cases}
#'    \epsilon, & x=1  \\
#'    1-\epsilon, & x=0
#'  \end{cases}
#'  }
#'
#'  \eqn{\epsilon-decreasing}: The probability of making a random choice
#'    gradually decreases as the number of trials increases throughout the
#'    experiment.
#'
#'  \deqn{
#'  P(x) =
#'  \begin{cases}
#'    \frac{1}{1+\epsilon \cdot i}, & x=1  \\
#'    \frac{\epsilon \cdot i}{1+\epsilon \cdot i}, & x=0
#'  \end{cases}
#'  }
#'
#' @section Decay Rate (\eqn{\zeta}):
#'
#'  Inner \code{dcay_func} is the function defined by the decay rate parameter
#'    (\eqn{\zeta}) and the constant \code{bonus}. This function indicates that
#'    at the end of each trial, not only the value of the chosen option will be
#'    changed according to the learning rate, but also the values of the
#'    unchosen options also undergo change.
#'
#'  It is due to the limitations of working memory capacity, the values of the
#'    unchosen options are hypothesized to decay back towards their initial
#'    value at a rate determined by the decay rate parameter (\eqn{\zeta})
#'    (Collins and Frank, 2012 \doi{10.1111/j.1460-9568.2011.07980.x}).
#'
#'  \deqn{W_{new} = W_{old} + \zeta \cdot (W_{0} - W_{old})}
#'
#'  Furthermore, Hitchcock, Kim and Frank, (2025) \doi{10.1037/xge0001817}
#'    suggest that if the feedback of the chosen option provides information
#'    relevant to the unchosen options, this decay rate may be enhanced or
#'    mitigated by the constant bonus.
#'
#' @section Example:
#' \preformatted{ # inner functions
#'  funcs = list(
#'    # Learning Rate
#'    lrng_func = multiRL::func_alpha,
#'    # Probability Function (Soft-Max + Lapse Rate)
#'    prob_func = multiRL::func_beta,
#'    # Utility Function (Stevens' Power Law)
#'    util_func = multiRL::func_gamma,
#'    # Bias Function (Upper-Confidence-Bound)
#'    bias_func = multiRL::func_delta,
#'    # Exploration Function (Epsilon-First, Greedy, Decreasing)
#'    expl_func = multiRL::func_epsilon,
#'    # Decay Rate
#'    dcay_func = multiRL::func_zeta
#'  )
#' }
#'
#' @references
#' Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning:
#' An Introduction (2nd ed). MIT press.
#'
#' Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning
#' is working memory, not reinforcement learning? A behavioral, computational,
#' and neurogenetic analysis. \emph{European Journal of Neuroscience, 35}(7),
#' 1024-1035.
#' \doi{10.1111/j.1460-9568.2011.07980.x}
#'
#' Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the
#' computational modeling of behavioral data. \emph{Elife, 8}, e49547.
#' \doi{10.7554/eLife.49547}
#'
#' Hitchcock, P. F., Kim, J., Frank, M. J. (2025). How working memory
#' and reinforcement learning interact when avoiding punishment and pursuing
#' reward concurrently. \emph{Journal of Experimental Psychology: General}.
#' \doi{10.1037/xge0001817}
#'
NULL

Any scripts or data that you put into this service are public.

multiRL documentation built on March 31, 2026, 5:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

multiRL
Reinforcement Learning Tools for Multi-Armed Bandit

R/document_funcs.R
In multiRL: Reinforcement Learning Tools for Multi-Armed Bandit

Try the multiRL package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

multiRL Reinforcement Learning Tools for Multi-Armed Bandit

R/document_funcs.R In multiRL: Reinforcement Learning Tools for Multi-Armed Bandit

Try the multiRL package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

multiRL
Reinforcement Learning Tools for Multi-Armed Bandit

R/document_funcs.R
In multiRL: Reinforcement Learning Tools for Multi-Armed Bandit