funcs: Core Functions

funcsR Documentation

Core Functions

Description

The Markov Decision Process (MDP) underlying Reinforcement Learning can be decomposed into six fundamental components. By modifying these six functions, an immense number of distinct Reinforcement Learning models can be created. Users only need to grasp the basic Markov Decision Process process and subsequently tailor these six functions to construct a unique reinforcement learning model.

Class

funcs [List]

Details

  • Action Select

    • Step 1: Agent uses bias_func to apply a bias term to the value of each option.

    • Step 2: Agent uses expl_func to decide whether to make a purely random exploratory choice.

    • Step 3: Agent uses prob_func to compute the selection probability for each action.

  • Value Update

    • Step 4: Agent uses util_func to translate the objective reward into subjective utility.

    • Step 5: Agent uses dcay_func to regress the values of unchosen options toward a baseline.

    • Step 6: Agent uses lrng_func to update the value of the chosen option.

Learning Rate (\alpha)

Inner lrng_func is the function that determines the learning rate (\alpha). This function governs how the model selects the \alpha. For instance, you can set different learning rates for different circumstances. Rather than 'learning' in a general sense, the learning rate determines whether the agent updates its expected values (Q-values) using an aggressive or conservative step size across different conditions.

Q_{new} = Q_{old} + \alpha \cdot (R - Q_{old})

Probability Function (\beta)

Inner prob_func is the function defined by the inverse temperature parameter (\beta) and the lapse parameter.

The inverse temperature parameter governs the randomness of choice. If \beta approaches 0, the agent will choose between different actions completely at random. As \beta increases, the choice becomes more dependent on the expected value (Q_{t}), meaning actions with higher expected values have a proportionally higher probability of being chosen.

Note: This formula includes a normalization of the (Q_{t}) values.

P_{t}(a) = \frac{ \exp\left( \beta \cdot \left( Q_t(a) - \max_{j} Q_t(a_j) \right) \right) }{ \sum_{i=1}^{k} \exp\left( \beta \cdot \left( Q_t(a_i) - \max_{j} Q_t(a_j) \right) \right ) }

The function below, which incorporates the constant lapse rate, is a correction to the standard soft-max rule. This is designed to prevent the probability of any action from becoming exactly 0 (Wilson and Collins, 2019 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.7554/eLife.49547")}). When the lapse parameter is set to 0.01, every action has at least a 1% probability of being executed. If the number of available actions becomes excessively large (e.g., greater than 100), it would be more appropriate to set the lapse parameter to a much smaller value.

P_{t}(a) = (1 - lapse \cdot N_{shown}) \cdot P_{t}(a) + lapse

When multiple cognitive processes (e.g., RL and WM) coexist within an MDP, the prob_func integrates the Q-tables from both systems by weighting the action probabilities generated by each.

Utility Function (\gamma)

Inner util_func is defined by the utility exponent parameter (\gamma). Its purpose is to account for the fact that the objective reward received by human may hold a different subjective value (utility) across different subjects.

Note: The built-in function is formulated according to Stevens' power law.

U(R) = {R}^{\gamma}

Bias Function (\delta)

Inner bias_func is the function defined by the parameter (\delta). This function signifies that the expected value of an action is not solely determined by the received reward, but is also influenced by the number of times the action has been executed. Specifically, an action that has been executed fewer times receives a larger exploration bias. (Sutton and Barto, 2018) This mechanism prompts exploration and ensures the agent to execute every action at least once.

\text{Bias} = \delta \cdot \sqrt{\frac{\log(N + e)}{N + 10^{-10}}}

There are also other types of biases, such as stickiness to the same key—a tendency to perseverate on the option corresponding to the previously pressed key.

Exploration Function (\epsilon)

Inner expl_func is the function defined by the parameter (\epsilon) and the constant threshold. This function controls the probability with which the agent engages in exploration (i.e., making a random choice) versus exploitation (i.e., making a value-based choice).

\epsilon-first: The agent must choose randomly for a fixed number of initial trials. Once the number of trials exceeds the threshold, the agent must exclusively choose based on value.

P(x) = \begin{cases} i \le \text{threshold}, & x=1 \\ i > \text{threshold}, & x=0 \end{cases}

\epsilon-greedy: The agent performs a random choice with probability \epsilon and makes a value-based choice with probability 1-\epsilon.

P(x) = \begin{cases} \epsilon, & x=1 \\ 1-\epsilon, & x=0 \end{cases}

\epsilon-decreasing: The probability of making a random choice gradually decreases as the number of trials increases throughout the experiment.

P(x) = \begin{cases} \frac{1}{1+\epsilon \cdot i}, & x=1 \\ \frac{\epsilon \cdot i}{1+\epsilon \cdot i}, & x=0 \end{cases}

Decay Rate (\zeta)

Inner dcay_func is the function defined by the decay rate parameter (\zeta) and the constant bonus. This function indicates that at the end of each trial, not only the value of the chosen option will be changed according to the learning rate, but also the values of the unchosen options also undergo change.

It is due to the limitations of working memory capacity, the values of the unchosen options are hypothesized to decay back towards their initial value at a rate determined by the decay rate parameter (\zeta) (Collins and Frank, 2012 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.1460-9568.2011.07980.x")}).

W_{new} = W_{old} + \zeta \cdot (W_{0} - W_{old})

Furthermore, Hitchcock, Kim and Frank, (2025) \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1037/xge0001817")} suggest that if the feedback of the chosen option provides information relevant to the unchosen options, this decay rate may be enhanced or mitigated by the constant bonus.

Example

 # inner functions
 funcs = list(
   # Learning Rate
   lrng_func = multiRL::func_alpha,
   # Probability Function (Soft-Max + Lapse Rate)
   prob_func = multiRL::func_beta,
   # Utility Function (Stevens' Power Law)
   util_func = multiRL::func_gamma,
   # Bias Function (Upper-Confidence-Bound)
   bias_func = multiRL::func_delta,
   # Exploration Function (Epsilon-First, Greedy, Decreasing)
   expl_func = multiRL::func_epsilon,
   # Decay Rate
   dcay_func = multiRL::func_zeta
 )

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed). MIT press.

Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 1024-1035. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.1460-9568.2011.07980.x")}

Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. Elife, 8, e49547. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.7554/eLife.49547")}

Hitchcock, P. F., Kim, J., Frank, M. J. (2025). How working memory and reinforcement learning interact when avoiding punishment and pursuing reward concurrently. Journal of Experimental Psychology: General. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1037/xge0001817")}


multiRL documentation built on March 31, 2026, 5:06 p.m.