| funcs | R Documentation |
The Markov Decision Process (MDP) underlying Reinforcement Learning can be decomposed into six fundamental components. By modifying these six functions, an immense number of distinct Reinforcement Learning models can be created. Users only need to grasp the basic Markov Decision Process process and subsequently tailor these six functions to construct a unique reinforcement learning model.
funcs [List]
Action Select
Step 1: Agent uses bias_func
to apply a bias term to the value of each option.
Step 2: Agent uses expl_func
to decide whether to make a purely random exploratory choice.
Step 3: Agent uses prob_func
to compute the selection probability for each action.
Value Update
Step 4: Agent uses util_func
to translate the objective reward into subjective utility.
Step 5: Agent uses dcay_func
to regress the values of unchosen options toward a baseline.
Step 6: Agent uses lrng_func
to update the value of the chosen option.
\alpha)Inner lrng_func is the function that determines the learning rate
(\alpha). This function governs how the model selects the
\alpha. For instance, you can set different learning rates for
different circumstances. Rather than 'learning' in a general sense, the
learning rate determines whether the agent updates its expected values
(Q-values) using an aggressive or conservative step size across different
conditions.
Q_{new} = Q_{old} + \alpha \cdot (R - Q_{old})
\beta)Inner prob_func is the function defined by the inverse temperature
parameter (\beta) and the lapse parameter.
The inverse temperature parameter governs the randomness of choice.
If \beta approaches 0, the agent will choose between different
actions completely at random.
As \beta increases, the choice becomes more dependent on the
expected value (Q_{t}), meaning actions with higher expected values
have a proportionally higher probability of being chosen.
Note: This formula includes a normalization of the (Q_{t}) values.
P_{t}(a) =
\frac{
\exp\left( \beta \cdot \left( Q_t(a) - \max_{j} Q_t(a_j) \right) \right)
}{
\sum_{i=1}^{k} \exp\left(
\beta \cdot \left( Q_t(a_i) - \max_{j} Q_t(a_j) \right) \right
)
}
The function below, which incorporates the constant lapse rate, is a correction to the standard soft-max rule. This is designed to prevent the probability of any action from becoming exactly 0 (Wilson and Collins, 2019 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.7554/eLife.49547")}). When the lapse parameter is set to 0.01, every action has at least a 1% probability of being executed. If the number of available actions becomes excessively large (e.g., greater than 100), it would be more appropriate to set the lapse parameter to a much smaller value.
P_{t}(a) = (1 - lapse \cdot N_{shown}) \cdot P_{t}(a) + lapse
When multiple cognitive processes (e.g., RL and WM) coexist within an MDP,
the prob_func integrates the Q-tables from both systems by
weighting the action probabilities generated by each.
\gamma)Inner util_func is defined by the utility exponent parameter
(\gamma). Its purpose is to account for the fact that the objective
reward received by human may hold a different subjective value (utility)
across different subjects.
Note: The built-in function is formulated according to Stevens' power law.
U(R) = {R}^{\gamma}
\delta)Inner bias_func is the function defined by the parameter
(\delta). This function signifies that the expected value of an
action is not solely determined by the received reward, but is also
influenced by the number of times the action has been executed.
Specifically, an action that has been executed fewer times receives a
larger exploration bias. (Sutton and Barto,
2018)
This mechanism prompts exploration and ensures the agent to execute
every action at least once.
\text{Bias} = \delta \cdot \sqrt{\frac{\log(N + e)}{N + 10^{-10}}}
There are also other types of biases, such as stickiness to the same key—a tendency to perseverate on the option corresponding to the previously pressed key.
\epsilon)Inner expl_func is the function defined by the parameter
(\epsilon) and the constant threshold. This function
controls the probability with which the agent engages in exploration
(i.e., making a random choice) versus exploitation (i.e., making a
value-based choice).
\epsilon-first: The agent must choose randomly for a fixed number of
initial trials. Once the number of trials exceeds the threshold, the agent
must exclusively choose based on value.
P(x) =
\begin{cases}
i \le \text{threshold}, & x=1 \\
i > \text{threshold}, & x=0
\end{cases}
\epsilon-greedy: The agent performs a random choice with probability
\epsilon and makes a value-based choice with probability
1-\epsilon.
P(x) =
\begin{cases}
\epsilon, & x=1 \\
1-\epsilon, & x=0
\end{cases}
\epsilon-decreasing: The probability of making a random choice
gradually decreases as the number of trials increases throughout the
experiment.
P(x) =
\begin{cases}
\frac{1}{1+\epsilon \cdot i}, & x=1 \\
\frac{\epsilon \cdot i}{1+\epsilon \cdot i}, & x=0
\end{cases}
\zeta)Inner dcay_func is the function defined by the decay rate parameter
(\zeta) and the constant bonus. This function indicates that
at the end of each trial, not only the value of the chosen option will be
changed according to the learning rate, but also the values of the
unchosen options also undergo change.
It is due to the limitations of working memory capacity, the values of the
unchosen options are hypothesized to decay back towards their initial
value at a rate determined by the decay rate parameter (\zeta)
(Collins and Frank, 2012 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.1460-9568.2011.07980.x")}).
W_{new} = W_{old} + \zeta \cdot (W_{0} - W_{old})
Furthermore, Hitchcock, Kim and Frank, (2025) \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1037/xge0001817")} suggest that if the feedback of the chosen option provides information relevant to the unchosen options, this decay rate may be enhanced or mitigated by the constant bonus.
# inner functions funcs = list( # Learning Rate lrng_func = multiRL::func_alpha, # Probability Function (Soft-Max + Lapse Rate) prob_func = multiRL::func_beta, # Utility Function (Stevens' Power Law) util_func = multiRL::func_gamma, # Bias Function (Upper-Confidence-Bound) bias_func = multiRL::func_delta, # Exploration Function (Epsilon-First, Greedy, Decreasing) expl_func = multiRL::func_epsilon, # Decay Rate dcay_func = multiRL::func_zeta )
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed). MIT press.
Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 1024-1035. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.1460-9568.2011.07980.x")}
Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. Elife, 8, e49547. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.7554/eLife.49547")}
Hitchcock, P. F., Kim, J., Frank, M. J. (2025). How working memory and reinforcement learning interact when avoiding punishment and pursuing reward concurrently. Journal of Experimental Psychology: General. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1037/xge0001817")}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.