R/step1_run_m.R
In binaryRL: Reinforcement Learning Tools for Two-Alternative Forced Choice Tasks

Documented in run_m

#' @title 
#' Step 1: Building reinforcement learning model
#'
#' @description
#' This function is designed to construct and customize reinforcement
#' learning models.
#'
#' Items for model construction:
#' \itemize{
#'   \item \strong{Data Input and Specification:} You must provide the raw
#'     dataset for analysis. Crucially, you need to inform the 
#'     \code{\link[binaryRL]{run_m}} function about the corresponding column 
#'     names within your dataset
#'     (e.g., 
#'     \code{\link[binaryRL]{Mason_2024_G1}}, 
#'     \code{\link[binaryRL]{Mason_2024_G2}}
#'     )
#'     This is a game, so it's critical that your dataset includes rewards
#'     for both the human-chosen option and the unchosen options.
#'   \item \strong{Customizable RL Models:} This function allows you
#'     to define and adjust the number of free parameters to create
#'     various reinforcement learning models.
#'     \itemize{
#'       \item \emph{Value Function:} 
#'          \itemize{
#'            \item \emph{Learning Rate:}
#'              By adjusting the number of \code{eta}, you can construct basic
#'              reinforcement learning models such as Temporal Difference (TD)
#'              and Risk Sensitive Temporal Difference (RSTD). 
#'              You can also directly adjust \code{\link[binaryRL]{func_eta}} to 
#'              define your own custom learning rate function.
#'            \item \emph{Utility Function:} You can directly adjust the form 
#'              of \code{\link[binaryRL]{func_gamma}} to incorporate the 
#'              principles of Kahneman's Prospect Theory. Currently, 
#'              the built-in \code{\link[binaryRL]{func_gamma}} only takes the 
#'              form of a power function, consistent with Stevens' Power Law.
#'          }
#'       \item \emph{Exploration–Exploitation Trade-off:} 
#'          \itemize{
#'            \item \emph{Initial Values:} This involves setting the 
#'              initial expected value for each option when it hasn't been 
#'              chosen yet. A higher initial value encourages exploration.
#'            \item \emph{Epsilon:} Adjusting the \code{threshold}, 
#'              \code{epsilon} and \code{lambda} parameters can lead to 
#'              exploration strategies such as epsilon-first, epsilon-greedy, 
#'              or epsilon-decreasing.
#'            \item \emph{Upper-Confidence-Bound:} By adjusting \code{pi}, 
#'              it controls the degree of exploration by scaling the uncertainty 
#'              bonus given to less-explored options.
#'            \item \emph{Soft-Max:} By adjusting the inverse temperature 
#'              parameter \code{tau}, this controls the agent's sensitivity to 
#'              value differences. A higher value of tau means greater emphasis 
#'              on value differences, leading to more exploitation. A smaller 
#'              value of tau indicates a greater tendency towards exploration.
#'          }
#'     }
#'
#'   \item \strong{Objective Function Format for Optimization:} Once your
#'     model is defined in \code{\link[binaryRL]{run_m}}, it must be structured 
#'     as an objective function that accepts \code{params} as input and returns 
#'     a loss value (typically \code{logL}). This format ensures compatibility 
#'     with the \pkg{algorithm} package, which uses it to estimate optimal 
#'     parameters. For an example of a standard objective function format, see 
#'     \code{\link[binaryRL]{TD}}, 
#'     \code{\link[binaryRL]{RSTD}}, 
#'     \code{\link[binaryRL]{Utility}}.
#' }
#'
#'  For more information, please refer to the homepage of this package:
#'  \url{https://yuki-961004.github.io/binaryRL/}
#' 
#' @param name [string] 
#' 
#' The name of your RL model
#' 
#' @param mode [string] 
#' 
#' This parameter controls the function's operational mode. It has three
#'  possible values, each typically associated with a specific function:
#'  \itemize{
#'    \item \code{"simulate"}: Should be used when working with 
#'      \code{\link[binaryRL]{rcv_d}}.
#'    \item \code{"fit"}: Should be used when working with 
#'      \code{\link[binaryRL]{fit_p}}.
#'    \item \code{"replay"}: Should be used when working with 
#'      \code{\link[binaryRL]{rpl_e}}.
#'  }
#'  In most cases, you won't need to modify this parameter directly, as suitable
#'  default values are set for different contexts.
#' 
#' @param policy [string]
#' 
#' Specifies the learning policy to be used.
#' This determines how the model updates action values based on observed or
#'   simulated choices. It can be either \code{"off"} or \code{"on"}.
#'   
#'  \itemize{
#'   \item {
#'    \strong{Off-Policy (Q-learning): }
#'    This is the most common approach for modeling
#'     reinforcement learning in Two-Alternative Forced Choice (TAFC) tasks.
#'     In this mode, the model's goal is to learn the underlying value of
#'     each option by observing the human participant's behavior. It achieves
#'     this by consistently updating the value of the option that the
#'     human actually chose. The focus is on understanding the value 
#'     representation that likely drove the participant's decisions.
#'   }
#'   \item {
#'    \strong{Off-Policy (SARSA): }
#'    In this mode, the target policy and the behavior policy are identical. 
#'     The model first computes the selection probability for each option based 
#'     on their current values. Critically, it then uses these probabilities to 
#'     sample its own action. The value update is then performed on the action 
#'     that the model itself selected. This approach focuses more on directly 
#'     mimicking the stochastic choice patterns of the agent, rather than just 
#'     learning the underlying values from a fixed sequence of actions.
#'   }
#'  }
#' 
#' @param data [data.frame] 
#' 
#' This data should include the following mandatory columns: 
#'  \itemize{
#'    \item \code{sub} "Subject"
#'    \item \code{time_line} "Block" "Trial"
#'    \item \code{L_choice} "L_choice"
#'    \item \code{R_choice} "R_choice"
#'    \item \code{L_reward} "L_reward"
#'    \item \code{R_reward} "R_reward"
#'    \item \code{sub_choose} "Sub_Choose"
#'  }
#' 
#' @param id [string] 
#' 
#' Which subject is going to be analyzed. The value should correspond to an 
#'  entry in the "sub" column, which must contain the subject IDs. 
#'  
#' e.g. \code{id = 18}
#'  
#' @param n_params [integer] 
#' 
#' The number of free parameters in your model. 
#' 
#' @param n_trials [integer] 
#' 
#' The total number of trials in your experiment.
#' 
#' @param gamma [NumericVector] 
#' 
#'  \strong{Note}: This should not be confused with the discount rate parameter
#'   (also named gamma) found in Temporal Difference (TD) models. 
#'   Rescorla-Wagner model does not include a discount rate.
#'   Here, \code{gamma} is used as a free parameter to shape the 
#'   utility function.
#'  
#'  \itemize{
#'    \item \strong{Stevens' Power Law}:
#'    Utility is modeled as:
#'    \deqn{U(R) = {R}^{\gamma}}
#'
#'    \item \strong{Kahneman's Prospect Theory}:
#'    This exponent is applied differently based on the sign of the reward:
#'    \deqn{U(R) = \begin{cases}
#'      R^{\gamma_{1}}, & R > 0 \\
#'      \beta \cdot R^{\gamma_{2}}, & R < 0
#'    \end{cases}}
#'  }
#'  
#' default: \code{gamma = 1} 
#' 
#' @param eta [NumericVector] 
#' 
#' Parameters used in the Learning Rate Function, 
#'  \code{rate_func}, 
#'  representing the rate at which the subject updates the difference 
#'  (prediction error) between the reward and the expected value in the 
#'  subject's mind.
#'
#'  The structure of \code{eta} depends on the model type:
#'  \itemize{
#'    \item For the \strong{Temporal Difference (TD) model}, 
#'    where a single learning rate is used throughout the experiment 
#'    \deqn{V_{new} = V_{old} + \eta \cdot (R - V_{old})}
#'    
#'    \item For the \strong{Risk-Sensitive Temporal Difference (RDTD) model},
#'    where two different learning rates are used depending on whether the 
#'    reward is lower or higher than the expected value:
#'    \deqn{V_{new} = V_{old} + \eta_{+} \cdot (R - V_{old}), R > V_{old}}
#'    \deqn{V_{new} = V_{old} + \eta_{-} \cdot (R - V_{old}), R < V_{old}}
#'  }
#'  
#'  TD: \code{eta = 0.3}
#'  
#'  RSTD: \code{eta = c(0.3, 0.7)}
#'
#' @param initial_value [double] 
#' 
#' Subject's initial expected value for each stimulus's reward. If this value 
#'  is not set \code{initial_value = NA}, the subject will use the reward received 
#'  after the first trial as the initial value for that stimulus. In other 
#'  words, the learning rate for the first trial is 100%. 
#'  
#' default: \code{initial_value = NA_real_}
#'
#' @param threshold [integer]
#' 
#' Controls the initial exploration phase in the \strong{epsilon-first} strategy.
#'  This is the number of early trials where the subject makes purely random
#'  choices, as they haven't yet learned the options' values. For example,
#'  \code{threshold = 20} means random choices for the first 20 trials.
#'  For \strong{epsilon-greedy} or \strong{epsilon-decreasing} strategies,
#'  \code{threshold} should be kept at its default value.
#'  
#'  \deqn{P(x) = \begin{cases}
#'    \text{trial} \le \text{threshold}, & x=1 \text{ (random choosing)} \\
#'    \text{trial} > \text{threshold}, & x=0 \text{ (value-based choosing)}
#'  \end{cases}}
#'  
#'  default: \code{threshold = 1}
#'  
#'  epsilon-first: \code{threshold = 20, epsilon = NA, lambda = NA}
#'
#' @param epsilon [NumericVector] 
#' 
#' A parameter used in the \strong{epsilon-greedy} exploration strategy. It 
#'  defines the probability of making a completely random choice, as opposed 
#'  to choosing based on the relative values of the left and right options. 
#'  For example, if \code{epsilon = 0.1}, the subject has a 10% chance of random 
#'  choice and a 90% chance of value-based choice. This parameter is only 
#'  relevant when \code{threshold} is at its default value (1) and 
#'  \code{lambda} is not set.
#'  
#'  \deqn{P(x) = \begin{cases}
#'    \epsilon, & x=1 \text{ (random choosing)} \\
#'    1-\epsilon, & x=0 \text{ (value-based choosing)}
#'  \end{cases}}
#' 
#'  epsilon-greedy: \code{threshold = 1, epsilon = 0.1, lambda = NA}
#' 
#' @param lambda [NumericVector] 
#' 
#' A numeric value that controls the decay rate of exploration probability
#'  in the \strong{epsilon-decreasing} strategy. A higher \code{lambda} value
#'  means the probability of random choice will decrease more rapidly
#'  as the number of trials increases.
#'  
#'  \deqn{P(x) = \begin{cases}
#'    \frac{1}{1+\lambda \cdot trial}, & x=1 \text{ (random choosing)} \\
#'    \frac{\lambda \cdot trial}{1+\lambda \cdot trial}, & x=0 \text{ (value-based choosing)}
#'  \end{cases}}
#'  
#' epsilon-decreasing: \code{threshold = 1, epsilon = NA, lambda = 0.5}
#' 
#' @param pi [NumericVector]
#' 
#' Parameter used in the Upper-Confidence-Bound (UCB) action selection
#'  formula. \code{bias_func} controls the degree of 
#'  exploration by scaling the uncertainty bonus given to less-explored options. 
#'  A larger value of \code{pi} (denoted as \code{c} in Sutton and Barto(2018) ) 
#'  increases the influence of this bonus, leading to more exploration of 
#'  actions with uncertain estimated values. Conversely, a smaller \code{pi} 
#'  results in less exploration.
#'
#' \deqn{
#'   A_t = \arg \max_{a} \left[ V_t(a) + \pi \sqrt{\frac{\ln(t)}{N_t(a)}} \right]
#' }
#' 
#' default: \code{pi = NA}
#' 
#' @param tau [NumericVector] 
#' 
#' Parameters used in the Soft-Max Function. \code{prob_func} 
#'  representing the sensitivity of the subject to the value difference when 
#'  making decisions. It determines the probability of selecting the left option 
#'  versus the right option based on their values. A larger value of tau 
#'  indicates greater sensitivity to the value difference between the options. 
#'  In other words, even a small difference in value will make the subject more 
#'  likely to choose the higher-value option. 
#'  
#'  \deqn{P_L = \frac{1}{1+e^{-(V_L-V_R) \cdot \tau}}; P_R = \frac{1}{1+e^{-(V_R-V_L) \cdot \tau}}} 
#' 
#'  default \code{tau = NA}
#'  
#' @param lapse [double] 
#' 
#' A numeric value between 0 and 1, representing the lapse rate.
#' 
#' You can interpret this parameter as the probability of the agent "slipping"
#'  or making a random choice, irrespective of the learned action values. This
#'  accounts for moments of inattention or motor errors. In this sense, it
#'  represents the minimum probability with which any given option will be
#'  selected. It is a free parameter that acknowledges that individuals do not
#'  always make decisions with full concentration throughout an experiment.
#'  
#' From a modeling perspective, the lapse rate is crucial for preventing the
#'  log-likelihood calculation from returning \code{-Inf}. This issue arises 
#'  when the model assigns a probability of zero to an action that the 
#'  participant actually chose (\code{log(0)} is undefined). By ensuring every 
#'  option has a non-zero minimum probability, the \code{lapse} parameter makes 
#'  the fitting process more stable and robust against noise in the data.
#'  
#'  \deqn{
#'    P_{final} = (1 - lapse) \cdot P_{softmax} + \frac{lapse}{N_{choices}}
#'  }
#'  
#' default: \code{lapse = 0.02} 
#' 
#' This ensures each option has a minimum selection probability of 1 percent 
#'  in TAFC tasks. 
#' 
#' @param alpha [NumericVector] 
#' 
#' Extra parameters that may be used in functions. 
#'
#' @param beta [NumericVector] 
#' 
#' Extra parameters that may be used in functions. 
#' 
#' @param priors [list]
#' 
#'   A list specifying the prior distributions for the model parameters.
#'   This argument is mandatory when using \code{estimate = "MAP"}.
#' 
#'  default: \code{priors = NULL}
#' 
#' @param util_func [Function] 
#' 
#'  Utility Function see \code{\link[binaryRL]{func_gamma}}.
#' 
#' @param rate_func [Function] 
#' 
#'  Learning Rate Function see \code{\link[binaryRL]{func_eta}}.
#' 
#' @param expl_func [Function] 
#' 
#'  Exploration Strategy Function see \code{\link[binaryRL]{func_epsilon}}.
#' 
#' @param bias_func [Function] 
#' 
#'  Upper-Confidence-Bound see \code{\link[binaryRL]{func_pi}}.
#' 
#' @param prob_func [Function] 
#' 
#'  Soft-Max Function see \code{\link[binaryRL]{func_tau}}.
#' 
#' @param loss_func [Function] 
#' 
#'  Loss Function see \code{\link[binaryRL]{func_logl}}.
#' 
#' @param sub [string] 
#' 
#'  Column name of subject ID
#' 
#'  e.g. \code{sub = "Subject"}
#' 
#' @param time_line [CharacterVector] 
#' 
#' A vector specifying the name of the column that the sequence of the 
#'  experiment. This argument defines how the experiment is structured, 
#'  such as whether it is organized by "Block" with breaks in between, and 
#'  multiple trials within each block. 
#'  
#' default: \code{time_line = c("Block", "Trial")}
#' 
#' @param L_choice [string]
#'  
#' Column name of left choice. 
#' 
#'  default: \code{L_choice = "Left_Choice"}
#' 
#' @param R_choice [string] 
#' 
#' Column name of right choice. 
#' 
#' default: \code{R_choice = "Right_Choice"}
#'  
#' @param L_reward [string] 
#' 
#' Column name of the reward of left choice 
#' 
#' default: \code{L_reward = "Left_reward"}
#' 
#' @param R_reward [string] 
#' 
#' Column name of the reward of right choice 
#' 
#' default: \code{R_reward = "Right_reward"}
#'  
#' @param sub_choose [string] 
#' 
#' Column name of choices made by the subject. 
#' 
#' default: \code{sub_choose = "Choose"}
#' 
#' @param rob_choose [string] 
#' 
#' Column name of choices made by the model, which you could ignore. 
#' 
#' default: \code{rob_choose = "Rob_Choose"}
#'  
#' @param raw_cols [CharacterVector] 
#' 
#' Defaults to \code{NULL}. If left as \code{NULL}, it will directly capture 
#'  all column names from the raw data.
#' 
#' @param var1 [string] 
#' 
#' Column name of extra variable 1. If your model uses more than just reward 
#'  and expected value, and you need other information, such as whether the 
#'  choice frame is Gain or Loss, then you can input the 'Frame' column as 
#'  var1 into the model.
#'  
#' default: \code{var1 = NA_character_}
#' 
#' @param var2 [string] 
#' 
#' Column name of extra variable 2. If one additional variable, var1, does not 
#'  meet your needs, you can add another additional variable, var2, into your 
#'  model.
#'  
#' default: \code{var2 = NA_character_}
#' 
#' @param seed [integer] 
#' 
#' Random seed. This ensures that the results are 
#'  reproducible and remain the same each time the function is run. 
#'  
#' default: \code{seed = 123}
#' 
#' @param digits_1 [integer] 
#' 
#' The number of decimal places to retain for columns related to value function 
#'  
#' default: \code{digits_1 = 2}
#' 
#' @param digits_2 [integer] 
#' 
#' The number of decimal places to retain for columns related to select function. 
#'  
#' default: \code{digits_2 = 5}
#'
#' @param engine [string]
#' 
#' - \code{"r"}: Use the pure R version of the code. 
#' 
#' - \code{"cpp"}: Use the Rcpp-optimized version. 
#' 
#' default: \code{engine = "cpp"}
#'
#' @returns 
#' A list of class \code{binaryRL} containing the results of the model fitting.
#'  
#' @examples
#' data <- binaryRL::Mason_2024_G2
#' 
#' binaryRL.res <- binaryRL::run_m(
#'   mode = "replay",
#'   data = data,
#'   id = 18,
#'   eta = c(0.321, 0.765),
#'   tau = 0.5,
#'   n_params = 3, 
#'   n_trials = 360
#' )
#' 
#' summary(binaryRL.res)
#' 
run_m <- function(
  name = NA,
  mode = c("simulate", "fit", "replay"),
  policy = c("on", "off"),

  data,
  id,
  n_params,
  n_trials,

  gamma = 1,
  eta,
  initial_value = NA_real_,
  threshold = 1,
  epsilon = NA,
  lambda = NA,
  pi = NA,
  tau = NA,
  lapse = 0.02,
  alpha = NA,
  beta = NA,
  
  priors = NULL,
  
  util_func = func_gamma,
  rate_func = func_eta,
  expl_func = func_epsilon,
  bias_func = func_pi,
  prob_func = func_tau,
  loss_func = func_logl,
  
  sub = "Subject",
  time_line = c("Block", "Trial"),
  L_choice = "L_choice",
  R_choice = "R_choice",
  L_reward = "L_reward",
  R_reward = "R_reward",
  sub_choose = "Sub_Choose",
  rob_choose = "Rob_Choose",
  raw_cols = NULL,
  var1 = NA_character_,
  var2 = NA_character_,
  
  seed = 123,
  digits_1 = 2,
  digits_2 = 5,
  engine = "cpp"
){
  # 只有fit模式需要关心是否是on-policy还是off-policy
  if (mode == "fit") {
    policy <- policy
  }
  # 因为不论是simulate还是repay, 都是根据参数生成假数据, 一定是on
  else {
    policy <- "on"
  }
  
  # model name
  if (!(is.na(name))) {
    name <- name
  }
  else if (is.na(name) & length(eta) == 1 & gamma == 1) {
    name <- "TD"
  }
  else if (is.na(name) & length(eta) == 2 & gamma == 1) {
    name <- "RSTD"
  }
  else if (is.na(name) & length(eta) == 1 & gamma != 1) {
    name <- "Utility"
  }
  else {
    name <- "Customed"
  }
  
  # 原始数据集的列名
  if (is.null(raw_cols)) {
    raw_cols = colnames(data)
  }
  
  # 选择被试
  data <- data[data[[sub]] == id, ]
  
  step1 <- unique_choice(
    data = data,
    L_choice = L_choice, 
    R_choice = R_choice
  )
  
  step2 <- arrange_data(
    data = step1[["data"]],
    time_line = time_line
  )
  
  step3 <- add_NA(
    data = step2
  )
  
  step4 <- set_initial_value(
    data = step3, 
    options = step1[["options"]], 
    initial_value = initial_value
  )
  
  if (engine == "cpp") {
    decision_making <- decision_making_cpp
  }
  else {
    decision_making <- decision_making_r
  }
  
  step5 <- decision_making(
    mode = mode,
    policy = policy,
    data = step4,
    options = step1[["options"]],
    seed = seed,
    sub_choose = sub_choose, rob_choose = rob_choose,
    L_choice = L_choice, R_choice = R_choice,
    L_reward = L_reward, R_reward = R_reward,
    var1 = var1, var2 = var2,
    
    initial_value = initial_value,
    threshold = threshold,
    lapse = lapse,
    
    alpha = alpha,
    beta = beta,
    gamma = gamma,
    eta = eta,
    epsilon = epsilon,
    lambda = lambda,
    pi = pi,
    tau = tau,
    
    util_func = util_func,
    rate_func = rate_func,
    expl_func = expl_func,
    bias_func = bias_func,
    prob_func = prob_func
  )
  
  step6 <- model_fit(
    data = step5, 
    loss_func = loss_func,
    alpha = alpha,
    beta = beta,
    var1 = var1,
    var2 = var2,
    L_choice = L_choice, 
    R_choice = R_choice, 
    sub_choose = sub_choose
  )
  
  step7 <- digits(
    data = step6, 
    options = step1[["options"]],
    digits_1 = digits_1, 
    digits_2 = digits_2
  )
  
  step8 <- output(
    mode = mode,
    policy = policy,
    
    data = step7,
    name = name,
    n_params = n_params,
    n_trials = n_trials,
    
    initial_value = initial_value,
    threshold = threshold,
    lapse = lapse,
    
    alpha = alpha,
    beta = beta,
    gamma = gamma,
    eta = eta,
    epsilon = epsilon,
    lambda = lambda,
    pi = pi,
    tau = tau,
    
    priors = priors
  )
  
  step9 <- mode(
    data = step8,
    mode = mode,
    sub_choose = sub_choose,
    rob_choose = rob_choose,
    raw_cols = raw_cols
  )
  
  final <- step9
  
  return(final)
}