Nothing
#' @title Policy of Agent
#' @name policy
#' @description
#'
#' The term "policy" in this context is debatable, but the core meaning is
#' whether the model itself acts based on the probabilities it estimates.
#'
#' @section Class:
#' \code{policy [Character]}
#'
#' @section Detail:
#' \itemize{
#' \item "On-Policy": The agent converts the expected value of
#' each action into a probability distribution using the soft-max
#' function. It then utilizes a \code{sample()} function to randomly
#' select an action to execute based on these estimated probabilities.
#' Under this mechanism, actions with higher expected values have a
#' greater likelihood of being selected. Once an action is performed,
#' the feedback received (reward or penalty) is used to update the
#' expected value of that action, which in turn influences the
#' probability of choosing different actions in the future.
#'
#' \item "Off-Policy": The agent directly replicates human
#' behavior. Consequently, in most cases, this ensures that the
#' rewards obtained by the agent in each trial are identical to those
#' obtained by the human. This also results in the value update
#' trajectories for different actions being exactly the same as the
#' trajectories experienced by the human. In this scenario, a previous
#' choice does not influence subsequent value updates. Because all
#' actions are copied from the human, the trajectory of value updates
#' will not diverge due to differences in individual samples.
#' Essentially, in this specific case, the \code{sample()} step does
#' not exist.
#' }
#'
#' @section Metaphor:
#' \itemize{
#' \item "On-Policy": The agent completes an examination paper independently
#' and then checks its answers against the ground truth to see if they
#' are correct. If it makes a mistake, it re-attempts the task
#' (adjusting the input parameters). This process repeats until its
#' answers are sufficiently close to the standard answers, or until
#' the degree of similarity can no longer be improved. In other words,
#' the agent has found the optimal parameters within the given model
#' to imitate human behavior as closely as possible.
#' \item "Off-Policy": The agent sees the standard answers to the exam
#' directly. It does not personally complete any of the papers;
#' instead, it acts as an observer trying to understand the underlying
#' logic behind the standard answers. Even if there are a few
#' answers that the agent cannot even understand at all, they will
#' ignore these outliers in order to maximize its overall accuracy.
#' }
#'
NULL
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.