ttt_qlearn: Q-Learning for Training Tic-Tac-Toe AI

Description Usage Arguments Details Value References Examples

View source: R/ttt_qlearn.R

Description

Train a tic-tac-toe AI through Q-learning

Usage

1
2
ttt_qlearn(player, N = 1000L, epsilon = 0.1, alpha = 0.8, gamma = 0.99,
  simulate = TRUE, sim_every = 250L, N_sim = 1000L, verbose = TRUE)

Arguments

player

AI player to train

N

number of episode, i.e. training games

epsilon

fraction of random exploration move

alpha

learning rate

gamma

discount factor

simulate

if true, conduct simulation during training

sim_every

conduct simulation after this many training games

N_sim

number of simulation games

verbose

if true, progress report is shown

Details

This function implements Q-learning to train a tic-tac-toe AI player. It is designed to train one AI player, which plays against itself to update its value and policy functions.

The employed algorithm is Q-learning with epsilon greedy. For each state s, the player updates its value evaluation by

V(s) = (1-α) V(s) + α γ max_s' V(s')

if it is the first player's turn. If it is the other player's turn, replace max by min. Note that s' spans all possible states you can reach from s. The policy function is also updated analogously, that is, the set of actions to reach s' that maximizes V(s'). The parameter α controls the learning rate, and gamma is the discount factor (earlier win is better than later).

Then the player chooses the next action by ε-greedy method; Follow its policy with probability 1-ε, and choose random action with probability ε. ε controls the ratio of explorative moves.

At the end of a game, the player sets the value of the final state either to 100 (if the first player wins), -100 (if the second player wins), or 0 (if draw).

This learning process is repeated for N training games. When simulate is set true, simulation is conducted after sim_every training games. This would be usefule for observing the progress of training. In general, as the AI gets smarter, the game tends to result in draw more.

See Sutton and Barto (1998) for more about the Q-learning.

Value

data.frame of simulation outcomes, if any

References

Sutton, Richard S and Barto, Andrew G. Reinforcement Learning: An Introduction. The MIT Press (1998)

Examples

1
2
p <- ttt_ai()
o <- ttt_qlearn(p, N = 200)

kota7/tictactoe documentation built on May 20, 2019, 1:11 p.m.