Title: Applied NeuroDynamic Programming in the Game of Chess
1Applied Neuro-Dynamic Programming in the Game of
Chess
2Dynamic Programming (DP)
- Family of algorithms applied to problems where
decisions are made in stages and a reward or cost
is received at each stage that is additive over
time - Optimal control method
- Example
- Traveling Salesman Problem
3Bellmans Equation
4Key Aspects of DP
- Problem must be structured into overlapping
sub-problems - Storage and retrieval of intermediate results is
necessary (tabular method) - State space must be manageable
- Objective is to calculate numerically the state
value function, J(s), and optimize the right
hand side of Bellmans equation so that the
optimal decision can be made for any given state
5Neuro-Dynamic Programming (NDP)
- Family of algorithms applied to DP-like problems
with either a very large state-space or an
unknown environmental model - Sub-optimal control method
- Example
- Backgammon (TD-Gammon)
6Key Aspects of NDP
- Rather than calculating the optimal state value
function, J(s), the objective is to calculate
the approximate state value function J(s,w) - Neural Networks are used to represent J(s,w)
- Reinforcement learning is used to improve the
decision making policy - Can be an on-line or off-line learning approach
- The Q-Factors of the state-action value function,
Q(s,a), could be calculated or approximated
(Q(s,a,w)) instead of J(s,w)
7The Game of Chess
- Played on 8x8 board with 6 types of pieces per
side (8 pawns, 2 knights, 2 bishops, 2 rooks, 1
queen and 1 king) each with its own rules of
movement - The two sides (black and white) alternate turns
- Goal is to capture the opposing sides king
Initial Position
8The Game of Chess
- Very complex with approximately 1040 states and
10120 possible games - Has clearly defined rules and is easy to simulate
making it an ideal problem for exploring and
testing the ideas in NDP - Despite recent successes in computer chess there
is still much room for improvement, particularly
in learning methodologies
9The Problem
- Given any legal initial position choose the move
leading to the largest long term reward
10Bellmans Equation
11A Theoretical Solution
- Solved with a direct implementation of the DP
algorithm (a simple recursive implementation of
Bellmans Equation, e.g. the Minimax algorithm
with last stage reward evaluation) - Results in an optimal solution, J(s)
- Computationally intractable (would take roughly
1035 MB of memory and 1017 centuries of
calculation)
12A Practical Solution
- Solved with a limited look-ahead version of the
Minimax algorithm with approximated last stage
reward evaluation - Results in a sub-optimal solution, J(s,w)
- Useful because an arbitrary amount of time or
look-ahead can be allocated to the computation of
the solution
13The Minimax Algorithm
14The Minimax Algorithm
15Alpha-Beta Minimax
- By adding lower (alpha) and upper (beta) bounds
on the possible range of scores a branch can
return, based on scores from previously analyzed
branches, complete branches can be removed from
the look-ahead without being expanded
16Alpha-Beta Minimax with Move Ordering
- Works best when moves at each node are tried in a
reasonably good order - Use iterative deepening look-ahead
- Rather than analyzing a position at an arbitrary
Minimax depth of n, analyze iteratively and
incrementally at depth 1, 2, 3, , n - Then try best move at previous iteration first in
next iteration - Counter-intuitive, but very good in practice!
17Alpha-Beta Minimax with Move Ordering
- MVV/LVA Most Valuable Victim, Least Valuable
Attacker - First sort all capture moves based on value of
capturing piece and value of captured piece then
try in that order - Next try Killer Moves
- Moves that have caused an alpha or beta cutoff at
the current depth in a previous iteration of
iterative deepening - History Moves (History Heuristic)
- Finally try rest of moves based on historical
results during the entire course of the iterative
deepening Minimax algorithm and try in order
based on Q-Factors (sort of)
18Hash Tables
- Minimax alone is not a DP algorithm because it
does not reuse previously computed results - The Minimax algorithm frequently re-expands and
recalculates the values of chess positions - Zobrist hashing is an efficient method of storing
scores of previously analyzed positions in a
table for reuse - Combined with hash tables, Minimax becomes a DP
algorithm!
19Minimal Window Alpha-Beta Minimax
- NegaScout/PVS Principal Variation Search
- Expands decision tree with infinite alpha-beta
bounds for the first move at each depth of
recursion, subsequent expansions are performed
with alpha, alpha1 bounds - Works best when moves are ordered well in an
iterative deepening framework - MTD(f) Memory Enhanced Test Driver
- Very sophisticated, can be thought of as a
binary search into the decision tree space by
continuously probing state-space with alpha-beta
window equal to 1 and adjusting additional
parameters accordingly - DP algorithm by design, requires a hash table
- Works best with good first guess f and well
ordered moves
20Other Minimax Enhancements
- Quiescence Search
- At leaf positions run Minimax search to
conclusion while only generating capture moves at
each position - Avoids a n-ply look-ahead from terminating in the
middle of a capture sequence and misevaluating
the leaf position - Results in increased accuracy of the position
evaluation, J(s,w)
21Other Minimax Enhancements
- Null-Move Forward Pruning
- During certain positions in the decision tree let
the current player pass the move to the other
player, perform Minimax algorithm at a reduced
look-ahead, then if the score returned is still
greater than the upper bound it is assumed that
if the current player had actually moved then the
resulting Minima score would still be greater
than the upper bound, so take the beta cutoff
immediately - Results in excellent reduction of nodes expanded
in the decision tree
22Other Minimax Enhancements
- Selective Extensions
- At interesting positions in the decision tree
extend the look-ahead by additional stages - Futility Pruning
- Based on alpha-beta values at leaf nodes it can
sometimes be reasonably assumed that if the
quiescence look-ahead was run it would still
return a result lower than alpha, so take an
alpha cutoff immediately
23Evaluating a Position
- The approximate state (position) value function,
J(s,w), can be approximated with a smoother
feature value function J(f(s),w) where f(s) is
the function that maps states into feature
vectors - Process is called feature extraction
- Could also calculate the approximate
state-feature value function J(s,f(s),w)
24Evaluating a Position
- Most chess systems use only approximate DP when
implementing the decision making policy, that is
the weight vector w of J(-,w) is predefined and
constant - In a true NDP implementation the weight vector w
is adjusted through reinforcements to improve the
decision making policy
25Evaluating a Position
26General Positional Evaluation Architecture
- White Approximator
- Fully connected MLP neural network
- Inputs of state and feature vectors specific to
white - One output indicating favorability (/-) of white
positional structure - Black Approximator
- Fully connected MLP neural network
- Inputs of state and feature vectors specific to
black - One output indicating favorability (/-) of black
positional structure - Final output is the difference between both
network outputs
27Material Balance Evaluation Architecture
- Two simple linear tabular evaluators, one for
white and one for black
28Pawn Structure Evaluation Architecture
- White Approximator
- Fully connected MLP neural network
- Inputs of state and feature vectors specific to
white - One output indicating favorability (/-) of white
positional structure - Black Approximator
- Fully connected MLP neural network
- Inputs of state and feature vectors specific to
black - One output indicating favorability (/-) of black
positional structure - Final output is the difference between both
network outputs
29Overall Approximation Architecture
- Evaluation is partitioned into 3 phases of the
game, opening, middle, and end - Positional evaluator consists of 9 neural network
evaluators and 3 tabular evaluators
30The Learning Algorithm
- Reinforcement learning method
- Temporal difference learning
- Use difference of two time successive
approximations of position value to adjust the
weights of neural networks - Value of final position is a value suitably
representative of the outcome of the game
31The Learning Algorithm
- TD(?)
- Algorithm that applies the temporal difference
error correction to decisions arbitrarily far
back in time discounted by a factor of ? at each
stage - ? must be in the interval 0,1
32The Learning Algorithm
- Presentation of training samples is provided by
the TD(?) algorithm - Weights for all networks are adjusted according
to Backpropagation algorithm
Neuron j local field
Neuron j output
33Self Play Training vs. On-Line Play Training
- In self play simulation the system will play
itself to train the position evaluator neural
networks - Policy of move selection should randomly select
non-greedy actions a small percentage of the time - System can be fully trained before deployment
34Self Play Training vs. On-Line Play Training
- In on-line play the system will play other
opponents to train the position evaluator neural
networks - Requires no randomization of the decision making
policy since opponent will provide sufficient
exploration of the state-space - System will be untrained initially at deployment
35Results
36Conclusion