Applied NeuroDynamic Programming in the Game of Chess

1 / 36

About This Presentation

Title:

Applied NeuroDynamic Programming in the Game of Chess

Description:

Family of algorithms applied to problems where decisions ... Quiescence Search ... at leaf nodes it can sometimes be reasonably assumed that if the quiescence ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 37

Provided by: campu2

more less

Transcript and Presenter's Notes

Title: Applied NeuroDynamic Programming in the Game of Chess

1
Applied Neuro-Dynamic Programming in the Game of
Chess

James Gideon

2
Dynamic Programming (DP)

Family of algorithms applied to problems where
decisions are made in stages and a reward or cost
is received at each stage that is additive over
time
Optimal control method
Example
Traveling Salesman Problem

3
Bellmans Equation

Stochastic DP

Deterministic DP

4
Key Aspects of DP

Problem must be structured into overlapping
sub-problems
Storage and retrieval of intermediate results is
necessary (tabular method)
State space must be manageable
Objective is to calculate numerically the state
value function, J(s), and optimize the right
hand side of Bellmans equation so that the
optimal decision can be made for any given state

5
Neuro-Dynamic Programming (NDP)

Family of algorithms applied to DP-like problems
with either a very large state-space or an
unknown environmental model
Sub-optimal control method
Example
Backgammon (TD-Gammon)

6
Key Aspects of NDP

Rather than calculating the optimal state value
function, J(s), the objective is to calculate
the approximate state value function J(s,w)
Neural Networks are used to represent J(s,w)
Reinforcement learning is used to improve the
decision making policy
Can be an on-line or off-line learning approach
The Q-Factors of the state-action value function,
Q(s,a), could be calculated or approximated
(Q(s,a,w)) instead of J(s,w)

7
The Game of Chess

Played on 8x8 board with 6 types of pieces per
side (8 pawns, 2 knights, 2 bishops, 2 rooks, 1
queen and 1 king) each with its own rules of
movement
The two sides (black and white) alternate turns
Goal is to capture the opposing sides king

Initial Position
8
The Game of Chess

Very complex with approximately 1040 states and
10120 possible games
Has clearly defined rules and is easy to simulate
making it an ideal problem for exploring and
testing the ideas in NDP
Despite recent successes in computer chess there
is still much room for improvement, particularly
in learning methodologies

9
The Problem

Given any legal initial position choose the move
leading to the largest long term reward

10
Bellmans Equation
11
A Theoretical Solution

Solved with a direct implementation of the DP
algorithm (a simple recursive implementation of
Bellmans Equation, e.g. the Minimax algorithm
with last stage reward evaluation)
Results in an optimal solution, J(s)
Computationally intractable (would take roughly
1035 MB of memory and 1017 centuries of
calculation)

12
A Practical Solution

Solved with a limited look-ahead version of the
Minimax algorithm with approximated last stage
reward evaluation
Results in a sub-optimal solution, J(s,w)
Useful because an arbitrary amount of time or
look-ahead can be allocated to the computation of
the solution

13
The Minimax Algorithm
14
The Minimax Algorithm
15
Alpha-Beta Minimax

By adding lower (alpha) and upper (beta) bounds
on the possible range of scores a branch can
return, based on scores from previously analyzed
branches, complete branches can be removed from
the look-ahead without being expanded

16
Alpha-Beta Minimax with Move Ordering

Works best when moves at each node are tried in a
reasonably good order
Use iterative deepening look-ahead
Rather than analyzing a position at an arbitrary
Minimax depth of n, analyze iteratively and
incrementally at depth 1, 2, 3, , n
Then try best move at previous iteration first in
next iteration
Counter-intuitive, but very good in practice!

17
Alpha-Beta Minimax with Move Ordering

MVV/LVA Most Valuable Victim, Least Valuable
Attacker
First sort all capture moves based on value of
capturing piece and value of captured piece then
try in that order
Next try Killer Moves
Moves that have caused an alpha or beta cutoff at
the current depth in a previous iteration of
iterative deepening
History Moves (History Heuristic)
Finally try rest of moves based on historical
results during the entire course of the iterative
deepening Minimax algorithm and try in order
based on Q-Factors (sort of)

18
Hash Tables

Minimax alone is not a DP algorithm because it
does not reuse previously computed results
The Minimax algorithm frequently re-expands and
recalculates the values of chess positions
Zobrist hashing is an efficient method of storing
scores of previously analyzed positions in a
table for reuse
Combined with hash tables, Minimax becomes a DP
algorithm!

19
Minimal Window Alpha-Beta Minimax

NegaScout/PVS Principal Variation Search
Expands decision tree with infinite alpha-beta
bounds for the first move at each depth of
recursion, subsequent expansions are performed
with alpha, alpha1 bounds
Works best when moves are ordered well in an
iterative deepening framework
MTD(f) Memory Enhanced Test Driver
Very sophisticated, can be thought of as a
binary search into the decision tree space by
continuously probing state-space with alpha-beta
window equal to 1 and adjusting additional
parameters accordingly
DP algorithm by design, requires a hash table
Works best with good first guess f and well
ordered moves

20
Other Minimax Enhancements

Quiescence Search
At leaf positions run Minimax search to
conclusion while only generating capture moves at
each position
Avoids a n-ply look-ahead from terminating in the
middle of a capture sequence and misevaluating
the leaf position
Results in increased accuracy of the position
evaluation, J(s,w)

21
Other Minimax Enhancements

Null-Move Forward Pruning
During certain positions in the decision tree let
the current player pass the move to the other
player, perform Minimax algorithm at a reduced
look-ahead, then if the score returned is still
greater than the upper bound it is assumed that
if the current player had actually moved then the
resulting Minima score would still be greater
than the upper bound, so take the beta cutoff
immediately
Results in excellent reduction of nodes expanded
in the decision tree

22
Other Minimax Enhancements

Selective Extensions
At interesting positions in the decision tree
extend the look-ahead by additional stages
Futility Pruning
Based on alpha-beta values at leaf nodes it can
sometimes be reasonably assumed that if the
quiescence look-ahead was run it would still
return a result lower than alpha, so take an
alpha cutoff immediately

23
Evaluating a Position

The approximate state (position) value function,
J(s,w), can be approximated with a smoother
feature value function J(f(s),w) where f(s) is
the function that maps states into feature
vectors
Process is called feature extraction
Could also calculate the approximate
state-feature value function J(s,f(s),w)

24
Evaluating a Position

Most chess systems use only approximate DP when
implementing the decision making policy, that is
the weight vector w of J(-,w) is predefined and
constant
In a true NDP implementation the weight vector w
is adjusted through reinforcements to improve the
decision making policy

25
Evaluating a Position
26
General Positional Evaluation Architecture

White Approximator
Fully connected MLP neural network
Inputs of state and feature vectors specific to
white
One output indicating favorability (/-) of white
positional structure
Black Approximator
Fully connected MLP neural network
Inputs of state and feature vectors specific to
black
One output indicating favorability (/-) of black
positional structure
Final output is the difference between both
network outputs

27
Material Balance Evaluation Architecture

Two simple linear tabular evaluators, one for
white and one for black

28
Pawn Structure Evaluation Architecture

White Approximator
Fully connected MLP neural network
Inputs of state and feature vectors specific to
white
One output indicating favorability (/-) of white
positional structure
Black Approximator
Fully connected MLP neural network
Inputs of state and feature vectors specific to
black
One output indicating favorability (/-) of black
positional structure
Final output is the difference between both
network outputs

29
Overall Approximation Architecture

Evaluation is partitioned into 3 phases of the
game, opening, middle, and end
Positional evaluator consists of 9 neural network
evaluators and 3 tabular evaluators

30
The Learning Algorithm

Reinforcement learning method
Temporal difference learning
Use difference of two time successive
approximations of position value to adjust the
weights of neural networks
Value of final position is a value suitably
representative of the outcome of the game

31
The Learning Algorithm

TD(?)
Algorithm that applies the temporal difference
error correction to decisions arbitrarily far
back in time discounted by a factor of ? at each
stage
? must be in the interval 0,1

32
The Learning Algorithm

Presentation of training samples is provided by
the TD(?) algorithm
Weights for all networks are adjusted according
to Backpropagation algorithm

Neuron j local field
Neuron j output
33
Self Play Training vs. On-Line Play Training

In self play simulation the system will play
itself to train the position evaluator neural
networks
Policy of move selection should randomly select
non-greedy actions a small percentage of the time
System can be fully trained before deployment

34
Self Play Training vs. On-Line Play Training

In on-line play the system will play other
opponents to train the position evaluator neural
networks
Requires no randomization of the decision making
policy since opponent will provide sufficient
exploration of the state-space
System will be untrained initially at deployment

35
Results