Title: LikelyAdmissible
1Likely-Admissible Sub-symbolic Heuristics
26-08-2004 Valencia
- Marco Ernandes
- Cognitive Science PhD Student
- Email ernandes_at_dii.unisi.it
- Web www.dii.unisi.it/ernandes
- Marco Gori
- Professor of Computer Science
- Email marco_at_dii.unisi.it
- Web www.dii.unisi.it/marco
2Heuristic Search
- Search algorithms
- A, IDA, BS,
- Heuristic information
- h(n) ? tipically the distance from node n to goal
- Heuristic usage policy
- How to combine h(n) g(n) to obtain f(n)
3Optimal Search for NP Problems
- 2 approaches
- Rigid admissibility
- requires optimistic heuristics
- ALWAYS retrieves optimal solutions CC
- Relaxed Admissibility
- e-admissible search (es WA)
- retrieves solutions with bounded costs C?(1e)C
- the problem is no more NP-complete
4Two families of heuristics
- Online heuristics
- The h(n) value is computed during search when a
node is visited. - An AI classic Manhattan Distance
- Memory-based heuristics
- Offline phase resolution of all possibile
subproblems and storage of all the results. - Online phase decomposition of a node in
subproblems and database querying. - Successfully used for rigid admissibility.
5Online heuristic research
- How to improve Manhattan estimations?
- Working on its main bias locality.
- Manhattan considers each piece of the problem as
completely independent from the rest. - Hence it has no way to determine how tiles
influence each other. - hM h - GAP
- Manhattan does not consider the
- influence of the blank tile.
hM 3 h 11
Tile conflicts
6Online heuristic research
- How to improve Manhattan estimations?
- 1 Manhattan Correction (Hansson et al., 1992)
- The idea is to increment the estimation with ad
hoc techniques, maintaining admissibility. - 2 ABSOLVER approach (Prieditis, 1989)
- Automatically inventing admissible heurisitcs
through constraint elimination. - 3 Higher-Order Heuristics (Korf, 1996)
- Generalizing Manhattan considering subproblems
of a configuration and not the single elements.
7Manhattan Corrections
- Linear Conflicts
- Corner Tiles
- Last Moves
- Non-Linear Conflicts
- Corner Deduction
- First Moves
Hansson et al., 1992
Conflict Deduction Ernandes, 2003
Ernandes, 2003
8Examples
Linear Conflicts computes conflicts on the same
row/coloumn
Corner tiles computes conflicts thanks to
corners properties
Last Moves computes the last two moves to
complete the puzzle.
Non-Linear Conflicts computes conflicts on a
different row/coloumn (two types)
Corner deduction as corner tiles but with
correct tiles on the diagonal.
9Conflict Deduction
- It is more convenient to implement the various
techniques separately. - Cannot add together all corrections
inadmissibility! - If one tile is involved in one conflict it counts
only once. - To maximize the estimation we use, for each
tile, the technique that gives the highest
contribute.
10Higher-Order Heuristics
- Ad hoc techniques generate strongly
problem-dependent heuristics. - They are not sufficient to attack bigger problems
as the 24-puzzle. - Manhattan has to be generalized otherwise
considering the distance-to-goal of more elements
(tiles) collected together. - First example ? Pairwise Distances instead of
computing the distance of 1 indipendent tile, we
use couples of tiles.
11Higher-Order Heuristics problems
- We can potentiate the Pairwise Distance computing
it for all possible tile couples and then seeking
the combination that maximizes the estimation
Maximum Weighted Matching Problem - PD remains poorly informed. We need triples of
tiles but the Matching Problem becomes
NP-Complete (Korf, 1996). - Hence the only possible Higher-Order Heuristic to
be efficiently used online is Pairwise Distance,
which is to poor ? less informed than Conflict
Deduction!
12From Higher-Order Heuristics to Memory-based
heuristics
- Higher-Order Heuristics could ignore the
maximization problem and consider pre-designed
tile groups (and increment their dimensions). - Solving subproblems of 3 or more tiles (patterns)
is too expensive during search we need to do
this offline.
13Disjoint Pattern Databases(KorfTaylor 2002)
- Additive version of Pattern Databases
(CulbersonSchaeffer,96) where pattern are
considered independently. - Manhattan is the simplest Disjoint Pattern DB 1
tile 1 pattern. DPDBs, unlike PDBs, always
dominate Manhattan. - On the 15-puzzle they perform 75 times faster
than non-additive PDBs and their DB generation is
much easier because distances can be computed
backwards by disarranging the patterns. - Different DPDBs can be combined taking argmax
global speedup over Manhattan 2000, space
reduction 13000.
DPDB 1
DPDB 2
14DPDBs and the 24-puzzle.
- This technique solved the 24-puzzle between 1,1
and 21 times faster than classic Higher-Order
Heuristics avg. 2 days. - But in many cases using more nodes!
- This technique evidently does not scale with
dimension problems. - Maintaining the same time complexity for the
35-puzzle would mean increase from 1013 to 1028
the number of DB entrances.
15Criticizing the classic approach
- We believe that it is more sensible to
investigate the combination online heuristics
relaxed admissibility. - A) Because rigid admissibility does not give any
chance to face problems of greater greater
dimensions. - Online admissible heuristics ? NP-Hard in time
- Mem-based admissible heuristics ? NP-Hard in
space - B) Because admissibility is a sufficient
condition for optimality, not necessary!
16Admissible overestimations
- Some overestimations obviously dont affect
optimality - Constant overestimations
- Overestimations outside the optimal path
- Optimal path overestimations coupled with
overestimations in brother sub-branches. - In some domains other overestimations are
admissible - Uniform-cost problems h lt hc (Move
games) - Orthogonal single-piece-move problems h lt h2c
(Atomic Manhattan-space problems ? like the
sliding-tile puzzle) - Simple experiment with the 8-puzzle and A
- Use heuristic hhMs with s variable
- If sgt0 and slt2 search is optimal, but more
inefficient while s?2. - If s2 search can be supoptimal, and regain space
efficiency.
17Likely-Admissible Search
- We relax the optimality requirement in a
probabilistic sense (not qualitatively like
e-admissible search). - Why is it a better approach than e-admissibility?
- It allows to retrieve TRULY OPTIMAL solutions.
- It still allows to change the nature of search
complexity. - It allows to study the complexity stressing p
asymptotically to 1. - Because search can rely on any heuristic, unlike
e-admissible search that works only on
already-proven-admissible ones. - Because we can better combine search with
statistical machine learning techniques. Using
universal approximators we can automatically
generate heuristics.
18 Likely-Admissible Search A statistical
framework
- Any given non-admissible heuristic can be used.
The only requisite is to have a previous
statistical analysis of overestimation
frequencies. - We refer with P(h) to the probability that
heuristic h underestimates h for any given state
x ? X. - We refer with ph to the probability of optimally
solving a problem using h and A. - A main goal of the framework is to obtain ph from
P(h) WE WANT TO ESTIMATE OPTIMALITY FROM
ADMISSIBILITY
19 Likely-Admissible Search Trivial case single
heuristic.
- The overestimations over optimal path p affect
optimality hence, given solution depth d - (eq. 1)
- Considering the admissible overestimation
theorem, in the sliding-tiles puzzle domain - (eq. 2)
20 Likely-Admissible Search Effect of Admissible
Overestimations Th.
- Underestimating h2 is MUCH EASIER than h!
- Best heuristic generated for the 8-puzzle
overestimated h in 28,4 of cases, but h2 in
1,9 !!
21 Likely-Admissible Search Multiple Heuristics
- To enrich the heuristic information we can
generate many heuristics and use them
simultaneously. - With j different heuristics we can take each time
the smaller evaluation, in order to stress
admissibility -
- Thus
- (eq. 3)
- (eq. 3b)
22 Likely-Admissible Search Multiple Heuristics
- A common problem we desire an optimality p
how many heuristics do we have to use to obtain
p? - We will consider for simplicity that all j
heuristics have the same given P(h2?). Hence - (eq. 4)
j grows logarithmically with this term, that
grows both with d and pH because dgt1 e pH lt 1
23 Likely-Admissible Search Some Examples
- 8-puzzle how many heuristics?
- d ? 22
- Desired optimality 99,9 ? pH 0,999
- Given heuristics have P(h2?) 0,95
- log 0,05 (1 22 0,999 ) ? log 0,05 0,0000455 ?
?3,33? - 4
- 15-puzzle how many heuristics?
- d ? 53
- Same desired optimality
- Give heuristics have P(h2?) 0,93
- log 0,07 (1 22 0,999 ) ? log 0,07 0,0000189 ?
?4,1? - 5
24 Likely-Admissible Search Main Problems
- Equations 3 and 3b assume that
- INDEPENDENT PROBABILITY DISTRIBUTION
Overestimation probability of competitive
heuristics hj(x) have an independent distribution
over X. - Equations 2 assumes that
- CONSTANT PROBABILITY
- Underestimation probability P(h2) is constant
for all x independently by h(n). - All these assumptions are very strong
- We observed experimentally that ANN heuristics
map X with similar overestimation probabilities. - We observed that avg. error grows with h, thus
P(h2) too.
25 Likely-Admissible Search Prediction capability
- Eq.3 is not usable since it requires total
independency. - Optimality growth seems more or less linear (not
exponential) with the number of heuristics. It
sensibly improves with learning over different
datasets . - Trivial equation 2 gives a probabilistic lower
bound of effective search optimality - Extremely precise if the estimation is over 80.
- Imprecise (but always pessimistic) for low
predictions. - Optimistic predictions are very rare and depend
on the CONSTANT PROBABILITY assumption. - Predictions are much more accurate than
e-admissible search predictions.
26 Likely-Admissible Search Optimality
prediction 8-puzzle
27 Likely-Admissible Search Optimality
prediction 15-puzzle
28Sub-symbolic heuristics
We used standard MLP networks.
h(n)
29 Sub-symbolic heuristics Are sub-symbolic
heuristics online?
- We believe so. Even that there is an offline
learning phase. For 2 reasons - 1. Nodes visited during search are generally
UNSEEN. - Exactly like often humans do with learned
heuristics we dont recover a heuristic value
from a database, we compute it employing the
inner rules that the heuristic provides. - 2. The learned heuristic should be
dimension-independent learning over small
problems could be used for bigger problems (i.e.
8-puzzle ? 15-puzzle). This is not possible with
mem-based heuristics.
30 Sub-symbolic heuristics Outputs Targets
- Two options
- A) 1 linear neuron output
- B) n 0/1 neuron outputs
- A is much better.
- Two possible targets
- A) direct target function ? o(x)h(x)
- B) gap target ? o(x)h(x)-hM(x)
- (which takes advantage of Manhattan too)
- Experiments B improves against A only in bigger
problems such as the 15-puzzle.
31 Sub-symbolic heuristics Entrances coding
(N x k t) if square k occupied by value t ? N2
- A 000000100 001000000 000010000
- B 001 100 100 001 010 010 100 010
- C -2 0 0 1 -1 1 1 1 0 1 0 0
Row/column targets in block k of value t are high
if k occupied by value t ? 2N3/2
For each square compute hortvert distances ? 2N
32 Sub-symbolic heuristics Learning Algorithm
- Backpropagation with a new error function,
instead of classic function Ed od td over
example d. -
- We introduce a coefficient of asymmetry in order
to stress admissibility - Ed (1-w) (od td) if (od td) lt 0
- Ed (1w) (od td) if (od td) gt 0 with
0 lt w lt 1 - The modified backprop minimizes
- E(W) ½ ?d?D rd (od td )2 with rd (1w)
or rd (1-w) - We used a dynamic decreasing w, in order to
stress underestimations when learning is simple
and to ease it successively. Momentum a0,8
helped smoothness.
33 Sub-symbolic heuristics Asymmetric Regression
Symmetric error
Asymmetric error
- This is a general idea for backpropagation
learning. - It can suit any regression problem where
overestimations harm more than underestimations
(or contrary). - Heuristic machine learning is an ideal
application field.
34 Sub-symbolic heuristics Dataset Generation
- Examples are previously optimally solved
configurations. - Few examples are sufficient for good learning. A
few hundreds to have faster search than
Manhattan. - Experimental ideal 8-puzzle set ? 10000
examples, 15-puzzle ? 25000 (1/500x106 of the
problem space!). - IMPORTANT these examples have to be
representative of cases present in search trees,
not of random cases! see 15-puzzle search tree
distribution - Hence, avg. h should stay around d/2. Over 60
of 15-puzzle examples have d lt 30, ? 80 have d lt
45. Dataset generation is much easier than
expected and its fully parallelizable. - Generating two 25000 15-puzzle dataset, took 100
hours, half than learning.
35 Sub-symbolic heuristics Modifying estimations
a posteriori
- Using trunc() ? mandatory for IDA
- Adapting value to Manhattans parity
- Increases by 30 IDA efficiency.
- Does not improve admissibility, due to the
admissible overestimations theorem. - Shifting to Manhattan in search endings.
- Maintaining dominance over Manhattan.
- Arbitrary estimation reduction.
36Experimental Results 8-puzzle using A single
heuristics
21,97
22,91
22,81
22,79
22,73
22,57
22,56
22,43
22,31
Manhattan
Conflict Deduction
1 ANN techniques a posteriori
1 ANN asym-learning
1
3
4
5
1 ANN
Test set 2000 random configurations
37Experimental Results 8-puzzle using A and
multiple heuristics
Test set 2000 random configurations
38Experimental Results 15-puzzle using IDA and
multiple heuristics
Test set 700 random configurations
(avg d52,7, nodes with Manh 3.7 x 108)
39Experimental Results Some comparisons
38
Try the demo at http//www.dii.unisi.it/
ernandes/samloyd/
- Compared to e-admissible search
- WIDA with w1,25 and hconflict deduction
predicted d66, factual d54,49, nodes visited
42374 - IDA with 1 ANN factual d54,45, nodes 24711
- Compared to Manhattan
- IDA with 1 ANN (optimality ? 30) 1/1000
execution time, 1/15000 nodes visited - IDA with 2 ANN (opt. ? 50) 1/500 time, 1/13000
nodes. - IDA with 4 ANN-1 (opt. ? 90) 1/70 time, 1/2800
nodes. - Compared to DPDBs
- IDA with 1 ANN between -17 and 13 nodes
visited, between 1,4 and 3,5 times slower
40Conclusions
39
- We defined a new framework of relaxed-admissible
search likely-admissible search - This statistical framework is more appealing than
e-admissibility - it relaxes the quantity of the solutions, not the
quality - it works with any non-admissible heuristic
- it can exploit statistical learning techniques
- Likely-admissible sub-symbolic heuristics
- performance on 15-puzzle can challenge DPDB
heuristics - represent a way to speed-up solving, avoid
memory abuse and still retrieve optimal solutions.
41Further Work
40
- 1 Generalization of the input coding. Two goals
- A) reduce the dimension of input representation.
- B) allow learning over different
problem-dimensions - An idea using graphs and Recurrent ANN to
generate heuristics. - 2 Auto-feed Learning
- The system should be able to generate its own
dataset automatically during learning, increasing
complexity gradually. - 3. Network specialization
- Train and apply heuristics only over a certain
domain of complexity (i.e. guided by Manhattan
Distance), during search.
42Likely-Admissible Sub-symbolic Heuristics
26-08-2004 Valencia
- THANK YOU FOR YOUR ATTENTION