LikelyAdmissible

About This Presentation

Title:

LikelyAdmissible

Description:

Marco Ernandes & Marco Gori Department of Information Engineering University of ... We can potentiate the Pairwise Distance computing it for all possible tile ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 43

Provided by: marcoer

Category:

more less

Transcript and Presenter's Notes

Title: LikelyAdmissible

1
Likely-Admissible Sub-symbolic Heuristics
26-08-2004 Valencia

Marco Ernandes
Cognitive Science PhD Student
Email ernandes_at_dii.unisi.it
Web www.dii.unisi.it/ernandes
Marco Gori
Professor of Computer Science
Email marco_at_dii.unisi.it
Web www.dii.unisi.it/marco

2
Heuristic Search

Search algorithms
A, IDA, BS,
Heuristic information
h(n) ? tipically the distance from node n to goal
Heuristic usage policy
How to combine h(n) g(n) to obtain f(n)

3
Optimal Search for NP Problems

2 approaches
Rigid admissibility
requires optimistic heuristics
ALWAYS retrieves optimal solutions CC
Relaxed Admissibility
e-admissible search (es WA)
retrieves solutions with bounded costs C?(1e)C
the problem is no more NP-complete

4
Two families of heuristics

Online heuristics
The h(n) value is computed during search when a
node is visited.
An AI classic Manhattan Distance
Memory-based heuristics
Offline phase resolution of all possibile
subproblems and storage of all the results.
Online phase decomposition of a node in
subproblems and database querying.
Successfully used for rigid admissibility.

5
Online heuristic research

How to improve Manhattan estimations?
Working on its main bias locality.
Manhattan considers each piece of the problem as
completely independent from the rest.
Hence it has no way to determine how tiles
influence each other.
hM h - GAP
Manhattan does not consider the
influence of the blank tile.

hM 3 h 11
Tile conflicts
6
Online heuristic research

How to improve Manhattan estimations?
1 Manhattan Correction (Hansson et al., 1992)
The idea is to increment the estimation with ad
hoc techniques, maintaining admissibility.
2 ABSOLVER approach (Prieditis, 1989)
Automatically inventing admissible heurisitcs
through constraint elimination.
3 Higher-Order Heuristics (Korf, 1996)
Generalizing Manhattan considering subproblems
of a configuration and not the single elements.

7
Manhattan Corrections

Linear Conflicts
Corner Tiles
Last Moves
Non-Linear Conflicts
Corner Deduction
First Moves

Hansson et al., 1992
Conflict Deduction Ernandes, 2003
Ernandes, 2003
8
Examples
Linear Conflicts computes conflicts on the same
row/coloumn
Corner tiles computes conflicts thanks to
corners properties
Last Moves computes the last two moves to
complete the puzzle.
Non-Linear Conflicts computes conflicts on a
different row/coloumn (two types)
Corner deduction as corner tiles but with
correct tiles on the diagonal.
9
Conflict Deduction

It is more convenient to implement the various
techniques separately.
Cannot add together all corrections
inadmissibility!
If one tile is involved in one conflict it counts
only once.
To maximize the estimation we use, for each
tile, the technique that gives the highest
contribute.

10
Higher-Order Heuristics

Ad hoc techniques generate strongly
problem-dependent heuristics.
They are not sufficient to attack bigger problems
as the 24-puzzle.
Manhattan has to be generalized otherwise
considering the distance-to-goal of more elements
(tiles) collected together.
First example ? Pairwise Distances instead of
computing the distance of 1 indipendent tile, we
use couples of tiles.

11
Higher-Order Heuristics problems

We can potentiate the Pairwise Distance computing
it for all possible tile couples and then seeking
the combination that maximizes the estimation
Maximum Weighted Matching Problem
PD remains poorly informed. We need triples of
tiles but the Matching Problem becomes
NP-Complete (Korf, 1996).
Hence the only possible Higher-Order Heuristic to
be efficiently used online is Pairwise Distance,
which is to poor ? less informed than Conflict
Deduction!

12
From Higher-Order Heuristics to Memory-based
heuristics

Higher-Order Heuristics could ignore the
maximization problem and consider pre-designed
tile groups (and increment their dimensions).
Solving subproblems of 3 or more tiles (patterns)
is too expensive during search we need to do
this offline.

13
Disjoint Pattern Databases(KorfTaylor 2002)

Additive version of Pattern Databases
(CulbersonSchaeffer,96) where pattern are
considered independently.
Manhattan is the simplest Disjoint Pattern DB 1
tile 1 pattern. DPDBs, unlike PDBs, always
dominate Manhattan.
On the 15-puzzle they perform 75 times faster
than non-additive PDBs and their DB generation is
much easier because distances can be computed
backwards by disarranging the patterns.
Different DPDBs can be combined taking argmax
global speedup over Manhattan 2000, space
reduction 13000.

DPDB 1
DPDB 2
14
DPDBs and the 24-puzzle.

This technique solved the 24-puzzle between 1,1
and 21 times faster than classic Higher-Order
Heuristics avg. 2 days.
But in many cases using more nodes!
This technique evidently does not scale with
dimension problems.
Maintaining the same time complexity for the
35-puzzle would mean increase from 1013 to 1028
the number of DB entrances.

15
Criticizing the classic approach

We believe that it is more sensible to
investigate the combination online heuristics
relaxed admissibility.
A) Because rigid admissibility does not give any
chance to face problems of greater greater
dimensions.
Online admissible heuristics ? NP-Hard in time
Mem-based admissible heuristics ? NP-Hard in
space
B) Because admissibility is a sufficient
condition for optimality, not necessary!

16
Admissible overestimations

Some overestimations obviously dont affect
optimality
Constant overestimations
Overestimations outside the optimal path
Optimal path overestimations coupled with
overestimations in brother sub-branches.
In some domains other overestimations are
admissible
Uniform-cost problems h lt hc (Move
games)
Orthogonal single-piece-move problems h lt h2c
(Atomic Manhattan-space problems ? like the
sliding-tile puzzle)
Simple experiment with the 8-puzzle and A
Use heuristic hhMs with s variable
If sgt0 and slt2 search is optimal, but more
inefficient while s?2.
If s2 search can be supoptimal, and regain space
efficiency.

17
Likely-Admissible Search

We relax the optimality requirement in a
probabilistic sense (not qualitatively like
e-admissible search).
Why is it a better approach than e-admissibility?
It allows to retrieve TRULY OPTIMAL solutions.
It still allows to change the nature of search
complexity.
It allows to study the complexity stressing p
asymptotically to 1.
Because search can rely on any heuristic, unlike
e-admissible search that works only on
already-proven-admissible ones.
Because we can better combine search with
statistical machine learning techniques. Using
universal approximators we can automatically
generate heuristics.

18
Likely-Admissible Search A statistical
framework

Any given non-admissible heuristic can be used.
The only requisite is to have a previous
statistical analysis of overestimation
frequencies.
We refer with P(h) to the probability that
heuristic h underestimates h for any given state
x ? X.
We refer with ph to the probability of optimally
solving a problem using h and A.
A main goal of the framework is to obtain ph from
P(h) WE WANT TO ESTIMATE OPTIMALITY FROM
ADMISSIBILITY

19
Likely-Admissible Search Trivial case single
heuristic.

The overestimations over optimal path p affect
optimality hence, given solution depth d
(eq. 1)
Considering the admissible overestimation
theorem, in the sliding-tiles puzzle domain
(eq. 2)

20
Likely-Admissible Search Effect of Admissible
Overestimations Th.

Underestimating h2 is MUCH EASIER than h!
Best heuristic generated for the 8-puzzle
overestimated h in 28,4 of cases, but h2 in
1,9 !!

21
Likely-Admissible Search Multiple Heuristics

To enrich the heuristic information we can
generate many heuristics and use them
simultaneously.
With j different heuristics we can take each time
the smaller evaluation, in order to stress
admissibility
Thus
(eq. 3)
(eq. 3b)

22
Likely-Admissible Search Multiple Heuristics

A common problem we desire an optimality p
how many heuristics do we have to use to obtain
p?
We will consider for simplicity that all j
heuristics have the same given P(h2?). Hence
(eq. 4)

j grows logarithmically with this term, that
grows both with d and pH because dgt1 e pH lt 1
23
Likely-Admissible Search Some Examples

8-puzzle how many heuristics?
d ? 22
Desired optimality 99,9 ? pH 0,999
Given heuristics have P(h2?) 0,95
log 0,05 (1 22 0,999 ) ? log 0,05 0,0000455 ?
?3,33?
4
15-puzzle how many heuristics?
d ? 53
Same desired optimality
Give heuristics have P(h2?) 0,93
log 0,07 (1 22 0,999 ) ? log 0,07 0,0000189 ?
?4,1?
5

24
Likely-Admissible Search Main Problems

Equations 3 and 3b assume that
INDEPENDENT PROBABILITY DISTRIBUTION
Overestimation probability of competitive
heuristics hj(x) have an independent distribution
over X.
Equations 2 assumes that
CONSTANT PROBABILITY
Underestimation probability P(h2) is constant
for all x independently by h(n).
All these assumptions are very strong
We observed experimentally that ANN heuristics
map X with similar overestimation probabilities.
We observed that avg. error grows with h, thus
P(h2) too.

25
Likely-Admissible Search Prediction capability

Eq.3 is not usable since it requires total
independency.
Optimality growth seems more or less linear (not
exponential) with the number of heuristics. It
sensibly improves with learning over different
datasets .
Trivial equation 2 gives a probabilistic lower
bound of effective search optimality
Extremely precise if the estimation is over 80.
Imprecise (but always pessimistic) for low
predictions.
Optimistic predictions are very rare and depend
on the CONSTANT PROBABILITY assumption.
Predictions are much more accurate than
e-admissible search predictions.

26
Likely-Admissible Search Optimality
prediction 8-puzzle
27
Likely-Admissible Search Optimality
prediction 15-puzzle
28
Sub-symbolic heuristics
We used standard MLP networks.
h(n)
29
Sub-symbolic heuristics Are sub-symbolic
heuristics online?

We believe so. Even that there is an offline
learning phase. For 2 reasons
1. Nodes visited during search are generally
UNSEEN.
Exactly like often humans do with learned
heuristics we dont recover a heuristic value
from a database, we compute it employing the
inner rules that the heuristic provides.
2. The learned heuristic should be
dimension-independent learning over small
problems could be used for bigger problems (i.e.
8-puzzle ? 15-puzzle). This is not possible with
mem-based heuristics.

30
Sub-symbolic heuristics Outputs Targets

Two options
A) 1 linear neuron output
B) n 0/1 neuron outputs
A is much better.
Two possible targets
A) direct target function ? o(x)h(x)
B) gap target ? o(x)h(x)-hM(x)
(which takes advantage of Manhattan too)
Experiments B improves against A only in bigger
problems such as the 15-puzzle.

31
Sub-symbolic heuristics Entrances coding
(N x k t) if square k occupied by value t ? N2

A 000000100 001000000 000010000
B 001 100 100 001 010 010 100 010
C -2 0 0 1 -1 1 1 1 0 1 0 0

Row/column targets in block k of value t are high
if k occupied by value t ? 2N3/2
For each square compute hortvert distances ? 2N
32
Sub-symbolic heuristics Learning Algorithm

Backpropagation with a new error function,
instead of classic function Ed od td over
example d.
We introduce a coefficient of asymmetry in order
to stress admissibility
Ed (1-w) (od td) if (od td) lt 0
Ed (1w) (od td) if (od td) gt 0 with
0 lt w lt 1
The modified backprop minimizes
E(W) ½ ?d?D rd (od td )2 with rd (1w)
or rd (1-w)
We used a dynamic decreasing w, in order to
stress underestimations when learning is simple
and to ease it successively. Momentum a0,8
helped smoothness.

33
Sub-symbolic heuristics Asymmetric Regression
Symmetric error
Asymmetric error

This is a general idea for backpropagation
learning.
It can suit any regression problem where
overestimations harm more than underestimations
(or contrary).
Heuristic machine learning is an ideal
application field.

34
Sub-symbolic heuristics Dataset Generation

Examples are previously optimally solved
configurations.
Few examples are sufficient for good learning. A
few hundreds to have faster search than
Manhattan.
Experimental ideal 8-puzzle set ? 10000
examples, 15-puzzle ? 25000 (1/500x106 of the
problem space!).
IMPORTANT these examples have to be
representative of cases present in search trees,
not of random cases! see 15-puzzle search tree
distribution
Hence, avg. h should stay around d/2. Over 60
of 15-puzzle examples have d lt 30, ? 80 have d lt
45. Dataset generation is much easier than
expected and its fully parallelizable.
Generating two 25000 15-puzzle dataset, took 100
hours, half than learning.

35
Sub-symbolic heuristics Modifying estimations
a posteriori

Using trunc() ? mandatory for IDA
Adapting value to Manhattans parity
Increases by 30 IDA efficiency.
Does not improve admissibility, due to the
admissible overestimations theorem.
Shifting to Manhattan in search endings.
Maintaining dominance over Manhattan.
Arbitrary estimation reduction.

36
Experimental Results 8-puzzle using A single
heuristics
21,97
22,91
22,81
22,79
22,73
22,57
22,56
22,43
22,31
Manhattan
Conflict Deduction
1 ANN techniques a posteriori
1 ANN asym-learning
1
3
4
5
1 ANN
Test set 2000 random configurations
37
Experimental Results 8-puzzle using A and
multiple heuristics
Test set 2000 random configurations
38
Experimental Results 15-puzzle using IDA and
multiple heuristics
Test set 700 random configurations
(avg d52,7, nodes with Manh 3.7 x 108)
39
Experimental Results Some comparisons
38
Try the demo at http//www.dii.unisi.it/
ernandes/samloyd/

Compared to e-admissible search
WIDA with w1,25 and hconflict deduction
predicted d66, factual d54,49, nodes visited
42374
IDA with 1 ANN factual d54,45, nodes 24711
Compared to Manhattan
IDA with 1 ANN (optimality ? 30) 1/1000
execution time, 1/15000 nodes visited
IDA with 2 ANN (opt. ? 50) 1/500 time, 1/13000
nodes.
IDA with 4 ANN-1 (opt. ? 90) 1/70 time, 1/2800
nodes.
Compared to DPDBs
IDA with 1 ANN between -17 and 13 nodes
visited, between 1,4 and 3,5 times slower

40
Conclusions
39

We defined a new framework of relaxed-admissible
search likely-admissible search
This statistical framework is more appealing than
e-admissibility
it relaxes the quantity of the solutions, not the
quality
it works with any non-admissible heuristic
it can exploit statistical learning techniques
Likely-admissible sub-symbolic heuristics
performance on 15-puzzle can challenge DPDB
heuristics
represent a way to speed-up solving, avoid
memory abuse and still retrieve optimal solutions.

41
Further Work
40

1 Generalization of the input coding. Two goals
A) reduce the dimension of input representation.
B) allow learning over different
problem-dimensions
An idea using graphs and Recurrent ANN to
generate heuristics.
2 Auto-feed Learning
The system should be able to generate its own
dataset automatically during learning, increasing
complexity gradually.
3. Network specialization
Train and apply heuristics only over a certain
domain of complexity (i.e. guided by Manhattan
Distance), during search.