Multimedia search: From Lab to Web

About This Presentation

Title:

Multimedia search: From Lab to Web

Description:

A POMDP example: The Tiger Problem. 21. The Tiger Problem. Description: ... The Open(x) action reveals the tiger behind a door x with 50% chance, and resets ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 59

Provided by: LambertS7

more less

Transcript and Presenter's Notes

Title: Multimedia search: From Lab to Web

1
KI2 MDP / POMDP
Kunstmatige Intelligentie / RuG
2
Decision Processes

Agent
Perceives environment (St) flawlessly
Chooses action (a)
Which alters the state of the world (St1)

3
finite state machine
zie BAL
geen signalen
geen signalen
A1 lummel wat rond
A2 volg object
zie BAL
geen signalen
zie obstakel
zie obstakel
zie BAL
A3 houd afstand
zie obstakel
4
Stochastic Decision Processes

Agent
Perceives environment (St) flawlessly
Chooses action (a) according to P(aS)
Which alters the state of the world (St1)
according to P(St1St,a)

5
Markov Decision Processes

Agent
Perceives environment (St) flawlessly
Chooses action (a) according to P(aS)
Which alters the state of the world (St1)
according to P(St1St,a)

? If no longer-term dependencies 1st order
Markov process
6
Aannames

De waarneming van St is zonder ruis, alle
benodigde informatie is waarneembaar
Acties a worden volgens kans P(aS)
geselecteerd (random generator)
Gevolgen van a in (St1) treden stochastisch
op met kans P(St1St,a)

7
A policy
1
-1
START
8
A policy
1
-1
START
9
MDP

States
Actions
Transitions between states
P(aisk) policy beleid, welke a men zoal
beslist gegeven de mogelijke
omstandigheden s

10
policy p

argmax ai P(aisk)
Hoe kan een agent dit leren?
Kostenminimalisatie
Beloning/straf uit omgeving als gevolg van
gedrag/acties (Thorndike, 1911)
? Reinforcements R(a,S)
? Structuur wereld T P(St1St)

11
Reinforcements

Gegeven een historie van States, Actions en
de resulterende Reinforcements kan een
agent leren de waarde van een Actie te
schatten.
Hoe Som van reinforcements R?
Gemiddelde?
? exponentiële weging
eerste stap bepaalt alle latere (learning from
past)
directe beloning is nuttiger (rekenen over
toekomst)
impatience mortality

12
Assigning Utility (Value) to Sequences
Discounted Rewards

V(s0,s1,s2) R(s0) ?R(s1) ?2R(s2)
where 0lt?1
where R is reinforcement value, s
refers to state,? is the the discount
factor

13
Assigning Utility to States

Can we say V(s)R(s)?
de utiliteit van een toestand is de verwachte
utiliteit van alle toestanden die erop zullen
volgen, wanneer beleid (policy) p wordt
gehanteerd
Transitiekans T(s,a,s)

NO!!!
14
Assigning Utility to States

Can we say V(s)R(s)?
Vp(s) is specific to each policy p
Vp(s) E(??tR(st) p, s0s)
V(s) Vp (s)
V(s)R(s) ? max ?T(s,a,s)V(s)
a
s
Bellman equation
If we solve function V(s) for each state we will
have solved the optimal p for the given MDP

15
Value Iteration Algorithm

We have to solve S simultaneous Bellman
equations
Cant solve directly, so use an iterative
approach
1. Begin with arbitrary initial values V0
2. For each s, calculate V(s) from R(s) and V0
3. Use these new utility values to update V0
4. Repeat steps 2-3 until V0 converges
This equilibrium is a unique solution! (see RN
for proof)
page 621 RN

16
Search space

T SAS
Explicit enumeration of combinations is often not
feasible (cf. chess, GO)
Chunking within T
Problem if S is real valued

17
MDP ? POMDP

MDP wereld is weliswaar stochastisch,
Markoviaans, maar
De waarneming van die wereld zelf is betrouwbaar,
er hoeven geen aannames worden gemaakt.
De meeste echte problemen omvatten
ruis in de waarneming zelf
onvolledigheid van informatie

18
MDP ? POMDP

De meeste echte problemen omvatten
ruis in de waarneming zelf
onvolledigheid van informatie
In deze gevallen moet de agent een stelsel van
Beliefs kunnen ontwikkelen op basis van series
partiële waarnemingen.

19
Partially Observable Markov Decision Processes
(POMDPs)

A POMDP has
States S
Actions A
Probabilistic transitions
Immediate Rewards on actions
A discount factor
Observations Z
Observation probabilities (reliabilities)
An initial belief b0

20
A POMDP example The Tiger Problem
21
The Tiger Problem

Description
2 states Tiger_Left, Tiger_Right
3 actions Listen, Open_Left, Open_Right
2 observations Hear_Left, Hear_Right

22
The Tiger Problem

Rewards are
-1 for the Listen action
-100 for the Open(x) in the Tiger-at-x state
10 for the Open(x) in the Tiger-not-at x state

23
The Tiger Problem

Furthermore
The Listen action does not change the state
The Open(x) action reveals the tiger behind a
door x with 50 chance, and resets the trial.
The Listen action gives the correct information
85 of the time p(hearleft Listen,
tigerleft) 0.85
p(hearright Listen, tigerleft) 0.15

24
The Tiger Problem

Question
what policy gives the highest return in rewards?
Actions depend on beliefs!
If belief is 50/50 L/R, the expected reward will
be R 0.5 (-100 10) -45
Beliefs are updated with observations (which may
be wrong)

25
The Tiger Problem, horizon t1

Optimal policy

26
The Tiger Problem, horizon t2

Optimal policy

27
The Tiger Problem, horizon tInf

Optimal policy
listen a few times
choose a door
next trial
listen1 Tigerleft (p0.85), listen2 Tigerleft
(p0.96),listen3 ...
(binomial)
Good news the optimal policy can be learnedif
actions are followed by rewards!

28
The Tiger Problem, belief updates on Listen

P(TigerListen,State)t1 P(TigerListen,Stat
e)t / ( P(TigerListen,State)t
P(Listen) (1-P(TigerListen,State)t )
(1-P(Listen) )
Example
initial (Tigerleft) (p0.5000), listen1
Tigerleft (p0.8500), listen2 Tigerleft
(p0.9698),listen3 Tigerleft
(p0.9945),listen4 ...
(Note underlying binomial distribution)

29
The Tiger Problem, belief updates on Listen

P(TigerListen,State)t1 P(TigerListen,Stat
e)t / ( P(TigerListen,State)t
P(Listen) (1-P(TigerListen,State)t )
(1-P(Listen) )
Example 2, noise in observation
initial (Tigerleft) (p0.5000), listen1
Tigerleft (p0.8500), listen2 Tigerleft
(p0.9698),listen3 Tigerright (p0.8500),
Belief drops...listen4 Tigerleft
(p0.9698), and recoverslisten5
...

30
Solving a POMDP

To solve a POMDP is to find, for any
action/observation history, the action that
maximizes the expected discounted reward

31
The belief state

Instead of maintaining the complete
action/observation history, we maintain a belief
state b.
The belief is a probability distribution over the
states. Dim(b) S-1

32
The belief space
Here is a representation of the belief space when
we have two states (s0,s1)
33
The belief space
Here is a representation of the belief state when
we have three states (s0,s1,s2)
34
The belief space
Here is a representation of the belief state when
we have four states (s0,s1,s2,s3)
35
The belief space

The belief space is continuous but we only visit
a countable number of belief points.

36
The Bayesian update
37
Value Function in POMDPs

We will compute the value function over the
belief space.
Hard the belief space is continuous !!
But we can use a property of the optimal value
function for a finite horizon it is
piecewise-linear and convex.
We can represent any finite-horizon solution by a
finite set of alpha-vectors.
V(b) max_aS_s a(s)b(s)

38
Alpha-Vectors

They are a set of hyperplanes which define the
belief function. At each belief point the value
function is equal to the hyperplane with the
highest value.

39
Belief Transform

Assumption
Finite action
Finite observation
Next belief state T(cbf,a,z) where
cbf current belief state, aaction,
zobservation
Finite number of possible next belief state

40
PO-MDP into continuous CO-MDP

The process is Markovian, the next belief state
depends on
Current belief state
Current action
Observation
Discrete PO-MDP problem can be converted into a
continuous space CO-MDP problem where the
continuous space is the belief space.

41
Problem

Using VI in continuous state space.
No nice tabular representation as before.

42
PWLC

Restrictions on the form of the solutions to the
continuous space CO-MDP
The finite horizon value function is piecewise
linear and convex (PWLC) for every horizon
length.
the value of a belief point is simply the dot
product of the two vectors.

GOALfor each iteration of value iteration, find
a finite number of linear segments that make up
the value function
43
Steps in Value Iteration (VI)

Represent the value function for each horizon as
a set of vectors.
Overcome how to represent a value function over a
continuous space.
Find the vector that has the largest dot product
with the belief state.

44
PO-MDP Value Iteration Example

Assumption
Two states
Two actions
Three observations
Ex horizon length is 1.

b0.25 0.75
a1 a2

s1 s2

0
0 1.5

V(a1,b) 0.25x10.75x0 0.25 V(a2,b)0.25x00.75
x1.51.125
45
PO-MDP Value Iteration Example

The value of a belief state for horizon length 2
given b,a1,z1
immediate action plus the value of the next
action.
Find best achievable value for the belief state
that results from our initial belief state b when
we perform action a1 and observe z1.

46
PO-MDP Value Iteration Example

Find the value for all the belief points given
this fixed action and observation.
The Transformed value function is also PWLC.

47
PO-MDP Value Iteration Example

How to compute the value of a belief state given
only the action?
The horizon 2 value of the belief state, given
that
Values for each observation z1 0.7 z2 0.8 z3
1.2
P(z1 b,a1)0.6 P(z2 b,a1)0.25 P(z3
b,a1)0.15
0.6x0.8 0.25x0.7 0.15x1.2 0.835

48
Transformed Value Functions

Each of these transformed functions partitions
the belief space differently.
Best next action to perform depends upon the
initial belief state and observation.

49
Best Value For Belief States

The value of every single belief point, the sum
of
Immediate reward.
The line segments from the S() functions for each
observation's future strategy.
since adding lines gives you lines, it is linear.

50
Best Strategy for any Belief Points

All the useful future strategies are easy to pick
out

51
Value Function and Partition

For the specific action a1, the value function
and corresponding partitions

52
Value Function and Partition

For the specific action a2, the value function
and corresponding partitions

53
Which Action to Choose?

put the value functions for each action together
to see where each action gives the highest value.

54
Compact Horizon 2 Value Function
55
POMDP Model

Control dynamics for a POMDP

56
Active Learning

In an Active Learning Problem the learner has the
ability to influence its training data.
The learner asks for what is the most useful
given its current knowledge.
Methods to find the most useful query have been
shown by Cohn et al. (95)

57
Active Learning (Cohn et al. 95)

Their method, used for function approximation
tasks, is based on finding the query that will
minimize the estimated variance of the learner.
They showed how this could be done exactly
For a mixture of Gaussians model.
For locally weighted regression.

58
Active Perception