Title: Multimedia search: From Lab to Web
1KI2 MDP / POMDP
Kunstmatige Intelligentie / RuG
2Decision Processes
- Agent
- Perceives environment (St) flawlessly
- Chooses action (a)
- Which alters the state of the world (St1)
3finite state machine
zie BAL
geen signalen
geen signalen
A1 lummel wat rond
A2 volg object
zie BAL
geen signalen
zie obstakel
zie obstakel
zie BAL
A3 houd afstand
zie obstakel
4Stochastic Decision Processes
- Agent
- Perceives environment (St) flawlessly
- Chooses action (a) according to P(aS)
- Which alters the state of the world (St1)
according to P(St1St,a)
5Markov Decision Processes
- Agent
- Perceives environment (St) flawlessly
- Chooses action (a) according to P(aS)
- Which alters the state of the world (St1)
according to P(St1St,a)
? If no longer-term dependencies 1st order
Markov process
6Aannames
- De waarneming van St is zonder ruis, alle
benodigde informatie is waarneembaar - Acties a worden volgens kans P(aS)
geselecteerd (random generator) - Gevolgen van a in (St1) treden stochastisch
op met kans P(St1St,a)
7A policy
1
-1
START
8A policy
1
-1
START
9MDP
- States
- Actions
- Transitions between states
- P(aisk) policy beleid, welke a men zoal
beslist gegeven de mogelijke
omstandigheden s
10policy p
- argmax ai P(aisk)
- Hoe kan een agent dit leren?
- Kostenminimalisatie
- Beloning/straf uit omgeving als gevolg van
gedrag/acties (Thorndike, 1911) - ? Reinforcements R(a,S)
- ? Structuur wereld T P(St1St)
11Reinforcements
- Gegeven een historie van States, Actions en
de resulterende Reinforcements kan een - agent leren de waarde van een Actie te
schatten. - Hoe Som van reinforcements R?
Gemiddelde? - ? exponentiële weging
- eerste stap bepaalt alle latere (learning from
past) - directe beloning is nuttiger (rekenen over
toekomst) - impatience mortality
12Assigning Utility (Value) to Sequences
Discounted Rewards
- V(s0,s1,s2) R(s0) ?R(s1) ?2R(s2)
- where 0lt?1
- where R is reinforcement value, s
refers to state,? is the the discount
factor
13Assigning Utility to States
- Can we say V(s)R(s)?
- de utiliteit van een toestand is de verwachte
utiliteit van alle toestanden die erop zullen
volgen, wanneer beleid (policy) p wordt
gehanteerd - Transitiekans T(s,a,s)
NO!!!
14Assigning Utility to States
- Can we say V(s)R(s)?
- Vp(s) is specific to each policy p
- Vp(s) E(??tR(st) p, s0s)
- V(s) Vp (s)
- V(s)R(s) ? max ?T(s,a,s)V(s)
- a
s - Bellman equation
- If we solve function V(s) for each state we will
have solved the optimal p for the given MDP
15Value Iteration Algorithm
- We have to solve S simultaneous Bellman
equations - Cant solve directly, so use an iterative
approach - 1. Begin with arbitrary initial values V0
- 2. For each s, calculate V(s) from R(s) and V0
- 3. Use these new utility values to update V0
- 4. Repeat steps 2-3 until V0 converges
- This equilibrium is a unique solution! (see RN
for proof) - page 621 RN
16Search space
- T SAS
- Explicit enumeration of combinations is often not
feasible (cf. chess, GO) - Chunking within T
- Problem if S is real valued
17MDP ? POMDP
- MDP wereld is weliswaar stochastisch,
Markoviaans, maar - De waarneming van die wereld zelf is betrouwbaar,
er hoeven geen aannames worden gemaakt. - De meeste echte problemen omvatten
- ruis in de waarneming zelf
- onvolledigheid van informatie
18MDP ? POMDP
- De meeste echte problemen omvatten
- ruis in de waarneming zelf
- onvolledigheid van informatie
- In deze gevallen moet de agent een stelsel van
Beliefs kunnen ontwikkelen op basis van series
partiële waarnemingen.
19Partially Observable Markov Decision Processes
(POMDPs)
- A POMDP has
- States S
- Actions A
- Probabilistic transitions
- Immediate Rewards on actions
- A discount factor
- Observations Z
- Observation probabilities (reliabilities)
- An initial belief b0
20A POMDP example The Tiger Problem
21The Tiger Problem
- Description
- 2 states Tiger_Left, Tiger_Right
- 3 actions Listen, Open_Left, Open_Right
- 2 observations Hear_Left, Hear_Right
22The Tiger Problem
- Rewards are
- -1 for the Listen action
- -100 for the Open(x) in the Tiger-at-x state
- 10 for the Open(x) in the Tiger-not-at x state
23The Tiger Problem
- Furthermore
- The Listen action does not change the state
- The Open(x) action reveals the tiger behind a
door x with 50 chance, and resets the trial. - The Listen action gives the correct information
85 of the time p(hearleft Listen,
tigerleft) 0.85 - p(hearright Listen, tigerleft) 0.15
24The Tiger Problem
- Question
- what policy gives the highest return in rewards?
- Actions depend on beliefs!
- If belief is 50/50 L/R, the expected reward will
be R 0.5 (-100 10) -45 - Beliefs are updated with observations (which may
be wrong)
25The Tiger Problem, horizon t1
26The Tiger Problem, horizon t2
27The Tiger Problem, horizon tInf
- Optimal policy
- listen a few times
- choose a door
- next trial
- listen1 Tigerleft (p0.85), listen2 Tigerleft
(p0.96),listen3 ...
(binomial) - Good news the optimal policy can be learnedif
actions are followed by rewards!
28The Tiger Problem, belief updates on Listen
- P(TigerListen,State)t1 P(TigerListen,Stat
e)t / ( P(TigerListen,State)t
P(Listen) (1-P(TigerListen,State)t )
(1-P(Listen) ) - Example
- initial (Tigerleft) (p0.5000), listen1
Tigerleft (p0.8500), listen2 Tigerleft
(p0.9698),listen3 Tigerleft
(p0.9945),listen4 ...
(Note underlying binomial distribution)
29The Tiger Problem, belief updates on Listen
- P(TigerListen,State)t1 P(TigerListen,Stat
e)t / ( P(TigerListen,State)t
P(Listen) (1-P(TigerListen,State)t )
(1-P(Listen) ) - Example 2, noise in observation
- initial (Tigerleft) (p0.5000), listen1
Tigerleft (p0.8500), listen2 Tigerleft
(p0.9698),listen3 Tigerright (p0.8500),
Belief drops...listen4 Tigerleft
(p0.9698), and recoverslisten5
...
30Solving a POMDP
- To solve a POMDP is to find, for any
action/observation history, the action that
maximizes the expected discounted reward -
31The belief state
- Instead of maintaining the complete
action/observation history, we maintain a belief
state b. - The belief is a probability distribution over the
states. Dim(b) S-1
32The belief space
Here is a representation of the belief space when
we have two states (s0,s1)
33The belief space
Here is a representation of the belief state when
we have three states (s0,s1,s2)
34The belief space
Here is a representation of the belief state when
we have four states (s0,s1,s2,s3)
35The belief space
- The belief space is continuous but we only visit
a countable number of belief points.
36The Bayesian update
37Value Function in POMDPs
- We will compute the value function over the
belief space. - Hard the belief space is continuous !!
- But we can use a property of the optimal value
function for a finite horizon it is
piecewise-linear and convex. - We can represent any finite-horizon solution by a
finite set of alpha-vectors. - V(b) max_aS_s a(s)b(s)
38Alpha-Vectors
- They are a set of hyperplanes which define the
belief function. At each belief point the value
function is equal to the hyperplane with the
highest value.
39Belief Transform
- Assumption
- Finite action
- Finite observation
- Next belief state T(cbf,a,z) where
- cbf current belief state, aaction,
zobservation - Finite number of possible next belief state
40PO-MDP into continuous CO-MDP
- The process is Markovian, the next belief state
depends on - Current belief state
- Current action
- Observation
- Discrete PO-MDP problem can be converted into a
continuous space CO-MDP problem where the
continuous space is the belief space.
41Problem
- Using VI in continuous state space.
- No nice tabular representation as before.
42PWLC
- Restrictions on the form of the solutions to the
continuous space CO-MDP - The finite horizon value function is piecewise
linear and convex (PWLC) for every horizon
length. - the value of a belief point is simply the dot
product of the two vectors.
GOALfor each iteration of value iteration, find
a finite number of linear segments that make up
the value function
43Steps in Value Iteration (VI)
- Represent the value function for each horizon as
a set of vectors. - Overcome how to represent a value function over a
continuous space. - Find the vector that has the largest dot product
with the belief state.
44PO-MDP Value Iteration Example
- Assumption
- Two states
- Two actions
- Three observations
- Ex horizon length is 1.
-
b0.25 0.75
a1 a2
s1 s2
V(a1,b) 0.25x10.75x0 0.25 V(a2,b)0.25x00.75
x1.51.125
45PO-MDP Value Iteration Example
- The value of a belief state for horizon length 2
given b,a1,z1 - immediate action plus the value of the next
action. - Find best achievable value for the belief state
that results from our initial belief state b when
we perform action a1 and observe z1.
46PO-MDP Value Iteration Example
- Find the value for all the belief points given
this fixed action and observation. - The Transformed value function is also PWLC.
47PO-MDP Value Iteration Example
- How to compute the value of a belief state given
only the action? - The horizon 2 value of the belief state, given
that - Values for each observation z1 0.7 z2 0.8 z3
1.2 - P(z1 b,a1)0.6 P(z2 b,a1)0.25 P(z3
b,a1)0.15 - 0.6x0.8 0.25x0.7 0.15x1.2 0.835
48Transformed Value Functions
- Each of these transformed functions partitions
the belief space differently. - Best next action to perform depends upon the
initial belief state and observation.
49Best Value For Belief States
- The value of every single belief point, the sum
of - Immediate reward.
- The line segments from the S() functions for each
observation's future strategy. - since adding lines gives you lines, it is linear.
50Best Strategy for any Belief Points
- All the useful future strategies are easy to pick
out
51Value Function and Partition
- For the specific action a1, the value function
and corresponding partitions
52Value Function and Partition
- For the specific action a2, the value function
and corresponding partitions
53Which Action to Choose?
- put the value functions for each action together
to see where each action gives the highest value.
54Compact Horizon 2 Value Function
55POMDP Model
- Control dynamics for a POMDP
56Active Learning
- In an Active Learning Problem the learner has the
ability to influence its training data. - The learner asks for what is the most useful
given its current knowledge. - Methods to find the most useful query have been
shown by Cohn et al. (95)
57Active Learning (Cohn et al. 95)
- Their method, used for function approximation
tasks, is based on finding the query that will
minimize the estimated variance of the learner. - They showed how this could be done exactly
- For a mixture of Gaussians model.
- For locally weighted regression.
58Active Perception
- Automatic gesture recognition
- not full-image pattern recognition
- gaze-based image analysis fixations
- save computing time in image processing
- requires computing time for POMDP action
selection (pan/tilt/zoom of camera)