Multimedia search: From Lab to Web

About This Presentation
Title:

Multimedia search: From Lab to Web

Description:

A POMDP example: The Tiger Problem. 21. The Tiger Problem. Description: ... The Open(x) action reveals the tiger behind a door x with 50% chance, and resets ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Multimedia search: From Lab to Web


1
KI2 MDP / POMDP
Kunstmatige Intelligentie / RuG
2
Decision Processes
  • Agent
  • Perceives environment (St) flawlessly
  • Chooses action (a)
  • Which alters the state of the world (St1)

3
finite state machine
zie BAL
geen signalen
geen signalen
A1 lummel wat rond
A2 volg object
zie BAL
geen signalen
zie obstakel
zie obstakel
zie BAL
A3 houd afstand
zie obstakel
4
Stochastic Decision Processes
  • Agent
  • Perceives environment (St) flawlessly
  • Chooses action (a) according to P(aS)
  • Which alters the state of the world (St1)
    according to P(St1St,a)

5
Markov Decision Processes
  • Agent
  • Perceives environment (St) flawlessly
  • Chooses action (a) according to P(aS)
  • Which alters the state of the world (St1)
    according to P(St1St,a)

? If no longer-term dependencies 1st order
Markov process
6
Aannames
  • De waarneming van St is zonder ruis, alle
    benodigde informatie is waarneembaar
  • Acties a worden volgens kans P(aS)
    geselecteerd (random generator)
  • Gevolgen van a in (St1) treden stochastisch
    op met kans P(St1St,a)

7
A policy
1
-1
START
8
A policy
1
-1
START
9
MDP
  • States
  • Actions
  • Transitions between states
  • P(aisk) policy beleid, welke a men zoal
    beslist gegeven de mogelijke
    omstandigheden s

10
policy p
  • argmax ai P(aisk)
  • Hoe kan een agent dit leren?
  • Kostenminimalisatie
  • Beloning/straf uit omgeving als gevolg van
    gedrag/acties (Thorndike, 1911)
  • ? Reinforcements R(a,S)
  • ? Structuur wereld T P(St1St)

11
Reinforcements
  • Gegeven een historie van States, Actions en
    de resulterende Reinforcements kan een
  • agent leren de waarde van een Actie te
    schatten.
  • Hoe Som van reinforcements R?
    Gemiddelde?
  • ? exponentiële weging
  • eerste stap bepaalt alle latere (learning from
    past)
  • directe beloning is nuttiger (rekenen over
    toekomst)
  • impatience mortality

12
Assigning Utility (Value) to Sequences
Discounted Rewards
  • V(s0,s1,s2) R(s0) ?R(s1) ?2R(s2)
  • where 0lt?1
  • where R is reinforcement value, s
    refers to state,? is the the discount
    factor

13
Assigning Utility to States
  • Can we say V(s)R(s)?
  • de utiliteit van een toestand is de verwachte
    utiliteit van alle toestanden die erop zullen
    volgen, wanneer beleid (policy) p wordt
    gehanteerd
  • Transitiekans T(s,a,s)

NO!!!
14
Assigning Utility to States
  • Can we say V(s)R(s)?
  • Vp(s) is specific to each policy p
  • Vp(s) E(??tR(st) p, s0s)
  • V(s) Vp (s)
  • V(s)R(s) ? max ?T(s,a,s)V(s)
  • a
    s
  • Bellman equation
  • If we solve function V(s) for each state we will
    have solved the optimal p for the given MDP

15
Value Iteration Algorithm
  • We have to solve S simultaneous Bellman
    equations
  • Cant solve directly, so use an iterative
    approach
  • 1. Begin with arbitrary initial values V0
  • 2. For each s, calculate V(s) from R(s) and V0
  • 3. Use these new utility values to update V0
  • 4. Repeat steps 2-3 until V0 converges
  • This equilibrium is a unique solution! (see RN
    for proof)
  • page 621 RN

16
Search space
  • T SAS
  • Explicit enumeration of combinations is often not
    feasible (cf. chess, GO)
  • Chunking within T
  • Problem if S is real valued

17
MDP ? POMDP
  • MDP wereld is weliswaar stochastisch,
    Markoviaans, maar
  • De waarneming van die wereld zelf is betrouwbaar,
    er hoeven geen aannames worden gemaakt.
  • De meeste echte problemen omvatten
  • ruis in de waarneming zelf
  • onvolledigheid van informatie

18
MDP ? POMDP
  • De meeste echte problemen omvatten
  • ruis in de waarneming zelf
  • onvolledigheid van informatie
  • In deze gevallen moet de agent een stelsel van
    Beliefs kunnen ontwikkelen op basis van series
    partiële waarnemingen.

19
Partially Observable Markov Decision Processes
(POMDPs)
  • A POMDP has
  • States S
  • Actions A
  • Probabilistic transitions
  • Immediate Rewards on actions
  • A discount factor
  • Observations Z
  • Observation probabilities (reliabilities)
  • An initial belief b0

20
A POMDP example The Tiger Problem
21
The Tiger Problem
  • Description
  • 2 states Tiger_Left, Tiger_Right
  • 3 actions Listen, Open_Left, Open_Right
  • 2 observations Hear_Left, Hear_Right

22
The Tiger Problem
  • Rewards are
  • -1 for the Listen action
  • -100 for the Open(x) in the Tiger-at-x state
  • 10 for the Open(x) in the Tiger-not-at x state

23
The Tiger Problem
  • Furthermore
  • The Listen action does not change the state
  • The Open(x) action reveals the tiger behind a
    door x with 50 chance, and resets the trial.
  • The Listen action gives the correct information
    85 of the time p(hearleft Listen,
    tigerleft) 0.85
  • p(hearright Listen, tigerleft) 0.15

24
The Tiger Problem
  • Question
  • what policy gives the highest return in rewards?
  • Actions depend on beliefs!
  • If belief is 50/50 L/R, the expected reward will
    be R 0.5 (-100 10) -45
  • Beliefs are updated with observations (which may
    be wrong)

25
The Tiger Problem, horizon t1
  • Optimal policy

26
The Tiger Problem, horizon t2
  • Optimal policy

27
The Tiger Problem, horizon tInf
  • Optimal policy
  • listen a few times
  • choose a door
  • next trial
  • listen1 Tigerleft (p0.85), listen2 Tigerleft
    (p0.96),listen3 ...
    (binomial)
  • Good news the optimal policy can be learnedif
    actions are followed by rewards!

28
The Tiger Problem, belief updates on Listen
  • P(TigerListen,State)t1 P(TigerListen,Stat
    e)t / ( P(TigerListen,State)t
    P(Listen) (1-P(TigerListen,State)t )
    (1-P(Listen) )
  • Example
  • initial (Tigerleft) (p0.5000), listen1
    Tigerleft (p0.8500), listen2 Tigerleft
    (p0.9698),listen3 Tigerleft
    (p0.9945),listen4 ...
    (Note underlying binomial distribution)

29
The Tiger Problem, belief updates on Listen
  • P(TigerListen,State)t1 P(TigerListen,Stat
    e)t / ( P(TigerListen,State)t
    P(Listen) (1-P(TigerListen,State)t )
    (1-P(Listen) )
  • Example 2, noise in observation
  • initial (Tigerleft) (p0.5000), listen1
    Tigerleft (p0.8500), listen2 Tigerleft
    (p0.9698),listen3 Tigerright (p0.8500),
    Belief drops...listen4 Tigerleft
    (p0.9698), and recoverslisten5
    ...

30
Solving a POMDP
  • To solve a POMDP is to find, for any
    action/observation history, the action that
    maximizes the expected discounted reward

31
The belief state
  • Instead of maintaining the complete
    action/observation history, we maintain a belief
    state b.
  • The belief is a probability distribution over the
    states. Dim(b) S-1

32
The belief space
Here is a representation of the belief space when
we have two states (s0,s1)
33
The belief space
Here is a representation of the belief state when
we have three states (s0,s1,s2)
34
The belief space
Here is a representation of the belief state when
we have four states (s0,s1,s2,s3)
35
The belief space
  • The belief space is continuous but we only visit
    a countable number of belief points.

36
The Bayesian update
37
Value Function in POMDPs
  • We will compute the value function over the
    belief space.
  • Hard the belief space is continuous !!
  • But we can use a property of the optimal value
    function for a finite horizon it is
    piecewise-linear and convex.
  • We can represent any finite-horizon solution by a
    finite set of alpha-vectors.
  • V(b) max_aS_s a(s)b(s)

38
Alpha-Vectors
  • They are a set of hyperplanes which define the
    belief function. At each belief point the value
    function is equal to the hyperplane with the
    highest value.

39
Belief Transform
  • Assumption
  • Finite action
  • Finite observation
  • Next belief state T(cbf,a,z) where
  • cbf current belief state, aaction,
    zobservation
  • Finite number of possible next belief state

40
PO-MDP into continuous CO-MDP
  • The process is Markovian, the next belief state
    depends on
  • Current belief state
  • Current action
  • Observation
  • Discrete PO-MDP problem can be converted into a
    continuous space CO-MDP problem where the
    continuous space is the belief space.

41
Problem
  • Using VI in continuous state space.
  • No nice tabular representation as before.

42
PWLC
  • Restrictions on the form of the solutions to the
    continuous space CO-MDP
  • The finite horizon value function is piecewise
    linear and convex (PWLC) for every horizon
    length.
  • the value of a belief point is simply the dot
    product of the two vectors.

GOALfor each iteration of value iteration, find
a finite number of linear segments that make up
the value function
43
Steps in Value Iteration (VI)
  • Represent the value function for each horizon as
    a set of vectors.
  • Overcome how to represent a value function over a
    continuous space.
  • Find the vector that has the largest dot product
    with the belief state.

44
PO-MDP Value Iteration Example
  • Assumption
  • Two states
  • Two actions
  • Three observations
  • Ex horizon length is 1.

b0.25 0.75
a1 a2


s1 s2
  • 0
  • 0 1.5

V(a1,b) 0.25x10.75x0 0.25 V(a2,b)0.25x00.75
x1.51.125
45
PO-MDP Value Iteration Example
  • The value of a belief state for horizon length 2
    given b,a1,z1
  • immediate action plus the value of the next
    action.
  • Find best achievable value for the belief state
    that results from our initial belief state b when
    we perform action a1 and observe z1.

46
PO-MDP Value Iteration Example
  • Find the value for all the belief points given
    this fixed action and observation.
  • The Transformed value function is also PWLC.

47
PO-MDP Value Iteration Example
  • How to compute the value of a belief state given
    only the action?
  • The horizon 2 value of the belief state, given
    that
  • Values for each observation z1 0.7 z2 0.8 z3
    1.2
  • P(z1 b,a1)0.6 P(z2 b,a1)0.25 P(z3
    b,a1)0.15
  • 0.6x0.8 0.25x0.7 0.15x1.2 0.835

48
Transformed Value Functions
  • Each of these transformed functions partitions
    the belief space differently.
  • Best next action to perform depends upon the
    initial belief state and observation.

49
Best Value For Belief States
  • The value of every single belief point, the sum
    of
  • Immediate reward.
  • The line segments from the S() functions for each
    observation's future strategy.
  • since adding lines gives you lines, it is linear.

50
Best Strategy for any Belief Points
  • All the useful future strategies are easy to pick
    out

51
Value Function and Partition
  • For the specific action a1, the value function
    and corresponding partitions

52
Value Function and Partition
  • For the specific action a2, the value function
    and corresponding partitions

53
Which Action to Choose?
  • put the value functions for each action together
    to see where each action gives the highest value.

54
Compact Horizon 2 Value Function
55
POMDP Model
  • Control dynamics for a POMDP

56
Active Learning
  • In an Active Learning Problem the learner has the
    ability to influence its training data.
  • The learner asks for what is the most useful
    given its current knowledge.
  • Methods to find the most useful query have been
    shown by Cohn et al. (95)

57
Active Learning (Cohn et al. 95)
  • Their method, used for function approximation
    tasks, is based on finding the query that will
    minimize the estimated variance of the learner.
  • They showed how this could be done exactly
  • For a mixture of Gaussians model.
  • For locally weighted regression.

58
Active Perception
  • Automatic gesture recognition
  • not full-image pattern recognition
  • gaze-based image analysis fixations
  • save computing time in image processing
  • requires computing time for POMDP action
    selection (pan/tilt/zoom of camera)
Write a Comment
User Comments (0)