?Connection between MC/HMM and MDP/POMDP

About This Presentation
Title:

?Connection between MC/HMM and MDP/POMDP

Description:

The previous on two lotteries shows. that not only is money not ... (max norm difference of two vectors is the maximum amount by which they differ on ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 44
Provided by: rao58

less

Transcript and Presenter's Notes

Title: ?Connection between MC/HMM and MDP/POMDP


1
11/19
  • ?Connection between MC/HMM and MDP/POMDP
  • ?Utility in terms of the value of the vantage
    point

2
Choose between Two Lotteries
  • Lottery A
  • 80 chance of 4K
  • Lottery B
  • 100 chance of 3K
  • Lottery C
  • 20 chance of 4K
  • Lottery D
  • 25 chance of 3K

People are risk-averse with high-probability
events but are willing to take risks with
unlikely payoffs (see 16.3 in RN)
3
Choose between Two Lotteries
  • Lottery A
  • 80 chance of 4K
  • Lottery B
  • 100 chance of 3K
  • Lottery C
  • 20 chance of 4K
  • Lottery D
  • 25 chance of 3K

Standard notation for a lottery p,A (1-p) B
with prob p. you get prize A, and (1-p) you
get prize B
People are risk-averse with high-probability
events but are willing to take risks with
unlikely payoffs (see 16.3 in RN)
4
Money ? Utility
The previous slide on two lotteries shows that
not only is money not utility, but the
money?Utility conversion can be inconsistent
5
Expected Monetary value and Certainty Amount
  • Consider a lottery if the coin comes heads you
    get 1000 and if it is tails you get 0
  • The EMV of the lottery is 500
  • I have the option of taking part in this lottery
  • I want to see how much money I need to give you
    up-front so you will give up the option.
  • Apparently, on the average, people seem to want
    400 to give up on this lottery (obviously, this
    is an averageyour mileage may vary)
  • This is called the certainty amount
  • The difference between certainty amount and EMV
    is called the insurance premium
  • To see why it makes sense, suppose the lottery
    was with prob. 0.001 you lose your house to fire
    and 0.999 nothing happens
  • You take insurance in essense to avoid taking
    part in this lottery.

6
What is a solution to an MDP ?
  • The solution should tell the optimal action to do
    in each state (called a Policy)
  • Policy is a function from states to actions (
    see finite horizon case below)
  • Not a sequence of actions anymore
  • Needed because of the non-deterministic actions
  • If there are S states and A actions that we
    can do at each state, then there are AS
    policies
  • How do we get the best policy?
  • Pick the policy that gives the maximal expected
    reward
  • For each policy p
  • Simulate the policy (take actions suggested by
    the policy) to get behavior traces
  • Evaluate the behavior traces
  • Take the average value of the behavior traces.

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
7
Horizon Policy
If you are twenty and not a liberal, you are
heartless If you are sixty and not a
conservative, you are mindless

--Churchill
  • We said policy is a function from states to
    actions.. but we sort of lied.
  • Best policy is non-stationary, i.e., depends on
    how long the agent has to live which is
    called horizon
  • More generally, a policy is a mapping from
    ltstate, time-to-deathgt ? ltactiongt
  • So, if we have a horizon of k, then we will have
    k policies
  • If the horizon is infinite, then policies must
    all be the same.. (So infinite horizon case is
    easy!)

8
Horizon Policy
If you are twenty and not a liberal, you are
heartless If you are sixty and not a
conservative, you are mindless

--Churchill
  • How long should behavior traces be?
  • Each trace is no longer than k (Finite Horizon
    case)
  • Policy will be horizon-dependent (optimal action
    depends not just on what state you are in, but
    how far is your horizon)
  • Eg Financial portfolio advice for yuppies vs.
    retirees.
  • No limit on the size of the trace (Infinite
    horizon case)
  • Policy is not horizon dependent

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
9
How to handle unbounded state sequences?
  • If we dont have a horizon, then we can have
    potentially infinitely long state sequences.
    Three ways to handle them
  • Use discounted reward model ( ith state in the
    sequence contributes only i R(si)
  • Assume that the policy is proper (i.e., each
    sequence terminates into an absorbing state with
    non-zero probability).
  • Consider average reward per-step

10
How to evaluate a policy?
  • Step 1 Define utility of a sequence of states in
    terms of their rewards
  • Assume stationarity of preferences
  • If you prefer future f1 to f2 starting tomorrow,
    you should prefer them the same way even if they
    start today
  • Then, only two reasonable ways to define Utility
    of a sequence of states
  • U(s1, s2 ? sn) ?n R(si)
  • U(s1, s2 ? sn) ?n i R(si) (0 1)
  • Maximum utility bounded from above by Rmax/(1 -
    )
  • Step 2 Utility of a policy ¼ is the expected
    utility of the behaviors exhibited by an agent
    following it. E ?1t0 t
    R(st) ¼
  • Step 3 Optimal policy ¼ is the one that
    maximizes the expectation argmax¼ E ?1t0 t
    R(st) ¼
  • Since there are only As different policies, you
    can evaluate them all in finite time (Haa haa..)

11
Utility of a State
  • The (long term) utility of a state s with respect
    to a policy \pi is the expected value of all
    state sequences starting with s
  • U¼(s) E ?1t0 t R(st) ¼ , s0 s
  • The true utility of a state s is just its utility
    w.r.t optimal policy U(s) U¼(s)
  • Thus, U and ¼ are closely related
  • ¼(s) argmaxa ?s Mass U(s)
  • As are utilities of neighboring states
  • U(s) R(s) argmaxa ?s Mass U(s)

Bellman Eqn
12
Repeat
(Value)
(sequence of states behavior)
How about deterministic case? U(si) is the
shortest path to the goal ?
13
Bellman Equations as a basis for computing
optimal policy
  • Qn Is there a simpler way than having to
    evaluate AS policies?
  • Yes
  • The Optimal Value and Optimal Policy are related
    by the Bellman Equations
  • U(s) R(s) argmaxa ?s Mass U(s)
  • ¼(s) argmaxa ?s Mass U(s)
  • The equations can be solved exactly through
  • value iteration (iteratively compute U and then
    compute ¼)
  • policy iteration ( iterate over policies)
  • Or solve approximately through real-time dynamic
    programming

14
.8
.1
.1
U(i) R(i) maxj Maij U(j)

15
Value Iteration Demo
  • http//www.cs.ubc.ca/spider/poole/demos/mdp/vi.htm
    l
  • Things to note
  • The way the values change (states far from
    absorbing states may first reduce and then
    increase their values)
  • The convergence speed difference between Policy
    and value

16
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
17
Terminating Value Iteration
  • The basic idea is to terminate the value
    iteration when the values have converged (i.e.,
    not changing much from iteration to iteration)
  • Set a threshold e and stop when the change across
    two consecutive iterations is less than e
  • There is a minor problem since value is a vector
  • We can bound the maximum change that is allowed
    in any of the dimensions between two successive
    iterations by e
  • Max norm . of a vector is the maximal value
    among all its dimensions. We are basically
    terminating when Ui Ui1 lt e

18
Policies converge earlier than values
  • There are finite number of policies but infinite
    number of value functions.
  • So entire regions of value vector are mapped
    to a specific policy
  • So policies may be converging faster than
    values. Search in the space of policies
  • Given a utility vector Ui we can compute the
    greedy policy pui
  • The policy loss of pui is Upui-U
  • (max norm difference of two vectors is the
    maximum amount by which they differ on any
    dimension)

P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
19
n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
20
ThanksandGiving
11/21
It's the mark of a truly educated man to be
deeply moved by statistics. -Oscar Wilde
  • Suppose you randomly reshuffled the world, and
    you have 100 people on your street (randomly
    sampled from the entire world).
  • On your street, there will be 5 people from US.
    Suppose they are a family. This family
  • Will own 2 of the 8 cars on the entire street
  • Will own 60 of the wealth of the whole street
  • Of the 100 people on the street, you (and you
    alone) will have had a college education
  • and of your neighbors
  • Nearly half (50) of your neighbors would suffer
    from malnutrition.
  • About 13 of the people would be chronically
    hungry.
  • One in 12 of the children on your street would
    die of some mostly preventable disease by the age
    of 5 from measles, malaria, or diarrhea. One in
    12.
  • If we came face to face with these inequities
    every day, I believe we would already be doing
    something more about them.

  • --William H. Gates (5/2003)

  • (On Bill Moyers NOW program)

http//www.pbs.org/now/transcript/transcript_gates
.html
21
Bellman equations when actions have costs
  • The model discussed in class ignores action costs
    and only thinks of state rewards
  • C(s,a) is the cost of doing action a in state s
  • Assume costs are just negative rewards..
  • The Bellman equation then becomes
  • U(s) R(s) maxa -C(s,a) ?s R(s)
    Mass
  • Notice that the only difference is that -C(s,a)
    is now inside the maximization
  • With this model, we can talk about partial
    satisfaction planning problems where
  • Actions have costs goals have utilities and the
    optimal plan may not satisfy all goals.

22
Incomplete observability(the dreaded POMDPs)
  • To model partial observability, all we need to do
    is to look at MDP in the space of belief states
    (belief states are fully observable even when
    world states are not)
  • Policy maps belief states to actions
  • In practice, this causes (humongous) problems
  • The space of belief states is continuous (even
    if the underlying world is discrete and finite).
    GET IT? GET IT??
  • Even approximate policies are hard to find
    (PSPACE-hard).
  • Problems with few dozen world states are hard to
    solve currently
  • Depth-limited exploration (such as that done in
    adversarial games) are the only option

Belief state s10.3, s20.4 s40.3
5 LEFTs
5 UPs
5 rights
This figure basically shows that belief states
change as we take actions
23
Real Time Dynamic Programming
  • Value and Policy iteration are the bed-rock
    methods for solving MDPs. Both give optimality
    guarantees
  • Both of them tend to be very inefficient for
    large (several thousand state) MDPs (Polynomial
    in S ? )
  • Many ideas are used to improve the efficiency
    while giving up optimality guarantees
  • E.g. Consider the part of the policy for more
    likely states (envelope extension method)
  • Interleave search and execution (Real Time
    Dynamic Programming)
  • Do limited-depth analysis based on reachability
    to find the value of a state (and there by the
    best action you should be doingwhich is the
    action that is sending you the best value)
  • The values of the leaf nodes are set to be their
    immediate rewards
  • Alternatively some admissible estimate of the
    value function (h)
  • If all the leaf nodes are terminal nodes, then
    the backed up value will be true optimal value.
    Otherwise, it is an approximation

RTDP
For leaf nodes, can use R(s) or some heuristic
value h(s)
24
MDPs and Deterministic Search
  • Problem solving agent search corresponds to what
    special case of MDP?
  • Actions are deterministic Goal states are all
    equally valued, and are all sink states.
  • Is it worth solving the problem using MDPs?
  • The construction of optimal policy is an overkill
  • The policy, in effect, gives us the optimal path
    from every state to the goal state(s))
  • The value function, or its approximations, on the
    other hand are useful. How?
  • As heuristics for the problem solving agents
    search
  • This shows an interesting connection between
    dynamic programming and state search paradigms
  • DP solves many related problems on the way to
    solving the one problem we want
  • State search tries to solve just the problem we
    want
  • We can use DP to find heuristics to run state
    search..

25
RTA(RTDP with deterministic actionsand leaves
evaluated by f(.))
S
S n
m
k
G
G1 H2 F3
G1 H2 F3
n
m
G2 H3 F5
k
infty
RTA is a special case of RTDP --It is useful
for acting in determinostic, dynamic worlds
--While RTDP is useful for actiong in stochastic,
dynamic worlds
--Grow the tree to depth d --Apply f-evaluation
for the leaf nodes --propagate f-values up to the
parent nodes f(parent) min(
f(children))
26
What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Min-Max!
If you have deterministic actions then RTDP
becomes RTA (if you use h(.) to evaluate leaves
27
Incomplete observability(the dreaded POMDPs)
  • To model partial observability, all we need to do
    is to look at MDP in the space of belief states
    (belief states are fully observable even when
    world states are not)
  • Policy maps belief states to actions
  • In practice, this causes (humongous) problems
  • The space of belief states is continuous (even
    if the underlying world is discrete and finite).
    GET IT? GET IT??
  • Even approximate policies are hard to find
    (PSPACE-hard).
  • Problems with few dozen world states are hard to
    solve currently
  • Depth-limited exploration (such as that done in
    adversarial games) are the only option

Belief state s10.3, s20.4 s40.3
5 LEFTs
5 UPs
This figure basically shows that belief states
change as we take actions
28
Claude Shannon (finite look-ahead)
Chaturanga, India (550AD) (Proto-Chess)
Von Neuman (Min-Max theorem)
Donald Knuth (a-b analysis)
John McCarthy (a-b pruning)
29
What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Review
Min-Max!
30
Game Playing (Adversarial Search)
  • Perfect play
  • Do minmax on the complete game tree
  • Alpha-Beta pruning (a neat idea that is the bane
    of many a CSE471 student)
  • Resource limits
  • Do limited depth lookahead
  • Apply evaluation functions at the leaf nodes
  • Do minmax
  • Miscellaneous
  • Games of Chance
  • Status of computer games..

31
Fun to try and find analogies between this and
environment properties
32
(No Transcript)
33
lt 2
lt 2
lt 5
lt 14
Cut
2
14
5
2
  • Whenever a node gets its true value, its
    parents bound gets updated
  • When all children of a node have been evaluated
    (or a cut off occurs below that node), the
    current bound of that node is its true value
  • Two types of cutoffs
  • If a min node n has bound ltk, and a max ancestor
    of n, say m, has a bound gtj, then cutoff occurs
    as long as j gtk
  • If a max node n has bound gtk, and a min ancestor
    of n, say m, has a bound ltj, then cutoff occurs
    as long as jltk

34
11/26
  • Agenda
  • Adversarial Search (30min)
  • Learning Inductive Learning (45min)

35
Another alpha-beta example
Project 2 assigned
36
(order nodes in terms of their static eval
values)
Click for an animation of Alpha-beta search in
action on Tic-Tac-Toe
37
(No Transcript)
38
Searching Tic Tac Toe using Minmax
A game is considered Solved if it can be shown
that the MAX player has a winning (or at least
Non-losing) Strategy This means that the
backed-up Value in the Full min-max Tree is ve
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Evaluation Functions TicTacToe
If win for Max infty If lose for Max
-infty If draw for Max 0 Else
rows/cols/diags open for Max -
rows/cols/diags open for Min
43
(No Transcript)
44
(No Transcript)
45
What depth should we go to? --Deeper the
better (but why?) Should we go to uniform
depth? --Go deeper in branches where
the game is in a flux (backed up
values are changing fast)
Called Quiescence Can we avoid the horizon
effect?
46
Why is deeper better?
  • Possible reasons
  • Taking mins/maxes of the evaluation values of the
    leaf nodes improves their collective accuracy
  • Going deeper makes the agent notice traps thus
    significantly improving the evaluation accuracy
  • All evaluation functions first check for
    termination states before computing the
    non-terminal evaluation

47
(just as human weight lifters refuse to compete
against cranes)
48
End of Gametrees
49
(so is MDP policy)
50
(No Transcript)
51
(No Transcript)
52
Multi-player Games
Everyone maximizes their utility --How does
this compare to 2-player games? (Maxs
utility is negative of Mins)
53
Expecti-Max
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
Write a Comment
User Comments (0)