What if you didn - PowerPoint PPT Presentation

About This Presentation

Title:

What if you didn

Description:

What if you see this as a game? RTA* (RTDP with deterministic actions and leaves evaluated by f(.)) Game Playing (Adversarial Search) ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 45

Provided by: rao58

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: What if you didn

1
What if you didnt have any hard goals..?And got
rewards continually?And have stochastic actions?

MDPs as Utility-based problem solving agents

2
Repeat
can generalize to have action costs C(a,s)
If Mij matrix is not known a priori, then we
have a reinforcement learning scenario..
3
Repeat
U is the maximal expected utility (value)
assuming optimal policy
4
Policies change with rewards..
Repeat
-
-
-
-
5
Repeat
(Value)
(sequence of states behavior)
How about deterministic case? U(si) is the
shortest path to the goal ?
6
MDPs and Deterministic Search

Problem solving agent search corresponds to what
special case of MDP?
Actions are deterministic Goal states are all
equally valued, and are all sink states.
Is it worth solving the problem using MDPs?
The construction of optimal policy is an overkill
The policy, in effect, gives us the optimal path
from every state to the goal state(s))
The value function, or its approximations, on the
other hand are useful. How?
As heuristics for the problem solving agents
search
This shows an interesting connection between
dynamic programming and state search paradigms
DP solves many related problems on the way to
solving the one problem we want
State search tries to solve just the problem we
want
We can use DP to find heuristics to run state
search..

7
SSPPStochastic Shortest Path Problem An MDP with
Init and Goal states
Not discussed (MDP variation closest to A)

MDPs dont have a notion of an initial and
goal state. (Process orientation instead of
task orientation)
Goals are sort of modeled by reward functions
Allows pretty expressive goals (in theory)
Normal MDP algorithms dont use initial state
information (since policy is supposed to cover
the entire search space anyway).
Could consider envelope extension methods
Compute a deterministic plan (which gives the
policy for some of the states Extend the policy
to other states that are likely to happen during
execution
RTDP methods

SSSP are a special case of MDPs where
(a) initial state is given
(b) there are absorbing goal states
(c) Actions have costs. Goal states have zero
costs.
A proper policy for SSSP is a policy which is
guaranteed to ultimately put the agent in one of
the absorbing states
For SSSP, it would be worth finding a partial
policy that only covers the relevant states
(states that are reachable from init and goal
states on any optimal policy)
Value/Policy Iteration dont consider the notion
of relevance
Consider heuristic state search algorithms
Heuristic can be seen as the estimate of the
value of a state.

8
Why are they called Markov decision processes?

Markov property means that state contains all the
information (to decide the reward or the
transition)
Reward of a state Sn is independent of the path
used to get to Sn
Effect of doing an action A in state Sn doesnt
depend on the way we reached state Sn
(As a consequence of the above) Maximal expected
utility of a state S doesnt depend on the path
used to get to S
Markov properties are assumed (to make life
simple)
It is possible to have non-markovian rewards
(e.g. you will get a reward in state Si only if
you came to Si through SJ
E.g. If you picked up a coupon before going to
the theater, then you will get a reward
It is possible to convert non-markovian rewards
into markovian ones, but it leads to a blow-up in
the state space. In the theater example above,
add coupon as part of the state (it becomes an
additional state variableincreasing the state
space two-fold).
It is also possible to have non-markovian
effectsespecially if you have partial
observability
E.g. Suppose there are two states of the world
where the agent can get banana smell

Added based on class discussion
9
What does a solution to an MDP look like?

The solution should tell the optimal action to do
in each state (called a Policy)
Policy is a function from states to actions (
see finite horizon case below)
Not a sequence of actions anymore
Needed because of the non-deterministic actions
If there are S states and A actions that we
can do at each state, then there are AS
policies
How do we get the best policy?
Pick the policy that gives the maximal expected
reward
For each policy p
Simulate the policy (take actions suggested by
the policy) to get behavior traces
Evaluate the behavior traces
Take the average value of the behavior traces.
How long should behavior traces be?
Each trace is no longer than k (Finite Horizon
case)
Policy will be horizon-dependent (optimal action
depends not just on what state you are in, but
how far is your horizon)
Eg Financial portfolio advice for yuppies vs.
retirees.
No limit on the size of the trace (Infinite
horizon case)
Policy is not horizon dependent
Qn Is there a simpler way than having to
evaluate AS policies?
Yes

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
10
(No Transcript)
11
(Value)
How about deterministic case? U(si) is the
shortest path to the goal ?
12
.8
.1
.1
13
Bellman equations when actions have costs

The model discussed in class ignores action costs
and only thinks of state rewards
More generally, the reward/cost depends on the
state as well as action
R(s,a) is the reward/cost of doing action a in
state s
The Bellman equation then becomes
U(s) max over a R(s,a) expected utility
of doing a
Notice that the only difference is that R(.,.) is
now inside the maximization
With this model, we can talk about partial
satisfaction planning problems where
Actions have costs goals have utilities and the
optimal plan may not satisfy all goals.

Not discussed
14
Updates can be done synchronously OR
asynchronously --convergence guaranteed
as long as each state updated
infinitely often
Why are values coming down first? Why are some
states reaching optimal value faster?
.8
.1
.1
15
Terminating Value Iteration

The basic idea is to terminate the value
iteration when the values have converged (i.e.,
not changing much from iteration to iteration)
Set a threshold e and stop when the change across
two consecutive iterations is less than e
There is a minor problem since value is a vector
We can bound the maximum change that is allowed
in any of the dimensions between two successive
iterations by e
Max norm . of a vector is the maximal value
among all its dimensions. We are basically
terminating when Ui Ui1 lt e

16
Policies converge earlier than values

There are finite number of policies but infinite
number of value functions.
So entire regions of value vector are mapped
to a specific policy
So policies may be converging faster than
values. Search in the space of policies
Given a utility vector Ui we can compute the
greedy policy pui
The policy loss of pui is Upui-U
(max norm difference of two vectors is the
maximum amount by which they differ on any
dimension)

P4
P3
V(S2)
U
P2
P1
V(S1)
Consider an MDP with 2 states and 2 actions
17
n linear equations with n unknowns.
We can either solve the linear eqns exactly,
or solve them approximately by running the
value iteration a few times (the update wont
have the max operation)
18
Other ways of solving MDPs

Value and Policy iteration are the bed-rock
methods for solving MDPs. Both give optimality
guarantees
Both of them tend to be very inefficient for
large (several thousand state) MDPs
Many ideas are used to improve the efficiency
while giving up optimality guarantees
E.g. Consider the part of the policy for more
likely states (envelope extension method)
Interleave search and execution (Real Time
Dynamic Programming)
Do limited-depth analysis based on reachability
to find the value of a state (and there by the
best action you you should be doingwhich is the
action that is sending you the best value)
The values of the leaf nodes are set to be their
immediate rewards
If all the leaf nodes are terminal nodes, then
the backed up value will be true optimal value.
Otherwise, it is an approximation

RTDP
19
What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Min-Max!
If you have deterministic actions then RTDP
becomes RTA (if you use h(.) to evaluate leaves
20
MDPs and Deterministic Search

Problem solving agent search corresponds to what
special case of MDP?
Actions are deterministic Goal states are all
equally valued, and are all sink states.
Is it worth solving the problem using MDPs?
The construction of optimal policy is an overkill
The policy, in effect, gives us the optimal path
from every state to the goal state(s))
The value function, or its approximations, on the
other hand are useful. How?
As heuristics for the problem solving agents
search
This shows an interesting connection between
dynamic programming and state search paradigms
DP solves many related problems on the way to
solving the one problem we want
State search tries to solve just the problem we
want
We can use DP to find heuristics to run state
search..

21
Incomplete observability(the dreaded POMDPs)

To model partial observability, all we need to do
is to look at MDP in the space of belief states
(belief states are fully observable even when
world states are not)
Policy maps belief states to actions
In practice, this causes (humongous) problems
The space of belief states is continuous (even
if the underlying world is discrete and finite).
GET IT? GET IT??
Even approximate policies are hard to find
(PSPACE-hard).
Problems with few dozen world states are hard to
solve currently
Depth-limited exploration (such as that done in
adversarial games) are the only option

Belief state s10.3, s20.4 s40.3
5 LEFTs
5 UPs
This figure basically shows that belief states
change as we take actions
22
Incomplete observability(the dreaded POMDPs)

To model partial observability, all we need to do
is to look at MDP in the space of belief states
(belief states are fully observable even when
world states are not)
Policy maps belief states to actions
In practice, this causes (humongous) problems
The space of belief states is continuous (even
if the underlying world is discrete and finite).
GET IT? GET IT??
Even approximate policies are hard to find
(PSPACE-hard).
Problems with few dozen world states are hard to
solve currently
Depth-limited exploration (such as that done in
adversarial games) are the only option

Belief state s10.3, s20.4 s40.3
5 LEFTs
5 UPs
This figure basically shows that belief states
change as we take actions
23
Claude Shannon (finite look-ahead)
Chaturanga, India (550AD) (Proto-Chess)
Von Neuman (Min-Max theorem)
9/28
Donald Knuth (a-b analysis)
John McCarthy (a-b pruning)
24
Agenda

Loose ends from MDP
Horizon in MDP
And making rewards finite over infinite horizons
RTA (is RTDP with deterministic actions)
Min-max is RTDP with min-max instead of
expectimax
And todays main topic
Its all fun and GAMES

Steaming in Tempe
25
Announcements etc.

Homework 2 returned ?
(!! Our TA doesnt sleep)
Average 33/60
Max 56/60
Solutions online
Homework 3 socket opened ?
Project 1 due today
Extra credit portion will be accepted until
Thursday with late penalty
Any steam to be let off?
Todays class
Its all fun and GAMES

Steaming in Tempe
26
What does a solution to an MDP look like?

The solution should tell the optimal action to do
in each state (called a Policy)
Policy is a function from states to actions (
see finite horizon case below)
Not a sequence of actions anymore
Needed because of the non-deterministic actions
If there are S states and A actions that we
can do at each state, then there are AS
policies
How do we get the best policy?
Pick the policy that gives the maximal expected
reward
For each policy p
Simulate the policy (take actions suggested by
the policy) to get behavior traces
Evaluate the behavior traces
Take the average value of the behavior traces.
How long should behavior traces be?
Each trace is no longer than k (Finite Horizon
case)
Policy will be horizon-dependent (optimal action
depends not just on what state you are in, but
how far is your horizon)
Eg Financial portfolio advice for yuppies vs.
retirees.
No limit on the size of the trace (Infinite
horizon case)
Policy is not horizon dependent
Qn Is there a simpler way than having to
evaluate AS policies?
Yes

We will concentrate on infinite horizon
problems (infinite horizon doesnt
necessarily mean that that all behavior
traces are infinite. They could be finite
and end in a sink state)
27
(No Transcript)
28
What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Review
Min-Max!
29
RTA(RTDP with deterministic actionsand leaves
evaluated by f(.))
S
S n
m
k
G
G1 H2 F3
G1 H2 F3
n
m
G2 H3 F5
k
infty
RTA is a special case of RTDP --It is useful
for acting in determinostic, dynamic worlds
--While RTDP is useful for actiong in stochastic,
dynamic worlds
--Grow the tree to depth d --Apply f-evaluation
for the leaf nodes --propagate f-values up to the
parent nodes f(parent) min(
f(children))
30
Game Playing (Adversarial Search)

Perfect play
Do minmax on the complete game tree
Resource limits
Do limited depth lookahead
Apply evaluation functions at the leaf nodes
Do minmax
Alpha-Beta pruning (a neat idea that is the bane
of many a CSE471 student)
Miscellaneous
Games of Chance
Status of computer games..

31
Fun to try and find analogies between this and
environment properties
32
(just as human weight lifters refuse to compete
against cranes)
33
(No Transcript)
34
Searching Tic Tac Toe using Minmax
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Evaluation Functions TicTacToe
If win for Max infty If lose for Max
-infty If draw for Max 0 Else
rows/cols/diags open for Max -
rows/cols/diags open for Min
39
(No Transcript)
40
(No Transcript)
41
What depth should we go to? --Deeper the
better (but why?) Should we go to uniform
depth? --Go deeper in branches where
the game is in a flux (backed up
values are changing fast)
Called Quiescence Can we avoid the horizon
effect?
42
Why is deeper better?

Possible reasons
Taking mins/maxes of the evaluation values of the
leaf nodes improves their collective accuracy
Going deeper makes the agent notice traps thus
significantly improving the evaluation accuracy
All evaluation functions first check for
termination states before computing the
non-terminal evaluation

43
(so is MDP policy)
44
lt 2
lt 2
lt 5
lt 14
Cut
2
14
5
2

Whenever a node gets its true value, its
parents bound gets updated
When all children of a node have been evaluated
(or a cut off occurs below that node), the
current bound of that node is its true value
Two types of cutoffs
If a min node n has bound ltk, and a max ancestor
of n, say m, has a bound gtj, then cutoff occurs
as long as j gtk
If a max node n has bound gtk, and a min ancestor
of n, say m, has a bound ltj, then cutoff occurs
as long as jltk

45
(No Transcript)
46
An eye for an eye only ends up making the whole
world blind. -Mohandas Karamchand Gandhi,
born October 2nd, 1869.
Lecture of October 2nd, 2003
47
Another alpha-beta example
Project 2 assigned
48
(order nodes in terms of their static eval
values)
Click for an animation of Alpha-beta search in
action on Tic-Tac-Toe
49
(No Transcript)
50
(No Transcript)
51
Multi-player Games
Everyone maximizes their utility --How does
this compare to 2-player games? (Maxs
utility is negative of Mins)
52
Expecti-Max
53
What if you see this as a game?
If you are perpetual optimist then V2
max(V3,V4)
Min-Max!
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)

Write a Comment

User Comments (0)