Title: Representations for Decision Making Under Uncertainty
1Representations for Decision Making Under
Uncertainty
- Michael L. Littman (Rutgers Univ.)
- mlittman_at_cs.rutgers.edu
2Example Travel Domain
- 6/23 1530 at(home), email(goal 6/26 0900
at(Cornell)) - 6/23 1530 wait(6/25 1830)
- 6/25 1830 go(mycar, home, EWR)
- 6/25 2030 go(plane, EWR, SYR)
- 6/25 2130 go(alexcar, SYR, alexhouse)
- 6/25 2230 wait(6/26 0800)
- 6/26 0800 go(alexcar2, alexhouse, Cornell)
- 6/26 0900 at(Cornell)
3Example Travel Domain
- 6/23 1530 at(home), email(goal 6/26 0900
at(Cornell)) - 6/23 1530 wait(6/25 1830), email(6/25 2130
at(plane, EWR)) - 6/25 1830 go(mycar, home, EWR)
- 6/25 2030 go(plane, EWR, SYR)
- 6/25 2130 go(alexcar, SYR, alexhouse)
- 6/25 2230 wait(6/26 0800)
- 6/26 0800 go(alexcar2, alexhouse, Cornell)
- 6/26 0900 at(Cornell)
6/25 1830 wait(6/25 1930) 6/25 1930 go(mycar,
home, EWR) 6/25 2130 go(plane, EWR, SYR) 6/25
2230 go(alexcar, SYR, alexhouse) 6/25 2330
wait(6/26 0800) 6/26 0800 go(alexcar2, alexhouse,
Cornell) 6/26 0900 at(Cornell)
4Example Travel Domain
- 6/23 1530 at(home), email(goal 6/26 0900
at(Cornell)) - 6/23 1530 wait(6/25 1830), email(6/25 2130
at(plane, EWR)) - 6/25 1830 wait(6/25 1930)
- 6/25 1930 go(mycar, home, EWR)
- 6/25 2130 go(plane, EWR, SYR), action failed
- 6/25 2230 go(alexcar, SYR, alexhouse)
- 6/25 2330 wait(6/26 0800)
- 6/26 0800 go(alexcar2, alexhouse, Cornell)
- 6/26 0900 at(Cornell)
6/25 2200 go(mycar, EWR, home) 6/25 2230
wait(6/26 0400) 6/26 0400 go(mycar, home,
Cornell) 6/26 0900 at(Cornell)
6/25 2200 go(mycar, EWR, home) 6/25 2230
wait(6/26 0400) 6/26 0400 go(mycar, home,
Cornell) 6/26 0900 at(Cornell)
5Things to Notice
- Real-world tasks are uncertain.
- Need to think ahead, react to surprises.
- Im tired.
- Level of description carefully chosen requires
significant engineering - which details to suppress
- model transition probabilities, costs
- varies with task
Gap
6Logical Representation
- STRIPS belief nets (Littman Younes 03).
- go(vehicle, from, to)
- if (at(from) and at(vehicle, from))
- 0.7 not(at(from)), not(at(vehicle, from)),
at(to), at(vehicle, to) - 0.3 failure
- else failure
- First planning competition at ICAPS-04.
- Progress in algorithm design.
- Domain engineering a major challenge
7Predictive State Representation
- To be learnable, representation based on
measurable quantities. - Can use action-conditional predictions If I put
my hand in my pocket, will I find my keys there?
(e.g., Littman, Sutton Singh 02) - Some early examples being learned.
8Instance-based Representation
- Uses experience directly to serve as a knowledge
representation. Non-parametric. - Has been applied to continuous and partially
observable environments. Subset nearby. - Also memory-based (Moore Atkeson 93
SantamarÃa et al. 97 Forbes Andre 00 Smart
Kaelbling 00 McCallum 95)
joint angle space
9Optimal Disk Repair Policy
- Disk online? (Littman, Fenson, Howard, Hirsh
Nguyen 03) - yes Restart software
- success Done (Software bug, Transient disk
error) - failure Replace disk (Partial disk failure)
- no Other disk online?
- yes Operator replug any loose cables
- success Done (Cable disconnected)
- failure Operator replug all cables
- success Done (Cables crossed)
- failure Replace disk (Permanent disk failure)
- no Replace SCSI controller (SCSI controller
failure)
10Experience-based Knowledge Formation
- Objective System creates, maintains knowledge
- Current
- humans design representation (meaningless)
- raw data
- Key Challenges
- need representation grounded in subjective
experience self-supervised learning - may require sensors and actions
11Reinforcement Learning Rich Sensors
- Objective Need lots of sensors for
understanding use them to guide behavior - Current
- completely observable domains
- nearly unobservable domains
- Key Challenges
- how represent visual, auditory features?
- deal with continuous parameter values?
- temporal dependencies? (Instance-based RL?)
12Robust Analogical Reasoning
- Objective Use grounded information flexibly in
reasoning learn more efficiently - Current
- analogies over fragile knowledge reps
- ground-level reasoning with no analogies
- Key Challenges
- how learn what can substitute for what?
- how do fast matching?
- active replacement for KR?
13The (RL) Problem
- Decision maker interacts with environment.
- x1, a1, r1, x2, a2, r2, x3, a3, r3
- Objective maximize long-term total reward.
observations
environment
reward
actions
14Emerging Applications
- Monitoring and repair of LANs
- Network management
- Power grid restoration
- Robotics (RoboCup, legged league)
- Stock trading
- Active vision
15Issue of State
- In most RL research, xt is Markov ( finite).
- This is quite helpful (Sutton Barto 98)
- MDP model, efficient planning algs.
- Depend on value function V(x).
- Create policy p(x).
- Polytime learning of near optimal policy (Kearns
Singh 98, e.g.). - Needs many exposures to each state.
16Some Harder Cases
- Continuous environments
- Sensors report continuous values.
- Sure, Markov, but states dont recur.
- Non-Markovian environments
- A.k.a. partially observable environments.
- Sensors provide incomplete information.
- Not clear what to use as states.
17Planning with States
- Whats all the fuss about states?
- Just received observation xt. What to do?
- Given our experience, which action will lead to
the maximum reward? - Estimate rewards and transitions, solve
- Q(s,a) R(s,a) Ss (T(s,a,s) maxa
Q(s,a)) - to choose optimal actions.
- (Smooth estimates, reward exploration.)
18Proposal for State 1 Window
- History window (length k)
- State at t is xt-k 1,at-k 1, , xt-2, at-2,
xt-1, at-1, xt. - Often appropriate, at least approximately.
- Cant capture many simple environments.
- Often scales badly.
- Learning quite direct (observable).
19Proposal for State 2 POMDP
- Partially observable Markov decision process
- Assert hidden states control dynamics.
- Track distribution over hidden states.
- (Belief) states are continuous, cant use Q.
- Sophisticated algorithms have been devised.
- Captures more complex environments.
- Hard to learn hidden is hidden.
- (Monahan 82 Sondik 71 Cassandra et al. 97
Chrisman 92)
20Proposal for State 3 PSR
- Predictive state representation
- States defined by predictions of tests.
- Tests sequences of future history windows.
- Test outcomes are independently verifiable.
- Linear update can express POMDPs.
- Prediction vectors are also continuous states.
- Learning algorithms under development.
- Not clear which tests to use in representation.
- (Littman, Sutton Singh 02 Singh et al. 03)
21Instance-based Continuous
- Basic idea For a continuous, Markov xs.
- Store instances (one per t) in a database.
- Assign a value to each (TD style).
- Approximate value of xt using nearby
instances. - Also memory-based (Moore Atkeson 93
SantamarÃa et al. 97 Forbes Andre 00 Smart
Kaelbling 00)
joint angle space
22Instance Subsets as State
- Instance-based approaches generally
- viewed as value function approximation,
- view state space as continuous (vs. Q).
- Idea May be fruitful to view as state rep!
- View instant at time t as matching a subset of
previously stored instances. - Plan using subsets of instances as state.
23Issues of Instance Subsets
- Tractable Discrete, directly observable states.
- Lots of subsets.
- Most not reachable.
- Typically (sub?) linear in number of instances.
- Still big may need careful forgetting.
- Similarity function often designed by hand.
- Like clustering, but can be overlapping.
- Often wildly overestimates values.
- May encourage exploration if replan.
24Instance-based Non-Markovian
- What constitutes a instance?
- Nearest sequence memory (McCallum 95)
- Instant t is entire history up to t.
- Neighbors are top k suffix matches.
- Unlike history window, context sensitive.
- Superior to POMDP-learning approach.
- Scaling issues, but promising.
25Recap
- RL when continuous non-Markovian.
- Exploring instance-based state rep.
- continuous mountain car
- non-Markovian fault remediation
- Focus for the remainder of the talk.
- Problem area
- Formal model
- Algorithmic progress
- Mini demo
26Networking Application
- Browsing the web, cant reach Yahoo!
- Whats wrong?
- Make it better!
27Modeling Assumptions
- Hard problem simplifying assumptions
- Fault remediation episodes independent.
- High level actions provide information or repair.
- Otherwise, no fault mode changes.
- Want minimum cost repair.
28Cost-sensitive Fault Remediation
- Cost-sensitive diagnosis
- no state transitions, actions informational only
- episode ends when a diagnosis made
- Cost-sensitive fault remediation
- still no state change, repair is all or nothing
- episode continues until objective achieved
- Episodic Partially Observable MDP
- actions change state and provide information
- episode continues until objective achieved
29Conceptual Model
- Cost-sensitive fault remediation
- Set of underlying fault modes
- Actions observe attributes with costs
- Actions remediate fault modes with costs,
repairing some, failing on others - (Greiner et al. 96, Turney 00, Guo 02, Zubek
Dietterich 02)
30Example Observables
31Example Remedial Actions, Faults
32Cost-sensitive Classification
- Decisions what to observe, guess class
- All classes are remedial.
- Misclassification cost on par with observation
cost - Too cheap pick most likely class
- Too expensive observe everything
- (Turney )
33CSFR Formal Definition
- Set of fault modes C observables T remediation
actions R - Fault probability distribution p(c) over c
- Cost function j(t,c) for observing t in mode c
- Observation model b(t,c), with probability of
observing 1 using t in c - Remediation costs m(r,c) for taking r in c
- Goal function G(r,c) returns 1 if r repairs c
- For every c there is some r such that G(r,c)1.
34Disk Diagnosis Example
- Failure modes
- Permanent disk failure
- Partial disk failure
- Transient disk error
- Software bug
- Cable disconnected
- Cables crossed
- SCSI controller failure
- Tests, Remedial Actions cost
- Disk online? 1
- Disk ok? 600
- All cables plugged in? 60
- Other disk online? 1
- Restart software 30
- Operator replug any loose cables 70
- Operator replug all cables 420
- Replace disk 900
- Replace SCSI controller 900
35Disk Diagnosis Example
- Observations
- Disk Online?, Read?, Write?, Scan Disk?, Create
File? - Remediation
- Reboot, Replace (Copied) Disk, Remove Unwanted
Files
36Disk Diagnosis Faults
- Faults MTTF Prob
- Disk Hang 1yr 0.0400
- Disk Timeout 5wks 0.4340
- Read Failure (Bad Sector) Caught 35yrs 0.0001
- Read Failure (Bad Sector) Not caught 0.0010
- Write Failure (Bad Sector) Caught 35yrs 0.0001
- Write Failure (Bad Sector) Not caught 0.0010
- Disk Full 4wks 0.5230
37Optimal Policy
- Disk Online?
- yes Read?
- yes Create File?
- yes Replace Disk (Read Failure (Bad Sector) Not
Caught, or Write Failure (Bad Sector) Caught, or
Write Failure (Bad Sector) Not Caught) - no Remove Unwanted Files (Disk Full)
- no Reboot (Disk Timeout)
- failure Replace Disk (Read Failure (Bad Sector)
Caught) - no Reboot (Disk Hang)
38Optimal Disk Repair Policy
- Disk online?
- yes Restart software
- success Done (Software bug, Transient disk
error) - failure Replace disk (Partial disk failure)
- no Other disk online?
- yes Operator replug any loose cables
- success Done (Cable disconnected)
- failure Operator replug all cables
- success Done (Cables crossed)
- failure Replace disk (Permanent disk failure)
- no Replace SCSI controller (SCSI controller
failure)
39Things to Notice
- Different from people (gearing up for tests).
- Some actions repair multiple faults.
- Some remedial actions fail.
- Each fault appears only once.
- Closed world assumption.
40Learning in CSFRs
- Can plan in a CSFR model, from where?
- Fault modes unknown.
- Episodes end with successful repair.
- Necessarily incomplete information.
- Instance-based approach
- Instances are the episodes actions, results.
- Instant matches subset of instances.
- Use matches to estimate model plan.
41A DP Planning Algorithm
- Subset of the set of episodes s, observable
o, remediation r. - Recursively compute
- Cost of optimal repair
- V(s) min(mino Q(s,o), minr
Q(s,r)) - Cost of repair starting with observation o
- Q(s,o)c(s,o)Pr(o1s)V(so1)Pr(o0s)V(so0
) - Cost of repair starting with repair r
- Q(s,r) c(s,r)Pr(r1s) V(sr1)
42Algorithm Analysis
- Two important details
- Choose only observables that provide information
or have repaired faults. - If no information actions left, choose
remediation. - Then, state-space size bounded by number of
subsets of true fault modes.
43Algorithmic Limitations
- Optimal if few faults, (nearly) deterministic
observables, easily explored. - Moving ahead
- need smart exploration
- better scaling with many (related) faults
- handle noisy, continuous observables
- deal with some side effects
44Practical Considerations
- CSFR big improvement over classification.
- Fault modes analogous to classes, but
automatically identified autonomous. - But, how know if repair succeeded?
- Primary complaint
- Monitors
- Triggers common to both (start with info).
45Networking Demo Actions
- Observables
- Is the network medium physically connected?
- Is the active interface wireless?
- Is active interface DHCP-Enabled?
- Ping my IP?
- Ping locahost?
- Ping Gateway?
- DNS lookup
- My IP setting looks valid
- My netmask setting looks valid
- My DNS setting looks valid
- My gateway setting looks valid
- Can I Reach PnP?
- Can PnP reach DNS?
Remediation Actions Plug your network cable back
in (or restore your wireless connection) Renew
DHCP lease Fix IP Setting Check your router's
physical connection to your ISP Check your
router's physical connection to your LAN Contact
ISP and report that their DNS Server appears to
be down Contact ISP and report that you cannot
reach their DNS Server Contact ISP and report
that you cannot reach pnphome.com
46Network Demo Notes
- My gateway setting looks valid.
- My netmask setting looks valid.
- Renew DHCP lease.
47Conclusions
- Instance-based advantages
- efficient in experience (expensive commodity)
- no catastrophic forgetting (rare events!)
- memory is cheap
- Instance-based state representation advantages
- model based planning and exploration explicit
- state representation with minimal analysis
- incrementally modifiable (discover structure)
- reduce to previously researched problem!