Bayesian Sparse Sampling for On-line Reward Optimization

About This Presentation

Title:

Bayesian Sparse Sampling for On-line Reward Optimization

Description:

Presented in the Value of Information Seminar at NIPS 2005 ... computationally disasterous. conceptually disasterous. but. computationally clean. versus ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 22

Provided by: Tao98

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Sparse Sampling for On-line Reward Optimization

1
Bayesian Sparse Sampling for On-line Reward
Optimization
Presented in the Value of Information Seminar at
NIPS 2005
Based on previous paper from ICML 2005, written
by Dale Schuurmans et all
2
Background Perspective

Be Bayesian about reinforcement learning
Ideal representation of uncertainty for
action selection
Computational barriers

Why are Bayesian approaches not prevalent in RL?
3
Exploration vs. Exploitation

Bayes decision theory
Value of information measured by ultimate return
in reward
Choose actions to max expected value
Exploration/exploitation tradeoff implicitly
handled as side effect

4
Bayesian Approach
conceptually clean but computationally disasterous
versus
conceptually disasterous but computationally clean
5
Bayesian Approach
conceptually clean but computationally disasterous
versus
conceptually disasterous but computationally clean
6
Overview

Efficient lookahead search for Bayesian RL
Sparser sparse sampling
Controllable computational cost
Higher quality action selection than current
methods

Greedy Epsilon - greedy Boltzmann Thompson
Sampling Bayes optimal Interval estimation
Myopic value of perfect info. Standard sparse
sampling Péret Garcia
(Luce 1959) (Thompson 1933) (Hee 1978) (Lai
1987, Kaelbling 1994) (Dearden, Friedman, Andre
1999) (Kearns, Mansour, Ng 2001) (Péret Garcia
2004)
7
Sequential Decision Making
Requires model P(r,ss,a)
How to make an optimal decision?
s
V(s)
Planning
a
a
MAX
s,a
s,a
Q(s,a)
Q(s,a)
r
r
r
r
expectation
expectation
s
s
s
s
V(s)
V(s)
V(s)
V(s)
a
a
a
a
a
a
a
a
MAX
MAX
MAX
MAX
s,a
s,a
s,a
s,a
s,a
s,a
s,a
s,a
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
E
E
E
E
E
E
E
E
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
This is finite horizon, finite action, finite
reward case
General case Fixed point equations
8
Reinforcement Learning
Do not have model P(r,ss,a)
s
V(s)
a
a
MAX
s,a
s,a
Q(s,a)
Q(s,a)
r
r
r
r
expectation
expectation
s
s
s
s
V(s)
V(s)
V(s)
V(s)
a
a
a
a
a
a
a
a
MAX
MAX
MAX
MAX
s,a
s,a
s,a
s,a
s,a
s,a
s,a
s,a
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
E
E
E
E
E
E
E
E
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
9
Reinforcement Learning
Do not have model P(r,ss,a)
s
V(s)
a
a
MAX
s,a
s,a
Q(s,a)
Q(s,a)
r
r
r
r
expectation
expectation
s
s
s
s
V(s)
V(s)
V(s)
V(s)
a
a
a
a
a
a
a
a
MAX
MAX
MAX
MAX
s,a
s,a
s,a
s,a
s,a
s,a
s,a
s,a
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
Q(s,a)
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
E
E
E
E
E
E
E
E
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
10
Bayesian Reinforcement Learning
Belief state bP(?)
Prior P(?) on model P(rs sa, ?)
Meta-level MDP
meta-level state
s b
decision a
a
Choose action to maximize long term reward
actions
s b a
outcome r, s, b
Meta-level Model P(r,sbs b,a)
rewards
r
s b
decision a
a
actions
s b a
outcome r, s, b
Meta-level Model P(r,sbs b,a)
rewards
r
s b
Have a model for meta-level transitions! - based
on posterior update and expectations over
base-level MDPs
11
Bayesian RL Decision Making
How to make an optimal decision?
V(s b)
Bayes optimal action selection
Solve planning problem in meta-level MDP
a
a
MAX
Q(s b, a)
Q(s b, a)
- Optimal Q,V values
r
r
r
r
E
E
V(s b)
V(s b)
V(s b)
V(s b)
Problem meta-level MDP much larger than
base-level MDP
Impractical
12
Bayesian RL Decision Making
Current approximation strategies
Consider current belief state b
s b
s
a
a
a
a
MAX
MAX
Draw a base-level MDP
s b, a
s b, a
s, a
s, a
E
E
r
r
r
r
r
r
r
r
E
E
s b
s b
s b
s b
s
s
s
s
? Exploration is based on uncertainty
Greedy approach current b ? mean base-level MDP
model ? point estimate for Q, V
? choose greedy action
Thompson approach current b ? sample a
base-level MDP model ? point estimate
for Q, V (Choose action proportional to
probability it is max Q)
But doesnt consider uncertainty
But still myopic
13
Their Approach

Try to better approximate Bayes optimal action
selection by performing lookahead
Adapt sparse sampling (Kearns, Mansour ,Ng)
Make some practical improvements

14
Sparse Sampling
(Kearns, Mansour, Ng 2001)
15
Bayesian Sparse SamplingObservation 1

Action value estimates are not equally important
Need better Q value estimates for some actions
but not all
Preferentially expand tree under actions that
might be optimal

MAX
Biased tree growth Use Thompson sampling to
select actions to expand
16
Bayesian Sparse SamplingObservation 2

Correct leaf value estimates to same depth

MAX
Use mean MDP Q-value multiplied by remaining
depth
t1
E
t2
E
t3
effective horizon N3
17
Bayesian Sparse SamplingTree growing procedure
1. Sample prior for a model 2. Solve action
values 3. Select the optimal action

Descend sparse tree from root
Thompson sample actions
Sample outcome
Until new node added
Repeat until tree size limit reached

s,b
s,b,a
Execute action Observe reward
sb
Control computation by controlling tree size
18
Simple experiments

5 Bernoulli bandits
Beta priors
Sampled model from prior
Run action selection strategies
Repeat 3000 times
Average accumulated reward per step

19
(No Transcript)
20
(No Transcript)
21
That's it

Write a Comment

User Comments (0)

About PowerShow.com

Bayesian Sparse Sampling for On-line Reward Optimization - PowerPoint PPT Presentation

Bayesian Sparse Sampling for On-line Reward Optimization

Presented in the Value of Information Seminar at NIPS 2005 ... computationally disasterous. conceptually disasterous. but. computationally clean. versus ... – PowerPoint PPT presentation