Multi-armed Bandit Problems with Dependent Arms

About This Presentation

Title:

Multi-armed Bandit Problems with Dependent Arms

Description:

Pull arms sequentially so as to maximize the total expected reward ... Pick the cluster with the largest index, and pull the corresponding arm ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 32

Provided by: DC2

Learn more at: http://www.cs.cmu.edu

more less

Transcript and Presenter's Notes

Title: Multi-armed Bandit Problems with Dependent Arms

1
Multi-armed Bandit Problems with Dependent Arms

Sandeep Pandey (spandey_at_cs.cmu.edu)
Deepayan Chakrabarti (deepay_at_yahoo-inc.com)
Deepak Agarwal (dagarwal_at_yahoo-inc.com)

2
Background Bandits
Bandit arms

Pull arms sequentially so as to maximize the
total expected reward
Show ads on a webpage to maximize clicks
Product recommendation to maximize sales

3
Dependent Arms

Reward probabilities µi are generally assumed to
be independent of each other
What if they are dependent?
E.g., ads on similar topics, using similar
text/phrases, should have similar rewards

Skiing, snowboarding
Skiing, snowshoes
Get Vonage!
Snowshoe rental
µ10.3
µ20.28
µ310-6
µ20.31
4
Dependent Arms

Reward probabilities µi are generally assumed to
be independent of each other
What if they are dependent?
E.g., ads on similar topics, using similar
text/phrases, should have similar rewards
A click on one ad ? other similar ads may
generate clicks as well
Can we increase total reward using this
dependency?

5
Cluster Model of Dependence
Cluster 1
Cluster 2
µi f(pi)
Successes si Bin(ni, µi)
6
Cluster Model of Dependence
µi f(p1)
µi f(p2)

Total reward
Discounted ? at.ER(t), a discounting
factor
Undiscounted ? ER(t)

7
Discounted Reward
x1 x2
Arm 2
The optimal policy can be computed using
per-cluster MDPs only.
MDP for cluster 1
Pull Arm 1
x1 x2
x1 x2

Optimal Policy
Compute an (index, arm) pair for each cluster
Pick the cluster with the largest index, and
pull the corresponding arm

x3 x4
Arm 4
MDP for cluster 2
Pull Arm 3
x3 x4
x3 x4
8
Discounted Reward
x1 x2
Arm 2
The optimal policy can be computed using
per-cluster MDPs only.
MDP for cluster 1
Pull Arm 1
x1 x2

Reduces the problem to smaller state spaces
Reduces to Gittins Theorem 1979 for
independent bandits
Approximation bounds on the index for k-step
lookahead

x1 x2

Optimal Policy
Compute an (index, arm) pair for each cluster
Pick the cluster with the largest index, and
pull the corresponding arm

x3 x4
Arm 4
MDP for cluster 2
Pull Arm 3
x3 x4
x3 x4
9
Cluster Model of Dependence
µi f(p1)
µi f(p2)

Total reward
Discounted ? at.ER(t), a discounting
factor
Undiscounted ? ER(t)

8
t0
T
t0
10
Undiscounted Reward
Cluster arm 1
Cluster arm 2
All arms in a cluster are similar ? They
can be grouped into one hypothetical
cluster arm
11
Undiscounted Reward

Two-Level Policy
In each iteration
Pick cluster arm using a traditional bandit
policy
Pick an arm within that cluster using a
traditional bandit policy

Cluster arm 1
Cluster arm 2
Each cluster arm must have some estimated
reward probability
12
Issues

What is the reward probability of a cluster
arm?
How do cluster characteristics affect performance?

13
Reward probability of a cluster arm

What is the reward probability r of a cluster
arm?
MEAN r ?si / ?ni, i.e., average success
rate, summing over all arms in the cluster
Kocsis/2006, Pandey/2007
Initially, r µavg average µ of arms in
cluster
Finally, r µmax max µ among arms in cluster
Drift in the reward probability of the cluster
arm

14
Reward probability drift causes problems
Best (optimal) arm, with reward probability µopt
Cluster 1
Cluster 2
(opt cluster)

Drift ? Non-optimal clusters might temporarily
look better ? optimal arm is explored only O(log
T) times

15
Reward probability of a cluster arm

What is the reward probability r of a cluster
arm?
MEAN r ?si / ?ni
MAX r max( Eµi )
PMAX r E max(µi)
Both MAX and PMAX aim to estimate µmax and thus
reduce drift

for all arms i in cluster
16
Reward probability of a cluster arm
Bias in estimation of µmax

MEAN r ?si / ?ni
MAX r max( Eµi )
PMAX r E max(µi)
Both MAX and PMAX aim to estimate µmax and thus
reduce drift

Variance of estimator
High Unbiased
Low High
17
Comparison of schemes

10 clusters, 11.3 arms/cluster

MAX performs best
18
Issues

What is the reward probability of a cluster
arm?
How do cluster characteristics affect performance?

19
Effects of cluster characteristics

We analytically study the effects of cluster
characteristics on the crossover-time
Crossover-time Time when the expected reward
probability of the optimal cluster becomes
highest among all cluster arms

20
Effects of cluster characteristics

Crossover-time Tc for MEAN depends on
Cluster separation ? µopt µmax outside opt
cluster ? increases ? Tc decreases
Cluster size Aopt Aopt increases ? Tc
increases
Cohesiveness in opt cluster 1-avg(µopt µi)
Cohesiveness increases ? Tc decreases

21
Experiments (effect of separation)
? increases ? Tc decreases ? higher reward
22
Experiments (effect of size)
Aopt increases ? Tc increases ? lower reward
23
Experiments (effect of cohesiveness)
Cohesiveness increases ? Tc decreases ? higher
reward
24
Related Work

Typical multi-armed bandit problems
Do not consider dependencies
Very few arms
Bandits with side information
Cannot handle dependencies among arms
Active learning
Emphasis on examples required to achieve a given
prediction accuracy

25
Conclusions

We analyze bandits where dependencies are
encapsulated within clusters
Discounted Reward ? the optimal policy is an
index scheme on the clusters
Undiscounted Reward
Two-level Policy with MEAN, MAX, and PMAX
Analysis of the effect of cluster characteristics
on performance, for MEAN

26
Discounted Reward
1
3
4
2
x1 x2
x3 x4

Create a belief-state MDP
Each state contains the estimated reward
probabilities for all arms
Solve for optimal

27
Background Bandits
Bandit arms
Regret optimal payoff actual payoff
28
Reward probability of a cluster arm

What is the reward probability of a cluster
arm?
Eventually, every cluster arm must converge to
the most rewarding arm µmax within that cluster
since a bandit policy is used within each cluster
However, drift causes problems

29
Experiments