Title: Multi-armed Bandit Problems with Dependent Arms
1Multi-armed Bandit Problems with Dependent Arms
- Sandeep Pandey (spandey_at_cs.cmu.edu)
- Deepayan Chakrabarti (deepay_at_yahoo-inc.com)
- Deepak Agarwal (dagarwal_at_yahoo-inc.com)
2Background Bandits
Bandit arms
- Pull arms sequentially so as to maximize the
total expected reward - Show ads on a webpage to maximize clicks
- Product recommendation to maximize sales
3Dependent Arms
- Reward probabilities µi are generally assumed to
be independent of each other - What if they are dependent?
- E.g., ads on similar topics, using similar
text/phrases, should have similar rewards
Skiing, snowboarding
Skiing, snowshoes
Get Vonage!
Snowshoe rental
µ10.3
µ20.28
µ310-6
µ20.31
4Dependent Arms
- Reward probabilities µi are generally assumed to
be independent of each other - What if they are dependent?
- E.g., ads on similar topics, using similar
text/phrases, should have similar rewards - A click on one ad ? other similar ads may
generate clicks as well - Can we increase total reward using this
dependency?
5Cluster Model of Dependence
Cluster 1
Cluster 2
µi f(pi)
Successes si Bin(ni, µi)
6Cluster Model of Dependence
µi f(p1)
µi f(p2)
- Total reward
- Discounted ? at.ER(t), a discounting
factor - Undiscounted ? ER(t)
7Discounted Reward
x1 x2
Arm 2
The optimal policy can be computed using
per-cluster MDPs only.
MDP for cluster 1
Pull Arm 1
x1 x2
x1 x2
- Optimal Policy
- Compute an (index, arm) pair for each cluster
- Pick the cluster with the largest index, and
pull the corresponding arm
x3 x4
Arm 4
MDP for cluster 2
Pull Arm 3
x3 x4
x3 x4
8Discounted Reward
x1 x2
Arm 2
The optimal policy can be computed using
per-cluster MDPs only.
MDP for cluster 1
Pull Arm 1
x1 x2
- Reduces the problem to smaller state spaces
- Reduces to Gittins Theorem 1979 for
independent bandits - Approximation bounds on the index for k-step
lookahead
x1 x2
- Optimal Policy
- Compute an (index, arm) pair for each cluster
- Pick the cluster with the largest index, and
pull the corresponding arm
x3 x4
Arm 4
MDP for cluster 2
Pull Arm 3
x3 x4
x3 x4
9Cluster Model of Dependence
µi f(p1)
µi f(p2)
- Total reward
- Discounted ? at.ER(t), a discounting
factor - Undiscounted ? ER(t)
8
t0
T
t0
10Undiscounted Reward
Cluster arm 1
Cluster arm 2
All arms in a cluster are similar ? They
can be grouped into one hypothetical
cluster arm
11Undiscounted Reward
- Two-Level Policy
- In each iteration
- Pick cluster arm using a traditional bandit
policy - Pick an arm within that cluster using a
traditional bandit policy
Cluster arm 1
Cluster arm 2
Each cluster arm must have some estimated
reward probability
12Issues
- What is the reward probability of a cluster
arm? - How do cluster characteristics affect performance?
13Reward probability of a cluster arm
- What is the reward probability r of a cluster
arm? - MEAN r ?si / ?ni, i.e., average success
rate, summing over all arms in the cluster
Kocsis/2006, Pandey/2007 - Initially, r µavg average µ of arms in
cluster - Finally, r µmax max µ among arms in cluster
- Drift in the reward probability of the cluster
arm
14Reward probability drift causes problems
Best (optimal) arm, with reward probability µopt
Cluster 1
Cluster 2
(opt cluster)
- Drift ? Non-optimal clusters might temporarily
look better ? optimal arm is explored only O(log
T) times
15Reward probability of a cluster arm
- What is the reward probability r of a cluster
arm? - MEAN r ?si / ?ni
- MAX r max( Eµi )
- PMAX r E max(µi)
- Both MAX and PMAX aim to estimate µmax and thus
reduce drift
for all arms i in cluster
16Reward probability of a cluster arm
Bias in estimation of µmax
- MEAN r ?si / ?ni
- MAX r max( Eµi )
- PMAX r E max(µi)
- Both MAX and PMAX aim to estimate µmax and thus
reduce drift
Variance of estimator
High Unbiased
Low High
17Comparison of schemes
- 10 clusters, 11.3 arms/cluster
MAX performs best
18Issues
- What is the reward probability of a cluster
arm? - How do cluster characteristics affect performance?
19Effects of cluster characteristics
- We analytically study the effects of cluster
characteristics on the crossover-time - Crossover-time Time when the expected reward
probability of the optimal cluster becomes
highest among all cluster arms
20Effects of cluster characteristics
- Crossover-time Tc for MEAN depends on
- Cluster separation ? µopt µmax outside opt
cluster ? increases ? Tc decreases - Cluster size Aopt Aopt increases ? Tc
increases - Cohesiveness in opt cluster 1-avg(µopt µi)
Cohesiveness increases ? Tc decreases
21Experiments (effect of separation)
? increases ? Tc decreases ? higher reward
22Experiments (effect of size)
Aopt increases ? Tc increases ? lower reward
23Experiments (effect of cohesiveness)
Cohesiveness increases ? Tc decreases ? higher
reward
24Related Work
- Typical multi-armed bandit problems
- Do not consider dependencies
- Very few arms
- Bandits with side information
- Cannot handle dependencies among arms
- Active learning
- Emphasis on examples required to achieve a given
prediction accuracy
25Conclusions
- We analyze bandits where dependencies are
encapsulated within clusters - Discounted Reward ? the optimal policy is an
index scheme on the clusters - Undiscounted Reward
- Two-level Policy with MEAN, MAX, and PMAX
- Analysis of the effect of cluster characteristics
on performance, for MEAN
26Discounted Reward
1
3
4
2
x1 x2
x3 x4
- Create a belief-state MDP
- Each state contains the estimated reward
probabilities for all arms - Solve for optimal
27Background Bandits
Bandit arms
Regret optimal payoff actual payoff
28Reward probability of a cluster arm
- What is the reward probability of a cluster
arm? - Eventually, every cluster arm must converge to
the most rewarding arm µmax within that cluster - since a bandit policy is used within each cluster
- However, drift causes problems
29Experiments
- Simulation based on one weeks worth of data from
a large-scale ad-matching application - 10 clusters, with 11.3 arms/cluster on average
30Comparison of schemes
- 10 clusters, 11.3 arms/cluster
- Cluster separation ? 0.08
- Cluster size Aopt 31
- Cohesiveness 0.75
MAX performs best
31Reward probability drift causes problems
Best (optimal) arm, with reward probability µopt
Cluster 1
Cluster 2
(opt cluster)
- Intuitively, to reduce regret, we must
- Quickly converge to the optimal cluster arm
- and then to the best arm within that cluster