Title: Cobot: A Social Reinforcement Learning Agent
1Cobot A Social Reinforcement Learning Agent
- Charles Lee Isbell, Jr.
- Christian R. Shelton
- Michael Kearns
- Satinder Singh
- Peter Stone
- Presented by Josh Waxman
2Applications of RL
- Control
- Game playing
- Optimization
- Recently
- Human-computer interaction
- prev systems encounter humans one at a time
- E.g. spoken dialog systems
- Challenges
- Data sparsity ?
- Inevitable violations of Markov property
- Irreproducibility of experiments (happening in a
MOO) - Variability in users understanding of Cobots
working - Drift or users desires, Inconsistency of Reward
- Choosing appropriate state space
3LambdaMOO (?µ)
- MUD Multi-User Dungeon
- A class of online worlds with roots in text-based
multiplayer role-playing games. - Virtual world, oft created by participants
- Users choose characters to represent them
- Mechanisms of social interaction reinforce
illusion that user is present in the virtual
space - MOO Multi-user Object Oriented - MUD that uses
an object-oriented programming language to
manipulate objects in the virtual world - Complex, open ended, multiuser chat environment,
populated by a community of human users with rich
and often enduring social relationships.
4LambdaMOO (?µ) (2)
- Interconnected rooms
- Rooms contain users and objects that can move
between them - Each room has chat channel (people in a room can
talk to each other) - Room (and objects) has text description that
gives it a look and feel
5Verbs and Speech in ?µ
- Users can talk and also have a series of verbs,
allowing rich set of actions and expression of
emotional states - Buster is overwhelmed by all these deadlines.
- Buster begins to slowly tear his hair out, one
strand at a time. - HFh comforts Buster. standard verb comfort
- HFh to Buster Remember, the mighty oak was
once a nut like you. - Buster to HFh Right, but his personal growth
was assured. Thanks anyway, though. - Buster feels better now.
verbs
speech
6LambdaMOO (?µ) (3)
- Rooms created by users
- Descriptions
- Control access by other users
- Can create objects
- 4836 Active User Accounts
- 118,154 objects
- Oldest continuously operated MUD
- Founded in 1990
- Good environment for AI experiments, including
learning
7Cobot
- Cobot is RL-based agent for LambdaMOO
- Long term goal to build an agent who can learn
to perform useful, interesting and entertaining
actions in LambdaMOO on the basis of user
feedback.
8Cobot (2)
- Originally Social Statistics Agent
- How freq, in what ways users interact
- Provided service of these statistics
- Rudimentary chatting capabilities
- Reactive did not initiate interaction
- Very popular with LambdaMOO users
9Cobot (3)
- Modifications
- Not just reactive. Proactive
- Take actions under own initiative
- Propose conversation topics
- Introduce users
- Word play
- Hope will eventually take unprompted actions
that are meaningful, useful or amusing to users.
10Reinforcement Learning
- In RL, often model decision making by agents in
uncertain environment as MDPs. - Markov Decision Process if environment has the
Markov property, in which only need look at
current state to make a decision - At time t, agent senses environment, chooses an
action a from A, set of actions available in
state s. - Action causes change in environment, and agent
receives a scalar reward from the environment
11Reinforcement Learning (2)
- Goal Maximize expected rewards over some time
horizon - A policy p is a mapping of a state s and action a
to the probability of taking action a from state
s. - p(s, a) ? p(s,a)
- p the optimal policy
- A value function is a function of states (V) or
state-action pairs (Q) that tells how good it is
to be in a specific state, where goodness is
defined in terms of expected future return. - Qp(s, a), the action value function for policy p
is the expected return when taking action a from
state s and afterwards following policy p.
12Reinforcement Learning (3)
- p denotes the optimal policy whose value
function Q is greater than or equal to that of
any other policy for all states I and actions a
in the set. - Q optimal action-value function
- Most RL algorithms use experience to p from the
agents experience in its environment, by learning
Q - The learned value function is used to choose
actions stochastically, so that in each state,
actions with higher value are chosen with higher
probability. - Many RL algorithms use function approximations,
(parametric representations of complex value
functions) both to map state-action features to
their values and to map states to distributions
over actions (i.e., the policy).
13Linear Function Approximator
State feature 1
State feature 2
- Used a linear function approximator for each
state feature, maintain vector of real-valued
weights indexed by possible actions - weight feature increases prob of taking that
action - weight decreases
Action 1
Action 1
Action 2
Action 2
Action 3
Action 3
Action 4
Action 4
Action 5
Action 5
Action 6
Action 6
Action 7
Action 7
Action 8
Action 8
Action 9
Action 9
14Reactive Actions (Social Statistics)
- Tells the questioner various facts about herself,
including the verbs she like to use most, and the
verbs most often directed at her. - who are your playmates Lists those who interact
with Cobot the most. - who loves me Tells the questioner those with
whom she interacts the most. - who acts like me Tells the questioner users who
perform actions similar to hers. - who does not act like me Tells the questioner
users who do not perform actions similar to hers. - relate me to user Indicates how the questioner
interacts with user, providing a ranking on each
others list of playmates, how many playmates
they have in common, and how similarly they act. - tell me my verb-buddies Tells the questioner the
users she uses verb on and who use verb on her. - who is spammy Lists those users who generate and
are the targets of the most verbs. - http//cobot.research.att.com/papers/cobot.pdf
15Example
- HFh to cobot relate me to Buster
- cobot whispers, Here are your relationships with
Buster. - You like to use - (62), poke (7), hug (3),
eye (3), nod (2), hi5, h5, zap, comfort, and
grin on each other. - Buster is ranked 14 on your list of playmates.
You are ranked 1 on Busters list. - Your socializing overlap is 75.4 and your
playmate overlap is 33.7. - Your actions have a similarity of 95.9 but
ignoring common speech verbs its 58.3. - Others act on you with a similarity of 96.6 but
ignoring common speech verbs its 81.9.
16- Technicolor Guest to cobot Whats your
religion? - cobot to Technicolor Guest technology unless
you REALLY believe in that religion yourself and
find that it arouses a deep, strong, genuine
response in many other people. -
- cobot to DragonBoi The line of conflict should
be drawn between the mass of the people and the
power-holding elite of industrial society. - DragonBoi eyes cobot warily.
- cobot to DragonBoi You have something cute in
your eye. - DragonBoi to HFh youre bot is turning into a
communist. - DragonBoi to cobot so what are you gonna do
about it? - DragonBoi to HFh did you see what it told me?
you are turning it into a communist! - Wikked to cobot are you a communist?
- DragonBoi to cobot are you now, or ever been a
member of the communist party? - cobot mumbles something in binary.
- DragonBoi to cobot so youre taking the fifth
are you? well this is a special commision, that
wont work here!
179 Proactive Actions (back to chart)
- Null Action Choose to remain silent for this time
period. - Topic Starters (4) Introduce a conversational
topic. Cobot declares that he wants to discuss
sports or politics, or he utters a sentence from
either the sports section or political section of
the Boston Globe. - Roll Call (2) Initiate a roll call, a common
word play routine in LambdaMOO. For example,
someone may declare that she is tired of Monica
Lewinsky by announcing TIRED OF LEWINSKY ROLL
CALL. Each user feeling the same will agree with
the roll call. Cobot initiates a roll call by
taking a recent utterance, and extracting either
a single noun, or a verb phrase. These are
treated as two separate RL actions. - Social Commentary Make a comment describing the
current social state of the Living Room, such as
It sure is quiet or Everyone here is
friendly. These statements are based on Cobots
statistics from recent activity. Several
different utterances possible, but they are
treated as a single action for RL purposes. - Introductions Introduce two users who have not
yet interacted with one another in front of Cobot.
18Actions (2)
- These actions were chosen to fit in with what
goes on in LambdaMOO. So as not to irritate. - Most common routines
- Conversation
- Wordplay
- Emoting
- Infinite range of actions since based on
utterance from recent conversation (ROLL CALL) or
from Boston Globe online
19Reinforcement Learning
- At set time intervals, Cobot chooses an action
according to a distribution based on Q values in
current state. - Rewards and punishments between time t and t1
apply to action at time t. - Possible erroneous reward/punishment if user
rewarded a reactive rather than proactive action
noise in training process
20Feedback Actions
- Explicit
- reward and punish verbs
- give numeric training signal to Cobot
- immed feedback to current state, action
- Backed up to prev. state and actions
- Implicit
- standard LambdaMOO verbs
- e.g. hug and spank, kiss, spit,
- numerically weaker than explicit
21Train for individual useror community?
- Design Choice
- Train for entire community
- Or each individual user
- Combine value functions for those present
- Thus, like several RL processes in parallel, with
each process with different state space - Why?
- If just store which users present as another
state feature, Cobot would have to learn on own
this feature primacy - Learning should be fast, significant. If users
dont get feedback that they influenced Cobots
behavior, will be discouraged - Curse of dimensionality, size of state space
increases exponentially with num of state
features. Dont want to represent
presence/absence of 250 users, maintain small
state space, speed up learning - Certain users interact much more often with Cobot
than others. Dont want their input to dwarf the
impact of others.
22State space for generic user
- Social Summary Vector (4)
- rate user produces events
- rate events produced by others directed at user
- other users are amongst users playmates
- others users that user is their playmate
- Playmate top 10 interact with
- Mood Vector Recent use of eight groups of
common words - e.g. grin and smile in a single group
- Rates vector rate at which events produced by
users present, including Cobot - Current Room which room Cobot is currently in
- Roll Call Vector
- Has saved Roll Call text been used by Cobot
before - Has someone done a roll call since last time
Cobot did a roll call - Has there been roll call since last time Cobot
grabbed text - Bias vector always on means user is present
23- State space for single user too complex to model
based on table representation - Linear function approximator used for each user
- Mix policies of users present
24Experimental Procedure
- Cobot in LambdaMOO since Sept 1999
- RL Cobot in May 2000
- Cobot is real working system with real human
users, conducted experiment in this context - Launched RL functionality in Living Room
- Cobot logged RL-related data from May 10
October 10, 2000 - States visited, actions taken, rewards from each
user, params of value function, etc. - 63123 RL actions taken ( reactive actions)
- 3171 reward, punishment events
- From 254 users
25Findings
- Inappropriateness of average reward
- Successful RL would have increase in avg reward
over time
26- Not because users more dissatisfied as Cobot
learns - Humans fickle, preferences change over time
(indeed, novelty highly valued in LambdaMOO) - popular, exciting ? irritatin
- Trying to hit (learn) a moving target
- So perhaps average reward shouldnt be primary
measure of performance - Users with fixed preferences
- Tend to give less feedback of reward/punishment
as learns preferences accurately (good enough) - Didnt mention users get bored
- Typical RL, consistently gives reward, punishment
- M and S, dedicated users. Explore other measures
later
27Users M and S
28Findings
- Small set of dedicated parents
- 254 users
- 218 gave 20
- 15 gave 50
- Many had passing interest, a few willing to
invest signif time to teach preferences to Cobot - M 594 S 69
29Findings
- Some parents have strong opinions
- Majority of users, policy learned was close to
uniform distribution - Policies dependant on state, but for most users,
this dependence was weak, and thus near uniform
distribution - Most users did not provide enough feedback, and
maybe were not consistent and strong in feedback
they provided - Small group, did learn a non-uniform policy
- M, S policies relatively independent of state
other users, not as dramatic, but non-uniform - Makes sense if does not like sports, does not
matter what room, or what others users are doing - M likes Roll call Cobot selects with prob
0.99. S likes social commentary, selects with
prob 0.38 (S interacted less, at 69)
30Findings
- Cobot learns matching policies
- Policy for user M reflects empirical pattern of
rewards over time
31Action 6 roll call see earlier chart recall M
likes Roll call
Blue bars average reward given by User M for
each action note relative, see 8Yellow bars
Policy learned for User MRed Bars empirical
frequency at which the action was taken
32Findings
- Cobot responds to dedicated parents
- For those users who train him, those users have
strong impact. Shifts towards Ms preferences
when M is present. Of course! No one else
trained him, so here is where reward/punishment
will have most impact. Need only say this because
so few actually trained him. - Some preferences depend on state
- Deduce which features relevant to a given user
- By construction, bias feature indep of state
(always on) - (All weights initialized to 0, so only nonzero
features contribute. Feature relevant if far from
bias feature weight vector, and all 0 vector)
33Findings some do in fact rely on state
34Conclusions
- Reported on efforts to apply RL in a complex
human online social environment (a MOO) where
many of the standard assumptions (stationary
rewards, Markovian behavior, appropriateness of
average reward) are clearly violated. - We feel that the results obtained with Cobot so
far are compelling, and offer promise for the
application of RL in such open-ended social
settings. - Cobot continues to take RL actions and receive
rewards and punishments from LambdaMOO users, and
we plan to continue and embellish this work as
part of our overall efforts on Cobot.