Cobot: A Social Reinforcement Learning Agent - PowerPoint PPT Presentation

About This Presentation

Title:

Cobot: A Social Reinforcement Learning Agent

Description:

Cobot: A Social Reinforcement Learning Agent – PowerPoint PPT presentation

Number of Views:245

Avg rating:3.0/5.0

Slides: 35

Provided by: pli3

Learn more at: http://www.sci.brooklyn.cuny.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cobot: A Social Reinforcement Learning Agent

1
Cobot A Social Reinforcement Learning Agent

Charles Lee Isbell, Jr.
Christian R. Shelton
Michael Kearns
Satinder Singh
Peter Stone
Presented by Josh Waxman

2
Applications of RL

Control
Game playing
Optimization
Recently
Human-computer interaction
prev systems encounter humans one at a time
E.g. spoken dialog systems
Challenges
Data sparsity ?
Inevitable violations of Markov property
Irreproducibility of experiments (happening in a
MOO)
Variability in users understanding of Cobots
working
Drift or users desires, Inconsistency of Reward
Choosing appropriate state space

3
LambdaMOO (?µ)

MUD Multi-User Dungeon
A class of online worlds with roots in text-based
multiplayer role-playing games.
Virtual world, oft created by participants
Users choose characters to represent them
Mechanisms of social interaction reinforce
illusion that user is present in the virtual
space
MOO Multi-user Object Oriented - MUD that uses
an object-oriented programming language to
manipulate objects in the virtual world
Complex, open ended, multiuser chat environment,
populated by a community of human users with rich
and often enduring social relationships.

4
LambdaMOO (?µ) (2)

Interconnected rooms
Rooms contain users and objects that can move
between them
Each room has chat channel (people in a room can
talk to each other)
Room (and objects) has text description that
gives it a look and feel

5
Verbs and Speech in ?µ

Users can talk and also have a series of verbs,
allowing rich set of actions and expression of
emotional states
Buster is overwhelmed by all these deadlines.
Buster begins to slowly tear his hair out, one
strand at a time.
HFh comforts Buster. standard verb comfort
HFh to Buster Remember, the mighty oak was
once a nut like you.
Buster to HFh Right, but his personal growth
was assured. Thanks anyway, though.
Buster feels better now.

verbs
speech
6
LambdaMOO (?µ) (3)

Rooms created by users
Descriptions
Control access by other users
Can create objects
4836 Active User Accounts
118,154 objects
Oldest continuously operated MUD
Founded in 1990
Good environment for AI experiments, including
learning

7
Cobot

Cobot is RL-based agent for LambdaMOO
Long term goal to build an agent who can learn
to perform useful, interesting and entertaining
actions in LambdaMOO on the basis of user
feedback.

8
Cobot (2)

Originally Social Statistics Agent
How freq, in what ways users interact
Provided service of these statistics
Rudimentary chatting capabilities
Reactive did not initiate interaction
Very popular with LambdaMOO users

9
Cobot (3)

Modifications
Not just reactive. Proactive
Take actions under own initiative
Propose conversation topics
Introduce users
Word play
Hope will eventually take unprompted actions
that are meaningful, useful or amusing to users.

10
Reinforcement Learning

In RL, often model decision making by agents in
uncertain environment as MDPs.
Markov Decision Process if environment has the
Markov property, in which only need look at
current state to make a decision
At time t, agent senses environment, chooses an
action a from A, set of actions available in
state s.
Action causes change in environment, and agent
receives a scalar reward from the environment

11
Reinforcement Learning (2)

Goal Maximize expected rewards over some time
horizon
A policy p is a mapping of a state s and action a
to the probability of taking action a from state
s.
p(s, a) ? p(s,a)
p the optimal policy
A value function is a function of states (V) or
state-action pairs (Q) that tells how good it is
to be in a specific state, where goodness is
defined in terms of expected future return.
Qp(s, a), the action value function for policy p
is the expected return when taking action a from
state s and afterwards following policy p.

12
Reinforcement Learning (3)

p denotes the optimal policy whose value
function Q is greater than or equal to that of
any other policy for all states I and actions a
in the set.
Q optimal action-value function
Most RL algorithms use experience to p from the
agents experience in its environment, by learning
Q
The learned value function is used to choose
actions stochastically, so that in each state,
actions with higher value are chosen with higher
probability.
Many RL algorithms use function approximations,
(parametric representations of complex value
functions) both to map state-action features to
their values and to map states to distributions
over actions (i.e., the policy).

13
Linear Function Approximator
State feature 1
State feature 2

Used a linear function approximator for each
state feature, maintain vector of real-valued
weights indexed by possible actions
weight feature increases prob of taking that
action
weight decreases

Action 1
Action 1
Action 2
Action 2
Action 3
Action 3
Action 4
Action 4
Action 5
Action 5
Action 6
Action 6
Action 7
Action 7
Action 8
Action 8
Action 9
Action 9
14
Reactive Actions (Social Statistics)

Tells the questioner various facts about herself,
including the verbs she like to use most, and the
verbs most often directed at her.
who are your playmates Lists those who interact
with Cobot the most.
who loves me Tells the questioner those with
whom she interacts the most.
who acts like me Tells the questioner users who
perform actions similar to hers.
who does not act like me Tells the questioner
users who do not perform actions similar to hers.
relate me to user Indicates how the questioner
interacts with user, providing a ranking on each
others list of playmates, how many playmates
they have in common, and how similarly they act.
tell me my verb-buddies Tells the questioner the
users she uses verb on and who use verb on her.
who is spammy Lists those users who generate and
are the targets of the most verbs.
http//cobot.research.att.com/papers/cobot.pdf

15
Example

HFh to cobot relate me to Buster
cobot whispers, Here are your relationships with
Buster.
You like to use - (62), poke (7), hug (3),
eye (3), nod (2), hi5, h5, zap, comfort, and
grin on each other.
Buster is ranked 14 on your list of playmates.
You are ranked 1 on Busters list.
Your socializing overlap is 75.4 and your
playmate overlap is 33.7.
Your actions have a similarity of 95.9 but
ignoring common speech verbs its 58.3.
Others act on you with a similarity of 96.6 but
ignoring common speech verbs its 81.9.

Technicolor Guest to cobot Whats your
religion?
cobot to Technicolor Guest technology unless
you REALLY believe in that religion yourself and
find that it arouses a deep, strong, genuine
response in many other people.
cobot to DragonBoi The line of conflict should
be drawn between the mass of the people and the
power-holding elite of industrial society.
DragonBoi eyes cobot warily.
cobot to DragonBoi You have something cute in
your eye.
DragonBoi to HFh youre bot is turning into a
communist.
DragonBoi to cobot so what are you gonna do
about it?
DragonBoi to HFh did you see what it told me?
you are turning it into a communist!
Wikked to cobot are you a communist?
DragonBoi to cobot are you now, or ever been a
member of the communist party?
cobot mumbles something in binary.
DragonBoi to cobot so youre taking the fifth
are you? well this is a special commision, that
wont work here!

17
9 Proactive Actions (back to chart)

Null Action Choose to remain silent for this time
period.
Topic Starters (4) Introduce a conversational
topic. Cobot declares that he wants to discuss
sports or politics, or he utters a sentence from
either the sports section or political section of
the Boston Globe.
Roll Call (2) Initiate a roll call, a common
word play routine in LambdaMOO. For example,
someone may declare that she is tired of Monica
Lewinsky by announcing TIRED OF LEWINSKY ROLL
CALL. Each user feeling the same will agree with
the roll call. Cobot initiates a roll call by
taking a recent utterance, and extracting either
a single noun, or a verb phrase. These are
treated as two separate RL actions.
Social Commentary Make a comment describing the
current social state of the Living Room, such as
It sure is quiet or Everyone here is
friendly. These statements are based on Cobots
statistics from recent activity. Several
different utterances possible, but they are
treated as a single action for RL purposes.
Introductions Introduce two users who have not
yet interacted with one another in front of Cobot.

18
Actions (2)

These actions were chosen to fit in with what
goes on in LambdaMOO. So as not to irritate.
Most common routines
Conversation
Wordplay
Emoting
Infinite range of actions since based on
utterance from recent conversation (ROLL CALL) or
from Boston Globe online

19
Reinforcement Learning

At set time intervals, Cobot chooses an action
according to a distribution based on Q values in
current state.
Rewards and punishments between time t and t1
apply to action at time t.
Possible erroneous reward/punishment if user
rewarded a reactive rather than proactive action
noise in training process

20
Feedback Actions

Explicit
reward and punish verbs
give numeric training signal to Cobot
immed feedback to current state, action
Backed up to prev. state and actions
Implicit
standard LambdaMOO verbs
e.g. hug and spank, kiss, spit,
numerically weaker than explicit

21
Train for individual useror community?

Design Choice
Train for entire community
Or each individual user
Combine value functions for those present
Thus, like several RL processes in parallel, with
each process with different state space
Why?
If just store which users present as another
state feature, Cobot would have to learn on own
this feature primacy
Learning should be fast, significant. If users
dont get feedback that they influenced Cobots
behavior, will be discouraged
Curse of dimensionality, size of state space
increases exponentially with num of state
features. Dont want to represent
presence/absence of 250 users, maintain small
state space, speed up learning
Certain users interact much more often with Cobot
than others. Dont want their input to dwarf the
impact of others.

22
State space for generic user

Social Summary Vector (4)
rate user produces events
rate events produced by others directed at user
other users are amongst users playmates
others users that user is their playmate
Playmate top 10 interact with
Mood Vector Recent use of eight groups of
common words
e.g. grin and smile in a single group
Rates vector rate at which events produced by
users present, including Cobot
Current Room which room Cobot is currently in
Roll Call Vector
Has saved Roll Call text been used by Cobot
before
Has someone done a roll call since last time
Cobot did a roll call
Has there been roll call since last time Cobot
grabbed text
Bias vector always on means user is present

State space for single user too complex to model
based on table representation
Linear function approximator used for each user
Mix policies of users present

24
Experimental Procedure

Cobot in LambdaMOO since Sept 1999
RL Cobot in May 2000
Cobot is real working system with real human
users, conducted experiment in this context
Launched RL functionality in Living Room
Cobot logged RL-related data from May 10
October 10, 2000
States visited, actions taken, rewards from each
user, params of value function, etc.
63123 RL actions taken ( reactive actions)
3171 reward, punishment events
From 254 users

25
Findings

Inappropriateness of average reward
Successful RL would have increase in avg reward
over time

Not because users more dissatisfied as Cobot
learns
Humans fickle, preferences change over time
(indeed, novelty highly valued in LambdaMOO)
popular, exciting ? irritatin
Trying to hit (learn) a moving target
So perhaps average reward shouldnt be primary
measure of performance
Users with fixed preferences
Tend to give less feedback of reward/punishment
as learns preferences accurately (good enough)
Didnt mention users get bored
Typical RL, consistently gives reward, punishment
M and S, dedicated users. Explore other measures
later

27
Users M and S
28
Findings

Small set of dedicated parents
254 users
218 gave 20
15 gave 50
Many had passing interest, a few willing to
invest signif time to teach preferences to Cobot
M 594 S 69

29
Findings

Some parents have strong opinions
Majority of users, policy learned was close to
uniform distribution
Policies dependant on state, but for most users,
this dependence was weak, and thus near uniform
distribution
Most users did not provide enough feedback, and
maybe were not consistent and strong in feedback
they provided
Small group, did learn a non-uniform policy
M, S policies relatively independent of state
other users, not as dramatic, but non-uniform
Makes sense if does not like sports, does not
matter what room, or what others users are doing
M likes Roll call Cobot selects with prob
0.99. S likes social commentary, selects with
prob 0.38 (S interacted less, at 69)

30
Findings

Cobot learns matching policies
Policy for user M reflects empirical pattern of
rewards over time

31
Action 6 roll call see earlier chart recall M
likes Roll call
Blue bars average reward given by User M for
each action note relative, see 8Yellow bars
Policy learned for User MRed Bars empirical
frequency at which the action was taken
32
Findings

Cobot responds to dedicated parents
For those users who train him, those users have
strong impact. Shifts towards Ms preferences
when M is present. Of course! No one else
trained him, so here is where reward/punishment
will have most impact. Need only say this because
so few actually trained him.
Some preferences depend on state
Deduce which features relevant to a given user
By construction, bias feature indep of state
(always on)
(All weights initialized to 0, so only nonzero
features contribute. Feature relevant if far from
bias feature weight vector, and all 0 vector)

33
Findings some do in fact rely on state
34
Conclusions

Reported on efforts to apply RL in a complex
human online social environment (a MOO) where
many of the standard assumptions (stationary
rewards, Markovian behavior, appropriateness of
average reward) are clearly violated.
We feel that the results obtained with Cobot so
far are compelling, and offer promise for the
application of RL in such open-ended social
settings.
Cobot continues to take RL actions and receive
rewards and punishments from LambdaMOO users, and
we plan to continue and embellish this work as
part of our overall efforts on Cobot.