Cobot: A Social Reinforcement Learning Agent - PowerPoint PPT Presentation

About This Presentation
Title:

Cobot: A Social Reinforcement Learning Agent

Description:

Cobot: A Social Reinforcement Learning Agent – PowerPoint PPT presentation

Number of Views:245
Avg rating:3.0/5.0
Slides: 35
Provided by: pli3
Category:

less

Transcript and Presenter's Notes

Title: Cobot: A Social Reinforcement Learning Agent


1
Cobot A Social Reinforcement Learning Agent
  • Charles Lee Isbell, Jr.
  • Christian R. Shelton
  • Michael Kearns
  • Satinder Singh
  • Peter Stone
  • Presented by Josh Waxman

2
Applications of RL
  • Control
  • Game playing
  • Optimization
  • Recently
  • Human-computer interaction
  • prev systems encounter humans one at a time
  • E.g. spoken dialog systems
  • Challenges
  • Data sparsity ?
  • Inevitable violations of Markov property
  • Irreproducibility of experiments (happening in a
    MOO)
  • Variability in users understanding of Cobots
    working
  • Drift or users desires, Inconsistency of Reward
  • Choosing appropriate state space

3
LambdaMOO (?µ)
  • MUD Multi-User Dungeon
  • A class of online worlds with roots in text-based
    multiplayer role-playing games.
  • Virtual world, oft created by participants
  • Users choose characters to represent them
  • Mechanisms of social interaction reinforce
    illusion that user is present in the virtual
    space
  • MOO Multi-user Object Oriented - MUD that uses
    an object-oriented programming language to
    manipulate objects in the virtual world
  • Complex, open ended, multiuser chat environment,
    populated by a community of human users with rich
    and often enduring social relationships.

4
LambdaMOO (?µ) (2)
  • Interconnected rooms
  • Rooms contain users and objects that can move
    between them
  • Each room has chat channel (people in a room can
    talk to each other)
  • Room (and objects) has text description that
    gives it a look and feel

5
Verbs and Speech in ?µ
  • Users can talk and also have a series of verbs,
    allowing rich set of actions and expression of
    emotional states
  • Buster is overwhelmed by all these deadlines.
  • Buster begins to slowly tear his hair out, one
    strand at a time.
  • HFh comforts Buster. standard verb comfort
  • HFh to Buster Remember, the mighty oak was
    once a nut like you.
  • Buster to HFh Right, but his personal growth
    was assured. Thanks anyway, though.
  • Buster feels better now.

verbs
speech
6
LambdaMOO (?µ) (3)
  • Rooms created by users
  • Descriptions
  • Control access by other users
  • Can create objects
  • 4836 Active User Accounts
  • 118,154 objects
  • Oldest continuously operated MUD
  • Founded in 1990
  • Good environment for AI experiments, including
    learning

7
Cobot
  • Cobot is RL-based agent for LambdaMOO
  • Long term goal to build an agent who can learn
    to perform useful, interesting and entertaining
    actions in LambdaMOO on the basis of user
    feedback.

8
Cobot (2)
  • Originally Social Statistics Agent
  • How freq, in what ways users interact
  • Provided service of these statistics
  • Rudimentary chatting capabilities
  • Reactive did not initiate interaction
  • Very popular with LambdaMOO users

9
Cobot (3)
  • Modifications
  • Not just reactive. Proactive
  • Take actions under own initiative
  • Propose conversation topics
  • Introduce users
  • Word play
  • Hope will eventually take unprompted actions
    that are meaningful, useful or amusing to users.

10
Reinforcement Learning
  • In RL, often model decision making by agents in
    uncertain environment as MDPs.
  • Markov Decision Process if environment has the
    Markov property, in which only need look at
    current state to make a decision
  • At time t, agent senses environment, chooses an
    action a from A, set of actions available in
    state s.
  • Action causes change in environment, and agent
    receives a scalar reward from the environment

11
Reinforcement Learning (2)
  • Goal Maximize expected rewards over some time
    horizon
  • A policy p is a mapping of a state s and action a
    to the probability of taking action a from state
    s.
  • p(s, a) ? p(s,a)
  • p the optimal policy
  • A value function is a function of states (V) or
    state-action pairs (Q) that tells how good it is
    to be in a specific state, where goodness is
    defined in terms of expected future return.
  • Qp(s, a), the action value function for policy p
    is the expected return when taking action a from
    state s and afterwards following policy p.

12
Reinforcement Learning (3)
  • p denotes the optimal policy whose value
    function Q is greater than or equal to that of
    any other policy for all states I and actions a
    in the set.
  • Q optimal action-value function
  • Most RL algorithms use experience to p from the
    agents experience in its environment, by learning
    Q
  • The learned value function is used to choose
    actions stochastically, so that in each state,
    actions with higher value are chosen with higher
    probability.
  • Many RL algorithms use function approximations,
    (parametric representations of complex value
    functions) both to map state-action features to
    their values and to map states to distributions
    over actions (i.e., the policy).

13
Linear Function Approximator
State feature 1
State feature 2
  • Used a linear function approximator for each
    state feature, maintain vector of real-valued
    weights indexed by possible actions
  • weight feature increases prob of taking that
    action
  • weight decreases

Action 1
Action 1
Action 2
Action 2
Action 3
Action 3
Action 4
Action 4
Action 5
Action 5
Action 6
Action 6
Action 7
Action 7
Action 8
Action 8
Action 9
Action 9
14
Reactive Actions (Social Statistics)
  • Tells the questioner various facts about herself,
    including the verbs she like to use most, and the
    verbs most often directed at her.
  • who are your playmates Lists those who interact
    with Cobot the most.
  • who loves me Tells the questioner those with
    whom she interacts the most.
  • who acts like me Tells the questioner users who
    perform actions similar to hers.
  • who does not act like me Tells the questioner
    users who do not perform actions similar to hers.
  • relate me to user Indicates how the questioner
    interacts with user, providing a ranking on each
    others list of playmates, how many playmates
    they have in common, and how similarly they act.
  • tell me my verb-buddies Tells the questioner the
    users she uses verb on and who use verb on her.
  • who is spammy Lists those users who generate and
    are the targets of the most verbs.
  • http//cobot.research.att.com/papers/cobot.pdf

15
Example
  • HFh to cobot relate me to Buster
  • cobot whispers, Here are your relationships with
    Buster.
  • You like to use - (62), poke (7), hug (3),
    eye (3), nod (2), hi5, h5, zap, comfort, and
    grin on each other.
  • Buster is ranked 14 on your list of playmates.
    You are ranked 1 on Busters list.
  • Your socializing overlap is 75.4 and your
    playmate overlap is 33.7.
  • Your actions have a similarity of 95.9 but
    ignoring common speech verbs its 58.3.
  • Others act on you with a similarity of 96.6 but
    ignoring common speech verbs its 81.9.

16
  • Technicolor Guest to cobot Whats your
    religion?
  • cobot to Technicolor Guest technology unless
    you REALLY believe in that religion yourself and
    find that it arouses a deep, strong, genuine
    response in many other people.
  • cobot to DragonBoi The line of conflict should
    be drawn between the mass of the people and the
    power-holding elite of industrial society.
  • DragonBoi eyes cobot warily.
  • cobot to DragonBoi You have something cute in
    your eye.
  • DragonBoi to HFh youre bot is turning into a
    communist.
  • DragonBoi to cobot so what are you gonna do
    about it?
  • DragonBoi to HFh did you see what it told me?
    you are turning it into a communist!
  • Wikked to cobot are you a communist?
  • DragonBoi to cobot are you now, or ever been a
    member of the communist party?
  • cobot mumbles something in binary.
  • DragonBoi to cobot so youre taking the fifth
    are you? well this is a special commision, that
    wont work here!

17
9 Proactive Actions (back to chart)
  • Null Action Choose to remain silent for this time
    period.
  • Topic Starters (4) Introduce a conversational
    topic. Cobot declares that he wants to discuss
    sports or politics, or he utters a sentence from
    either the sports section or political section of
    the Boston Globe.
  • Roll Call (2) Initiate a roll call, a common
    word play routine in LambdaMOO. For example,
    someone may declare that she is tired of Monica
    Lewinsky by announcing TIRED OF LEWINSKY ROLL
    CALL. Each user feeling the same will agree with
    the roll call. Cobot initiates a roll call by
    taking a recent utterance, and extracting either
    a single noun, or a verb phrase. These are
    treated as two separate RL actions.
  • Social Commentary Make a comment describing the
    current social state of the Living Room, such as
    It sure is quiet or Everyone here is
    friendly. These statements are based on Cobots
    statistics from recent activity. Several
    different utterances possible, but they are
    treated as a single action for RL purposes.
  • Introductions Introduce two users who have not
    yet interacted with one another in front of Cobot.

18
Actions (2)
  • These actions were chosen to fit in with what
    goes on in LambdaMOO. So as not to irritate.
  • Most common routines
  • Conversation
  • Wordplay
  • Emoting
  • Infinite range of actions since based on
    utterance from recent conversation (ROLL CALL) or
    from Boston Globe online

19
Reinforcement Learning
  • At set time intervals, Cobot chooses an action
    according to a distribution based on Q values in
    current state.
  • Rewards and punishments between time t and t1
    apply to action at time t.
  • Possible erroneous reward/punishment if user
    rewarded a reactive rather than proactive action
    noise in training process

20
Feedback Actions
  • Explicit
  • reward and punish verbs
  • give numeric training signal to Cobot
  • immed feedback to current state, action
  • Backed up to prev. state and actions
  • Implicit
  • standard LambdaMOO verbs
  • e.g. hug and spank, kiss, spit,
  • numerically weaker than explicit

21
Train for individual useror community?
  • Design Choice
  • Train for entire community
  • Or each individual user
  • Combine value functions for those present
  • Thus, like several RL processes in parallel, with
    each process with different state space
  • Why?
  • If just store which users present as another
    state feature, Cobot would have to learn on own
    this feature primacy
  • Learning should be fast, significant. If users
    dont get feedback that they influenced Cobots
    behavior, will be discouraged
  • Curse of dimensionality, size of state space
    increases exponentially with num of state
    features. Dont want to represent
    presence/absence of 250 users, maintain small
    state space, speed up learning
  • Certain users interact much more often with Cobot
    than others. Dont want their input to dwarf the
    impact of others.

22
State space for generic user
  • Social Summary Vector (4)
  • rate user produces events
  • rate events produced by others directed at user
  • other users are amongst users playmates
  • others users that user is their playmate
  • Playmate top 10 interact with
  • Mood Vector Recent use of eight groups of
    common words
  • e.g. grin and smile in a single group
  • Rates vector rate at which events produced by
    users present, including Cobot
  • Current Room which room Cobot is currently in
  • Roll Call Vector
  • Has saved Roll Call text been used by Cobot
    before
  • Has someone done a roll call since last time
    Cobot did a roll call
  • Has there been roll call since last time Cobot
    grabbed text
  • Bias vector always on means user is present

23
  • State space for single user too complex to model
    based on table representation
  • Linear function approximator used for each user
  • Mix policies of users present

24
Experimental Procedure
  • Cobot in LambdaMOO since Sept 1999
  • RL Cobot in May 2000
  • Cobot is real working system with real human
    users, conducted experiment in this context
  • Launched RL functionality in Living Room
  • Cobot logged RL-related data from May 10
    October 10, 2000
  • States visited, actions taken, rewards from each
    user, params of value function, etc.
  • 63123 RL actions taken ( reactive actions)
  • 3171 reward, punishment events
  • From 254 users

25
Findings
  • Inappropriateness of average reward
  • Successful RL would have increase in avg reward
    over time

26
  • Not because users more dissatisfied as Cobot
    learns
  • Humans fickle, preferences change over time
    (indeed, novelty highly valued in LambdaMOO)
  • popular, exciting ? irritatin
  • Trying to hit (learn) a moving target
  • So perhaps average reward shouldnt be primary
    measure of performance
  • Users with fixed preferences
  • Tend to give less feedback of reward/punishment
    as learns preferences accurately (good enough)
  • Didnt mention users get bored
  • Typical RL, consistently gives reward, punishment
  • M and S, dedicated users. Explore other measures
    later

27
Users M and S
28
Findings
  • Small set of dedicated parents
  • 254 users
  • 218 gave 20
  • 15 gave 50
  • Many had passing interest, a few willing to
    invest signif time to teach preferences to Cobot
  • M 594 S 69

29
Findings
  • Some parents have strong opinions
  • Majority of users, policy learned was close to
    uniform distribution
  • Policies dependant on state, but for most users,
    this dependence was weak, and thus near uniform
    distribution
  • Most users did not provide enough feedback, and
    maybe were not consistent and strong in feedback
    they provided
  • Small group, did learn a non-uniform policy
  • M, S policies relatively independent of state
    other users, not as dramatic, but non-uniform
  • Makes sense if does not like sports, does not
    matter what room, or what others users are doing
  • M likes Roll call Cobot selects with prob
    0.99. S likes social commentary, selects with
    prob 0.38 (S interacted less, at 69)

30
Findings
  • Cobot learns matching policies
  • Policy for user M reflects empirical pattern of
    rewards over time

31
Action 6 roll call see earlier chart recall M
likes Roll call
Blue bars average reward given by User M for
each action note relative, see 8Yellow bars
Policy learned for User MRed Bars empirical
frequency at which the action was taken
32
Findings
  • Cobot responds to dedicated parents
  • For those users who train him, those users have
    strong impact. Shifts towards Ms preferences
    when M is present. Of course! No one else
    trained him, so here is where reward/punishment
    will have most impact. Need only say this because
    so few actually trained him.
  • Some preferences depend on state
  • Deduce which features relevant to a given user
  • By construction, bias feature indep of state
    (always on)
  • (All weights initialized to 0, so only nonzero
    features contribute. Feature relevant if far from
    bias feature weight vector, and all 0 vector)

33
Findings some do in fact rely on state
34
Conclusions
  • Reported on efforts to apply RL in a complex
    human online social environment (a MOO) where
    many of the standard assumptions (stationary
    rewards, Markovian behavior, appropriateness of
    average reward) are clearly violated.
  • We feel that the results obtained with Cobot so
    far are compelling, and offer promise for the
    application of RL in such open-ended social
    settings.
  • Cobot continues to take RL actions and receive
    rewards and punishments from LambdaMOO users, and
    we plan to continue and embellish this work as
    part of our overall efforts on Cobot.
Write a Comment
User Comments (0)
About PowerShow.com