Title: Machine Learning for Agents and MultiAgent Systems
1Machine Learning for Agents and Multi-Agent
Systems
- Daniel Kudenko and Dimitar Kazakov
- Department of Computer Science
- University of York, UK
ECAI-02, Lyon, July 2002
2Outline
- Principles of Machine Learning (ML)
- ML for Single Agents
- ML for Multi-Agent Systems
- Specialisation and Role Learning
- Focus Topic 1 Learning of Co-ordination
- Evolution, Indiv. Learning and Language
- Focus Topic 2 Evolution of Kinship-Driven
Altruism
3Why Learning Agents?
- Designers cannot foresee all situations that the
agent will encounter. - To display full autonomy agents need to learn
from and adapt to novel environ-ments. - Learning is a crucial part of intelligence.
4Evolution Individual Learning in MAS
5What is Machine Learning?
- Definition A computer program is said to learn
from experience E with respect to some class of
tasks T and perform-ance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E. Mitchell 97 - Example T play tennis, E playing
matches, P score
6ML Another View
- ML can be seen as the task of
- taking a set of observations represented in a
given object/data language and - representing (the information in) that set in
another language called concept/hypothesis
language. - A side effect of this step the ability to deal
with unseen observations.
7Object and Concept Language
- Object Language (x,y,/-).
- Concept Language any ellipse (5 param. x1, y1,
x2, y2, l1l2)
l1
l2
?
?
_
_
_
_
8Machine Learning Biases
- The concept/hypothesis language specifies the
language bias, which limits the set of all
concepts/hypotheses that can be
expressed/considered/learned. - The preference bias allows us to decide between
two hypotheses (even if they both classify the
training data equally). - The search bias defines the order in which
hypotheses will be considered. - Important if one does not search the whole
hypothesis space.
9Concept Language andEager vs. Lazy Learning
- Eager learning commit to hypothesis computed
after training. - Lazy learning store all encountered examples and
perform classification based on this database
(e.g. nearest neighbour).
10Concept Language and Black- vs. White-Box
Learning
- Black-Box Learning Interpretation of the
learning result is unclear to a user. - White-Box Learning Creates (symbolic) structures
that are comprehensible.
11Concept Language and Background Knowledge
- Examples of concept language
- A set of real or idealised examples expressed in
the object language that represent each of the
concepts learned (Nearest Neighbour) - attribute-value pairs (propositional logic)
- relational concepts (first order logic)
- One can extend the concept language with
user-defined concepts or background knowledge.
12Background Knowledge (2)
- Characteristic for Inductive Logic Programming
(ILP) - The use of certain BK predicates may be a
necessary condition for learning the right
hypothesis. - Redundant or irrelevant BK slows down the
learning.
13 Choice of Background Knowledge (the
anthropologists view)
- In an ideal world one should start from a
complete model of the background knowledge of the
target population. In practice, even with the
most intensive anthropological studies, such a
model is impossible to achieve. We do not even
know what it is that we know ourselves. The best
that can be achieved is a study of the directly
relevant background knowledge, though it is only
when a solution is identified that one can know
what is or is not relevant. - The Critical Villager, Eric Dudley
14Preference Bias, Search Bias Version Space
- Version space the subset of hypotheses that have
zero training error.
most gen. concept
_
_
most spec. concept
_
_
15More Preference Biases
- Consider the new representation of your data as
made of a theory T and a description D needed to
reconstruct the original data from T. - Ockhams razor Dont multiply the number of
entities without a reason - In ML, it means the simpler the theory, the
better - Minimal Description Length (Rissanen 89)
- choose T, for which the binary representation
of T and D combined is the shortest possible.
16Positive Only Learning
- A way of dealing with domains where no negative
examples are available. - Learn the concept of non-self-destructive
actions. - The trivial definition Anything belongs to the
target concept looks all right ! - Trick generate random examples and treat them as
negative.
17Active Learning
- Learner decides which training data to receive
(i.e. generates training examples and uses an
oracle to classify them). (Thompson et
al. 1999) -
- Closed Loop ML learner suggests hypothesis and
verifies it experimentally. If hypothesis is
rejected, the collected data gives rise to a new
hypothesis. (Bryant and Muggleton 2000)
18Machine Learning vs. Learning Agents
- Machine Learning Learning as
the only goal
Classic Machine Learning
Active Learning
Closed Loop Machine Learning
Learning as one of many goals Learning
Agent(s)
19Integrating Machine Learning into the Agent
Architecture
- Time constraints on learning
- Synchronisation between agents actions
- Learning and recall
- Timing analysis of theories learned
20Time Constraints on Learning
- Machine Learning alone
- predictive accuracy matters, time doesnt (just a
price to pay) - ML in Agents
- Soft deadlines resources must be shared with
other activities (perception, planning, control) - Hard deadlines imposed by environment Make up
your mind now!
21Doing Eager vs. Lazy Learning under Time Pressure
- Eager Learning
- Theories typically more compact
- and faster to use
- Takes more time to learn do it when the agent
is idle - Lazy Learning
- Knowledge acquired at (almost) no cost
- May be much slower when a test example comes
22 Any-Time Learning
- Consider two types of algorithms
- Running a prescribed number of steps guarantees
finding a solution - can use worst case complexity analysis to find an
upper bound on the execution time - Any-time algorithms
- a longer run may result in a better solution
- dont know an optimal solution when they see one
- example Genetic Algorithms
- policies halt learning to meet hard deadlines or
when cost outweighs expected improvements of
accuracy
23Time Constraints on Learning in Simulated
Environments
- Consider various cases
- Unlimited time for learning
- Upper bound on time for learning
- Learning in real time
- Gradually tightening the constraints makes
integration easier - Not limited to simulations real-world problems
have similar setting - e.g., various types of auctions
24Learning and Recall
- Agent must strike a balance between
- Learning, which updates the model of the world
- Recall, which applies existing model of the world
to other tasks
25Learning and Recall (2)
Update sensory information
Recall current model of world to choose and carry
out an action
Learn new model of the world
- In theory, the two can run in parallel
- In practice, must share limited resources
26Learning and Recall (3)
- Possible strategies
- Parallel learning and recall at all times
- Mutually exclusive learning and recall
- After incremental, eager learning, examples are
discarded - or kept if batch or lazy learning used
- Cheap on-the-fly learning (preprocessing),
off-line computationally expensive learning - reduce raw information, change object language
- analogy with human learning and the role of sleep
27Timing Analysis of Theories Learned Example
- (Kazakov, PhD Thesis)
- Beware of phase transition-like behaviour
-
- left simple theory with low coverage succeeds
or quickly fails ? high speed - middle medium coverage, fragmentary theory,
lots of backtracking ? low speed - right general theory with high coverage less
backtracking ? high speed
28Types of Learning Task
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
29Reinforcement Learning
- Agent learns from environmental feedback
indicating the benefit of states. - No explicit teacher required.
- Learning target optimal policy (i.e.,
state-action mapping) - Optimality measure e.g., cumulative discounted
reward.
30Q Learning
- Reinforcement Learning Algorithm.
- Most popular agent learning technique.
- Values (discounted cumulative reward) of
state-action pairs are stored in a Q-table. - Optimal policy is easily derived from Q-table.
31Q Learning
Value of a state discounted cumulative reward
V?(st) ?i ? 0 ?i r(sti,ati) 0 ?
? lt 1 is a discount factor (? 0 means that
only immediate reward is considered). r(sti
,ati) is the reward determined by performing
actions specified by policy ?. Q(s,a)
r(s,a) ? V(?(s,a)) Optimal Policy
?(s) argmaxa Q(s,a)
32Example
100
50
33Example (cont.)
- Vshort (s0) 100
- Vlong (s0) 50 ? 0 ?2 100
- 0.8 ? Vlong (s0) 114 ? ? long
- 0.5 ? Vlong (s0) 75 ? ? short
34Q Learning
Initialize all Q(s,a) to 0 In some state s choose
some action a. Let s be the resulting state and
r the reward received. Update Q
Q(s,a) r ? maxa Q(s,a)
35Q Learning
- Guaranteed convergence towards optimum
(state-action pairs have to be visited infinitely
often). - Exploration strategy can speed up convergence
(more on this later). - Basic Q Learning does not generalize replace
state-action table with function approximation
(e.g. neural net) in order to handle unseen
states.
36Learning in Multi-Agent Systems Important Issues
- Classification
- Social Awareness
- Communication
- Role Learning
- Distributed Learning
- Focus Learning of Coordination
37A Brief History
Disembodied ML
Single-Agent Learning
Machine Learning
Multiple Single-Agent Learners
Social Multi-Agent Learners
Social Multi-Agent System
Multiple Single-Agent System
Agents
Single-Agent System
38Types of Multi-Agent LearningWeiss
Dillenbourg 99
- Multiplied Learning No interference in the
learning process by other agents (except for
exchange of training data or outputs). - Divided Learning Division of learning task on
functional level. - Interacting Learning cooperation beyond the pure
exchange of data.
39Social Awareness
- Awareness of existence of other agents and
(eventually) knowledge about their behavior. - Not necessary to achieve near optimal MAS
behavior rock sample collection Steels 89. - Can it degrade performance?
40Levels of Social Awareness VidalDurfee 97
- 0-level agent no knowledge about existence of
other agents. - 1-level agent recognizes that other agents
exist, model other agents as 0-level. - 2-level agent has some knowledge about behavior
of other agents and their behavior model other
agents as 1-level agents. - k-level agent model other agents as (k-1)-level.
41Social Awareness and Q Learning
- 0-level agents already learn implicitly about
other agents. - Mundhe and Sen, 00 study of two Q learning
agents up to level 2. - Two 1-level agents display slowest and least
effective learning (worse than two 0-level
agents).
42Agent models and Q Learning
- Q S ? An ? R, where n is the number of agents.
- If other agents actions are not observable, need
assumption for actions of other agents. - Pessimistic assumption given an agents action
choice other agents will minimize reward. - Optimistic assumption other agents will maximize
reward.
43Agent Models and Q Learning
- Pessimistic Assumption leads to overly cautious
behavior. - Optimistic Assumption guarantees convergence
towards optimum Lauer Riedmiller 00. - If knowledge of other agents behavior available,
Q value update can be based on probabilistic
computation Claus and Boutilier 98. But no
guarantee of optimality.
44Q Learning CommunicationTan 93
- Types of communication
- Sharing sensation
- Sharing or merging policies
- Sharing episodes
- Results
- Communication generally helps
- Extra sensory information may hurt
45Role Learning
- Often useful for agents to specialize in specific
roles for joint tasks. - Pre-defined roles reduce flexibility, often not
easy to define optimal distribution, may be
expensive. - How to learn roles?
- Prasad et al. 96 learn optimal distribution of
pre-defined roles.
46Q Learning of roles
- CritesBarto 98 elevator domain regular Q
learning no specialization achieved (but highly
efficient behavior). - OnoFukumoto 96 Hunter-Prey domain,
specialization achieved with greatest mass
merging strategy.
47Q Learning of Roles Balch 99
- Two main types of reward function local and
global. - Global reward supports specialization.
- Local reward supports emergence of homogeneous
behaviors. - Some domains benefit from learning team
heterogeneity (e.g., robotic soccer), others do
not (e.g., multi-robot foraging). - Heterogeneity measure social entropy.
48Distributed Learning
- Motivation Agents learning a global hypothesis
from local observations. - Application of MAS techniques to (inductive)
learning. - Applications Distributed Data Mining Provost
Kolluri 99, Robotic Soccer.
49Distributed Data Mining
- Provost Hennessy 96 Individual learners see
only subset of all training examples and compute
a set of local rules based on these. - Local rules are evaluated by other learners based
on their data. - Only rules with good evaluation are carried over
to the global hypothesis.
50Learning to Coordinate
- Good coordination is crucial for good MAS
performance. - Example soccer team.
- Pre-defined coordination protocols are often
difficult to define in advance. - Needed learning of coordination.
- Focus Q-learning of coordination.
51Soccer Formation
52Soccer Formation Control
- Formation control is a coordination problem.
- Good formations and set-plays seem to be a strong
factor in winning teams. - To date pre-defined.
- Can (near-)optimal formations be (reinforcement)
learned?
53A Sub-Problem
- Given n agents at random positions, and a
formation having n positions. - Wanted set of n policies that transforms initial
state into the desired formation. - Specifically Q learning of these policies.
54A Further Simplification
- MAS Policy decision procedure who takes which
position. - No two agents should choose the same formation
position. - Problem reduces to reinforcement learning of
coordination in cooperative games.
55Cooperative Games
- Players perform actions simultaneously.
- Afterwards, all players receive the same reward
based on the joint action.
Player 2
A1
A2
5
3
A1
Player 1
2
0
A2
56Cooperative Games and Formations
- Consider 2-player formation with 2 positions
left, right. - Corresponding cooperative game
Player 2
left
right
0
5
left
Player 1
5
0
right
57Learning in Cooperative Games
- To date focus on Q-learning.
- Is communication/observation amongst agents
necessary? - Does this requirement change with increasing
difficulty of the cooperative game?
58Convergence
- Single-agent Q-learning guaranteed convergence
(to optimum). - Multi-agent Q-learning more assumptions needed.
- Crucial in MAS action selection strategy.
59Q Learning Revisited
- Modified Q update function
- Q(a) Q(a) ? (r Q(a))
- Boltzmann action selection strategy
eEV(a)/T
P(a)
60Boltzmann Exploration
- Usually EV(a) Q(a).
- Trade-off between exploration and exploitation.
- Higher temperature T results in more emphasis on
exploration. - Temperature T should be high at first, and
lowered with time (T(t) e(-st)).
61Q Learning of Coordination
- Singh et al., 2000 convergence to some joint
action can be ensured with specific temperature
properties. - Convergence to optimal joint action for simple
cases
Player 2
A1
A2
5
3
A1
Player 1
2
0
A2
62Difficult Cooperative Games
- Climbing Game Claus Boutillier, 98
Player 2
a
b
c
11
-30
0
a
-30
7
6
b
Player 1
0
0
5
c
63Climbing Game
- Multiplied Q learning with Boltzmann exploration
converges to suboptimal (c,c). - C B, 98 Joint action learners (JAL).
- Agents observe each others actions and build a
probabilistic model, according to which the next
action is chosen. - Agents get to (b,b) but are stuck there.
64Climbing Game (cont.)
- Optimistic assumption Lauer Riedmiller, 00
never reduce Q-values due to penalties. - Converges quickly to optimal (a,a).
- However, does not converge on stochastic version
of climbing game.
65Stochastic Climbing Game
Player 2
a
b
c
12/10
0/-60
0/-60
a
0/-60
14/0
8/4
b
Player 1
5/-5
5/-5
7/3
c
66FMQ Heuristic
- Kapetanakis Kudenko, 02
- EV(a) Q(a) c freq(maxR(a)) maxR(a)
- EV(a) carries information on how frequently an
action produces its maximum corresponding reward.
- Converges to optimal (a,a) for climbing game and
partially stochastic climbing game.
67Partially Stochastic Climbing Game
Player 2
a
b
c
11
0/-60
0/-60
a
0/-60
14/0
8/4
b
Player 1
5/-5
5/-5
7/3
c
68Difficult Cooperative Games
- Penalty Game Claus Boutillier, 98
Player 2
a
b
c
10
0
k
a
0
2
0
b
Player 1
k
0
10
c
69Penalty Game
- JAL convergence to optimal (a,a) or (c,c) only
for small penalties k (kgt-20). - Both optimistic assumption and FMQ converge to
either optimum also for large penalties (up to
100).
70Learning of Coordination More Questions
- Scaling-up of Q learning approaches?
- Agents with state Boutillier, 99.
- Large numbers of actions/agents?
- Learning of formations from non-explicit rewards?
71Learning of Coordination Conclusions
- Idealized and simple cases have been studied and
solved. - Mutual communication/observation may not be
needed. - Beyond Q learning Evolutionary approaches
Quinn, 01.
72Learning and Natural Selection
- In learning, search is trivial, choosing the
right bias is hard. - But, the choice of learning bias is always
external to the learner ! - To find the best suited bias one could combine
arbitrary choices of bias with evolution and
natural selection of the fittest individuals.
73Darwinian vs. Lamarckian Evolution
- Darwinian evolution nothing learned by the
individual is encoded in the genes and passed on
to the offspring. - The Baldwin effect learning abilities (good
biases) are selected because they give the
individual a better chance in a dynamic
environment. - What is passed on to the offspring is useful, but
very general.
74Darwinian vs. Lamarckian Evolution (2)
- Lamarckian Evolution individual experience
acquired in life can be inherited. - Not the case in nature.
- Doesnt mean we cant use it.
- The inherited concepts may be too specific and
not of general importance.
75Learning and Language
- Language uses concepts which are
- specific enough to be useful to most/all speakers
of that language - general enough to correspond to shared experience
(otherwise, how would one know what the other is
talking about !) - The concepts of a language serve as a learning
bias which is inherited not in genes but
through education.
76Communication and Learning
- Language
- helps one learn (in addition to inherited biases)
- allows to communicate knowledge.
- Distinguish between
- Knowledge things that one can explain by the
means of a language to another. - Skills the rest, require individual learning,
cannot be communicated.
77Communication and Learning
- In language learning, forgetting examples may
be harmful (van den Bosch et al.) - An expert is someone who does not think anymore
he knows. Frank Lloyd Wright. - It may be difficult to communicate what one has
learned because of - Limited bandwidth (for lazy learning)
- The absence of appropriate concepts in the
language (for black-box learning)
78Communication and Learning
- In a society of communicating agents, less
accurate white-box learning may be better than
more accurate but expensive learning that cannot
be communicated since the reduced performance
could be outweighed by the much lower cost of
learning.
79Stochastic Simulation of Inherited Kinship-Driven
Altruism
Heather Turner and Dimitar Kazakov
- Assess the rôle of a hypothetical inherited
feature (gene) promoting altruism between
relatives as a factor for survival
in the context of a simulated MAS employing
natural selection.
Studies the link b/w evolution and co-operation
80Altruism
- Definition A selfless behaviour/action that will
provide benefit to another at no gain or to the
detriment of the actor. - Kinship-driven altruism altruistic behaviour
directed to relatives.
81Natural Selection of Inherited Behaviour
- Classical Darwinism- survival of the fittest
individuals- fitness is the ability to
reproduce- actions hindering fitness disappear - Neo-Darwinism (Dawkins, Hamilton)- genes rather
than individuals selected- inclusive fitness of
all copies of a gene- actions increasing
inclusive fitness are promoted
82Altruism and Natural Selection
- Classical Darwinism- altruism hinders ones
fitness, hence it should be demoted by natural
selection - Neo-Darwinism- altruistic acts can influence
positively the inclusive fitness of a gene in the
population- e.g. kinship-driven altruism
83Kinship-Driven Altruism Example
?
?
Case A
Case B
84Kinship-Driven Altruism Example
- A self-sacrifice is justified as it removes one
rather than 1.5 copies of each gene. - Hamilton (1964) presents a detailed analytical
model and supports it with evidence from nature.
85Kinship-Driven Altruism
- Extent of help f(degree of kinship)
- Altruistic behaviour based on a well-chosen f
could increase the inclusive fitness of all genes
of its carriers, i.e., would help propagate
them. - If altruism was an inherited feature (gene), it
could be propagated itself for the same reason.
86One Day in the Life of an Agent
- Age and maybe die.
- Mate
- Hunt
- Help a relative.
- Death, finding a mate, food or relative modelled
as stochastic processes.
87Altruistic Behaviour
- If you meet someone poorer than you
- Use your sharing function to decide how much you
would give an identical twin - Then reduce the amount according to the perceived
degree of kinship (expected average percentage of
shared genes), e.g. by half for a child
88Experiments Degrees of Freedom
- Type of sharing function
- Models of the degree of kinship
- Initial ratio b/w selfish and altruist
individuals - (Hunting and mating gambling policies always
subject to evolution.)
89Type of Sharing Function
- Communism
- Progressive Taxation with a non-taxable
allowance - Poll Tax
- Pay the same amount pt, even if that kills you
- q, a, pt inherited, subject to nat. selection
90Modelling the Degree of Kinship
gp
gp
25
- Royalty the entire family tree known.- in
practice, just two generations back and forth.
par
par
50
half sib
full sib
100
25
50
?
Prediction based on similarity of visible
inherited features. Unknown indiscriminate,
optimistic (share with everyone as if child).
91Results Population Size
Royalty
Prediction
Unknown
Communism
Progressive Taxation
Poll Tax
92Results Percentage of Altruists
Royalty
Prediction
Unknown
?
Communism
?
?
?
Progressive Taxation
?
Poll Tax
93Results Initial Percentage of Altruists
- Royalty model, progressive taxation, initial
levels of altruists 0, 25, 50, 75, 100. - Converge to the same ratio of altruists in the
population.
94Conclusions
- Perfect knowledge of the degree of kinship or a
sharing function based on progressive taxation
promote altruism. - Progressive taxation supports a more altruistic
population than communism (in the limit) when
knowledge of kinship is uncertain.
95Contribution
- Replicate the natural phenomenon of kinship
altruism in a simulated MAS. - Implement a model of natural selection different
from the one commonly used in GA and MAS and
closer to nature.
96Bibliography
Alonso, E., D. Kudenko and D. Kazakov (eds.)
(2002). Proc. of the Second Symposium on Adaptive
Agents and Multi-Agent Systems, Imperial College,
London, ISBN 1902956280. Alonso, E. Kudenko,
D. (eds.) (2001). Proc. of the Symposium on
Adaptive Agents and Multi-Agent Systems,
University of York, UK. Baldwin, J.M. (1896). A
new factor in evolution. The American Naturalist
30. Bryant, C.H. Muggleton, S. (2000). Closed
loop machine learning. Technical Report YCS 330,
University of York, Department of Computer
Science, Heslington, York, UK. Boutillier, 99
C. Boutillier "Sequential Optimality and
Coordination in Multiagent Systems, IJCAI
99. Claus Boutillier 98 C. Claus and C.
Boutillier. The Dynamics of Reinforcement
Learning in Cooperative Multiagent Systems. AAAI
98. Hamilton 64 W.D. Hamilton. The genetical
evolution of social behaviour (I and II). Journal
of Theor. Biology. 1964. Lauer Riedmiller 00
M. Lauer and M. Riedmiller. An Algorithm for
Distributed Reinforcement Learning in Cooperative
Multi-Agent Systems. In Proc. of the 17th
International Conference in Machine Learning,
2000.
Mitchell 97 T. Mitchell. Machine Learning.
McGraw Hill, 1997. Mundhe Sen 00 M. Mundhe
and S. Sen. Evaluating Concurrent Reinforcement
Learners. Proceedings of the Fourth International
Conference on Multiagent Systems, IEEE Press,
2000. Quinn, 01 M. Quinn "Evolving
communication without dedicated communication
channels", ECAL '01, Springer LNCS
2159. Rissanen, J. (1989). Stochastic Complexity
in Statistical Enquiry. World Scientific
Pubishing Co, Singapore. Thompson, C., Califf,
M.E. Mooney, R. (1999). Active learning for
natural language parsing and information
extraction. Proceedings of the Sixteenth
International Conference on Machnine
Learning. Vidal Durfee 97 J.M. Vidal and E.
Durfee. Agents Learning about Agents A Framework
and Analysis. In Working Notes of the AAAI-97
workshop on Multiagent Learning, 1997. Weiss
Dilelnbourg 99 G. Weiss and P. Dillenbourg. What
is Multi in Multi-Agent Learning? In P.
Dillenbourg (ed.), Collaborative Learning.
Cognitive and Computational Approaches. Pergamon
Press, 1999.