Machine Learning for Agents and MultiAgent Systems

About This Presentation

Title:

Machine Learning for Agents and MultiAgent Systems

Description:

Values (discounted cumulative reward) of state-action pairs are stored in a Q-table. ... 0 1 is a discount factor ( = 0 means that only immediate reward is ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 97

Provided by: danielk

more less

Transcript and Presenter's Notes

Title: Machine Learning for Agents and MultiAgent Systems

1
Machine Learning for Agents and Multi-Agent
Systems

Daniel Kudenko and Dimitar Kazakov
Department of Computer Science
University of York, UK

ECAI-02, Lyon, July 2002
2
Outline

Principles of Machine Learning (ML)
ML for Single Agents
ML for Multi-Agent Systems
Specialisation and Role Learning
Focus Topic 1 Learning of Co-ordination
Evolution, Indiv. Learning and Language
Focus Topic 2 Evolution of Kinship-Driven
Altruism

3
Why Learning Agents?

Designers cannot foresee all situations that the
agent will encounter.
To display full autonomy agents need to learn
from and adapt to novel environ-ments.
Learning is a crucial part of intelligence.

4
Evolution Individual Learning in MAS
5
What is Machine Learning?

Definition A computer program is said to learn
from experience E with respect to some class of
tasks T and perform-ance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E. Mitchell 97
Example T play tennis, E playing
matches, P score

6
ML Another View

ML can be seen as the task of
taking a set of observations represented in a
given object/data language and
representing (the information in) that set in
another language called concept/hypothesis
language.
A side effect of this step the ability to deal
with unseen observations.

7
Object and Concept Language

Object Language (x,y,/-).
Concept Language any ellipse (5 param. x1, y1,
x2, y2, l1l2)

l1
l2
?
?

_
_

_
_
8
Machine Learning Biases

The concept/hypothesis language specifies the
language bias, which limits the set of all
concepts/hypotheses that can be
expressed/considered/learned.
The preference bias allows us to decide between
two hypotheses (even if they both classify the
training data equally).
The search bias defines the order in which
hypotheses will be considered.
Important if one does not search the whole
hypothesis space.

9
Concept Language andEager vs. Lazy Learning

Eager learning commit to hypothesis computed
after training.
Lazy learning store all encountered examples and
perform classification based on this database
(e.g. nearest neighbour).

10
Concept Language and Black- vs. White-Box
Learning

Black-Box Learning Interpretation of the
learning result is unclear to a user.
White-Box Learning Creates (symbolic) structures
that are comprehensible.

11
Concept Language and Background Knowledge

Examples of concept language
A set of real or idealised examples expressed in
the object language that represent each of the
concepts learned (Nearest Neighbour)
attribute-value pairs (propositional logic)
relational concepts (first order logic)
One can extend the concept language with
user-defined concepts or background knowledge.

12
Background Knowledge (2)

Characteristic for Inductive Logic Programming
(ILP)
The use of certain BK predicates may be a
necessary condition for learning the right
hypothesis.
Redundant or irrelevant BK slows down the
learning.

13
Choice of Background Knowledge (the
anthropologists view)

In an ideal world one should start from a
complete model of the background knowledge of the
target population. In practice, even with the
most intensive anthropological studies, such a
model is impossible to achieve. We do not even
know what it is that we know ourselves. The best
that can be achieved is a study of the directly
relevant background knowledge, though it is only
when a solution is identified that one can know
what is or is not relevant.
The Critical Villager, Eric Dudley

14
Preference Bias, Search Bias Version Space

Version space the subset of hypotheses that have
zero training error.

most gen. concept
_
_

most spec. concept

_
_
15
More Preference Biases

Consider the new representation of your data as
made of a theory T and a description D needed to
reconstruct the original data from T.
Ockhams razor Dont multiply the number of
entities without a reason
In ML, it means the simpler the theory, the
better
Minimal Description Length (Rissanen 89)
choose T, for which the binary representation
of T and D combined is the shortest possible.

16
Positive Only Learning

A way of dealing with domains where no negative
examples are available.
Learn the concept of non-self-destructive
actions.
The trivial definition Anything belongs to the
target concept looks all right !
Trick generate random examples and treat them as
negative.

17
Active Learning

Learner decides which training data to receive
(i.e. generates training examples and uses an
oracle to classify them). (Thompson et
al. 1999)
Closed Loop ML learner suggests hypothesis and
verifies it experimentally. If hypothesis is
rejected, the collected data gives rise to a new
hypothesis. (Bryant and Muggleton 2000)

18
Machine Learning vs. Learning Agents

Machine Learning Learning as
the only goal

Classic Machine Learning
Active Learning
Closed Loop Machine Learning
Learning as one of many goals Learning
Agent(s)
19
Integrating Machine Learning into the Agent
Architecture

Time constraints on learning
Synchronisation between agents actions
Learning and recall
Timing analysis of theories learned

20
Time Constraints on Learning

Machine Learning alone
predictive accuracy matters, time doesnt (just a
price to pay)
ML in Agents
Soft deadlines resources must be shared with
other activities (perception, planning, control)
Hard deadlines imposed by environment Make up
your mind now!

21
Doing Eager vs. Lazy Learning under Time Pressure

Eager Learning
Theories typically more compact
and faster to use
Takes more time to learn do it when the agent
is idle
Lazy Learning
Knowledge acquired at (almost) no cost
May be much slower when a test example comes

22
Any-Time Learning

Consider two types of algorithms
Running a prescribed number of steps guarantees
finding a solution
can use worst case complexity analysis to find an
upper bound on the execution time
Any-time algorithms
a longer run may result in a better solution
dont know an optimal solution when they see one
example Genetic Algorithms
policies halt learning to meet hard deadlines or
when cost outweighs expected improvements of
accuracy

23
Time Constraints on Learning in Simulated
Environments

Consider various cases
Unlimited time for learning
Upper bound on time for learning
Learning in real time
Gradually tightening the constraints makes
integration easier
Not limited to simulations real-world problems
have similar setting
e.g., various types of auctions

24
Learning and Recall

Agent must strike a balance between
Learning, which updates the model of the world
Recall, which applies existing model of the world
to other tasks

25
Learning and Recall (2)

Update sensory information
Recall current model of world to choose and carry
out an action
Learn new model of the world

In theory, the two can run in parallel
In practice, must share limited resources

26
Learning and Recall (3)

Possible strategies
Parallel learning and recall at all times
Mutually exclusive learning and recall
After incremental, eager learning, examples are
discarded
or kept if batch or lazy learning used
Cheap on-the-fly learning (preprocessing),
off-line computationally expensive learning
reduce raw information, change object language
analogy with human learning and the role of sleep

27
Timing Analysis of Theories Learned Example

(Kazakov, PhD Thesis)
Beware of phase transition-like behaviour

left simple theory with low coverage succeeds
or quickly fails ? high speed
middle medium coverage, fragmentary theory,
lots of backtracking ? low speed
right general theory with high coverage less
backtracking ? high speed

28
Types of Learning Task

Supervised Learning
Unsupervised Learning
Reinforcement Learning

29
Reinforcement Learning

Agent learns from environmental feedback
indicating the benefit of states.
No explicit teacher required.
Learning target optimal policy (i.e.,
state-action mapping)
Optimality measure e.g., cumulative discounted
reward.

30
Q Learning

Reinforcement Learning Algorithm.
Most popular agent learning technique.
Values (discounted cumulative reward) of
state-action pairs are stored in a Q-table.
Optimal policy is easily derived from Q-table.

31
Q Learning
Value of a state discounted cumulative reward
V?(st) ?i ? 0 ?i r(sti,ati) 0 ?
? lt 1 is a discount factor (? 0 means that
only immediate reward is considered). r(sti
,ati) is the reward determined by performing
actions specified by policy ?. Q(s,a)
r(s,a) ? V(?(s,a)) Optimal Policy
?(s) argmaxa Q(s,a)
32
Example
100
50
33
Example (cont.)

Vshort (s0) 100
Vlong (s0) 50 ? 0 ?2 100
0.8 ? Vlong (s0) 114 ? ? long
0.5 ? Vlong (s0) 75 ? ? short

34
Q Learning
Initialize all Q(s,a) to 0 In some state s choose
some action a. Let s be the resulting state and
r the reward received. Update Q
Q(s,a) r ? maxa Q(s,a)
35
Q Learning

Guaranteed convergence towards optimum
(state-action pairs have to be visited infinitely
often).
Exploration strategy can speed up convergence
(more on this later).
Basic Q Learning does not generalize replace
state-action table with function approximation
(e.g. neural net) in order to handle unseen
states.

36
Learning in Multi-Agent Systems Important Issues

Classification
Social Awareness
Communication
Role Learning
Distributed Learning
Focus Learning of Coordination

37
A Brief History
Disembodied ML
Single-Agent Learning
Machine Learning
Multiple Single-Agent Learners
Social Multi-Agent Learners
Social Multi-Agent System
Multiple Single-Agent System
Agents
Single-Agent System
38
Types of Multi-Agent LearningWeiss
Dillenbourg 99

Multiplied Learning No interference in the
learning process by other agents (except for
exchange of training data or outputs).
Divided Learning Division of learning task on
functional level.
Interacting Learning cooperation beyond the pure
exchange of data.

39
Social Awareness

Awareness of existence of other agents and
(eventually) knowledge about their behavior.
Not necessary to achieve near optimal MAS
behavior rock sample collection Steels 89.
Can it degrade performance?

40
Levels of Social Awareness VidalDurfee 97

0-level agent no knowledge about existence of
other agents.
1-level agent recognizes that other agents
exist, model other agents as 0-level.
2-level agent has some knowledge about behavior
of other agents and their behavior model other
agents as 1-level agents.
k-level agent model other agents as (k-1)-level.

41
Social Awareness and Q Learning

0-level agents already learn implicitly about
other agents.
Mundhe and Sen, 00 study of two Q learning
agents up to level 2.
Two 1-level agents display slowest and least
effective learning (worse than two 0-level
agents).

42
Agent models and Q Learning

Q S ? An ? R, where n is the number of agents.
If other agents actions are not observable, need
assumption for actions of other agents.
Pessimistic assumption given an agents action
choice other agents will minimize reward.
Optimistic assumption other agents will maximize
reward.

43
Agent Models and Q Learning

Pessimistic Assumption leads to overly cautious
behavior.
Optimistic Assumption guarantees convergence
towards optimum Lauer Riedmiller 00.
If knowledge of other agents behavior available,
Q value update can be based on probabilistic
computation Claus and Boutilier 98. But no
guarantee of optimality.

44
Q Learning CommunicationTan 93

Types of communication
Sharing sensation
Sharing or merging policies
Sharing episodes
Results
Communication generally helps
Extra sensory information may hurt

45
Role Learning

Often useful for agents to specialize in specific
roles for joint tasks.
Pre-defined roles reduce flexibility, often not
easy to define optimal distribution, may be
expensive.
How to learn roles?
Prasad et al. 96 learn optimal distribution of
pre-defined roles.

46
Q Learning of roles

CritesBarto 98 elevator domain regular Q
learning no specialization achieved (but highly
efficient behavior).
OnoFukumoto 96 Hunter-Prey domain,
specialization achieved with greatest mass
merging strategy.

47
Q Learning of Roles Balch 99

Two main types of reward function local and
global.
Global reward supports specialization.
Local reward supports emergence of homogeneous
behaviors.
Some domains benefit from learning team
heterogeneity (e.g., robotic soccer), others do
not (e.g., multi-robot foraging).
Heterogeneity measure social entropy.

48
Distributed Learning

Motivation Agents learning a global hypothesis
from local observations.
Application of MAS techniques to (inductive)
learning.
Applications Distributed Data Mining Provost
Kolluri 99, Robotic Soccer.

49
Distributed Data Mining

Provost Hennessy 96 Individual learners see
only subset of all training examples and compute
a set of local rules based on these.
Local rules are evaluated by other learners based
on their data.
Only rules with good evaluation are carried over
to the global hypothesis.

50
Learning to Coordinate

Good coordination is crucial for good MAS
performance.
Example soccer team.
Pre-defined coordination protocols are often
difficult to define in advance.
Needed learning of coordination.
Focus Q-learning of coordination.

51
Soccer Formation
52
Soccer Formation Control

Formation control is a coordination problem.
Good formations and set-plays seem to be a strong
factor in winning teams.
To date pre-defined.
Can (near-)optimal formations be (reinforcement)
learned?

53
A Sub-Problem

Given n agents at random positions, and a
formation having n positions.
Wanted set of n policies that transforms initial
state into the desired formation.
Specifically Q learning of these policies.

54
A Further Simplification

MAS Policy decision procedure who takes which
position.
No two agents should choose the same formation
position.
Problem reduces to reinforcement learning of
coordination in cooperative games.

55
Cooperative Games

Players perform actions simultaneously.
Afterwards, all players receive the same reward
based on the joint action.

Player 2
A1
A2
5
3
A1
Player 1
2
0
A2
56
Cooperative Games and Formations

Consider 2-player formation with 2 positions
left, right.
Corresponding cooperative game

Player 2
left
right
0
5
left
Player 1
5
0
right
57
Learning in Cooperative Games

To date focus on Q-learning.
Is communication/observation amongst agents
necessary?
Does this requirement change with increasing
difficulty of the cooperative game?

58
Convergence

Single-agent Q-learning guaranteed convergence
(to optimum).
Multi-agent Q-learning more assumptions needed.
Crucial in MAS action selection strategy.

59
Q Learning Revisited

Modified Q update function
Q(a) Q(a) ? (r Q(a))
Boltzmann action selection strategy

eEV(a)/T
P(a)

?a eEV(a)/T

60
Boltzmann Exploration

Usually EV(a) Q(a).
Trade-off between exploration and exploitation.
Higher temperature T results in more emphasis on
exploration.
Temperature T should be high at first, and
lowered with time (T(t) e(-st)).

61
Q Learning of Coordination

Singh et al., 2000 convergence to some joint
action can be ensured with specific temperature
properties.
Convergence to optimal joint action for simple
cases

Player 2
A1
A2
5
3
A1
Player 1
2
0
A2
62
Difficult Cooperative Games

Climbing Game Claus Boutillier, 98

Player 2
a
b
c
11
-30
0
a
-30
7
6
b
Player 1
0
0
5
c
63
Climbing Game

Multiplied Q learning with Boltzmann exploration
converges to suboptimal (c,c).
C B, 98 Joint action learners (JAL).
Agents observe each others actions and build a
probabilistic model, according to which the next
action is chosen.
Agents get to (b,b) but are stuck there.

64
Climbing Game (cont.)

Optimistic assumption Lauer Riedmiller, 00
never reduce Q-values due to penalties.
Converges quickly to optimal (a,a).
However, does not converge on stochastic version
of climbing game.

65
Stochastic Climbing Game
Player 2
a
b
c
12/10
0/-60
0/-60
a
0/-60
14/0
8/4
b
Player 1
5/-5
5/-5
7/3
c
66
FMQ Heuristic

Kapetanakis Kudenko, 02
EV(a) Q(a) c freq(maxR(a)) maxR(a)
EV(a) carries information on how frequently an
action produces its maximum corresponding reward.
Converges to optimal (a,a) for climbing game and
partially stochastic climbing game.

67
Partially Stochastic Climbing Game
Player 2
a
b
c
11
0/-60
0/-60
a
0/-60
14/0
8/4
b
Player 1
5/-5
5/-5
7/3
c
68
Difficult Cooperative Games

Penalty Game Claus Boutillier, 98

Player 2
a
b
c
10
0
k
a
0
2
0
b
Player 1
k
0
10
c
69
Penalty Game

JAL convergence to optimal (a,a) or (c,c) only
for small penalties k (kgt-20).
Both optimistic assumption and FMQ converge to
either optimum also for large penalties (up to
100).

70
Learning of Coordination More Questions

Scaling-up of Q learning approaches?
Agents with state Boutillier, 99.
Large numbers of actions/agents?
Learning of formations from non-explicit rewards?

71
Learning of Coordination Conclusions

Idealized and simple cases have been studied and
solved.
Mutual communication/observation may not be
needed.
Beyond Q learning Evolutionary approaches
Quinn, 01.

72
Learning and Natural Selection

In learning, search is trivial, choosing the
right bias is hard.
But, the choice of learning bias is always
external to the learner !
To find the best suited bias one could combine
arbitrary choices of bias with evolution and
natural selection of the fittest individuals.

73
Darwinian vs. Lamarckian Evolution

Darwinian evolution nothing learned by the
individual is encoded in the genes and passed on
to the offspring.
The Baldwin effect learning abilities (good
biases) are selected because they give the
individual a better chance in a dynamic
environment.
What is passed on to the offspring is useful, but
very general.

74
Darwinian vs. Lamarckian Evolution (2)

Lamarckian Evolution individual experience
acquired in life can be inherited.
Not the case in nature.
Doesnt mean we cant use it.
The inherited concepts may be too specific and
not of general importance.

75
Learning and Language

Language uses concepts which are
specific enough to be useful to most/all speakers
of that language
general enough to correspond to shared experience
(otherwise, how would one know what the other is
talking about !)
The concepts of a language serve as a learning
bias which is inherited not in genes but
through education.

76
Communication and Learning

Language
helps one learn (in addition to inherited biases)
allows to communicate knowledge.
Distinguish between
Knowledge things that one can explain by the
means of a language to another.
Skills the rest, require individual learning,
cannot be communicated.

77
Communication and Learning

In language learning, forgetting examples may
be harmful (van den Bosch et al.)
An expert is someone who does not think anymore
he knows. Frank Lloyd Wright.
It may be difficult to communicate what one has
learned because of
Limited bandwidth (for lazy learning)
The absence of appropriate concepts in the
language (for black-box learning)

78
Communication and Learning

In a society of communicating agents, less
accurate white-box learning may be better than
more accurate but expensive learning that cannot
be communicated since the reduced performance
could be outweighed by the much lower cost of
learning.

79
Stochastic Simulation of Inherited Kinship-Driven
Altruism
Heather Turner and Dimitar Kazakov

Assess the rôle of a hypothetical inherited
feature (gene) promoting altruism between
relatives as a factor for survival

in the context of a simulated MAS employing
natural selection.
Studies the link b/w evolution and co-operation
80
Altruism

Definition A selfless behaviour/action that will
provide benefit to another at no gain or to the
detriment of the actor.
Kinship-driven altruism altruistic behaviour
directed to relatives.

81
Natural Selection of Inherited Behaviour

Classical Darwinism- survival of the fittest
individuals- fitness is the ability to
reproduce- actions hindering fitness disappear
Neo-Darwinism (Dawkins, Hamilton)- genes rather
than individuals selected- inclusive fitness of
all copies of a gene- actions increasing
inclusive fitness are promoted

82
Altruism and Natural Selection

Classical Darwinism- altruism hinders ones
fitness, hence it should be demoted by natural
selection
Neo-Darwinism- altruistic acts can influence
positively the inclusive fitness of a gene in the
population- e.g. kinship-driven altruism

83
Kinship-Driven Altruism Example
?
?
Case A
Case B
84
Kinship-Driven Altruism Example

A self-sacrifice is justified as it removes one
rather than 1.5 copies of each gene.
Hamilton (1964) presents a detailed analytical
model and supports it with evidence from nature.

85
Kinship-Driven Altruism

Extent of help f(degree of kinship)
Altruistic behaviour based on a well-chosen f
could increase the inclusive fitness of all genes
of its carriers, i.e., would help propagate
them.
If altruism was an inherited feature (gene), it
could be propagated itself for the same reason.

86
One Day in the Life of an Agent

Age and maybe die.
Mate
Hunt
Help a relative.
Death, finding a mate, food or relative modelled
as stochastic processes.

87
Altruistic Behaviour

If you meet someone poorer than you
Use your sharing function to decide how much you
would give an identical twin
Then reduce the amount according to the perceived
degree of kinship (expected average percentage of
shared genes), e.g. by half for a child

88
Experiments Degrees of Freedom

Type of sharing function
Models of the degree of kinship
Initial ratio b/w selfish and altruist
individuals
(Hunting and mating gambling policies always
subject to evolution.)

89
Type of Sharing Function

Communism
Progressive Taxation with a non-taxable
allowance
Poll Tax
Pay the same amount pt, even if that kills you
q, a, pt inherited, subject to nat. selection

90
Modelling the Degree of Kinship
gp
gp
25

Royalty the entire family tree known.- in
practice, just two generations back and forth.

par
par
50
half sib
full sib
100
25
50
?
Prediction based on similarity of visible
inherited features. Unknown indiscriminate,
optimistic (share with everyone as if child).
91
Results Population Size
Royalty
Prediction
Unknown
Communism
Progressive Taxation
Poll Tax
92
Results Percentage of Altruists
Royalty
Prediction
Unknown
?
Communism
?
?
?
Progressive Taxation
?
Poll Tax
93
Results Initial Percentage of Altruists

Royalty model, progressive taxation, initial
levels of altruists 0, 25, 50, 75, 100.
Converge to the same ratio of altruists in the
population.

94
Conclusions

Perfect knowledge of the degree of kinship or a
sharing function based on progressive taxation
promote altruism.
Progressive taxation supports a more altruistic
population than communism (in the limit) when
knowledge of kinship is uncertain.

95
Contribution

Replicate the natural phenomenon of kinship
altruism in a simulated MAS.
Implement a model of natural selection different
from the one commonly used in GA and MAS and
closer to nature.

96
Bibliography
Alonso, E., D. Kudenko and D. Kazakov (eds.)
(2002). Proc. of the Second Symposium on Adaptive
Agents and Multi-Agent Systems, Imperial College,
London, ISBN 1902956280. Alonso, E. Kudenko,
D. (eds.) (2001). Proc. of the Symposium on
Adaptive Agents and Multi-Agent Systems,
University of York, UK. Baldwin, J.M. (1896). A
new factor in evolution. The American Naturalist
30. Bryant, C.H. Muggleton, S. (2000). Closed
loop machine learning. Technical Report YCS 330,
University of York, Department of Computer
Science, Heslington, York, UK. Boutillier, 99
C. Boutillier "Sequential Optimality and
Coordination in Multiagent Systems, IJCAI
99. Claus Boutillier 98 C. Claus and C.
Boutillier. The Dynamics of Reinforcement
Learning in Cooperative Multiagent Systems. AAAI
98. Hamilton 64 W.D. Hamilton. The genetical
evolution of social behaviour (I and II). Journal
of Theor. Biology. 1964. Lauer Riedmiller 00
M. Lauer and M. Riedmiller. An Algorithm for
Distributed Reinforcement Learning in Cooperative
Multi-Agent Systems. In Proc. of the 17th
International Conference in Machine Learning,
2000.
Mitchell 97 T. Mitchell. Machine Learning.
McGraw Hill, 1997. Mundhe Sen 00 M. Mundhe
and S. Sen. Evaluating Concurrent Reinforcement
Learners. Proceedings of the Fourth International
Conference on Multiagent Systems, IEEE Press,
2000. Quinn, 01 M. Quinn "Evolving
communication without dedicated communication
channels", ECAL '01, Springer LNCS
2159. Rissanen, J. (1989). Stochastic Complexity
in Statistical Enquiry. World Scientific
Pubishing Co, Singapore. Thompson, C., Califf,
M.E. Mooney, R. (1999). Active learning for
natural language parsing and information
extraction. Proceedings of the Sixteenth
International Conference on Machnine
Learning. Vidal Durfee 97 J.M. Vidal and E.
Durfee. Agents Learning about Agents A Framework
and Analysis. In Working Notes of the AAAI-97
workshop on Multiagent Learning, 1997. Weiss
Dilelnbourg 99 G. Weiss and P. Dillenbourg. What
is Multi in Multi-Agent Learning? In P.
Dillenbourg (ed.), Collaborative Learning.
Cognitive and Computational Approaches. Pergamon
Press, 1999.

Write a Comment

User Comments (0)