Title: Combining Genetics, Learning and Parenting
1Combining Genetics, Learning and Parenting
Michael Berger
Based on When to Apply the Fifth Commandment
The Effects of Parenting on Genetic and Learning
Agents / Michael Berger and Jeffrey S.
Rosenschein Submitted to AAMAS 2004
2Abstract Problem
- Hidden state
- Metric defined over state space
- Condition C1 When state changes, it is only to
an adjacent state - Condition C2 State changes occur at a low, but
positive rate
3The Environment
4Agent Definitions
- Reward Food Presence (0 or 1)
- Perception ltPosition, Food Presencegt
- Action ? NORTH, EAST, SOUTH, WEST, HALT
- Memory ltltPer, Acgt, , ltPer, Acgt, Pergt
- Memory length Mem No. of elements in memory
- No. of possible memories (2ltGrid WidthgtltGrid
Heightgt)Mem 5Mem-1 - MAM - Memory-Action Mapper
- Table
- One entry for every possible memory
- ASF - Action-Selection Filter
5Genetic Algorithm (I)
- Algorithm on a complete population, not on a
single agent - Requires introduction of generations
- Every generation consists of a new group of
agents - Each agent is created at the beginning of a
generation, and is terminated at its end - Agents life cycleBirth --gt Run (foraging) --gt
Possible matings --gt Death
6Genetic Algorithm (II)
- Each agent carries a gene sequence
- Each gene has a key (memory) and a value (action)
- A given memory determines the resultant action
- Gene sequence remains constant during the
life-time of an agent - Gene sequence is determined at the mating stage
of an agents parents
7Genetic Algorithm (III)
- Mating consists of two stages
- Selection stage - Determining mating rights.
Should be performed according to two principles - Survival of the fittest (as indicated in
performance during the life-time) - Preservation of genetic variance
- Offspring creation stage
- One or more parents create one or more offspring
- Offspring inherit some combination of parents
gene sequence - Each of the stages has many variants
8Genetic Algorithm Variant
- Selection
- Will be discussed later.
- Offspring creation
- Two parents mate and create two offspring
- Gene sequences of parents are aligned against one
another, and then two processes occur - Random crossover
- Random mutation
- Resultant pair of gene sequences are inherited by
the offspring (one by each offspring).
9Genetic Inheritance
10Genetic Agent
- MAM
- Every entry is considered a gene
- First column - Possible memory (key)
- Second column - Action to take (value)
- No changes after creation
- Parameters
Memory length
Crossover probability for each gene pair
Mutation probability for each gene
11Learning Algorithm
- Reinforcement Learning type algorithm
- After performing an action, agents receive a
signal informing them how well their choice of
action was (in this case, the reward) - Selected algorithm Q-learning with Boltzmann
exploration
12Basic Q-Learning (I)
Discount factor (non-negative, less than 1)
Reward at round j
Rewards Discounted sum at round n
- Q-learning attempts to maximize the expected
rewards discounted sum of an agent as a function
of any given memory at any round n
13Basic Q-Learning (II)
- Q(s,a) - Q-value. The expected discounted sum
of future rewards for an agent when its memory is
s and it selects action a and follows an optimal
policy thereafter. - Q(s,a) is updated after every time an agent
selects action a when at memory s. After action
execution, agent receives reward r and contains
memory s. Q(s,a) is updated as follows
14Basic Q-Learning (III)
- Q(s,a) values can be stored in different forms
- Neural network
- Table (nicknamed a Q-table)
- When saved as a Q-table, each row corresponds to
a possible memory s, and each column to a
possible action a. - When an agent contains memory s, it should simply
select an action a with that maximizes Q(s,a) -
right ???
- Q(s,a) values can be stored in different forms
- Neural network
- Table (nicknamed a Q-table)
- When saved as a Q-table, each row corresponds to
a possible memory s, and each column to a
possible action a. - When an agent contains memory s, it should simply
select an action a with that maximizes Q(s,a) -
WRONG !!!
15Boltzmann Exploration (I)
- Full exploitation of a Q-value might hide other,
better Q-values - Exploration of Q-values needed, at least in early
stages - Boltzmann explorationThe probability of
selecting action ai
16Boltzmann Exploration (II)
- t - An annealing temperature
- At round n
- t decreases gt exploration decreases,
exploitation increases - For a given s, the probability for selecting its
best Q-value approaches 1 as n increases - Variant here uses a freezing temperature
Freezing temperature - when t is below it,
exploration is replaced by full exploitation
17Learning Agent
- MAM
- A Q-table (dynamic)
- Parameters
Memory length
Learning rate
Rewards discount factor
Temperature annealing function
Freezing temperature
18Parenting Algorithm
- No classical parenting algorithm around, this
needs to be simulated - Selected algorithm Monte-Carlo (another
Reinforcement Learning type algorithm)
19Monte-Carlo (I)
- Some similarity to Q-learning
- A table (nicknamed an MC-table) stores values
(MC-values) that describe how good it is to
take action a given memory s - Table dictates a policy of action-selection
- Major differences from Q-learning
- Table isnt modified after every round, but only
after episodes of rounds (in our case, a
generation) - Q-Value and MC-values have different meanings
20Monte-Carlo (II)
- Off-line version of Monte-Carlo
- After completing an episode (generation) where
one table has dictated the action-selection
policy, a new, second table is constructed from
scratch to evaluate how good any action a is for
a given memory s - Second table will dictate policy in the next
episode (generation) - Equivalent to considering the second table as
being built during the current episode, as long
as it isnt used in the current episode
21Monte-Carlo (III)
- MC(s,a) is defined as the average of all rewards
received after memory s was encountered and
action a was selected - What if (s,a) was encountered more than once?
- Every-visit variant
- The average of all subsequent rewards is
calculated for each occurrence of (s,a) - MC(s,a) is the average of all calculated averages
22Monte-Carlo (IV)
- Every-visit variant more suitable than
first-visit variant (where only the first
encounter with (s,a) counts) - Environment can change a lot since the first
encounter with (s,a) - Exploration variants not used here
- For a given memory s, action a with the highest
MC-value is selected - Full exploitation here because we have the
experience of the previous episode of rounds
23Parenting Agent
- MAM
- An MC-table (doesnt matter if dynamic or static)
- Dictates action-selection for offsprings only
- ASF
- Selects between the actions suggested by both
parents with equal chance - Parameters
Memory length
24Complex Agent (I)
- Contains a genetic agent, a learning agent and a
parenting agent in a subsumption architecture - Mating selection (debt from before) occurs among
complex agents - At a generations end, each agents average
reward serves as its score - Agents receive mating rights according to scores
strata (determined by scores average and
standard deviation)
25Complex Agent (II)
- Mediates between the inner agents and the
environment - Perceptions passed directly to inner agents
- Actions suggested by all inner agents passed
through an ASF, which selects one of them - Parameters
ASFs prob. to select genetic action
ASFs prob. to select learning action
ASFs prob. to select parenting action
26Complex Agent - Mating
ENVIRONMENT
27Complex Agent - Perception
ENVIRONMENT
28Complex Agent - Action
ENVIRONMENT
29Experiment (I)
- Measures
- Eating-rate average reward for a given agent
(throughout its generation) - BER Best Eating-Rate (in a generation)
- Framework
- 20 agents in generation
- 9500 generations
- 30000 rounds per generation
- Dependent variable
- Success measure (Lambda) - Average of the BERs in
the last 1000 generations
30Experiment (II)
- Environment
- Grid 20 x 20
- A single food patch, 5 x 5 in size
31Experiment (III)
1
0.02
0.005
1
0.2
0.95
5 0.999n
0.2
1
32Experiment (IV)
- Complex agent parametersASF probabilities (111
combinations)
- Environment parameterProbability that in a
given round, the food patch moves in a random
direction (0, 10-6, 10-5, 10-4, 10-3, 10-2, 10-1)
Movement Probability
- One run for each combination of values
33Results Static Environment
- Best combination
- Genetic-Parenting hybrid (PLrn 0)
- PGen gt PPar
- Pure genetics dont perform well
- GA converges slower if not assisted by learning
or parenting - Pure parenting performs poorly
- For a given PPar, success improves as PLrn
decreases
(Graph for movement prob. 0)
34Results Low Dynamic Rate
- Best combination
- Genetic-Learning-Parenting hybrid
- PLrn gt PGen PPar
- PPar gt PGen
- Pure parenting performs poorly
(Graph for movement prob. 10-4)
35Results High Dynamic Rate
- Best combination
- Pure learning(PGen 0,Ppar 0)
- Pure parenting performs poorly
- Parenting loses effectiveness
- Non-parenting agents have better success
(Graph for movement prob. 10-2)
36Conclusions
- Pure parenting doesnt work
- Agent algorithm A will be defined as an
action-augmentor of agent algorithm B if - A and B are always used for receiving perceptions
- B is applied for executing an action in most
steps - A is applied for executing an action in at least
50 of the other steps - In a static enviornment (C1 C2), parenting
helps when used as an action-augmentor for
genetics - In slowly changing enviornments (C1 C2),
parenting helps when used as an action-augmentor
for learning - In quickly changing enviroments (C1 only),
parenting doesnt work - pure learning is best
37Bibliography (I)
- Genetic Algorithm
- R. Axelrod. The complexity of Cooperation
Agent-Based Models of Competition and
Collaboration. Princeton University Press, 1997. - H.G. Cobb and J.J. Grefenstette. Genetic
algorithms for tracking changing environments. In
Proceedings of the Fifth International Conference
on Genetic Algorithms, pages 523-530, San Mateo,
1993. - Q-Learning
- T.W. Sandholm and R.H. Crites. Multiagent
reinforcement learning in the iterated prisoners
dilemma. Biosystems, 37 147-166, 1996. - Monte-Carlo methods, Q-Learning, Reinforcement
Learning - R.S. Sutton and A.G. Barto. Reinforcement
Learning An Introduction. The MIT Press, 1998.
38Bibliography (II)
- Genetic-Learning combinations
- G. E. Hinton and S. J. Nowlan. How learning can
guide evolution. In Adaptive Individuals in
Evolving Populations Models and Algorithms,
pages 447-454. Addison-Wesley, 1996. - T.D. Johnston. Selective costs and benefits in
the evolution of learning. In Adaptive
Individuals in Evolving Populations Models and
Algorithms, pages 315-358. Addison-Wesley, 1996. - M. Littman. Simulations combining evolution and
learning. In Adaptive Individuals in Evolving
Populations Models and Algorithms, pages
465-477. Addison-Wesley, 1996. - G. Mayley. Landscapes, learning costs and genetic
assimilation. Evolutionary Computation, 4(3)
213-234, 1996.
39Bibliography (III)
- Genetic-Learning combinations (cont.)
- S. Nolfi, J.L. Elman and D. Parisi. Learning and
evolution in neural networks. Adaptive Behavior,
3(1) 5-28, 1994. - S. Nolfi and D. Parisi. Learning to adapt to
changing environments in evolving neural
networks. Adaptive Behavior, 5(1) 75-98, 1997. - D. Parisi and S.Nolfi. The influence of learning
on evolution. Models and Algorithms, pages
419-428. Addison-Wesley, 1996. - P.M. Todd and G.F. Miller. Exploring adaptive
agency II Simulating the evolution of
associative learning. In From Animals to Animats
Proceedings of the First International Conference
on Simulation of Adaptive Behavior, pages
306-315, San Mateo, 1991.
40Bibliography (IV)
- Exploitation vs. Exploration
- D. Carmel and S. Markovitch. Exploration
strategies for model-based learning in multiagent
systems. Autonomous Agents and Multi-agent
Systems, 2(2) 141-172, 1999. - Subsumption architecture
- R.A. Brooks. A robust layered control system for
a mobile robot. IEEE Journal of Robotics and
Automation, 2(1) 14-23, March 1986.
41Backup - Qualitative Data
42Qual. Data Mov. Prob. 0
Pure Parenting
Pure Genetics
Pure Learning
Best (0.7, 0, 0.3)
43Qual. Data Mov. Prob. 10-4
Pure Parenting
Pure Learning
Best (0.03, 0.9, 0.07)
44Qual. Data Mov. Prob. 10-2
Pure Parenting
(0.09, 0.9, 0.01)
Best Pure Learning