Title: Reinforcement Learning
1KI2 - 11
Reinforcement Learning
Johan Everts
Kunstmatige Intelligentie / RuG
2What is Learning ?
- Learning takes place as a result of interaction
between an agent and the world, the idea behind
learning is that - Percepts received by an agent should be used
not only for acting, but also for improving the
agents ability to behave optimally in the future
to achieve its goal.
3Learning Types
- Supervised learning
- Situation in which sample (input, output)
pairs - of the function to be learned can be perceived
or are given - Reinforcement learning
- Where the agent acts on its environment, it
receives some evaluation of its action
(reinforcement), but is not told of which action
is the correct one to achieve its goal - Unsupervised Learning
- No information at all about given output
4Reinforcement Learning
- Task
- Learn how to behave successfully to achieve a
goal while interacting with an external
environment - Learn through experience
- Examples
- Game playing The agent knows it has won or lost,
but it doesnt know the appropriate action in
each state - Control a traffic system can measure the delay
of cars, but not know how to decrease it.
5Elements of RL
Agent
Policy
Environment
- Transition model, how action influence states
- Reward R, imediate value of state-action
transition - Policy ?, maps states to actions
6Elements of RL
7Elements of RL
- Value function maps states to state values
- Discount factor ? ? 0, 1) (here 0.9)
V(state) values
8RL task (restated)
- Execute actions in environment,
- observe results.
- Learn action policy ? state ? action that
maximizes expected discounted reward - E r(t) ?r(t 1) ?2r(t 2)
- from any starting state in S
9Reinforcement Learning
- Target function is ? state ? action
- RL differs from other function approximation
tasks - Partially observable states
- Exploration vs. Exploitation
- Delayed reward -gt temporal credit assignment
10Reinforcement Learning
- Target function is ? state ? action
- However
- We have no training examples of form ltstate,
actiongt - Training examples are of form
- ltltstate, actiongt, rewardgt
11Utility-based agents
- Try to learn V ? (abbreviated V)
- perform lookahead search to choose best action
from any state s - Works well if agent knows
- ? state ? action ? state
- r state ? action ? R
- When agent doesnt know ? and r, cannot choose
actions this way
12Q-learning
- Q-learning
- Define new function very similar to V
- If agent learns Q, it can choose optimal action
even without knowing ? or R - Using Learned Q
13Learning the Q-value
- Note Q and V closely related
- Allows us to write Q recursively as
14Learning the Q-value
- FOR each lts, agt DO
- Initialize table entry
- Observe current state s
- WHILE (true) DO
- Select action a and execute it
- Receive immediate reward r
- Observe new state s
- Update table entry for as follows
- Move record transition from s to s
15Q-learning
- Q-learning, learns the expected utility of taking
a particular action a in a particular state s
(Q-value of the pair (s,a))
r(state, action) immediate reward values
Q(state, action) values
V(state) values
16Q-learning
- Demonstration
- http//iridia.ulb.ac.be/fvandenb/qlearning/qlear
ning.html - eps probability to use a random action instead
of the optimal policy - gam discount factor, closer to 1 more weight is
given to future reinforcements. - alpha learning rate
17Temporal Difference Learning
- Q-learning estimates one time step difference
- Why not for n steps?
18Temporal Difference Learning
- TD(?) formula
-
- Intuitive idea use constant 0 ? ? ? 1 to combine
estimates from various lookahead distances (note
normalization factor (1- ?))
19Genetic algorithms
- Imagine the individuals as agent functions
- Fitness function as performance measure or reward
function - No attempt made to learn the relationship between
the rewards and actions taken by an agent - Simply searches directly in the individual space
to find one that maximizes the fitness functions
20Genetic algorithms
- Represent an individual as a binary string
- Selection works like this if individual X scores
twice as high as Y on the fitness function, then
X is twice as likely to be selected for
reproduction than Y. - Reproduction is accomplished by cross-over and
mutation
21Cart Pole balancing
- Demonstration
- http//www.bovine.net/jlawson/hmc/pole/sane.html
22Summary
- RL addresses the problem of learning control
strategies for autonomous agents - In Q-learning an evaluation function over states
and actions is learned - TD-algorithms learn by iteratively reducing the
differences between the estimates produced by the
agent at different times - In the genetic approach, the relation between
rewards and actions is not learned. You simply
search the fitness function space.