Reinforcement Learning

1 / 22
About This Presentation
Title:

Reinforcement Learning

Description:

Reinforcement learning: ... Reinforcement Learning. Task ... gam: discount factor, closer to 1 more weight is given to future reinforcements. ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
KI2 - 11
Reinforcement Learning
Johan Everts
Kunstmatige Intelligentie / RuG
2
What is Learning ?
  • Learning takes place as a result of interaction
    between an agent and the world, the idea behind
    learning is that
  • Percepts received by an agent should be used
    not only for acting, but also for improving the
    agents ability to behave optimally in the future
    to achieve its goal.

3
Learning Types
  • Supervised learning
  • Situation in which sample (input, output)
    pairs
  • of the function to be learned can be perceived
    or are given
  • Reinforcement learning
  • Where the agent acts on its environment, it
    receives some evaluation of its action
    (reinforcement), but is not told of which action
    is the correct one to achieve its goal
  • Unsupervised Learning
  • No information at all about given output

4
Reinforcement Learning
  • Task
  • Learn how to behave successfully to achieve a
    goal while interacting with an external
    environment
  • Learn through experience
  • Examples
  • Game playing The agent knows it has won or lost,
    but it doesnt know the appropriate action in
    each state
  • Control a traffic system can measure the delay
    of cars, but not know how to decrease it.

5
Elements of RL
Agent
Policy
Environment
  • Transition model, how action influence states
  • Reward R, imediate value of state-action
    transition
  • Policy ?, maps states to actions

6
Elements of RL
7
Elements of RL
  • Value function maps states to state values
  • Discount factor ? ? 0, 1) (here 0.9)

V(state) values
8
RL task (restated)
  • Execute actions in environment,
  • observe results.
  • Learn action policy ? state ? action that
    maximizes expected discounted reward
  • E r(t) ?r(t 1) ?2r(t 2)
  • from any starting state in S

9
Reinforcement Learning
  • Target function is ? state ? action
  • RL differs from other function approximation
    tasks
  • Partially observable states
  • Exploration vs. Exploitation
  • Delayed reward -gt temporal credit assignment

10
Reinforcement Learning
  • Target function is ? state ? action
  • However
  • We have no training examples of form ltstate,
    actiongt
  • Training examples are of form
  • ltltstate, actiongt, rewardgt

11
Utility-based agents
  • Try to learn V ? (abbreviated V)
  • perform lookahead search to choose best action
    from any state s
  • Works well if agent knows
  • ? state ? action ? state
  • r state ? action ? R
  • When agent doesnt know ? and r, cannot choose
    actions this way

12
Q-learning
  • Q-learning
  • Define new function very similar to V
  • If agent learns Q, it can choose optimal action
    even without knowing ? or R
  • Using Learned Q

13
Learning the Q-value
  • Note Q and V closely related
  • Allows us to write Q recursively as

14
Learning the Q-value
  • FOR each lts, agt DO
  • Initialize table entry
  • Observe current state s
  • WHILE (true) DO
  • Select action a and execute it
  • Receive immediate reward r
  • Observe new state s
  • Update table entry for as follows
  • Move record transition from s to s

15
Q-learning
  • Q-learning, learns the expected utility of taking
    a particular action a in a particular state s
    (Q-value of the pair (s,a))

r(state, action) immediate reward values
Q(state, action) values
V(state) values
16
Q-learning
  • Demonstration
  • http//iridia.ulb.ac.be/fvandenb/qlearning/qlear
    ning.html
  • eps probability to use a random action instead
    of the optimal policy
  • gam discount factor, closer to 1 more weight is
    given to future reinforcements.
  • alpha learning rate

17
Temporal Difference Learning
  • Q-learning estimates one time step difference
  • Why not for n steps?

18
Temporal Difference Learning
  • TD(?) formula
  • Intuitive idea use constant 0 ? ? ? 1 to combine
    estimates from various lookahead distances (note
    normalization factor (1- ?))

19
Genetic algorithms
  • Imagine the individuals as agent functions
  • Fitness function as performance measure or reward
    function
  • No attempt made to learn the relationship between
    the rewards and actions taken by an agent
  • Simply searches directly in the individual space
    to find one that maximizes the fitness functions

20
Genetic algorithms
  • Represent an individual as a binary string
  • Selection works like this if individual X scores
    twice as high as Y on the fitness function, then
    X is twice as likely to be selected for
    reproduction than Y.
  • Reproduction is accomplished by cross-over and
    mutation

21
Cart Pole balancing
  • Demonstration
  • http//www.bovine.net/jlawson/hmc/pole/sane.html

22
Summary
  • RL addresses the problem of learning control
    strategies for autonomous agents
  • In Q-learning an evaluation function over states
    and actions is learned
  • TD-algorithms learn by iteratively reducing the
    differences between the estimates produced by the
    agent at different times
  • In the genetic approach, the relation between
    rewards and actions is not learned. You simply
    search the fitness function space.
Write a Comment
User Comments (0)