Title: Introduction to Reinforcement Learning
1Introduction toReinforcement Learning
2Overview
- General principles of RL
- Markov Decision Process as model
- Values of states V(s)
- Values of state-actions Q(a,s)
- Exploration vs. Exploitation
- Issues in RL
- Conclusion
3General principles of RL
- Neural Networks are supervised learning
algorithms for each input, we know the output. - What if we dont know the output for each input?
- Flight control system example
- Let the agent learn how to achieve certain goals
itself, through interaction with the environment.
4General principles of RL
- Let the agent learn how to achieve certain goals
itself, through interaction with the environment.
- This does not solve the problem!
5Popular model MDPs
- Markov Decision Process S,A,R,T
-
- Set of states S
- Set of actions A
- Reward function R
- Transition function T
- Markov property
- Tss only depends on s, s
- Policy p(S)gtA
- Problem Find policy p that maximizes the reward
- Discounted reward r0 g1r1 g2r2 ... gnrn
a0
a1
s0
s1
r0
6Values of states Vp(s)
- Definition of value Vp(s)
- Cumulative reward when starting in state s, and
executing some policy untill terminal state is
reached. - Optimal policy yields V(s)
7Determining Vp(s)
- Dynamic programmingV(s) R(s) S
Vps(TssV(s))
TD-learningV(s) V(s) a(R(s)V(s)-V(s))
- Only visited states are used
- Necessary to consider all states.
8Values of state-action Q(a,s)
- Q-values Q(a,s) Value of doing an action in a
certain state. - Dynamic Programming Q(a,s) R(s)
SsTssmaxaQ(a,s) - TD-learning
- Q(a,s) Q(a,s) a(R(s) maxaQ(a,s) -
Q(a,s)) T is not in this formula Model
free learning!
9Exploration vs. Exploitation
- Only exploitation
- New (maybe better) paths never discovered
- Only exploration
- What is learned is never exploited
- Good trade-off
- Explore first to learn, exploit later to benefit
10Some issues
- Hidden state
- If you dont know where you are, you cant know
what to do. - Curse of dimensionality
- Very large state spaces.
- Continuous states/action spaces
- All algorithms use discrete tables spaces. What
about continuous values? - Many of your articles discuss solutions to these
problems.
11Conclusion
- RL Learning through interaction and rewards.
- Markov Decision Process popular model
- Values of states V(s)
- Values of action/states Q(a,s) (model
free!) - Still some problems... not quite ready for
complex real-world problems yet, but research
underway!
12Literature
- Artificial Intelligence A Modern Approach
- Stuart Russel and Peter Norvig
- Machine Learning
- Tom M. Mitchell
- Reinforcement learning A Tutorial
- Mance E. Harmon and Stephanie S. Harmon