Tutorial : Echo State Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Tutorial : Echo State Networks

Description:

where the system function h might be a monster. ... Generating such sequences not possible with monotonic, area-bounded forgetting curves! ... – PowerPoint PPT presentation

Number of Views:1455
Avg rating:3.0/5.0
Slides: 58
Provided by: karlhg
Category:

less

Transcript and Presenter's Notes

Title: Tutorial : Echo State Networks


1
Tutorial Echo State Networks
  • Dan Popovici
  • University of Montreal (UdeM)

MITACS 2005
2
Overview
1. Recurrent neural networks a 1-minute
primer 2. Echo state networks 3. Examples,
examples, examples 4. Open Issues
3
1 Recurrent neural networks
4
Feedforward- vs. recurrent NN
Input
Input
Output
Output
  • connections only "from left to right", no
    connection cycle
  • activation is fed forward from input to output
    through "hidden layers"
  • no memory
  • at least one connection cycle
  • activation can "reverberate", persist even with
    no input
  • system with memory

5
recurrent NNs, main properties
  • input time series output time series
  • can approximate any dynamical system (universal
    approximation property)
  • mathematical analysis difficult
  • learning algorithms computationally expensive and
    difficult to master
  • few application-oriented publications, little
    research

6
Supervised training of RNNs
  • A. Training
  • Teacher
  • Model

B. Exploitation Input Correct (unknown)
output Model
in out
7
Backpropagation through time (BPTT)
  • Most widely used general-purpose supervised
    training algorithm
  • Idea 1. stack network copies, 2. interpret as
    feedforward network, 3. use backprop algorithm.

original RNN
stack of copies
8
What are ESNs?
  • training method for recurrent neural networks
  • black-box modelling of nonlinear dynamical
    systems
  • supervised training, offline and online
  • exploits linear methods for nonlinear modeling

Previously
ESN training
9
Introductory example a tone generator
  • Goal train a network to work as a tuneable tone
    generator

input frequency setting
output sines of desired frequency
10
Tone generator, sampling
  • For sampling period, drive fixed "reservoir"
    network with teacher input and output.
  • Observation internal states of dynamical
    reservoir reflect both input and output teacher
    signals

11
Tone generator compute weights
  • Determine reservoir-to-output weights
    such that training output is optimally
    reconstituted from internal "echo" signals.

12
Tone generator exploitation
  • With new output weights in place, drive trained
    network with input.
  • Observation network continues to function as in
    training.
  • internal states reflect input and output
  • output is reconstituted from internal states
  • internal states and output create each other

13
Tone generator generalization
  • The trained generator network also works with
    input different from training input

A. step input B. teacher and learned output
C. some internal states
14
Dynamical reservoir
  • large recurrent network (100 - units)
  • works as "dynamical reservoir", "echo chamber"
  • units in DR respond differently to excitation
  • output units combine different internal dynamics
    into desired dynamics

15
Rich excited dynamics
Unit impulse responses should vary
greatly. Achieve this by, e.g.,
  • inhomogeneous connectivity
  • random weights
  • different time constants
  • ...

excitation
responses
16
Notation and Update Rules
17
Learning basic idea
  • Every stationary deterministic dynamical system
    can be defined by an equation like

where the system function h might be a monster.
Combine h from the I/O echo functions by
selecting suitable DR-to-output weights

18
Offline training task definition
Recall
Let be the teacher output.
.
Compute weights such that mean square error
is minimized.
19
Offline training how it works
  1. Let network run with training signal
    teacher-forced.
  2. During this run, collect network states
    , in matrix M
  3. Compute weights , such that

is minimized
MSE minimizing weight computation (step 3) is a
standard operation. Many efficient
implementations available, offline/constructive
and online/adaptive.
20
Practical Considerations
Chosen randomly
  • Spectral radius of W lt 1
  • W should be sparse
  • Input and feedback weights have to be scaled
    appropriately
  • Adding noise in the update rule can increase
    generalization performance

21
Echo state network training, summary
  • use large recurrent network as "excitable
    dynamical reservoir (DR)"
  • DR is not modified through learning
  • adapt only DR output weights
  • thereby combine desired system function from I/O
    history echo functions
  • use any offline or online linear regression
    algorithm to minimize error

22
3 Examples, examples, examples
23
Short-term memories
24
Delay line scheme
25
Delay line example
  • Network size 400
  • Delays 1, 30, 60, 70, 80, 90, 100, 103, 106, 120
    steps
  • Training sequence length N 2000

training signal random walk with resting states
26
results
  • correct delayed signals ( ) and
    network outputs ( )

-1 -30 -60
-90
-100 -103 -106
-120
traces of some DR internal units
27
Delay line test with different input
  • correct delayed signals ( ) and
    network outputs ( )

-1 -30 -60
-90
-100 -103 -106
-120
traces of some DR internal units
28
3.2 Indentification of nonlinear systems
29
Identifying higher-order nonlinear systems
A tenth-order system
30
  • Results offline learning

augmented ESN (800 Parameters) NMSEtest 0.006
previous published state of the art1) NMSEtrain
0.24 D. Prokhorov, pers. communication2) NMSEte
st 0.004
1) Atiya Parlos (2000), IEEE Trans. Neural
Networks 11(3), 697-708 2) EKF-RNN, 30 units,
1000 Parameters.
31
The Mackey-Glass equation
  • delay differential equation
  • delay t gt 16.8 chaotic
  • benchmark for time series prediction

t 17
t 30
32
Learning setup
  • network size 1000
  • training sequence N 3000
  • sampling rate 1

33
Results for t 17
  • Error for 84-step prediction
  • NRMSE 1E-4.2
  • (averaged over 100 training runs on independently
    created data)
  • With refined training method
  • NRMSE 1E-5.1
  • previous best
  • NRMSE 1E-1.7

original
learnt model
34
Prediction with model
. . .
. . .
visible discrepancy after about 1500 steps
35
Comparison NRMSE for 84-step prediction
log10(NRMSE)
) data from survey in Gers / Eck /Schmidhuber
2000
36
3.3 Dynamic pattern recognition
37
Dynamic pattern detection1)
Training signal output jumps to 1 after
occurence of pattern instance in input
1) see GMD Report Nr 152 for detailed coverage
38
Single-instance patterns, training setup
  • 1. A single-instance, 10-step pattern is randomly
    fixed

2. It is inserted into 500-step random signal at
positions 200 (for training) 350, 400, 450, 500
(for testing) 3. 100-unit ESN trained on first
300 steps (single positive instance! "single shot
learning), tested on remaining 200 steps
the pattern
test data 200 steps with 4 occurances of pattern
on random background, desired output red impulses
39
Single-instance patterns, results
  • 1. trained network response on test data

2. network response after training 800 more
pattern-free steps ("negative examples")
3. like 2., but 5 positive examples in training
data
DR 12.4
DR 12.1
DR 6.4
discrimination ratio DR
4. comparison optimal linear filter
DR 3.5
40
Event detection for robots(joint work with
J.Hertzberg F. Schönherr)
  • Robot runs through office environment,
    experiences data streams (27 channels) like...

infrared distance sensor left motor
speed activation of "goThruDoor" external
teacher signal, marking event category
10 sec
41
Learning setup
27 (raw) data channels
unlimited number of event detector channels
100 unit RNN
  • simulated robot (rich simulation)
  • training run spans 15 simulated minutes
  • event categories like
  • pass through door
  • pass by 90 corner
  • pass by smooth corner

42
Results
  • easy to train event hypothesis signals
  • "boolean" categories possible
  • single-shot learning possible

43
Network setup in training
29 input channels code symbols
29 output channels for next symbol hypotheses
_ a z
. . .
400 units
44
Trained network in "text" generation
winning symbol is next input
!!
decision mechanism, e.g. winner-take-all
45
Results
  • Selection by random draw according to output

yth_upsghteyshhfakeofw_io,l_yodoinglle_d_upeiuttyt
yr_hsymua_doey_sammusos_trll,t.krpuflvek_hwiblhoos
lolyoe,_wtheble_ft_a_gimllveteud_ ...
Winner-take-all selection
sdear_oh,_grandmamma,_who_will_go_and_the_wolf_sai
d_the_wolf_said_the_wolf_said_the_wolf_said_the_wo
lf_said_the_wolf_said_the_wolf ...
46
4 Open Issues
47
  • 4.2 Multiple timescales
  • 4.3 Additive mixtures of dynamics
  • 4.4 "Switching" memory
  • 4.5 High-dimensional dynamics

48
Multiple time scales
  • This is hard to learn (Laser benchmark time
    series)

Reason 2 widely separated time scales Approach
for future research ESNs with different time
constants in their units
49
Additive dynamics
  • This proved impossible to learn

Reason requires 2 independent oscillators but
in ESN all dynamics are mutually
coupled. Approach for future research modular
ESNs and unsupervised multiple expert learning
50
"Switching" memory
  • This FSA has long memory "switches"

baaa....aaacaaa...aaabaaa...aaacaaa...aaa...
Generating such sequences not possible with
monotonic, area-bounded forgetting curves!
An ESN simply is not a model for long-term memory!
51
High-dimensional dynamics
  • High-dimensional dynamics would require very
    large ESN. Example 6-DOF nonstationary time
    series one-step prediction

200-unit ESN RMS 0.2 400-unit network RMS
0.1 best other training technique1) RMS
0.02 Approach for future research task-specific
optimization of ESN
1)Prokhorov et al, extended Kalman filtering
BPPT. Network size 40, 1400 trained links,
training time 3 weeks
52
Spreading trouble...
  • Signals xi(n) of reservoir can be interpreted as
    vectors in (infinite-dimensional) signal space
  • Correlation Exy yields inner product lt x, y gt
    on this space
  • Output signal y(n) is linear combination of these
    xi(n)
  • The more orthogonal the xi(n), the smaller the
    output weights

x1
y 30 x1 - 28 x2
y 0.5 x1 0.7 x2
y
x1
y
x2
x2
53
  • Eigenvectors vk of correlation matrix R (Exi x
    j ) are orthogonal signals
  • Eigenvalues l k indicate what "mass" of reservoir
    signals xi (all together) is aligned with vk
  • Eigenvalue spread l max/ l min indicates overall
    "non-orthogonality" of reservoir signals

x1
vmin

l max/ l min 1
l max/ l min 20
x1
vmax
vmax
vmin
x2
x2
54
  • Large eigenvalue spread large output
    weights ...
  • harmful for generalization, because slight
    changes in reservoir signals will induce large
    changes in output
  • harmful for model accuracy, because estimation
    error contained in reservoir signals is magnified
    (applies not to deterministic systems)
  • renders LMS online adaptive learning useless


l max/ l min 20
x1
vmax
vmin
x2
55
Summary
  • Basic idea dynamical reservoir of echo states
    supervised teaching of output connections.
  • Seemed difficult in nonlinear coupled systems,
    every variable interacts with every other. BUT
    seen the other way round, every variable rules
    and echoes every other. Exploit this for local
    learning and local system analysis.
  • Echo states shape the tool for the solution from
    the task.

56
Thank you.
57
References
  • H. Jaeger (2002) Tutorial on training recurrent
    neural networks, covering BPPT, RTRL, EKF and the
    "echo state network" approach. GMD Report 159,
    German National Research Center for Information
    Technology, 2002
  • Slides used by Herbert Jaeger at IK2002
Write a Comment
User Comments (0)
About PowerShow.com